CN109583483A

CN109583483A - A kind of object detection method and system based on convolutional neural networks

Info

Publication number: CN109583483A
Application number: CN201811347546.9A
Authority: CN
Inventors: 唐乾坤; 胡瑜; 金贝贝; 曾鸣; 曾一鸣; 刘世策; 叶靖
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2018-11-13
Filing date: 2018-11-13
Publication date: 2019-04-05
Anticipated expiration: 2038-11-13
Also published as: CN109583483B

Abstract

The present invention relates to a kind of object detection method and system based on convolutional neural networks, comprising: extract the convolution characteristic pattern of picture to be measured respectively using the convolution kernel of a variety of scales；The feature vector that each spatial position of convolution characteristic pattern is adjusted using full articulamentum, obtains fisrt feature figure, is spliced to obtain splicing characteristic pattern, the characteristic information in the splicing each channel of characteristic pattern is adjusted using full articulamentum, obtains second feature figure；For the anchor point frame for setting different scale and length-width ratio on each spatial position of second feature figure, the coordinate and size of anchor point frame are the coordinate systems relative to picture to be measured；Each anchor point frame is projected on second feature figure, the feature after the extraction projection of using area feature extraction operation inside anchor point frame, and has the anchor point frame of object as target candidate frame frame choosing；The accurate location and size of classification and regressive object candidate frame are carried out to the object in target candidate frame using target identification network.

Description

A kind of object detection method and system based on convolutional neural networks

Technical field

The present invention relates to technical field of computer vision more particularly to a kind of object candidate area generation method of enhancing with Device and object detection method.

Background technique

Target detection is one of basic project of computer vision field, with the development of depth convolutional neural networks, mesh The performance of mark detection also obtains very big improvement.It is most common in the object detection method currently based on convolutional neural networks It is based on two stage target detection process, which uses target candidate to generate network first and generate candidate frame (proposals), classification then is carried out to candidate frame (proposals) using identification network and accurate adjustment obtains boundary to the end Frame.

Based in two stage testing process, generating candidate frame is a most important step.It there is now two class target candidates The generation method of frame: one kind is using the method for traditional manual feature, and another kind of is that the candidate frame based on deep learning generates Technology.The former obtains candidate frame using super-pixel or the marginal information of object etc.；The latter is using full convolutional network by means of anchor Point frame (anchorbox, a series of rectangle frame of predetermined locations, scale and length-width ratio, similarly hereinafter) carry out predicting candidate frame simultaneously Position and judge whether candidate frame includes object.

Although the quality for the candidate frame that the candidate frame generation technique based on deep learning obtains is compared to based on traditional-handwork The method of feature, which obtains result, to get well, but includes in the candidate frame of the generation of the target candidate frame generation technique based on deep learning A large amount of background rather than include real object.The bounding box for finally identification model of second stage being obtained is from obtaining Higher positional accuracy is obtained, the improvement of target detection performance is limited.The primary limitation of generation method based on deep learning It is using the convolution kernel of single scale come the Object Extraction feature for different scale, while different rulers on characteristic pattern same position The anchor point frame of degree has used identical feature, so that final the result is that son optimization.

Summary of the invention

The volume of single scale is used during generating candidate frame for the candidate frame generation technique based on deep learning Product core extracts feature and different scale anchor point frame shares the limitation of same characteristic features, causes target detection that cannot obtain more Gao Ding The case where position precision, the invention proposes a kind of object detection methods based on convolutional neural networks, comprising:

Step 1, the convolution characteristic pattern for extracting picture to be measured respectively using the convolution kernel of a variety of scales；

Step 2, the feature vector that each spatial position of convolution characteristic pattern is adjusted using full articulamentum obtain the first spy Sign figure；

Step 3 splices the fisrt feature figure, obtains splicing characteristic pattern, adjusts the splicing characteristic pattern using the full articulamentum The characteristic information in each channel obtains second feature figure；

Step 4, the anchor point frame to set different scale and length-width ratio on each spatial position of the second feature figure, the anchor The coordinate and size of point frame are the coordinate systems relative to the picture to be measured；

Step 5 projects to each anchor point frame on the second feature figure, and using area feature extraction operation extracts projection Feature inside anchor point frame afterwards, and the probability value in the anchor point frame comprising object is obtained according to the feature inside the anchor point frame, and Target candidate frame is selected from all anchor point frames according to the probability value；

Step 6 carries out classification and regressive object candidate frame to the object in target candidate frame using target identification network Accurate location and size, the bounding box of object is determined according to the accurate location and size, classification results and the bounding box are made For object detection results output.

The object detection method based on convolutional neural networks, wherein the step 1 is specially using k different convolution kernel Convolution operation extracts feature parallel.

The object detection method based on convolutional neural networks, wherein the step 2 adjusts convolution spy especially by following formula Sign schemes the feature vector of each spatial position:

ω_ij=F (d_ij)

D in formula_ijFor the feature vector of a spatial position of convolution characteristic pattern (i, j), nonlinear function F is three cascades The full articulamentum, ω_ijFor the first adjustment factor, o_ijFor the feature vector of spatial position (i, j) of the fisrt feature figure, Indicate dot product.

The object detection method based on convolutional neural networks, wherein the step 3 adjusts each channel of splicing characteristic pattern Characteristic information specifically include:

The Feature Descriptor a in the channel is obtained using the average pond of the overall situation:

A=global_pooling (U)

U is the splicing characteristic pattern in formula, and global_pooling indicates global average pond；

Use three cascade full articulamentums as nonlinear function F, obtain the adjustment factor e in each channel:

E=F (a),

WhereinIndicate dot product, U ' is the second feature figure.

The object detection method based on convolutional neural networks, wherein the step 5 include:

According to the probability value, which is ranked up, after filtering out duplicate anchor point frame using non-maxima suppression, N number of target candidate frame of maximum probability is selected, N is default positive integer.

The invention also discloses a kind of object detection system based on convolutional neural networks, including:

Extraction module extracts the convolution characteristic pattern of picture to be measured using the convolution kernel of a variety of scales respectively；

First adjustment module is adjusted the feature vector of each spatial position of convolution characteristic pattern using full articulamentum, obtained To fisrt feature figure；

Second adjustment module splices the fisrt feature figure, obtains splicing characteristic pattern, adjusts the splicing using the full articulamentum The characteristic information in each channel of characteristic pattern, obtains second feature figure；

First adjustment module, for the anchor point for setting different scale and length-width ratio on each spatial position of the second feature figure Frame, the coordinate and size of the anchor point frame are the coordinate systems relative to the picture to be measured；

Candidate frame Choosing module, for each anchor point frame to be projected to the second feature figure, using area feature extraction Feature after operation extraction projection inside anchor point frame, and inclusion in the anchor point frame is obtained according to the feature inside the anchor point frame The probability value of body, and target candidate frame is selected from all anchor point frames according to the probability value；

Module of target detection carries out classification and regressive object to the object in target candidate frame using target identification network The accurate location and size of candidate frame, the bounding box of object is determined according to the accurate location and size, by classification results and the side Boundary's frame is exported as object detection results.

The object detection system based on convolutional neural networks, wherein the extraction module is specially to use k different convolution The convolution operation of core extracts feature parallel.

The object detection system based on convolutional neural networks, wherein first adjustment module is adjusted especially by following formula and is somebody's turn to do The feature vector of each spatial position of convolution characteristic pattern:

ω_ij=F (d_ij)

The object detection system based on convolutional neural networks, wherein it is every to adjust the splicing characteristic pattern for second adjustment module The characteristic information in a channel specifically includes:

A=global_pooling (U)

E=F (a),

WhereinIndicate dot product, U ' is the second feature figure.

The object detection system based on convolutional neural networks, wherein the candidate frame Choosing module include:

Compared with prior art, the present invention having the following advantages and benefits:

1, target candidate generation method provided by the present invention and device are independent of specific core network.Existing mind It all can serve as target candidate generation method provided by the invention and device after removing last full articulamentum through network Target candidate generation method provided by the invention and device easily can be directly connected to core network by core network On the last layer convolutional layer；

2, had using the target candidate frame (proposals) that target candidate generation method provided by the invention and device generate There is higher quality, i.e. proposals seldom includes background information, can accurately navigate to object；

3, target candidate generation method provided by the invention and device have processing speed more faster than the prior art；

4, target candidate generation method provided by the invention and device are utilized in being based on two stage target detection process The available higher detection accuracy of target candidate frame (proposals) of generation.

Detailed description of the invention

Fig. 1 is a kind of flow chart of the target candidate generation method of enhancing of the embodiment of the present invention；

Fig. 2 is a kind of schematic diagram that feature is extracted using the convolution kernel of a variety of scales of the embodiment of the present invention；

Fig. 3 is each space bit for the convolution characteristic pattern that a kind of convolution kernel for every kind of scale of the embodiment of the present invention obtains Set the schematic diagram of learning regulation coefficient；

Fig. 4 is that a kind of of the embodiment of the present invention is characterized each channel learning regulation coefficient of figure to be used to adaptive adjusting every The schematic diagram of a channel characteristics information；

Fig. 5 is the schematic diagram of feature in a kind of anchor point frame for extracting each spatial position of the embodiment of the present invention；

Fig. 6 is a kind of object candidate area generating means schematic diagram of enhancing of the embodiment of the present invention；

Fig. 7 is a kind of based on a kind of mesh of the target candidate generation method of enhancing provided by the invention of the embodiment of the present invention Mark detection method flow chart.

Specific embodiment

The invention proposes a kind of object detection methods based on convolutional neural networks, comprising:

ω_ij=F (d_ij)

A=global_pooling (U)

E=F (a),

WhereinIndicate dot product, U ' is the second feature figure.

It is logical below in conjunction with attached drawing in order to keep the purpose of the present invention, technical solution, design method and advantage more clear Crossing specific embodiment, the present invention is described in more detail.It should be appreciated that specific embodiment described herein is only to explain The present invention is not intended to limit the present invention.

Embodiment 1

Fig. 1 is a kind of target candidate frame generation method provided by the invention, be the steps include:

S11: feature is extracted using the convolution kernel of a variety of scales；

In a kind of preferred embodiment, as shown in Fig. 2, using the convolution kernel 1 × 1 of k=3 kind scale, 3 × 3 and 5 × 5 To extract feature in multiple dimensioned layer.In specific implementation in order to reduce parameter amount and increase non-linear expression, 3 × 3 and 5 × It is added to the convolutional layer that a public convolution kernel is 1 × 1 before 5 convolution and is used for dimensionality reduction；The convolutional layer for being 5 × 5 by convolution kernel It is further divided into the convolutional layer that cascade two layers of convolution kernel is 3 × 3.Every kind is also given in a kind of preferred embodiment in Fig. 2 The output channel number of convolution operation.

S12: for the obtained each spatial position learning regulation coefficient of characteristic pattern of convolution nuclear convolution of every kind of scale, it is used to The characteristic information that the adaptive each spatial position useful feature information of enhancing inhibits each spatial position useless simultaneously；

The step specific embodiment is, as shown in figure 3, for using the convolution kernel of a certain scale to carry out convolution behaviour Make the obtained a height of H of input, width W, port number is the characteristic pattern M of C_in, the feature vector of a spatial position (i, j) is d_ij(height × wide × port number=1 × 1 × C) indicates the characteristic value of position (i, j) of H*W characteristic pattern, and the position is all logical It is exactly feature vector d that the characteristic value in road, which is taken out,_ij., nonlinear function F use three cascade full articulamentums, act on d_ijOn Obtain the adjustment factor ω of the spatial position_ij, i.e.,

ω_ij=F (d_ij),

Wherein F indicates the nonlinear function that three full articulamentums are constituted.

The feature vector of the position after thus being adjusted is

WhereinIndicate dot product, o_ijFor the characteristic pattern M after adjusting_outThe spatial position (i, j) feature vector.

It is in actual implementation in order to reduce the complexity of model parameter and network, the parameter of three full articulamentums is each Spatial position is shared, and articulamentum complete in this way can be replaced with convolution kernel by 1 × 1 convolutional layer.

S13: together by the convolution merging features of every kind of scale after adjusting；

S14: to the characteristic pattern after splicing, each channel is learnt according to the feature distribution in each channel of this feature figure Adjustment factor, for the characteristic information in the adaptive each channel of adjusting, it should be noted that characteristic information be different from feature to Amount, characteristic information indicates the adjustment factor obtained with study to adjust the expressed feature out in each channel, and feature vector is more More is to indicate a column data, and the data that S14 is adjusted are H*W；

The specific embodiment of the step are as follows: as shown in figure 4, input feature vector figure is U, first with the average pond of the overall situation To the Feature Descriptor in the channel

A=global_pooling (U),

Wherein global_pooling indicates global average pond.

Then three cascade full articulamentums are used as nonlinear function F to obtain the adjustment factor e in each channel, i.e.,

E=F (a),

Therefore the characteristic pattern after final each channel is adjusted:

WhereinIndicate dot product.

S15: different scale and length and width are set on each spatial position to adjust the characteristic pattern after channel characteristics information The anchor point frame of ratio, the coordinate and size of the anchor point frame are relative to the coordinate system for being originally inputted picture.Each position of characteristic pattern As the center point coordinate of anchor point frame, which can be by obtaining the coordinate on input picture multiplied by down-sampling step number.In this way The frame for being primarily due to mark exists on original image, and anchor point frame, which projects on input picture, facilitates calculating training mesh Mark；

S16: each anchor point frame is projected on the characteristic pattern after adjusting channel characteristics information, using area feature extraction The feature inside the anchor point frame after projection is extracted in operation.

It is illustrated in figure 5 the concrete operations mode of the step, for each spatial position of characteristic pattern, a kind of preferred reality It applies in mode, in advance in the anchor point frame of each position setting 3 kinds of scales and 3 kinds of length-width ratios, the anchor point frame is then projected into institute It states on the output characteristic pattern U ' of channel adjustment module.Using area feature extracting method φ come extract projection after each anchor point The feature for including in frame, a kind of simple and effective Region Feature Extraction method can use (the candidate region pond RoIpooling Change).In the present embodiment in order to reduce number of parameters, anchor point frame is grouped according to length-width ratio, the extraction of identical aspect ratio The feature of same size.It is 128 for example, by using 3 kinds of scales², 256², 512²Pixel and length-width ratio are 1:2, totally 9 kinds of 1:1,2:1 Anchor point frame can be the characteristic information that anchor point frame extracts 5 × 11,7 × 7 and 11 × 5 sizes in each spatial position.Obtaining this After a little characteristic informations, two full articulamentums can be used and be further processed, then spells the characteristic pattern after processing It is connected into a characteristic pattern.The parameter of the full articulamentum of each spatial position identical aspect ratio can be total in actual implementation It enjoys, thus full articulamentum used can be converted to convolution layer operation.

S17: it is respectively intended to return by two parallel network layers are connected after the feature inside each anchor point frame of extraction Whether include object in the position of candidate frame and the differentiation candidate frame.

It, can be according to output in the target candidate frame generated using the target candidate frame generation method proposed by the present invention Include that the probability value of object is ranked up, after filtering out duplicate target candidate frame using non-maxima suppression, select N number of target candidate frame (proposals) of maximum probability.

Embodiment 2

The embodiment of the present invention also provides a kind of object candidate area generating means of enhancing, as shown in fig. 6, the device includes Convolution pyramid module 21, Space adjustment module 22, Space adjustment merging features module 23, channel adjustment module 24, feature are suitable Answer module 25 and classification and regression block 26.

Wherein convolution pyramid module 21, the module extract feature for the convolution kernel using a variety of scales；Space tune Save module 22, each spatial position learning regulation system for the characteristic pattern which is used to obtain for the convolution nuclear convolution of every kind of scale Number is used to the feature letter that the adaptive each spatial position useful feature information of enhancing inhibits each spatial position useless simultaneously Breath；Space adjustment merging features module 23, the convolution which is used to obtain the convolution kernel of every kind of scale after adjusting are special Sign is stitched together；Channel adjustment module 24, which is used for the characteristic pattern after splicing, logical according to each of this feature figure The feature distribution in road learns the adjustment factor in each channel, for the characteristic information in the adaptive each channel of adjusting；Feature is suitable Module 25 is answered, which is used for the anchor point frame of the different scale for the setting of each spatial position, while each anchor point frame being projected On characteristic pattern after to adjusting channel characteristics information, using area feature extraction operation is extracted inside the anchor point frame after projection Feature；Classify and regression block 26, connects two after the feature inside each anchor point frame in the module for that will extract Whether parallel network layer is respectively intended to return the position of candidate frame and differentiate in the candidate frame comprising object.

In object candidate area generating means provided by the embodiment of the present invention, the course of work and target of modules are waited Therefore aforementioned function equally may be implemented, details are not described herein in favored area generation method technical characteristic having the same.

Embodiment 3

The embodiment of the present invention provides a kind of object detection method of target candidate frame generated based on embodiment 1.Including such as Lower step:

S31: picture to be detected is obtained；

S32: picture is input in target detection network, and the target detection network includes enhancing described in embodiment 1 Target candidate frame generate network and target identification network；

S321: target candidate frame generates the target candidate frame (proposals) that network generation may include object；

S322: target identification network is classified to obtain in some proposal to the proposals that may include object The specific category of object；

S323: target identification network is returned to obtain in some proposal to the proposals that may include object Bounding box (boundingbox) size of object estimation；

The following are system embodiment corresponding with above method embodiment, present embodiment can be mutual with above embodiment Cooperation is implemented.The relevant technical details mentioned in above embodiment are still effective in the present embodiment, in order to reduce repetition, Which is not described herein again.Correspondingly, the relevant technical details mentioned in present embodiment are also applicable in above embodiment.

Invention additionally discloses can a kind of object detection system based on convolutional neural networks, including:

ω_ij=F (d_ij)

A=global_pooling (U)

E=F (a),

WhereinIndicate dot product, U ' is the second feature figure.

Claims

1. a kind of object detection method based on convolutional neural networks characterized by comprising

Step 2, the feature vector that each spatial position of convolution characteristic pattern is adjusted using full articulamentum, obtain fisrt feature Figure；

Step 3 splices the fisrt feature figure, obtains splicing characteristic pattern, adjusts the splicing characteristic pattern using full articulamentum and each lead to The characteristic information in road obtains second feature figure；

Step 4, the anchor point frame to set different scale and length-width ratio on each spatial position of the second feature figure；

Step 5 projects to each anchor point frame on the second feature figure, and using area feature extraction operation extracts anchor after projection Point frame inside feature, and according to the feature inside the anchor point frame obtain in the anchor point frame include object probability value, and according to The probability value selects target candidate frame from all anchor point frames；

Step 6, the standard for carrying out classification and regressive object candidate frame to the object in target candidate frame using target identification network True position and size, the bounding box of object is determined according to the accurate location and size, using classification results and the bounding box as mesh Mark testing result output.

2. as described in claim 1 based on the object detection method of convolutional neural networks, which is characterized in that the step 1 is specific To use the convolution operation of k different convolution kernels to extract feature parallel.

3. as described in claim 1 based on the object detection method of convolutional neural networks, which is characterized in that the step 2 is specific The feature vector of each spatial position of convolution characteristic pattern is adjusted by following formula:

ω_ij=F (d_ij)

D in formula_ijFor the feature vector of a spatial position of convolution characteristic pattern (i, j), nonlinear function F be three it is cascade should Full articulamentum, ω_ijFor the first adjustment factor, o_ijFor the feature vector of spatial position (i, j) of the fisrt feature figure,It indicates Dot product.

4. the object detection method as claimed in claim 1 or 3 based on convolutional neural networks, which is characterized in that the step 3 is adjusted The characteristic information for saving each channel of splicing characteristic pattern specifically includes:

A=global_pooling (U)

E=F (a),

WhereinIndicate dot product, U ' is the second feature figure.

5. as claimed in claim 4 based on the object detection method of convolutional neural networks, which is characterized in that the step 5 includes:

According to the probability value, which is ranked up, after filtering out duplicate anchor point frame using non-maxima suppression, is selected N number of target candidate frame of maximum probability, N are default positive integer.

6. a kind of object detection system based on convolutional neural networks characterized by comprising

First adjustment module adjusts the feature vector of each spatial position of convolution characteristic pattern using full articulamentum, obtains One characteristic pattern；

Second adjustment module splices the fisrt feature figure, obtains splicing characteristic pattern, adjusts the splicing feature using the full articulamentum The characteristic information for scheming each channel obtains second feature figure；

First adjustment module, for the anchor point frame for setting different scale and length-width ratio on each spatial position of the second feature figure；

Candidate frame Choosing module, for each anchor point frame to be projected to the second feature figure, using area feature extraction operation The feature after projecting inside anchor point frame is extracted, and obtains including object in the anchor point frame according to the feature inside the anchor point frame Probability value, and target candidate frame is selected from all anchor point frames according to the probability value；

Module of target detection carries out classification to the object in target candidate frame using target identification network and regressive object is candidate The accurate location and size of frame, the bounding box of object is determined according to the accurate location and size, by classification results and the bounding box It is exported as object detection results.

7. as described in claim 1 based on the object detection system of convolutional neural networks, which is characterized in that extraction module tool Body is to extract feature parallel using the convolution operation of k different convolution kernels.

8. as described in claim 1 based on the object detection system of convolutional neural networks, which is characterized in that the first adjusting mould Block adjusts the feature vector of each spatial position of convolution characteristic pattern especially by following formula:

ω_ij=F (d_ij)

9. the object detection system based on convolutional neural networks as described in claim 6 or 8, which is characterized in that second tune The characteristic information that section module adjusts each channel of splicing characteristic pattern specifically includes:

A=global_pooling (U)

E=F (a),

WhereinIndicate dot product, U ' is the second feature figure.

10. as claimed in claim 9 based on the object detection system of convolutional neural networks, which is characterized in that the candidate frame is chosen Modeling block includes: