CN107423760A

CN107423760A - Based on pre-segmentation and the deep learning object detection method returned

Info

Publication number: CN107423760A
Application number: CN201710598875.XA
Authority: CN
Inventors: 孙伟; 潘蓉; 卞磊; 王鹏
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2017-07-21
Filing date: 2017-07-21
Publication date: 2017-12-01

Abstract

The invention discloses a kind of object detection method based on pre-segmentation and the deep learning returned, mainly solve the problems, such as that existing object detection method is long to small target deteection low precision and detection time.Its implementation is：1) area-of-interest of image to be detected is extracted using Quadtree Partition algorithm；2) feature extraction is carried out to area-of-interest using basic convolutional layer and auxiliary convolutional layer, obtains the characteristic pattern of multiple yardsticks；3) positional information of acquiescence frame is calculated on the characteristic pattern of multiple yardsticks, is detected using convolution filter on the characteristic pattern of multiple yardsticks, obtains multiple prediction frames and multiple category scores；4) bezel locations and classification information of final target are obtained to multiple prediction frames and multiple category scores using non-maxima suppression.The present invention can fast and accurately be detected to the Small object in image, be detected in real time available for the target in unmanned plane.

Description

Based on pre-segmentation and the deep learning object detection method returned

Technical field

The invention belongs to Image Information Processing field, specifically a kind of deep learning object detection method, can be used for Accurate positioning in real time and classification to target.

Background technology

Target detection is a challenging problem in computer vision field, and its core missions is in static images Or certain Target Recognition Algorithms and search strategy are used in video, obtain specific objective position in an image or a video and class Not.The method of target detection is broadly divided into the algorithm of target detection of feature based and machine learning and based on deep learning at present Detection method.The method of wherein feature based and machine learning is by carrying out regional choice, feature extraction, grader to target The processes such as classification realize target detection.Regional choice is to carry out traversal to entire image by sliding window and select to there may be mesh Target frame, but time complexity is too high, redundancy window are made a slip of the tongue more, directly affects the speed and performance of feature extraction and classification. The feature commonly used in feature extraction has Haar wavelet characters, HOG features, SIFT feature and composite character etc., due to the light of image According to condition, the diversity of form of background and target etc. is higher to the robustness requirement of feature, and the feature quality of extraction is directly Influence the precision of target classification.Traditional grader mainly includes support vector machines and iterator Adaboost.Due to being pin To the identification mission of some feature, model generalization ability, it is difficult to be identified in actual applications to targeting accuracy.Opened from 2014 Begin, the algorithm of target detection based on deep learning achieves great breakthrough, overcomes lacking in traditional algorithm of target detection Point.The algorithm of target detection based on deep learning of main flow is broadly divided into two classes at present：Deep learning mesh based on candidate region Mark detection algorithm and the deep learning algorithm of target detection based on recurrence.The representative of algorithm of target detection based on candidate region is The R-CNN algorithms that R Girshick are proposed, the detection framework combination candidate region of the algorithm and convolutional neural networks CNN are divided Class.SPP-NET is generated by the speed-raising of R-CNN successive optimizations, Fast R-CNN and Faster R-CNN, the precision of target detection and Speed all improves a lot, but due to being divided into positioning and two steps of classification and positioning consumption when such method carries out target detection Shi Taichang, therefore target detection can not be carried out in real time.Deep learning algorithm of target detection based on homing method is representational There are YOLO and SSD, this kind of algorithm is mainly position and the classification for predicting target directly from image to be detected by the Return Law, this Kind method causes target detection speed to greatly speed up, and can reach the requirement of real-time target detection, but to the size of input picture There is strict demand and target location positioning is poor, Small object that can not be in detection image.YOLO and SSD300 requires defeated respectively It is 448*448 and 300*300 to enter picture size, reduces image to be detected to specific dimensions, can lose image detail, lead to not Detect Small object.

The content of the invention

It is an object of the invention to for above-mentioned existing technical problem, propose a kind of based on pre-segmentation and the depth returned The object detection method of study, to preserve image detail, improve the Real time detection performance to Small object.

The technical thought of the present invention is to obtain area-of-interest by carrying out quaternary tree pre-segmentation in the input image；Pass through The Analysis On Multi-scale Features figure of multiple dimensioned convolutional layer extraction area-of-interest；Target classification and prediction frame are predicted by convolution filter Position；Final target classification and target location coordinate is obtained by non-maxima suppression.

According to above-mentioned thinking, implementation of the invention includes as follows：

(1) established according to QuadTree algorithm and convolutional neural networks VGG-16 based on pre-segmentation and the deep learning net returned Network model；

(2) network model built is trained on training set of images；

(2a) uses image set PASCAL VOC2007 and PASCAL VOC2012 training dataset to be used as training set Image set PASCAL VOC2007 test data set is as test set；

The acquiescence frame on characteristic pattern generated in the mark frame and network model of (2b) to marking image in training set Matched；

The target loss function L (x, l, c, g) of (2c) tectonic network model；

Wherein, x is characterized the acquiescence frame on figure, and l is prediction block, and g is mark frame, and c is characterized the acquiescence side on figure Category score set of the frame in each classification, L_confAcquiescence frame on (x, c) expression characteristic pattern is on category score set c Softmax Classification Loss functions, L_loc(x, l, g) represents positioning loss function, and N represents the acquiescence side matched with mark frame Frame number, parameter alpha are arranged to 1 by cross validation；

(2d) minimizes loss function using gradient descent method and the weight parameter in network is successively reversely adjusted simultaneously, obtains To the network model trained；

(3) original image to be detected is input in the network model trained, obtains the target in image to be detected Classification and position coordinates.

The invention has the advantages that：

1) present invention due to image to be detected carry out area-of-interest pre-segmentation, avoid because picture is excessive and without legal The problem of position Small object；

2) due to only carrying out feature extraction to area-of-interest in the present invention, rather than feature extraction is carried out to whole image, Amount of calculation when reducing feature extraction and calculate the time.

3) feature of the present invention due to extracting area-of-interest using convolutional layer, its feature have displacement, rotation and scaling Consistency, the problem of avoiding engineer's poor robustness, it is more suitable for target detection.

4) present invention is predicted by convolution filter to characteristic pattern, is obtained the other confidence level of a series of target class and is obtained Divide the position coordinates with target, improve computational efficiency.

Brief description of the drawings

Fig. 1 is the implementation process figure of the present invention；

Fig. 2 is the network structure built in the present invention；

Image to be detected that Fig. 3 is used when being and being tested in the present invention；

Fig. 4 is the area-of-interest figure extracted in the present invention using QuadTree algorithm；

Fig. 5 is the simulation result figure for carrying out target detection to image to be detected with the present invention.

Embodiment

Reference picture 1, step is as follows for of the invention realizing：

Step 1, establish based on pre-segmentation and the deep learning network model returned.

The target detection network for being currently based on deep learning is divided into two major classes：One kind is the deep learning based on candidate region Target detection network, such as R-CNN, Fast R-CNN and Faster R-CNN；Another kind of is the deep learning mesh based on recurrence Mark detection network, such as YOLO and SSD, the present invention are proposed based on pre-segmentation and the deep learning object detection method returned.Mesh The method of preceding extraction area-of-interest includes：Area-of-interest exacting method based on threshold value, based on the interested of edge extracting Method for extracting region, the area-of-interest exacting method based on Quadtree Partition, the region of interesting extraction based on region growing Method etc., the region of interesting extraction method structure pre-segmentation Internet of Quadtree Partition is used in of the invention.

Reference picture 2, this step are implemented as follows

(1a) utilizes QuadTree algorithm structure area-of-interest pre-segmentation Internet；

(1a1) sets segmentation threshold in QuadTree algorithm as M, and maximum fractionation number is Q=1024, by image to be detected According to four sub-regions are horizontally and vertically divided into, wherein 0 ＜ M ＜ 255；

(1a2) calculates the average gray value of every sub-regions after segmentation, and subregion of the average gray value more than M is continued Four sub-regions are divided into, stop segmentation when the average gray value of subregion is less than M or segmentation times reach Q, and remember Record its positional information；

(1a3) finds the minimum sub-district positioned at image to be detected upper left corner and the lower right corner according to the positional information of subregion The position of area-of-interest in the coordinate in domain, as image to be detected；

(1b) establishes target detection Internet according to convolutional neural networks VGG-16；

Convolutional neural networks currently used for target identification have AlexNet, VGG-16, GoogLeNet, ResNet etc., this Target detection network is established using convolutional neural networks VGG-16 in invention, implementation step is as follows：

(1b1) uses the stage stage1-stage5 in convolutional neural networks VGG-16 as target detection Internet Basic convolutional layer, and full articulamentum fc6, fc7 therein are replaced with into two convolutional layers conv6, conv7, while add four The new auxiliary convolutional layer of convolutional layer conv8, conv9, conv10, conv11 as target detection Internet, four newly added The size of convolutional layer is respectively 10 × 10,5 × 5,3 × 3,1 × 1；

(1b2) forms the detection layers of target detection network using a series of convolution filter；

(1b3) is pressed down using the non-maximum in the target detection network R-CNN based on candidate region and convolutional neural networks Preparative layer forms the output layer of target detection Internet.

Step 2, the network model built is trained on training set of images.

The method being trained at present to deep learning network is broadly divided into two classes：From the unsupervised learning of lower rising and oneself Downward supervised learning is pushed up, is trained in the present invention using top-down supervised learning method, realizes that step is as follows：

(2a) selects the training set of images for training；

The image set for being usually used in target detection network training has：Imagenet image sets, PASCAL VOC image sets, COCO Image set etc., the present invention use image set PASCAL VOC2007 and PASCAL VOC2012 training dataset as training Collection, test set is used as by the use of image set PASCAL VOC2007 test data set；

(2b1) calculates size and the position of the acquiescence frame on characteristic pattern：

The ratio of width to height of setting acquiescence frame has 5 kinds of different values, respectively a={ 1,2,3,1/2,1/3 }, calculates k-th The ratio of width to height is a in characteristic pattern_τAcquiescence frame widthAnd height

Wherein a_τFor τ kind the ratio of width to height, 0≤τ≤5, s_minRepresent the length of side of acquiescence frame and the minimum ratio of input picture Value, s_maxThe length of side of acquiescence frame and the maximum ratio of input picture are represented, k ∈ [1, E], E represent the characteristic pattern in network model Number；

The centre coordinate (xcen, ycen) of acquiescence frame on k-th of characteristic pattern isWherein | f_k| it is the size of k-th of characteristic pattern, u, v represent the coordinate at characteristic pattern midpoint, u, v ∈ [0, | f_k|]；

(2b2) is obtained marking frame and is given tacit consent to the Jaccard between frame according to the size and centre coordinate of acquiescence frame Overlap coefficient, it is positive sample Pos to select acquiescence frame of the Jaccard overlap coefficients value more than 0.5, and other are negative sample Neg；

(2b21) calculate the acquiescence frame x upper left corner coordinate (xleft, yleft) and the lower right corner coordinate (xrigh, yrigh)：

(2b22) calculates coordinate (xmin, ymin) and the lower right corner in the upper left corner of the acquiescence frame with marking frame intersection Coordinate (xmax, ymax)：

Xmin=max (xleft, xgleft),

Ymin=max (yleft, ygleft),

Xmax=max (xrigh, xgrigh),

Ymax=max (yrigh, ygrigh)；

Wherein (xgleft, ygleft) and (xgrigh, ygrigh) represent to mark respectively the coordinate in the frame g upper left corner and The coordinate in the lower right corner；

(2b23) calculates acquiescence frame x with marking the area inter (x, g) of frame g intersections:

Inter (x, g)=(max (ymax-ymin, 0)) * (max (xmax-xmin, 0))；

(2b24) calculates acquiescence frame x and marks the Jaccard overlap coefficients between frame g：

It is positive sample Pos to select acquiescence frame of J (x, the g) coefficient value more than 0.5, and other are negative sample Neg, completes mark The matching of frame and acquiescence frame.

The target loss function L (x, l, c, g) of (2c) tectonic network model:

(2c1) is predicted using convolution filter on characteristic pattern, obtains giving tacit consent to classification of the frame on all categories Score set c and prediction frame are relative to position offset (Δ x, Δ y, the Δ w, Δ h), wherein (Δ x, Δ y) for giving tacit consent to frame Offset of the prediction frame centre coordinate relative to acquiescence frame centre coordinate is represented, Δ w represents that prediction frame is wide relative to silent Recognize the wide offset of frame, Δ h represents prediction frame height relative to the high offset of acquiescence frame；

The classification score set c of (2c2) according to the acquiescence frame on characteristic pattern on all categories, calculate softmax points Class loss function L_conf(x,c)：

Wherein, whenRepresent that i-th of acquiescence frame matches with j-th of mark frame that classification is p,Represent I-th of acquiescence frame mismatches with j-th of mark frame that classification is p, and 0≤i≤N, N represent the acquiescence matched with mark frame Frame number, 1≤p≤H, H are total categorical measure, and 0≤j≤T, T are the quantity of mark frame,Represent i-th in positive sample The average on all categories of individual acquiescence frame,Represent i-th in negative sample₂Individual acquiescence frame is in all categories On average, 0≤i₂≤N₂, N₂Represent and the unmatched acquiescence frame number of mark frame；

(2c3) calculates positioning loss function L_loc(x,l,g)：

Wherein (cx, cy) is by (centre coordinate of the acquiescence frame x after Δ x, Δ y) compensation, w, h are by (Δ w, Δ H) the wide and high of frame is given tacit consent to after compensating,I-th of prediction frame that offset is m is represented,Represent the jth that offset is m Individual prediction frame；

(2c4) is according to Classification Loss function L_conf(x, c) and positioning loss function L_loc(x, l, g), obtain target loss letter Number L (x, l, c, g)：

(2d) minimizes loss function using gradient descent method, while the weight parameter in network is successively reversely adjusted, The network model trained.

Step 3, original image to be detected is input in the network model trained, obtains the mesh in image to be detected Mark classification and position coordinates.

(3a) carries out Quadtree Partition in pre-segmentation layer to input picture, extracts area-of-interest；

(3b) carries out feature extraction using basic convolutional layer and auxiliary convolutional layer to area-of-interest, obtains multiple yardsticks Characteristic pattern；

(3c) calculates the position coordinates of acquiescence frame on the characteristic pattern of multiple yardsticks

(3d) is predicted on the characteristic pattern of multiple yardsticks using convolution filter, obtains the mesh in multiple prediction frames Mark category score and predict the position offset of the relative acquiescence frame of frame；

(3e) gives tacit consent to frame relatively using non-maxima suppression to the target classification in multiple prediction frames and prediction frame Position offset suppressed, obtain the position of the target classification and the relative acquiescence frame of prediction frame in final prediction frame Offset is put, and according to position offset (Δ xfinal, Δ yfinal, Δ wfinal, the Δ of the relative acquiescence frame of prediction frame Hfinal) and acquiescence frame position coordinatesObtain the position coordinates of prediction frame

The effect of the present invention can be further illustrated by following experiment.

1. experimental subjects

Four image to be detected a, b, c, d shown in test data set and Fig. 3 that experimental subjects is PASCAL VOC2007；

2. experimental procedure

(2.1) respectively using Fast-RCNN network models, Faster-RCNN network models, YOLO network models, The training of SSD300 network models and network model of the invention in image set PASCAL VOC2007 and PASCAL VOC2012 Training pattern on collection；

(2.2) respectively use (2.1) in train Fast-RCNN network models, Faster-RCNN network models, The network model of YOLO network models, SSD300 network models and the present invention is carried out in PASCAL VOC2007 test data sets Test, accuracy of detection and the detection speed for obtaining network model are as shown in table 1；

(2.3) model trained using the present invention carries out target detection successively on four pictures shown in Fig. 3, wherein The result of area-of-interest is extracted to image to be detected as shown in figure 4, final testing result is as shown in figure 5, wherein：

Fig. 5 a carry out the simulation result of target detection with the present invention to Fig. 3 a, and target classification is cat, and target location is frame；

Fig. 5 b carry out the simulation result of target detection with the present invention to Fig. 3 b, and target classification is ship, and target location is frame；

Fig. 5 c carry out the simulation result of target detection with the present invention to Fig. 3 c, and target classification is aircraft, and target location is side Frame；Fig. 5 d carry out the simulation result of target detection with the present invention to Fig. 3 d, and target classification is cat, and target location is frame.

Accurate positioning of the network model to Small object, the classification of the present invention is can be seen that by Fig. 5 b and Fig. 5 c testing result Accurately.

3. experimental data counts：

Respectively using train Fast-RCNN network models, Faster-RCNN network models, YOLO network models, SSD300 network models and the network model of the present invention are tested in PASCAL VOC2007 test data sets, obtained inspection Survey precision and detection speed is as shown in table 1：

Table 1

Algorithm model	Training dataset	Accuracy of detection (%)	Detection speed (frame/second)
				Fast-RCNN	07++12	68.4	3
Faster-RCNN	07++12	70.4	5
				YOLO	07++12	57.9	47
SSD300	07++12	72.4	59
				Context of methods	07++12	74.9	45

As it can be seen from table 1 accuracy of detection and detection speed ratio that the network model of the present invention is tested on test set Fast-RCNN network models, the accuracy of detection of Faster-RCNN network models and detection speed are all significantly increased, with SSD300 network models, YOLO network models are compared, and network model of the invention can detect while detection speed is ensured Precision improves.The rate request detected in real time is greater than that 25 frames are per second, and it is per second that detection speed of the invention reaches 45 frames, meets real When testing requirements.

Claims

1. based on pre-segmentation and the deep learning object detection method returned, including：

(1) established according to QuadTree algorithm and convolutional neural networks VGG-16 based on pre-segmentation and the deep learning network mould returned Type；

(2) network model built is trained on training set of images；

(2a) uses image set PASCAL VOC2007 and PASCAL VOC2012 training dataset to use image as training set Collect PASCAL VOC2007 test data set as test set；

The acquiescence frame on characteristic pattern generated in the mark frame and network model of (2b) to marking image in training set is carried out Matching；

The target loss function L (x, l, c, g) of (2c) tectonic network model；

<mrow> <mi>L</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>l</mi> <mo>,</mo> <mi>c</mi> <mo>,</mo> <mi>g</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mi>N</mi> </mfrac> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mrow> <mi>c</mi> <mi>o</mi> <mi>n</mi> <mi>f</mi> </mrow> </msub> <mo>(</mo> <mrow> <mi>x</mi> <mo>,</mo> <mi>c</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>&alpha;L</mi> <mrow> <mi>l</mi> <mi>o</mi> <mi>c</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>l</mi> <mo>,</mo> <mi>g</mi> <mo>)</mo> </mrow> </mrow>

Wherein, x is characterized the acquiescence frame on figure, and l is prediction block, and for mark frame, the acquiescence frame that c is characterized on figure exists g Category score set in each classification, L_confAcquiescence frame on (x, c) expression characteristic pattern is on category score set c Softmax Classification Loss functions, L_loc(x, l, g) represents positioning loss function, and N represents the acquiescence frame matched with mark frame Number, parameter alpha are arranged to 1 by cross validation；

(2d) minimizes loss function using gradient descent method and the weight parameter in network is successively reversely adjusted simultaneously, is instructed The network model perfected；

(3) original image to be detected is input in the network model trained, obtains the target classification in image to be detected And position coordinates.

2. according to the method for claim 1, wherein step (1) is built according to QuadTree algorithm and convolutional neural networks VGG-16 Be based on pre-segmentation and the deep learning network model returned, carries out in accordance with the following steps：

(1a) utilizes QuadTree algorithm structure area-of-interest pre-segmentation Internet：

(1a1) sets segmentation threshold in QuadTree algorithm as M, and maximum fractionation number is Q, by image to be detected according to level side It is divided into four sub-regions to vertical direction；

(1a2) calculates the average gray value of every sub-regions after segmentation, and subregion of the average gray value more than M is continued to split For four sub-regions, stop segmentation when the average gray value of subregion is less than M or segmentation times reach Q, and record it Positional information；

(1a3) is found positioned at the minimum subregion in image to be detected upper left corner and the lower right corner according to the positional information of subregion The position of area-of-interest in coordinate, as image to be detected；

(1b) establishes target detection Internet according to convolutional neural networks VGG-16：

(1b1) uses bases of the stage stage1-stage5 as target detection Internet in convolutional neural networks VGG-16 Convolutional layer, and full articulamentum fc6, fc7 therein are replaced with into two convolutional layers, while four new convolutional layers are added as target Detect the auxiliary convolutional layer of Internet；

(1b3) uses the non-maxima suppression layer in the target detection network R-CNN based on candidate region and convolutional neural networks Form the output layer of target detection Internet.

3. according to the method for claim 1, to the mark frame and net of mark image in training set wherein in step (2b) Acquiescence frame on the characteristic pattern generated in network model is matched, and is carried out in accordance with the following steps：

Setting the ratio of width to height for giving tacit consent to frame on each characteristic pattern has 5 kinds of different ratios, respectively a={ 1,2,3,1/2,1/ 3 }, it is a to calculate the ratio of width to height in k-th of characteristic pattern_τAcquiescence frame widthAnd height

Wherein a_τFor τ kind the ratio of width to height, 1≤τ≤5, s_minRepresent the length of side of acquiescence frame and the side of input picture that the ratio of width to height is 1 Long minimum ratio, s_maxRepresent the maximum ratio of the length of side for the acquiescence frame that the ratio of width to height is 1 and the length of side of input picture, k ∈ [1, E], E represent the characteristic pattern number in network model；

The centre coordinate (xcen, ycen) for calculating the acquiescence frame on k-th of characteristic pattern isWherein (u, V) coordinate at expression characteristic pattern midpoint, and u, v ∈ [0, | f_k|], | f_k| it is the size of k-th of characteristic pattern；

(2b2) calculates acquiescence frame x and mark according to the size of acquiescence frame and the position coordinates of centre coordinate and mark frame Jaccard overlap coefficients between frame g：

(2b22) calculates coordinate (x min, y min) and the lower right corner in the upper left corner of the acquiescence frame with marking frame intersection Coordinate (x max, y max)：

X min=max (xleft, xgleft),

Y min=max (yleft, ygleft),

X max=max (xrigh, xgrigh),

Y max=max (yrigh, ygrigh)；

Wherein (xgleft, ygleft) and (xgrigh, ygrigh) represent to mark coordinate and the bottom right in the frame g upper left corner respectively The coordinate at angle；

Inter (x, g)=(max (y max-y min, 0)) * (max (x max-x min, 0))

It is positive sample Pos to select acquiescence frame of J (x, the g) coefficient value more than 0.5, and other are negative sample Neg, completes mark frame Matching with giving tacit consent to frame.

4. according to the method for claim 1, wherein in step (2c) tectonic network model target loss function L (x, l, C, g), carry out as follows：

(2c1) is predicted using convolution filter on characteristic pattern, obtains giving tacit consent to classification score of the frame on all categories Set c and prediction frame are relative to position offset (Δ x, Δ y, the Δ w, Δ h), wherein (Δ x, Δ y) are represented for giving tacit consent to frame Offset of the frame centre coordinate relative to acquiescence frame centre coordinate is predicted, wherein Δ w represents that prediction frame is wide relative to silent Recognize the wide offset of frame, wherein Δ h represents prediction frame height relative to the high offset of acquiescence frame；

The classification score set c of (2c2) according to the acquiescence frame on characteristic pattern on all categories, calculate softmax classification damages Lose function L_conf(x,c)：

<mrow> <msub> <mi>L</mi> <mrow> <mi>c</mi> <mi>o</mi> <mi>n</mi> <mi>f</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <mi>P</mi> <mi>o</mi> <mi>s</mi> </mrow> <mi>N</mi> </munderover> <msubsup> <mi>x</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mi>p</mi> </msubsup> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <msubsup> <mover> <mi>c</mi> <mo>^</mo> </mover> <mi>i</mi> <mrow> <mi>P</mi> <mi>o</mi> <mi>s</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <munder> <mo>&Sigma;</mo> <mrow> <msub> <mi>i</mi> <mn>2</mn> </msub> <mo>&Element;</mo> <mi>N</mi> <mi>e</mi> <mi>g</mi> </mrow> </munder> <mi>log</mi> <mrow> <mo>(</mo> <msubsup> <mover> <mi>c</mi> <mo>^</mo> </mover> <msub> <mi>i</mi> <mn>2</mn> </msub> <mrow> <mi>N</mi> <mi>e</mi> <mi>g</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>,</mo> </mrow>

Wherein, whenRepresent that i-th of acquiescence frame matches with j-th of mark frame that classification is p,Represent i-th Individual acquiescence frame mismatches with j-th of mark frame that classification is p, and 0≤i≤N, N represent the acquiescence side matched with mark frame Frame number, 1≤p≤H, H are total categorical measure, and 0≤j≤T, T are the quantity of mark frame,Represent i-th in positive sample Give tacit consent to the average on all categories of frame,Represent i-th in negative sample₂Individual acquiescence frame is on all categories Average, 0≤i₂≤N₂, N₂Represent and the unmatched acquiescence frame number of mark frame；

(2c3) calculates positioning loss function L_loc(x,l,g)：

<mrow> <msub> <mi>L</mi> <mrow> <mi>l</mi> <mi>o</mi> <mi>c</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>l</mi> <mo>,</mo> <mi>g</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <mi>P</mi> <mi>o</mi> <mi>s</mi> </mrow> <mi>N</mi> </munderover> <munder> <mo>&Sigma;</mo> <mrow> <mi>m</mi> <mo>&Element;</mo> <mo>{</mo> <mi>c</mi> <mi>x</mi> <mo>,</mo> <mi>c</mi> <mi>y</mi> <mo>,</mo> <mi>w</mi> <mo>,</mo> <mi>h</mi> <mo>}</mo> </mrow> </munder> <msubsup> <mi>x</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mi>p</mi> </msubsup> <msub> <mi>smooth</mi> <mrow> <mi>L</mi> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>l</mi> <mi>i</mi> <mi>m</mi> </msubsup> <mo>-</mo> <msubsup> <mover> <mi>g</mi> <mo>^</mo> </mover> <mi>j</mi> <mi>m</mi> </msubsup> <mo>)</mo> </mrow> </mrow>

Wherein (cx, cy) is by (centre coordinate of the acquiescence frame x after Δ x, Δ y) compensation, w, h are by (Δ w, Δ h) are mended Give tacit consent to the wide and high of frame after repaying,I-th of prediction frame that offset is m is represented,Expression offset is pre- j-th of m Survey frame；

(2c4) is according to Classification Loss function L_conf(x, c) and positioning loss function L_loc(x, l, g), obtain target loss function L (x,l,c,g)：

<mrow> <mi>L</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>l</mi> <mo>,</mo> <mi>c</mi> <mo>,</mo> <mi>g</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mi>N</mi> </mfrac> <mrow> <mo>(</mo> <msub> <mi>L</mi> <mrow> <mi>c</mi> <mi>o</mi> <mi>n</mi> <mi>f</mi> </mrow> </msub> <mo>(</mo> <mrow> <mi>x</mi> <mo>,</mo> <mi>c</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>&alpha;L</mi> <mrow> <mi>l</mi> <mi>o</mi> <mi>c</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>l</mi> <mo>,</mo> <mi>g</mi> <mo>)</mo> </mrow> <mo>.</mo> </mrow>

5. original image to be detected wherein according to the method for claim 1, is input to the net trained in step (3) In network model, target classification and the position coordinates in image to be detected are obtained, is carried out as follows：

(3b) carries out feature extraction using basic convolutional layer and auxiliary convolutional layer to area-of-interest, obtains the feature of multiple yardsticks Figure；

(3c) calculates the position coordinates of acquiescence frame on the characteristic pattern of multiple yardsticks；

(3d) is predicted on the characteristic pattern of multiple yardsticks using convolution filter, obtains the target class in multiple prediction frames The position offset of other score and the relative acquiescence frame of prediction frame；

(3e) is using non-maxima suppression to the target classification in multiple prediction frames and the position of the relative acquiescence frame of prediction frame Put offset to be suppressed, obtain the target classification in final prediction frame and predict that the position of the relative acquiescence frame of frame is inclined Shifting amount, and prediction frame is obtained according to the position offset of the relative acquiescence frame of prediction frame and the position coordinates of acquiescence frame Position coordinates.