CN111259973A

CN111259973A - Method for improving average value average precision in real-time target detection system

Info

Publication number: CN111259973A
Application number: CN202010066060.9A
Authority: CN
Inventors: 陈德鹏; 贾华宇; 李战峰; 马珺
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2020-06-09

Abstract

The invention discloses a method for improving average value average precision in a real-time target detection system, belonging to the field of target detection and image processing; firstly, batch normalization replaces a discarding method, then a model classifier is used for extracting features, then a full-link layer is removed, and the whole network is changed into a full-convolution network; using the manually selected bounding box for prediction; according to the invention, the object marking frame is trained by using a K-means clustering method, so that a better frame width and height dimension can be automatically found, direct position prediction is carried out by using a method for predicting the coordinate position relative to a grid unit, after the positioning prediction value is normalized, parameters are easier to learn, and the model is more stable; the invention uses the two fixed frame improvement methods of dimension cluster and direct position prediction, and the average precision of the mean value is obviously improved.

Description

Method for improving average value average precision in real-time target detection system

Technical Field

The invention belongs to the technical field of target detection and digital image processing, and relates to a method for improving average precision of a mean value in a real-time target detection system.

Background

The deep learning develops rapidly, the target detection becomes the popular direction of current research, the application prospect is very wide, and the real-time target detection is often used in the key field of production and life, so that higher requirements on the precision of the real-time target detection are provided.

Typically object detection is given an image, objects are found therein, their positions are found, and the objects are classified. The object detection model is typically trained on a fixed set of classes, so the model can only locate and classify those classes in the image. Furthermore, the location of the target is typically in the form of a boundary matrix. Therefore, object detection requires location information relating to the objects in the image and classification of the objects.

When the fixed frame is used, two problems are encountered, the first problem is that the width and the height of the fixed frame are often selected prior frames, although the network can learn to adjust the width and the height of the frame in the training process, and finally an accurate object labeling frame is obtained. However, if a better, more representative a priori box dimension is chosen from the outset, then the network is more likely to learn the exact predicted location.

A second problem found when using a fixed frame is: the model is unstable, especially at early iterations. Most of the instability phenomenon occurs at the coordinates of the prediction box. Any fixed frame can end up at any point in the image, no matter where the prediction is made. After the model is initialized randomly, it takes a long time to stably predict the position of the sensitive object.

Therefore, the dimension of a previous verification frame of the traditional real-time target detection system is not selected sufficiently, and the stability of a prediction frame is poor, so that the average precision of the mean value is low, and the result of the real-time target detection system is inaccurate, and the identification effect is influenced.

Disclosure of Invention

The invention overcomes the defects of the prior art, provides a method for improving the average precision of the mean value in a real-time target detection system, adopts a regression model, obtains the target of the whole image through one-time network transmission, and aims to improve the average precision of the mean value so as to improve the detection accuracy and speed.

The invention is realized by the following technical scheme.

A method for improving average value average precision in a real-time target detection system is characterized by comprising the following steps:

1) the network adds batch normalization after each convolutional layer, which helps to normalize the model and prevents overfitting using batch normalization instead of a discard method.

2) And training the classification network by using the visual database, so that the trained classification network is suitable for high-resolution input.

3) Removing the full-link layer, and changing the whole network into a full-convolution network; the full convolution network can detect each size input. In order for the network to accept input images of various sizes, the network eliminates the fully-connected layer in the conventional network structure, because the fully-connected layer must require input and output of fixed-length feature vectors. The whole network is changed into a full convolution network, and the input with various sizes can be detected. Meanwhile, the full convolution network can better reserve the spatial position information of the target relative to the full connection layer.

Based on Darknet-improvement. Although the traditional Darknet has good enough precision, the model is large, and the network transmission is time-consuming, so that the invention provides an improvement of the Darknet model. The transport network is formally trained with a Darknet-improvement as a prior training model.

4) Predicting an object marking frame by using a fixed frame: one pool layer is removed to improve convolutional layer output resolution, and then the network input size is modified so that the feature map has only one center.

5) And training the object labeling box by using a K-means clustering method.

6) Direct position prediction is performed using the predicted coordinate position relative to the grid. The method of predicting the coordinate position relative to the grid is used, and the truth value is limited between 0 and 1, and the local regression function is used for limiting the truth value.

7) And adding a transfer layer, and connecting the shallow characteristic diagram of the transfer layer to the deep characteristic diagram to form fine-grained characteristics. The transfer layer is to connect the feature maps of high and low resolution once, and the connection mode is to superpose the features to different channels instead of spatial positions;

8) and carrying out multi-scale training on the network, and predicting pictures with different sizes. It is desirable that the network be robust to different size pictures, and this is therefore also taken into account when training. Unlike the method of fixing the picture size of the input network, the network is fine-tuned after a few iterations.

Further, the ImageNet pre-trained model classifier is used in step 2 to extract features: the classification network (custom dark net) was trimmed to 448 x, and the trained network was able to accommodate the high resolution input after 10 rounds of training on the ImageNet dataset. Then, the detection network part (i.e., the latter half) is also fine-tuned.

Further, in step 4, after removing one pool layer, modifying the network input size: the change from 448 x 448 to 416 results in a feature map with only one center. Items (particularly large items) are more likely to appear in the center of the image.

Further, the criterion used in step 5 is the IOU score, i.e. the intersection between the boxes is divided by the union, and the final distance function is:

d＝(box,centroid)＝1-IOU(box,centroid)

further, (x, y) in step 6 is the coordinates of the prediction box, and in the area recommendation network, (x, y) and, t are predicted_yThe following formula is used:

x＝(t_x ^*ω_a)-x_a

y＝(t_y ^*h_a)-y_a

when predicting t_xWhen the frame width is 1, the frame is moved to the right by a distance equal to the fixed frame width, and t is predicted_xThe frame is moved to the left by a distance equal to the width of the fixed frame, which is-1.

Compared with the prior art, the invention has the beneficial effects that.

The invention adopts a regression model, can obtain the target of the whole image only by carrying out network propagation once, and has remarkably accelerated speed. The average precision of the mean value is improved, so that the detection accuracy is greatly improved; and direct position prediction is carried out by using a method for predicting the coordinate position relative to the grid unit, and after the positioning prediction value is normalized, the parameters are easier to learn, and the model is more stable. The invention uses the two fixed frame improvement methods of dimension cluster and direct position prediction, and the average precision of the mean value is obviously improved.

Drawings

FIG. 1 is a flow chart of a method for improving the mean average accuracy in a real-time target detection system according to the present invention.

FIG. 2 is a schematic diagram of the direct position prediction according to the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail with reference to the embodiments and the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. The technical solutions of the present invention are described in detail below with reference to the embodiments and the drawings, but the scope of protection is not limited thereto.

As shown in fig. 1, it is a flowchart of a method for improving average accuracy of a mean value in a real-time target detection system, and specifically includes the following steps:

1) the network adds batch normalization after each convolution layer, wherein the batch normalization is beneficial to standardizing the model and can still not be overfitted after abandon method optimization.

2) The classification network (custom dark net) was trimmed to 448 x, and the trained network was able to accommodate the high resolution input after 10 rounds of training on the ImageNet dataset. Then, the detection network part (i.e., the latter half) is also fine-tuned.

3) In order for the network to accept input images of various sizes, the network eliminates the fully-connected layer in the conventional network structure, because the fully-connected layer must require input and output of fixed-length feature vectors. The whole network is changed into a full convolution network, and the input with various sizes can be detected. Meanwhile, the full convolution network can better reserve the spatial position information of the target relative to the full connection layer.

4) Based on Darknet-improvement. Although the traditional Darknet has good enough precision, the model is large, and the network transmission is time-consuming, so that the invention provides an improvement of the Darknet model. The transport network is formally trained with a Darknet-improvement as a prior training model.

5) The full-connection layer of the traditional network is removed, and the boundary box is predicted by adopting the fixed frame. First, one pool layer is removed to improve convolutional layer output resolution. Then, the network input size is modified: the change from 448 x 448 to 416 results in a feature map with only one center. Items (particularly large items) are more likely to appear in the center of the image. Since the convolutional layer down-sampling rate of the network is 32, the input size is 416, and the output size is 13 × 13. And the accuracy is improved by adopting the fixed frame.

With the addition of the fixed frame, the expected result is an increase in recall and a decrease in accuracy. Assuming that each grid predicts 9 proposed boxes, a total of 13 x 9-1521 boxes would be predicted, while the previous net predicted only 7 x 2-98 boxes. The specific data are as follows: without a fixed frame, the model recall rate is 81%, and the mean average precision is 69.5%; and a fixed frame is added, the model recall rate is 88%, and the average precision of the mean value is 69.2%. In this way, the accuracy rate is only reduced in a small range, and the recall rate is improved by 7%, which shows that the accuracy rate can be enhanced through further work, and the improvement space is provided.

6) And a K-means clustering method is used for class training of the boundary box, so that better width and height dimensions of the box can be automatically found. The traditional K-means clustering method uses the Euclidean distance function, which means that a larger frame generates more errors than a smaller frame, and the clustering result may deviate. Therefore, the criterion used in the present invention is the IOU score (i.e. the intersection between the boxes divided by the union), so that the error is independent of the size of the box, and the final distance function is:

d＝(box,centroid)＝1-IOU(box,centroid)

7) the problems found when using a fixed frame are: the model is unstable, especially at early iterations. Most of the instability phenomenon occurs in the (x, y) coordinates of the prediction box. In the area proposal network, the predictions (x, y) and, t_yThe following formula is used

x＝(t_x ^*ω_a)-x_a

y＝(t_y ^*h_a)-y_a

The understanding of this formula is: when predicting t_xWhen the frame is moved to the right by a predetermined distance (specifically, the width of the fixed frame) 1, t is predicted_xThe frame will be moved the same distance to the left as-1.

This formula is without any restriction so that any fixed frame can end up at any point in the image, no matter where the prediction is made. After the model is initialized randomly, it takes a long time to stably predict the position of the sensitive object.

Here, instead of using a method of predicting direct compensation, a method of predicting a coordinate position with respect to a grid is used, and a true value is limited to 0 to 1, and this limitation is performed by using a logistic regression function.

8) A transfer layer is added that links the shallow feature map (resolution 26 x 26, 4 times the resolution of the bottom layer) to the deep feature map. The transfer layer is characterized in that feature maps with high resolution and low resolution are connected once, and the connection mode is that features are superposed to different channels instead of spatial positions, so that the transfer layer can have better fine-grained features.

9) This is also taken into account when training in order to adapt the network to robustness with different size pictures. Unlike the method of fixing the picture size of the input network, the network is fine-tuned after a few iterations. The network is then adjusted for training according to the input size.

In the step 1, batch normalization is used to replace a discarding method to prevent overfitting, and by the method, the mean average precision is remarkably improved.

The step 2 is as follows: features are extracted by using an ImageNet pre-trained model classifier, so that the average precision of the medium mean value is remarkably improved by improving the input resolution.

And step 3: and removing the full-link layer, changing the whole network into a full-convolution network, and detecting various size inputs. Meanwhile, the full convolution network can better reserve the spatial position information of the target relative to the full connection layer.

Step 4 proposes a full convolution network model.

And 5, removing the full connection layer of the network, and predicting the bounding box by using a fixed frame, thereby improving the accuracy.

And 6, the boundary box is trained by using a K-means clustering method, so that better width and height dimensions of the box can be automatically found.

Step 7 uses a method of predicting the coordinate position relative to the grid, limiting the truth to between 0 and 1, using a logistic regression function to do this limitation.

Step 8 adds a transition layer that links the shallow feature map (resolution 26 x 26, 4 times the underlying resolution) to the deep feature map.

Finally, the network only uses the convolution layer and the pooling layer, so that dynamic adjustment can be performed, the mechanism enables the network to better predict pictures with different sizes, the detection tasks with different resolutions can be performed on the same network, the network runs faster on the small-size pictures, and the speed and the precision are balanced.

As shown in fig. 1, the present invention provides a method for improving average accuracy of a mean value in a real-time target detection system, which mainly comprises the following modules: batch normalization, high-resolution classifier, full convolution network, new basic convolution network, anchor frame, dimension clustering, direct position prediction, fine-grained feature and multi-scale training.

As shown in fig. 2, the specific process of direct position prediction of the method for improving the average precision of the mean value in the real-time target detection system of the present invention is as follows: the neural network predicts 5 object labeling boxes (clustered values) on each grid of the feature map (13 x 13), and each object labeling box predicts 5 coordinate values, t_x，t_y，t_w，t_h，t₀. If the grid is at a margin of (c) from the top left corner of the image_x，c_y) And the length and width of the frame dimension (object labeling frame prediction) corresponding to the grid are respectively (p)_w，p_h) Then the predicted value can be expressed as:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

Pr(object)*IOU(b，object)＝σ(t₀)

after the positioning prediction value is normalized, the parameters are easier to learn, and the model is more stable. The mean average accuracy is improved by 5% by using two anchor frame improvement methods of dimension clustering and direct position prediction.

Different from the prior art, the invention provides a method for improving the average precision of the mean value in a real-time target detection system, a regression model is adopted, and the target of the whole image can be obtained only by network transmission once, so that the speed is high. More skills are used, and the average precision of the mean value is improved, so that the detection accuracy is greatly improved.

While the invention has been described in further detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for improving average value average precision in a real-time target detection system is characterized by comprising the following steps:

1) adding batch normalization after each convolution layer in the network, and using the batch normalization to replace a discarding method to prevent overfitting;

2) training the classification network by using a visual database, so that the trained classification network is suitable for high-resolution input;

3) removing the full-link layer, and changing the whole network into a full-convolution network; the full convolution network can detect inputs of various sizes;

4) predicting an object marking frame by using a fixed frame: removing a pool layer to improve the output resolution of the convolutional layer, and then modifying the network input size to ensure that the characteristic diagram has only one center;

5) training an object marking frame by using a K-means clustering method;

6) direct position prediction using the predicted coordinate position relative to the grid;

7) adding a transfer layer, and connecting the shallow characteristic diagram of the transfer layer to the deep characteristic diagram to form fine-grained characteristics;

8) and carrying out multi-scale training on the network, and predicting pictures with different sizes.

2. The method of claim 1, wherein the ImageNet pre-trained model classifier is used to extract features in step 2: resolution was 448 x 448, 10 rounds of training were performed on the ImageNet dataset, and the trained network was used to accommodate high resolution input.

3. The method of claim 2, wherein in step 4, after removing a pool layer, modifying the network input size: the change from 448 x 448 to 416 results in a feature map with only one center.

4. The method of claim 1, wherein the criterion used in step 5 is an IOU score, i.e. the intersection between the boxes divided by the union, and the final distance function is:

5. the method of claim 1, wherein (x, y) in step 6 is coordinates of a prediction box, and (x, y) and t are predicted in the area recommendation network_yThe following formula is used

x＝(t_x*ω_a)-x_a

y＝(t_y*h_a)-y_a