CN111259973A - Method for improving average value average precision in real-time target detection system - Google Patents

Method for improving average value average precision in real-time target detection system Download PDF

Info

Publication number
CN111259973A
CN111259973A CN202010066060.9A CN202010066060A CN111259973A CN 111259973 A CN111259973 A CN 111259973A CN 202010066060 A CN202010066060 A CN 202010066060A CN 111259973 A CN111259973 A CN 111259973A
Authority
CN
China
Prior art keywords
network
frame
layer
prediction
full
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010066060.9A
Other languages
Chinese (zh)
Inventor
陈德鹏
贾华宇
李战峰
马珺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN202010066060.9A priority Critical patent/CN111259973A/en
Publication of CN111259973A publication Critical patent/CN111259973A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for improving average value average precision in a real-time target detection system, belonging to the field of target detection and image processing; firstly, batch normalization replaces a discarding method, then a model classifier is used for extracting features, then a full-link layer is removed, and the whole network is changed into a full-convolution network; using the manually selected bounding box for prediction; according to the invention, the object marking frame is trained by using a K-means clustering method, so that a better frame width and height dimension can be automatically found, direct position prediction is carried out by using a method for predicting the coordinate position relative to a grid unit, after the positioning prediction value is normalized, parameters are easier to learn, and the model is more stable; the invention uses the two fixed frame improvement methods of dimension cluster and direct position prediction, and the average precision of the mean value is obviously improved.

Description

Method for improving average value average precision in real-time target detection system
Technical Field
The invention belongs to the technical field of target detection and digital image processing, and relates to a method for improving average precision of a mean value in a real-time target detection system.
Background
The deep learning develops rapidly, the target detection becomes the popular direction of current research, the application prospect is very wide, and the real-time target detection is often used in the key field of production and life, so that higher requirements on the precision of the real-time target detection are provided.
Typically object detection is given an image, objects are found therein, their positions are found, and the objects are classified. The object detection model is typically trained on a fixed set of classes, so the model can only locate and classify those classes in the image. Furthermore, the location of the target is typically in the form of a boundary matrix. Therefore, object detection requires location information relating to the objects in the image and classification of the objects.
When the fixed frame is used, two problems are encountered, the first problem is that the width and the height of the fixed frame are often selected prior frames, although the network can learn to adjust the width and the height of the frame in the training process, and finally an accurate object labeling frame is obtained. However, if a better, more representative a priori box dimension is chosen from the outset, then the network is more likely to learn the exact predicted location.
A second problem found when using a fixed frame is: the model is unstable, especially at early iterations. Most of the instability phenomenon occurs at the coordinates of the prediction box. Any fixed frame can end up at any point in the image, no matter where the prediction is made. After the model is initialized randomly, it takes a long time to stably predict the position of the sensitive object.
Therefore, the dimension of a previous verification frame of the traditional real-time target detection system is not selected sufficiently, and the stability of a prediction frame is poor, so that the average precision of the mean value is low, and the result of the real-time target detection system is inaccurate, and the identification effect is influenced.
Disclosure of Invention
The invention overcomes the defects of the prior art, provides a method for improving the average precision of the mean value in a real-time target detection system, adopts a regression model, obtains the target of the whole image through one-time network transmission, and aims to improve the average precision of the mean value so as to improve the detection accuracy and speed.
The invention is realized by the following technical scheme.
A method for improving average value average precision in a real-time target detection system is characterized by comprising the following steps:
1) the network adds batch normalization after each convolutional layer, which helps to normalize the model and prevents overfitting using batch normalization instead of a discard method.
2) And training the classification network by using the visual database, so that the trained classification network is suitable for high-resolution input.
3) Removing the full-link layer, and changing the whole network into a full-convolution network; the full convolution network can detect each size input. In order for the network to accept input images of various sizes, the network eliminates the fully-connected layer in the conventional network structure, because the fully-connected layer must require input and output of fixed-length feature vectors. The whole network is changed into a full convolution network, and the input with various sizes can be detected. Meanwhile, the full convolution network can better reserve the spatial position information of the target relative to the full connection layer.
Based on Darknet-improvement. Although the traditional Darknet has good enough precision, the model is large, and the network transmission is time-consuming, so that the invention provides an improvement of the Darknet model. The transport network is formally trained with a Darknet-improvement as a prior training model.
4) Predicting an object marking frame by using a fixed frame: one pool layer is removed to improve convolutional layer output resolution, and then the network input size is modified so that the feature map has only one center.
5) And training the object labeling box by using a K-means clustering method.
6) Direct position prediction is performed using the predicted coordinate position relative to the grid. The method of predicting the coordinate position relative to the grid is used, and the truth value is limited between 0 and 1, and the local regression function is used for limiting the truth value.
7) And adding a transfer layer, and connecting the shallow characteristic diagram of the transfer layer to the deep characteristic diagram to form fine-grained characteristics. The transfer layer is to connect the feature maps of high and low resolution once, and the connection mode is to superpose the features to different channels instead of spatial positions;
8) and carrying out multi-scale training on the network, and predicting pictures with different sizes. It is desirable that the network be robust to different size pictures, and this is therefore also taken into account when training. Unlike the method of fixing the picture size of the input network, the network is fine-tuned after a few iterations.
Further, the ImageNet pre-trained model classifier is used in step 2 to extract features: the classification network (custom dark net) was trimmed to 448 x, and the trained network was able to accommodate the high resolution input after 10 rounds of training on the ImageNet dataset. Then, the detection network part (i.e., the latter half) is also fine-tuned.
Further, in step 4, after removing one pool layer, modifying the network input size: the change from 448 x 448 to 416 results in a feature map with only one center. Items (particularly large items) are more likely to appear in the center of the image.
Further, the criterion used in step 5 is the IOU score, i.e. the intersection between the boxes is divided by the union, and the final distance function is:
Figure BDA0002375994600000021
d=(box,centroid)=1-IOU(box,centroid)
further, (x, y) in step 6 is the coordinates of the prediction box, and in the area recommendation network, (x, y) and, t are predictedyThe following formula is used:
x=(tx *ωa)-xa
y=(ty *ha)-ya
when predicting txWhen the frame width is 1, the frame is moved to the right by a distance equal to the fixed frame width, and t is predictedxThe frame is moved to the left by a distance equal to the width of the fixed frame, which is-1.
Compared with the prior art, the invention has the beneficial effects that.
The invention adopts a regression model, can obtain the target of the whole image only by carrying out network propagation once, and has remarkably accelerated speed. The average precision of the mean value is improved, so that the detection accuracy is greatly improved; and direct position prediction is carried out by using a method for predicting the coordinate position relative to the grid unit, and after the positioning prediction value is normalized, the parameters are easier to learn, and the model is more stable. The invention uses the two fixed frame improvement methods of dimension cluster and direct position prediction, and the average precision of the mean value is obviously improved.
Drawings
FIG. 1 is a flow chart of a method for improving the mean average accuracy in a real-time target detection system according to the present invention.
FIG. 2 is a schematic diagram of the direct position prediction according to the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail with reference to the embodiments and the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. The technical solutions of the present invention are described in detail below with reference to the embodiments and the drawings, but the scope of protection is not limited thereto.
As shown in fig. 1, it is a flowchart of a method for improving average accuracy of a mean value in a real-time target detection system, and specifically includes the following steps:
1) the network adds batch normalization after each convolution layer, wherein the batch normalization is beneficial to standardizing the model and can still not be overfitted after abandon method optimization.
2) The classification network (custom dark net) was trimmed to 448 x, and the trained network was able to accommodate the high resolution input after 10 rounds of training on the ImageNet dataset. Then, the detection network part (i.e., the latter half) is also fine-tuned.
3) In order for the network to accept input images of various sizes, the network eliminates the fully-connected layer in the conventional network structure, because the fully-connected layer must require input and output of fixed-length feature vectors. The whole network is changed into a full convolution network, and the input with various sizes can be detected. Meanwhile, the full convolution network can better reserve the spatial position information of the target relative to the full connection layer.
4) Based on Darknet-improvement. Although the traditional Darknet has good enough precision, the model is large, and the network transmission is time-consuming, so that the invention provides an improvement of the Darknet model. The transport network is formally trained with a Darknet-improvement as a prior training model.
5) The full-connection layer of the traditional network is removed, and the boundary box is predicted by adopting the fixed frame. First, one pool layer is removed to improve convolutional layer output resolution. Then, the network input size is modified: the change from 448 x 448 to 416 results in a feature map with only one center. Items (particularly large items) are more likely to appear in the center of the image. Since the convolutional layer down-sampling rate of the network is 32, the input size is 416, and the output size is 13 × 13. And the accuracy is improved by adopting the fixed frame.
With the addition of the fixed frame, the expected result is an increase in recall and a decrease in accuracy. Assuming that each grid predicts 9 proposed boxes, a total of 13 x 9-1521 boxes would be predicted, while the previous net predicted only 7 x 2-98 boxes. The specific data are as follows: without a fixed frame, the model recall rate is 81%, and the mean average precision is 69.5%; and a fixed frame is added, the model recall rate is 88%, and the average precision of the mean value is 69.2%. In this way, the accuracy rate is only reduced in a small range, and the recall rate is improved by 7%, which shows that the accuracy rate can be enhanced through further work, and the improvement space is provided.
6) And a K-means clustering method is used for class training of the boundary box, so that better width and height dimensions of the box can be automatically found. The traditional K-means clustering method uses the Euclidean distance function, which means that a larger frame generates more errors than a smaller frame, and the clustering result may deviate. Therefore, the criterion used in the present invention is the IOU score (i.e. the intersection between the boxes divided by the union), so that the error is independent of the size of the box, and the final distance function is:
Figure BDA0002375994600000041
d=(box,centroid)=1-IOU(box,centroid)
7) the problems found when using a fixed frame are: the model is unstable, especially at early iterations. Most of the instability phenomenon occurs in the (x, y) coordinates of the prediction box. In the area proposal network, the predictions (x, y) and, tyThe following formula is used
x=(tx *ωa)-xa
y=(ty *ha)-ya
The understanding of this formula is: when predicting txWhen the frame is moved to the right by a predetermined distance (specifically, the width of the fixed frame) 1, t is predictedxThe frame will be moved the same distance to the left as-1.
This formula is without any restriction so that any fixed frame can end up at any point in the image, no matter where the prediction is made. After the model is initialized randomly, it takes a long time to stably predict the position of the sensitive object.
Here, instead of using a method of predicting direct compensation, a method of predicting a coordinate position with respect to a grid is used, and a true value is limited to 0 to 1, and this limitation is performed by using a logistic regression function.
8) A transfer layer is added that links the shallow feature map (resolution 26 x 26, 4 times the resolution of the bottom layer) to the deep feature map. The transfer layer is characterized in that feature maps with high resolution and low resolution are connected once, and the connection mode is that features are superposed to different channels instead of spatial positions, so that the transfer layer can have better fine-grained features.
9) This is also taken into account when training in order to adapt the network to robustness with different size pictures. Unlike the method of fixing the picture size of the input network, the network is fine-tuned after a few iterations. The network is then adjusted for training according to the input size.
In the step 1, batch normalization is used to replace a discarding method to prevent overfitting, and by the method, the mean average precision is remarkably improved.
The step 2 is as follows: features are extracted by using an ImageNet pre-trained model classifier, so that the average precision of the medium mean value is remarkably improved by improving the input resolution.
And step 3: and removing the full-link layer, changing the whole network into a full-convolution network, and detecting various size inputs. Meanwhile, the full convolution network can better reserve the spatial position information of the target relative to the full connection layer.
Step 4 proposes a full convolution network model.
And 5, removing the full connection layer of the network, and predicting the bounding box by using a fixed frame, thereby improving the accuracy.
And 6, the boundary box is trained by using a K-means clustering method, so that better width and height dimensions of the box can be automatically found.
Step 7 uses a method of predicting the coordinate position relative to the grid, limiting the truth to between 0 and 1, using a logistic regression function to do this limitation.
Step 8 adds a transition layer that links the shallow feature map (resolution 26 x 26, 4 times the underlying resolution) to the deep feature map.
Finally, the network only uses the convolution layer and the pooling layer, so that dynamic adjustment can be performed, the mechanism enables the network to better predict pictures with different sizes, the detection tasks with different resolutions can be performed on the same network, the network runs faster on the small-size pictures, and the speed and the precision are balanced.
As shown in fig. 1, the present invention provides a method for improving average accuracy of a mean value in a real-time target detection system, which mainly comprises the following modules: batch normalization, high-resolution classifier, full convolution network, new basic convolution network, anchor frame, dimension clustering, direct position prediction, fine-grained feature and multi-scale training.
As shown in fig. 2, the specific process of direct position prediction of the method for improving the average precision of the mean value in the real-time target detection system of the present invention is as follows: the neural network predicts 5 object labeling boxes (clustered values) on each grid of the feature map (13 x 13), and each object labeling box predicts 5 coordinate values, tx,ty,tw,th,t0. If the grid is at a margin of (c) from the top left corner of the imagex,cy) And the length and width of the frame dimension (object labeling frame prediction) corresponding to the grid are respectively (p)w,ph) Then the predicted value can be expressed as:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure BDA0002375994600000061
Figure BDA0002375994600000062
Pr(object)*IOU(b,object)=σ(t0)
after the positioning prediction value is normalized, the parameters are easier to learn, and the model is more stable. The mean average accuracy is improved by 5% by using two anchor frame improvement methods of dimension clustering and direct position prediction.
Different from the prior art, the invention provides a method for improving the average precision of the mean value in a real-time target detection system, a regression model is adopted, and the target of the whole image can be obtained only by network transmission once, so that the speed is high. More skills are used, and the average precision of the mean value is improved, so that the detection accuracy is greatly improved.
While the invention has been described in further detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (5)

1. A method for improving average value average precision in a real-time target detection system is characterized by comprising the following steps:
1) adding batch normalization after each convolution layer in the network, and using the batch normalization to replace a discarding method to prevent overfitting;
2) training the classification network by using a visual database, so that the trained classification network is suitable for high-resolution input;
3) removing the full-link layer, and changing the whole network into a full-convolution network; the full convolution network can detect inputs of various sizes;
4) predicting an object marking frame by using a fixed frame: removing a pool layer to improve the output resolution of the convolutional layer, and then modifying the network input size to ensure that the characteristic diagram has only one center;
5) training an object marking frame by using a K-means clustering method;
6) direct position prediction using the predicted coordinate position relative to the grid;
7) adding a transfer layer, and connecting the shallow characteristic diagram of the transfer layer to the deep characteristic diagram to form fine-grained characteristics;
8) and carrying out multi-scale training on the network, and predicting pictures with different sizes.
2. The method of claim 1, wherein the ImageNet pre-trained model classifier is used to extract features in step 2: resolution was 448 x 448, 10 rounds of training were performed on the ImageNet dataset, and the trained network was used to accommodate high resolution input.
3. The method of claim 2, wherein in step 4, after removing a pool layer, modifying the network input size: the change from 448 x 448 to 416 results in a feature map with only one center.
4. The method of claim 1, wherein the criterion used in step 5 is an IOU score, i.e. the intersection between the boxes divided by the union, and the final distance function is:
Figure FDA0002375994590000011
5. the method of claim 1, wherein (x, y) in step 6 is coordinates of a prediction box, and (x, y) and t are predicted in the area recommendation networkyThe following formula is used
x=(txa)-xa
y=(ty*ha)-ya
When predicting txWhen the frame width is 1, the frame is moved to the right by a distance equal to the fixed frame width, and t is predictedxThe frame is moved to the left by a distance equal to the width of the fixed frame, which is-1.
CN202010066060.9A 2020-01-20 2020-01-20 Method for improving average value average precision in real-time target detection system Pending CN111259973A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010066060.9A CN111259973A (en) 2020-01-20 2020-01-20 Method for improving average value average precision in real-time target detection system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010066060.9A CN111259973A (en) 2020-01-20 2020-01-20 Method for improving average value average precision in real-time target detection system

Publications (1)

Publication Number Publication Date
CN111259973A true CN111259973A (en) 2020-06-09

Family

ID=70952459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010066060.9A Pending CN111259973A (en) 2020-01-20 2020-01-20 Method for improving average value average precision in real-time target detection system

Country Status (1)

Country Link
CN (1) CN111259973A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950527A (en) * 2020-08-31 2020-11-17 珠海大横琴科技发展有限公司 Target detection method and device based on YOLO V2 neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255375A (en) * 2018-08-29 2019-01-22 长春博立电子科技有限公司 Panoramic picture method for checking object based on deep learning
CN109934121A (en) * 2019-02-21 2019-06-25 江苏大学 A kind of orchard pedestrian detection method based on YOLOv3 algorithm
CN110660052A (en) * 2019-09-23 2020-01-07 武汉科技大学 Hot-rolled strip steel surface defect detection method based on deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255375A (en) * 2018-08-29 2019-01-22 长春博立电子科技有限公司 Panoramic picture method for checking object based on deep learning
CN109934121A (en) * 2019-02-21 2019-06-25 江苏大学 A kind of orchard pedestrian detection method based on YOLOv3 algorithm
CN110660052A (en) * 2019-09-23 2020-01-07 武汉科技大学 Hot-rolled strip steel surface defect detection method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NODYOUNG: ""基于深度学习的目标检测学习总结"", 《HTTPS://BLOG.CSDN.NET/NNNNNNNNNNNNY/ARTICLE/DETAILS/68483053》 *
ZHOU ZHIGANG: ""Vehicle target detection based on R-FCN"", 《2018 CHINESE CONTROL AND DECISION CONFERENCE (CCDC)》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950527A (en) * 2020-08-31 2020-11-17 珠海大横琴科技发展有限公司 Target detection method and device based on YOLO V2 neural network

Similar Documents

Publication Publication Date Title
CN112424828B (en) Nuclear fuzzy C-means quick clustering algorithm integrating space constraint
CN107424159B (en) Image semantic segmentation method based on super-pixel edge and full convolution network
US20180157927A1 (en) Character Segmentation Method, Apparatus and Electronic Device
CN111027493B (en) Pedestrian detection method based on deep learning multi-network soft fusion
CN111768432A (en) Moving target segmentation method and system based on twin deep neural network
CN112396002A (en) Lightweight remote sensing target detection method based on SE-YOLOv3
CN107862702B (en) Significance detection method combining boundary connectivity and local contrast
CN112052876B (en) Improved RA-CNN-based fine-grained image detection method and system
CN108960115B (en) Multidirectional text detection method based on angular points
CN112287941B (en) License plate recognition method based on automatic character region perception
CN112184759A (en) Moving target detection and tracking method and system based on video
CN110008844B (en) KCF long-term gesture tracking method fused with SLIC algorithm
CN112364931A (en) Low-sample target detection method based on meta-feature and weight adjustment and network model
Zheng et al. Improvement of grayscale image 2D maximum entropy threshold segmentation method
CN113313706B (en) Power equipment defect image detection method based on detection reference point offset analysis
CN114648665A (en) Weak supervision target detection method and system
CN113888461A (en) Method, system and equipment for detecting defects of hardware parts based on deep learning
Wu et al. Typical target detection in satellite images based on convolutional neural networks
CN111860679B (en) Vehicle detection method based on YOLO v3 improved algorithm
CN113592894A (en) Image segmentation method based on bounding box and co-occurrence feature prediction
CN109993772B (en) Example level feature aggregation method based on space-time sampling
CN114139631B (en) Multi-target training object-oriented selectable gray box countermeasure sample generation method
CN108846845B (en) SAR image segmentation method based on thumbnail and hierarchical fuzzy clustering
CN114332921A (en) Pedestrian detection method based on improved clustering algorithm for Faster R-CNN network
CN111259973A (en) Method for improving average value average precision in real-time target detection system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200609