CN111353440A - Target detection method - Google Patents

Target detection method Download PDF

Info

Publication number
CN111353440A
CN111353440A CN202010137294.8A CN202010137294A CN111353440A CN 111353440 A CN111353440 A CN 111353440A CN 202010137294 A CN202010137294 A CN 202010137294A CN 111353440 A CN111353440 A CN 111353440A
Authority
CN
China
Prior art keywords
training
target
model
data set
target detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010137294.8A
Other languages
Chinese (zh)
Inventor
刘建闽
向钰
彭小华
阎晶亮
黄嵩衍
胡波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University of Humanities Science and Technology
Guangxi University of Finance and Economics
Original Assignee
Hunan University of Humanities Science and Technology
Guangxi University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University of Humanities Science and Technology, Guangxi University of Finance and Economics filed Critical Hunan University of Humanities Science and Technology
Publication of CN111353440A publication Critical patent/CN111353440A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target detection method, which comprises the following steps: constructing a video target detection and identification model, inputting labeled data in a training data set into a full convolution network, optimizing and generating a comprehensive loss value function by using weighted linear combination of independent loss functions and obtaining a total loss function; the video target detection recognition model uses a logic activation function to calculate a classification confidence value and a rectangular selection box at the last layer of the network model. The invention can improve the accuracy and the real-time performance of the high-frame-rate high-definition video target detection and identification.

Description

Target detection method
Technical Field
The invention belongs to the technical field of video identification, and particularly relates to a target detection method.
Background
With the rapid development of the fields of intelligent monitoring, intelligent transportation and the like, the increase of high-frame-rate high-definition data sources and the high requirements and complexity and variability of practical application scenes, the classical method cannot meet the latest requirements, and the problems of accuracy and real-time performance required by high-frame-rate high-definition video target detection and identification cannot be solved.
Therefore, how to provide a target detection method with high accuracy and real-time performance is a problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of this, the present invention provides a target detection method, which can solve the problems of accuracy and real-time required for high frame rate and high definition video target detection and identification in a high-requirement and complex and variable real-world application scene.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method of target detection, comprising:
constructing a video target detection and identification model, inputting labeled data in a training data set into a full convolution network through a high-speed storage medium, optimizing and generating a comprehensive loss value function by using weighted linear combination of independent loss functions, and obtaining a total loss function through training minimization;
all layers of the video target detection and identification model except the last layer use a correction linear unit based on a leaky bucket as an activation function, and a logic activation function is used for calculating a classification confidence value and a rectangular selection frame at the last layer of the network model, so that the real-time target detection and identification of video vehicles and pedestrians is realized.
Preferably, the method further comprises the following steps of testing the video target detection recognition model: and sending the verification data in the verification data set to the optimized full convolution network to obtain a prediction rectangular selection frame, and calculating with a real rectangular selection frame in the verification data set to obtain an average value of the area under the function curve.
Preferably, the pre-training weights and threshold initialization are input to the full convolution network through a learning rate decay strategy in a learning mode.
Preferably, the method for initializing the pre-training weight and the threshold comprises the following steps: and initializing a pre-training weight and a threshold of the pre-training model before training by adopting parameters of the pre-training model obtained by training based on the standard data set.
Preferably, the learning rate attenuation strategy comprises the following steps: the parameters obtained by the video target detection and identification model are adopted for model parameter initialization before training, a small batch gradient descent method is adopted for training, the learning rate lr of each iteration of the initial small batch gradient descent method of the model is changed into 0.001, and the model can output the confidence value and the selection box coordinates of each related target classification in each partition.
Preferably, the method for optimizing the generation of the composite loss value function by using the weighted linear combination of the independent loss functions and obtaining the overall loss function by training minimization is as follows:
setting Si(xg,ygg,hg) True rectangular selection box, x, for the targetg,ygIs the central point;
setting S (x, y, omega, h) as a prediction rectangle selection frame of a target, wherein x and y are central points;
wherein, x and y are coordinates of the central point of the prediction rectangle selection frame, and omega and h are the width and height of the prediction rectangle selection frame; x is the number ofg,ygSelecting the coordinates of the center point of the frame, omega, for the true rectanglegAnd hgSelecting the width and height of the frame for the true rectangle;
weighting the probability and the error D of the Euclidean distance between a prediction rectangle selection box of a target detection algorithm and a selection box truth value of a target real label based on the regular term hyperparameter lambda to obtain a loss function: l1 ═ λ · D to generalize the detection and identification capabilities, and at the same time calculate the Intersection over Union loss function value L2 ═ λ · IOU, which is a standard for measuring the accuracy of detecting the corresponding object in a specific dataset;
81 partitions generate 486 probabilities after each input original data passes through a network, a certain space point has a probability Pr, the existence of a target can be quantitatively measured, 81 partitions give out probabilities Pc of 6 classes, the quantitative measurement probability Po of the partitions to the target needs to be compounded, and then after the partitions are divided into six classes, the unconditional probability of a single independent class is compounded to obtain a corresponding classification loss function value Lc which is Pc-Po after the probability that the target is classified into the certain class after the target exists in the certain partition is determined; the overall loss function L is L1+ L2+ Lc.
Preferably, the convolutional layer and the pooling layer used by the video target detection and recognition model support multi-size image input, a data set with a single size is enhanced by implementing the data set through preprocessing, and images with different sizes are randomly selected after each M batches of training in the training.
Preferably, in the detection of multiple classes of targets, each class of target may draw a function curve of recall and precision, and avep (q) is an area under the function curve, as shown in equation 8: mAP is the average value of Q types of AP, and the type serial number Q is an integer from 1 to Q;
Figure BDA0002397801740000031
preferably, the training data set is various types of videos with various resolutions including a D5 format; the validation data set is composed of 20% of the data in the reserved training data set as validation data.
The invention has the beneficial effects that:
the method is based on separation confidence calculation and regular term hyper-parameters, designs a composite loss function based on classification confidence values, can be used for detecting and identifying high-frame-rate high-definition video targets under high-requirement and complex and variable practical application scenes, and obviously improves the accuracy and the real-time performance of measurement based on mAP and camera frame rate fps.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a diagram illustrating a video object detection and recognition model according to the present invention.
FIG. 2 is a diagram illustrating a test chart of a video object detection recognition model according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-2, the present invention provides a target detection method, including:
and constructing a video target detection and identification model, inputting the labeled data in the training data set into the full convolution network through a high-speed storage medium, and initializing a pre-training weight and a threshold value and inputting the pre-training weight and the threshold value into the full convolution network through a learning rate attenuation strategy in a learning mode. A weighted linear combination of the independent loss functions is used to optimize the resulting composite loss value function and minimize by training the overall loss function.
The video target detection and identification model calculates a classification confidence value and a rectangular selection box by using a logic activation function at the last layer of the network model, so that the real-time detection and identification of the targets of video vehicles and pedestrians are realized.
The invention also comprises the test of the video target detection and identification model: and sending the verification data in the verification data set to the optimized full convolution network to obtain a prediction rectangular selection frame, and calculating with a real rectangular selection frame in the verification data set to obtain an average value mAP of the area under the function curve.
The training data set is various videos with various resolutions including a D5 format; the validation data set is composed of 20% of the data in the reserved training data set as validation data.
The method for initializing the pre-training weight and the threshold comprises the following steps: and initializing a pre-training weight and a threshold of the pre-training model before training by adopting parameters of the pre-training model obtained by training based on the standard data set.
The method of the learning rate attenuation strategy comprises the following steps: the method is characterized in that parameters obtained by a video target detection and identification model are adopted for initializing model parameters before training, a small-batch gradient descent method is adopted for training, the learning rate lr of each iteration of the initial small-batch gradient descent method of the model is changed to 0.001, the purpose is to simultaneously keep certain learning capacity and memory capacity, new knowledge can be learned while old knowledge cannot be completely forgotten, the model can output confidence values and selection frame coordinates of each relevant target classification in each partition because the occupied memory is large and easy to overflow, the batch processing scale generally does not exceed 128, and the model can output the confidence values and the selection frame coordinates of each relevant target classification in each partition.
The method for optimizing the generation of the composite loss value function by using the weighted linear combination of the independent loss functions and obtaining the overall loss function by training minimization is as follows:
setting Si(xg,ygg,hg) True rectangular selection box, x, for the targetg,ygIs the central point;
setting S (x, y, omega, h) as a prediction rectangle selection frame of a target, wherein x and y are central points;
wherein, x and y are coordinates of the central point of the prediction rectangle selection frame, and omega and h are the width and height of the prediction rectangle selection frame; x is the number ofg,ygSelecting the coordinates of the center point of the frame, omega, for the true rectanglegAnd hgSelecting the width and height of the frame for the true rectangle;
weighting errors D of Euclidean distances between a prediction rectangle selection frame of a probability and a prediction rectangle selection frame of a target detection algorithm and a true value of the selection frame of a target real label based on a regular term hyper-parameter lambda to obtain a loss function L1 (lambda.D) to realize generalization of detection and identification capacity and simultaneously calculate an interaction over Unit loss function value L2 (lambda.IOU), wherein the IOU is a standard for measuring the accuracy of detecting a corresponding object in a specific data set, and the regular term hyper-parameter lambda ∈ [0, ∞ ];
81 partitions generate 486 probabilities after each input original data passes through a network, a certain space point has a probability Pr, the existence of a target can be quantitatively measured, 81 partitions give out probabilities Pc of 6 classes, the quantitative measurement probability Po of the partitions to the target needs to be compounded, and then after the partitions are divided into six classes, the unconditional probability of a single independent class is compounded to obtain a corresponding classification loss function value Lc which is Pc-Po after the probability that the target is classified into the certain class after the target exists in the certain partition is determined; the overall loss function L is L1+ L2+ Lc.
In another embodiment, the convolutional layer and the pooling layer used by the video target detection and recognition model support multi-size image input, a single-size data set is enhanced by implementing the data set through preprocessing, and images with different sizes are randomly selected after each M batches of training (M is 20 batches) in the training. Let the downsampling parameters of the network model in this chapter be equal to multiples of 64, 64 {320,384, …, 1216}, with the smallest size being 320 × 320 and the largest size being 1216 × 1216. Only the model needs to be fine-tuned to the corresponding dimension and then the training of the next batch is continued.
In the detection of multiple classes of targets, each class of target can draw a function curve of recall and precision, and avep (q) is the area under the function curve, as shown in equation 8: mAP is the average value of Q types of AP, and the type serial number Q is an integer from 1 to Q;
Figure BDA0002397801740000061
the method is used for calculating the classification confidence value and a rectangular selection box on the last layer of a network model according to a model designed by the method, so that the real-time detection and identification model of the targets of video vehicles and pedestrians is realized, the method can be used for detecting and identifying the high-frame-rate high-definition video targets under high-requirement and complex and variable practical application scenes, the performance based on mAP and FPS (frame rate per second) is obviously improved compared with a classical method, the accuracy and the instantaneity (70 FPS can be supported) based on mAP and FPS are obviously improved, and the performance requirements of the high-frame-rate high-definition video target detection and identification are met.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. A method of object detection, comprising:
constructing a video target detection and identification model, inputting labeled data in a training data set into a full convolution network through a high-speed storage medium, optimizing and generating a comprehensive loss value function by using weighted linear combination of independent loss functions, and obtaining a total loss function through training minimization;
the video target detection and identification model calculates a classification confidence value and a rectangular selection box by using a logic activation function at the last layer of the network model, so that the real-time detection and identification of the targets of video vehicles and pedestrians are realized.
2. The method of claim 1, further comprising testing a video object detection recognition model: and sending the verification data in the verification data set to the optimized full convolution network to obtain a prediction rectangular selection frame, and calculating with a real rectangular selection frame in the verification data set to obtain an average value of the area under the function curve.
3. The method of claim 2, wherein the pre-training weights and threshold initialization are input to the full convolution network via a learning rate decay strategy in a learning mode.
4. The method of claim 3, wherein the method for initializing the pre-training weights and the threshold comprises: and initializing a pre-training weight and a threshold of the pre-training model before training by adopting parameters of the pre-training model obtained by training based on the standard data set.
5. The method of claim 4, wherein the learning rate decay strategy comprises: the parameters obtained by the video target detection and identification model are adopted for model parameter initialization before training, a small batch gradient descent method is adopted for training, the learning rate lr of each iteration of the initial small batch gradient descent method of the model is changed into 0.001, and the model can output the confidence value and the selection box coordinates of each related target classification in each partition.
6. The method of claim 1, wherein the method of using weighted linear combination of independent loss functions to optimize the generation of the composite loss value function and the training minimization to obtain the overall loss function is:
setting Si(xg,ygg,hg) True rectangular selection box, x, for the targetg,ygIs the central point;
setting S (x, y, omega, h) as a prediction rectangle selection frame of a target, wherein x and y are central points;
wherein, x and y are coordinates of the central point of the prediction rectangle selection frame, and omega and h are the width and height of the prediction rectangle selection frame; x is the number ofg,ygSelecting the coordinates of the center point of the frame, omega, for the true rectanglegAnd hgSelecting the width and height of the frame for the true rectangle;
weighting the probability and the error D of the Euclidean distance between a prediction rectangle selection box of a target detection algorithm and a selection box truth value of a target real label based on the regular term hyperparameter lambda to obtain a loss function: l1 ═ λ · D to generalize the detection and identification capabilities, and at the same time calculate the Intersection over Union loss function value L2 ═ λ · IOU, which is a standard for measuring the accuracy of detecting the corresponding object in a specific dataset;
81 partitions generate 486 probabilities after each input original data passes through a network, a certain space point has a probability Pr, the existence of a target can be quantitatively measured, 81 partitions give out probabilities Pc of 6 classes, the quantitative measurement probability Po of the partitions to the target needs to be compounded, and then after the partitions are divided into six classes, the unconditional probability of a single independent class is compounded to obtain a corresponding classification loss function value Lc which is Pc-Po after the probability that the target is classified into the certain class after the target exists in the certain partition is determined; the overall loss function L is L1+ L2+ Lc.
7. The method of claim 1, wherein the convolutional layer and pooling layer of the video target detection recognition model support multi-size image input, and the single-size data set is enhanced by preprocessing, and randomly selecting different-size images after each M training batches.
8. The method as claimed in claim 2, wherein in the detection of the plurality of classes of targets, each class of target can draw a function curve of recall and precision, and avep (q) is an area under the function curve, as shown in equation 8: mAP is the average value of Q types of AP, and the type serial number Q is an integer from 1 to Q;
Figure FDA0002397801730000031
9. the method of claim 2, wherein the training data set is a variety of video with various resolutions including D5 format; the validation data set is composed of 20% of the data in the reserved training data set as validation data.
CN202010137294.8A 2019-12-30 2020-03-03 Target detection method Pending CN111353440A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2019113972729 2019-12-30
CN201911397272 2019-12-30

Publications (1)

Publication Number Publication Date
CN111353440A true CN111353440A (en) 2020-06-30

Family

ID=71197229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010137294.8A Pending CN111353440A (en) 2019-12-30 2020-03-03 Target detection method

Country Status (1)

Country Link
CN (1) CN111353440A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860510A (en) * 2020-07-29 2020-10-30 浙江大华技术股份有限公司 X-ray image target detection method and device
CN112149501A (en) * 2020-08-19 2020-12-29 北京豆牛网络科技有限公司 Method and device for identifying packaged fruits and vegetables, electronic equipment and computer readable medium
CN112613462A (en) * 2020-12-29 2021-04-06 安徽大学 Weighted intersection ratio method
CN113537242A (en) * 2021-07-19 2021-10-22 安徽炬视科技有限公司 Small target detection algorithm based on dense deconvolution and specific loss function

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860510A (en) * 2020-07-29 2020-10-30 浙江大华技术股份有限公司 X-ray image target detection method and device
CN112149501A (en) * 2020-08-19 2020-12-29 北京豆牛网络科技有限公司 Method and device for identifying packaged fruits and vegetables, electronic equipment and computer readable medium
CN112613462A (en) * 2020-12-29 2021-04-06 安徽大学 Weighted intersection ratio method
CN112613462B (en) * 2020-12-29 2022-09-23 安徽大学 Weighted intersection ratio method
CN113537242A (en) * 2021-07-19 2021-10-22 安徽炬视科技有限公司 Small target detection algorithm based on dense deconvolution and specific loss function

Similar Documents

Publication Publication Date Title
CN110135267B (en) Large-scene SAR image fine target detection method
CN111652321B (en) Marine ship detection method based on improved YOLOV3 algorithm
CN112396002B (en) SE-YOLOv 3-based lightweight remote sensing target detection method
CN111353440A (en) Target detection method
CN114202672A (en) Small target detection method based on attention mechanism
CN112101430B (en) Anchor frame generation method for image target detection processing and lightweight target detection method
CN111126472A (en) Improved target detection method based on SSD
CN111079739B (en) Multi-scale attention feature detection method
Zhou et al. Octr: Octree-based transformer for 3d object detection
CN112927279A (en) Image depth information generation method, device and storage medium
CN114332578A (en) Image anomaly detection model training method, image anomaly detection method and device
CN112801183A (en) Multi-scale target detection method based on YOLO v3
CN111310821A (en) Multi-view feature fusion method, system, computer device and storage medium
CN116310850B (en) Remote sensing image target detection method based on improved RetinaNet
CN116266387A (en) YOLOV4 image recognition algorithm and system based on re-parameterized residual error structure and coordinate attention mechanism
CN114565842A (en) Unmanned aerial vehicle real-time target detection method and system based on Nvidia Jetson embedded hardware
KR20210093875A (en) Video analysis methods and associated model training methods, devices, and devices
CN115661767A (en) Image front vehicle target identification method based on convolutional neural network
CN117853955A (en) Unmanned aerial vehicle small target detection method based on improved YOLOv5
CN114494823A (en) Commodity identification, detection and counting method and system in retail scene
CN113435324B (en) Vehicle target detection method and device and computer readable storage medium
CN113139540B (en) Backboard detection method and equipment
CN118279320A (en) Target instance segmentation model building method based on automatic prompt learning and application thereof
US20230343082A1 (en) Encoding of training data for training of a neural network
CN114998672B (en) Small sample target detection method and device based on meta learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200630