Target detection method, device, equipment and storage medium
Technical Field
The present application relates to the field of computer vision technologies, and in particular, to a method, an apparatus, a device, and a storage medium for target detection.
Background
As is known, vision is the most direct and effective means for obtaining information, whereas most monitoring systems are in a "record only and no judgment" mode of operation, the video signals obtained by the cameras being transmitted to the control center, analyzed and corresponding judgments made by the operator of the control center. However, this is a great waste of human resources. With the advent of computer vision intelligent video processing systems, video analysis such as target detection and tracking is achieved by using image processing techniques and machine learning methods.
The task of object detection is to find all objects of interest in the image, determine their position and size. Because various objects have different appearances, shapes and postures and are interfered by factors such as illumination, shielding and the like during imaging, target detection is always the most challenging problem in the field of machine vision.
In the existing target detection, false detection is easy to generate for scenes with complex backgrounds in static pictures, so the target detection accuracy rate needs to be improved. In addition, the existing target detection also has certain limitation on the generalization performance of monitoring complex scenes, and in order to improve the scene migration performance of a target detection algorithm, a large number of data sets need to be trained, so that the dependency on data is strong.
Disclosure of Invention
The application aims to provide a target detection method, a target detection device, target detection equipment and a storage medium, so as to improve the accuracy of target detection and scene migration performance.
In a first aspect, an embodiment of the present application provides a target detection method, including:
acquiring video data;
preprocessing a first image sequence of the video data to obtain a second image sequence with background images removed;
and inputting the second image sequence into a trained detection model for target detection to obtain a target detection result.
In a possible implementation manner, the method provided in the embodiment of the present application includes:
detecting a moving object of a first image sequence of the video data by using a background subtraction method;
and reserving pixels of the area where the moving target is located, and segmenting the pixels of the area where the moving target is located into independent moving target units by using a morphological method to obtain a second image sequence with the background image removed.
In a possible implementation manner, in the foregoing method provided in this embodiment of the present application, the detection model adopts an SSD framework, where the SSD framework includes: a feature extraction network and a target detection network.
In one possible implementation manner, in the foregoing method provided in an embodiment of the present application, the method further includes training an SSD framework, which includes:
preprocessing an image sequence of sample video data to obtain a sample image sequence with a background image removed;
carrying out artificial target labeling on the sample image sequence to obtain a training data set;
training an SSD framework based on the training dataset: firstly, initializing parameters to be trained and hyper-parameters in a network, inputting training data into the initialized network for network forward propagation to obtain an actual output result, adjusting network parameters by combining a loss function with a Back Propagation (BP) algorithm, performing iterative training, and ending the training when the loss value of the loss function is smaller than a set threshold or reaches the maximum iteration number to obtain a trained SSD frame.
In a possible implementation manner, in the foregoing method provided in this embodiment of the present application, the loss function is a weighted sum of the position error and the confidence error.
In a possible implementation manner, in the foregoing method provided in this embodiment of the present application, the confidence error is calculated as follows:
wherein,indicating that the prediction box i matches the real box j with respect to the category.
In a second aspect, an embodiment of the present application provides an object detection apparatus, including:
the acquisition module is used for acquiring video data;
the preprocessing module is used for preprocessing the first image sequence of the video data to obtain a second image sequence with background images removed;
and the target detection module is used for inputting the second image sequence into a trained detection model for target detection to obtain a target detection result.
In a possible implementation manner, in the apparatus provided in this embodiment of the present application, the preprocessing module is specifically configured to:
detecting a moving object of a first image sequence of the video data by using a background subtraction method;
reserving pixels of an area where a moving target is located, segmenting the pixels of the area where the moving target is located by using a morphological method, and segmenting the pixels into independent moving target units to obtain a second image sequence with background images removed;
a second sequence of images with background images removed is acquired.
In a possible implementation manner, in the foregoing apparatus provided in this embodiment of the present application, the detection model employs an SSD framework, where the SSD framework includes: a feature extraction network and a target detection network.
In a possible implementation manner, in the apparatus provided in this embodiment of the present application, a training module is further included, configured to:
preprocessing an image sequence of sample video data to obtain a sample image sequence with a background image removed;
carrying out artificial target labeling on the sample image sequence to obtain a training data set;
training an SSD framework based on the training dataset: firstly, initializing parameters to be trained and hyper-parameters in a network, inputting training data into the initialized network for network forward propagation to obtain an actual output result, adjusting network parameters by combining a loss function with a Back Propagation (BP) algorithm, performing iterative training, and ending the training when the loss value of the loss function is smaller than a set threshold or reaches the maximum iteration number to obtain a trained SSD frame.
In a possible implementation manner, in the foregoing apparatus provided in this embodiment of the present application, the loss function is a weighted sum of the position error and the confidence error.
In a possible implementation manner, in the foregoing apparatus provided in this embodiment of the present application, the confidence error is calculated as follows:
wherein,indicating that the prediction box i matches the real box j with respect to the category.
In a third aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor;
the memory for storing a computer program;
wherein the processor executes the computer program in the memory to implement the method described in the first aspect and the various embodiments of the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and the computer program is used for implementing the method described in the first aspect and the implementation manners of the first aspect when executed by a processor.
Compared with the prior art, the target detection method, the device, the equipment and the storage medium provided by the application acquire the video data, preprocess the first image sequence of the video data to acquire the second image sequence without the background image, input the second image sequence into the trained detection model for target detection, and acquire the target detection result. On one hand, only the foreground target is reserved for the image without the background, the interference of other background images is avoided, and the detection model focuses more on the foreground target during learning and reasoning, so that the target detection accuracy can be improved; on the other hand, because the background pixels of the input image are removed, only the foreground pixels seen by the detection model are not influenced by the video or picture sequence scene, and therefore the scene migration performance of the target detection is improved.
Drawings
Fig. 1 is a schematic flowchart of a target detection method according to an embodiment of the present application;
fig. 2 is a flowchart of a background removal method provided in an embodiment of the present application;
FIG. 3 is an overall structure of a target detection system based on an SSD framework according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a target detection apparatus according to a second embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to a third embodiment of the present application.
Detailed Description
The following detailed description of embodiments of the present application is provided in conjunction with the accompanying drawings, but it should be understood that the scope of the present application is not limited to the specific embodiments.
Throughout the specification and claims, unless explicitly stated otherwise, the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element or component but not the exclusion of any other element or component.
The problems to be solved by target detection are as follows: finding out some classes of objects in the image or video through the target frame, and giving the probability that the object belongs to some class, namely a task of combining position coordinate regression and class prediction.
SSD: chinese characters are fully called: single multi-frame detector, english full name: the SSD framework comprises a feature extraction network and a target detection network, wherein the feature extraction network is used for extracting features of an image, and the target detection network is used for performing position regression and target category prediction according to the extracted features so as to identify the object category in the image.
Fig. 1 is a schematic flow chart of a target detection method according to an embodiment of the present application, in practical application, an execution main body of this embodiment may be a target detection device, and the target detection device may be implemented by a virtual device, such as a software code, or by an entity device written with a relevant execution code, such as a usb disk, or by an entity device integrated with a relevant execution code, such as a chip, a computer, a robot, and the like.
As shown in fig. 1, the method includes the following steps S101 to S103:
and S101, acquiring video data.
S102, preprocessing the first image sequence of the video data to obtain a second image sequence with background images removed.
S103, inputting the second image sequence into a trained detection model for target detection to obtain a target detection result.
In this embodiment, the video data may be collected by a camera in real time, or may be stored in advance, and it may be understood that the video data is composed of multiple frames of images, and the video data includes objects to be identified, such as people and vehicles. After a video image to be detected by a target is obtained, removing the background from an image sequence in the video image, only retaining the foreground target, namely only retaining the pixels of the region where the target is located, and setting the background pixel region to zero to obtain the image sequence with the background image removed.
Specifically, step S102 may be specifically implemented as: detecting a moving object of a first image sequence of the video data by using a background subtraction method; and reserving pixels of the area where the moving target is located, and segmenting the pixels of the area where the moving target is located into independent moving target units by using a morphological method to obtain a second image sequence with the background image removed. Fig. 2 is a flowchart of a background removal method. The method mainly comprises the steps of calculating pixel stability, recording the gray value of a pixel point with the longest stabilization time from the beginning of operation to the current moment in the operation process of the algorithm, judging the stability of the pixel point through a series of threshold comparison operations when a new frame comes by using the stability of adjacent frames and historical pixels as a judgment basis, and judging whether the pixel point is a background point or not, so that background pixels are removed, and foreground pixels are reserved.
And then inputting the image sequence without the background image into a trained detection model for target detection to obtain a target detection result. The detection model is also trained by using a sample with a background removed during training.
The present application is described below in a specific embodiment.
In this embodiment, the detection model adopts an SSD framework, and the SSD framework includes: a feature extraction network and a target detection network. The SSD framework is trained as follows:
s201, preprocessing an image sequence of the sample video data to obtain a sample image sequence with a background image removed.
S202, carrying out artificial target labeling on the sample image sequence to obtain a training data set.
S203, training the SSD framework based on the training data set: firstly, initializing parameters to be trained and hyper-parameters in a network, inputting training data into the initialized network for network forward propagation to obtain an actual output result, adjusting network parameters by combining a loss function with a Back Propagation (BP) algorithm, performing iterative training, and ending the training when the loss value of the loss function is smaller than a set threshold or reaches the maximum iteration number to obtain a trained SSD frame.
Specifically, a training data set is first prepared: the method comprises the steps of detecting a moving target by a background subtraction method by adopting a traditional image processing algorithm, reserving a pixel mask of a region where the moving target is located, reserving pixels of the region where the moving target is located as far as possible by utilizing a morphological method, obtaining an image sequence with a background removed, manually labeling a data set by utilizing a labeling tool, and obtaining a training data set.
Designing a detection model: the design of the detection model is based on the existing SSD target detection network structure, and mainly aims at modifying a loss function and removing a background loss function item in class loss. In the present embodiment, the loss function is a weighted sum of the position error and the confidence error. For confidence errors, Softmax Loss is used, and for position errors, SmoothL1Loss is used.
The loss function is as follows:
wherein the first term LconfFor confidence errors, the second term LlocThe position error is obtained, N is the number of matched default frames, α is a balance factor (weight coefficient), and the value is 1 during cross validation. And c is a category confidence prediction value. l is the predicted value of the position of the corresponding bounding box of the prior frame, and g is the position parameter of the real target.
Wherein,
wherein, due to the singleness of the background sample, the characteristic learning of the background can be not considered, so that the network can pay more attention to the learning of the foreground sample,indicating that the prediction box i matches the real box j with respect to the class, the higher the probabilistic prediction of p the lower the loss,obtained by Softmax, the more probability of being background if the prediction box has no targetThe higher the loss.
Wherein,
in which, using a position regression function,indicating whether the ith prediction box and the jth real box match with respect to the class k,andrespectively representing a prediction box and a real box,the midpoint of the ith real box is represented,the midpoint of the ith default box is indicated,indicating the width of the ith default box.
Training a detection model: and (3) using the labeled background-removed image data set as a training data set, and training the detection model by using an SSD network framework. Fig. 3 shows an overall structure of an SSD frame-based target detection system, specifically, first, an input image size is adjusted to be an input required by a network (for example, 300 × 300 is adopted), a background-removed picture obtained by preprocessing is used as input data of a training model, as shown in fig. 3B, multi-layer image features are extracted by forward propagation of a main body network, different layer image features are fused, an error value is obtained by comparing IoU (Intersection over unit) with true value data, a model objective function is modified, a loss value is calculated, and the network learns more foreground information and ignores the background; and adjusting network parameters by utilizing error back propagation, adopting a random gradient descent method in the error back propagation process, setting a network learning rate lr to be 0.001 and a gradient momentum to be 0.9, and finishing one iteration. And finishing the training until the loss value of the loss function is smaller than a set threshold or the maximum iteration number is reached, and obtaining the trained SSD frame.
As shown in fig. 3, a data preprocessing layer is added based on the SSD network structure, the data layer obtains video data, removes a background by using a background modeling algorithm, and retains a foreground image and sends the foreground image to the SSD frame for detection. Feature extraction is implemented as shown in C in fig. 3, and then feature maps of different scales are extracted through some convolution and pooling operations, candidate target boxes are proposed for the feature maps of different scales, for example, the feature map is 8 × 8, the number of candidate target boxes is 8 × 9, and nine types of candidate targets, namely 3 proportional scales and 3 areas, are generated at each feature point position. When the SSD framework infers the image, a series of fixed-size candidate boxes are generated, and the likelihood of each candidate box containing an object instance. A large number of target frames are generated by one forward process, and a Non-maximum suppression (NMS) is required to filter out most of the target frames, and the method is to discard the target frames when the confidence threshold of the target frames is smaller than a threshold ct (e.g. 0.01) and IoU is smaller than lt (e.g. 0.45), and only the first N prediction results are retained. And matching the fusion acquired features with the truth-valued features to constrain the loss function, so that the loss function focuses more on the foreground target features, and the target detection is realized.
The target detection method provided in this embodiment preprocesses the first image sequence of the video data to obtain a second image sequence from which the background image is removed, and inputs the second image sequence into a trained detection model to perform target detection, so as to obtain a target detection result. On one hand, only the foreground target is reserved for the image without the background, the interference of other background images is avoided, and the detection model focuses more on the foreground target during learning and reasoning, so that the target detection accuracy can be improved; on the other hand, because the background pixels are removed, only the foreground pixels seen by the detection model can not be influenced by the video or picture sequence scene, and therefore the scene migration performance of the target detection is improved.
The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.
Fig. 4 is a schematic structural diagram of an object detection apparatus according to a second embodiment of the present application, and as shown in fig. 4, the apparatus may include:
an obtaining module 410, configured to obtain video data;
a preprocessing module 420, configured to preprocess the first image sequence of the video data to obtain a second image sequence with background images removed;
and the target detection module 430 is configured to input the second image sequence into a trained detection model for target detection, so as to obtain a target detection result.
The target detection apparatus provided in this embodiment preprocesses the first image sequence of the video data to obtain a second image sequence from which the background image is removed, and inputs the second image sequence into a trained detection model to perform target detection, so as to obtain a target detection result. On one hand, only the foreground target is reserved for the image without the background, the interference of other background images is avoided, and the detection model focuses more on the foreground target during learning and reasoning, so that the target detection accuracy can be improved; on the other hand, because the background pixels are removed, only the foreground pixels seen by the detection model can not be influenced by the video or picture sequence scene, and therefore the scene migration performance of the target detection is improved.
In a possible implementation manner, in the apparatus provided in this embodiment of the present application, the preprocessing module 420 is specifically configured to:
detecting a moving object of a first image sequence of the video data by using a background subtraction method;
and reserving pixels of the area where the moving target is located, and segmenting the pixels of the area where the moving target is located into independent moving target units by using a morphological method to obtain a second image sequence with the background image removed.
In a possible implementation manner, in the foregoing apparatus provided in this embodiment of the present application, the detection model employs an SSD framework, where the SSD framework includes: a feature extraction network and a target detection network.
In a possible implementation manner, in the apparatus provided in this embodiment of the present application, a training module is further included, configured to:
preprocessing an image sequence of sample video data to obtain a sample image sequence with a background image removed;
carrying out artificial target labeling on the sample image sequence to obtain a training data set;
training an SSD framework based on the training dataset: firstly, initializing parameters to be trained and hyper-parameters in a network, inputting training data into the initialized network for network forward propagation to obtain an actual output result, adjusting network parameters by combining a loss function with a Back Propagation (BP) algorithm, performing iterative training, and ending the training when the loss value of the loss function is smaller than a set threshold or reaches the maximum iteration number to obtain a trained SSD frame.
In a possible implementation manner, in the foregoing apparatus provided in this embodiment of the present application, the loss function is a weighted sum of the position error and the confidence error.
In a possible implementation manner, in the foregoing apparatus provided in this embodiment of the present application, the confidence error is calculated as follows:
wherein,indicating that the prediction box i matches the real box j with respect to the category.
Fig. 5 is a schematic structural diagram of an electronic device according to a third embodiment of the present application, and as shown in fig. 5, the electronic device includes: a memory 501 and a processor 502;
a memory 501 for storing a computer program;
wherein the processor 502 executes the computer program in the memory 501 to implement the methods provided by the method embodiments as described above.
In the embodiment, the object detection device provided by the application is exemplified by an electronic device. The processor may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
The memory may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer-readable storage medium and executed by a processor to implement the methods of the various embodiments of the present application above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.
An embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored, and the computer program is used for implementing the methods provided by the method embodiments described above when being executed by a processor.
In practice, the computer program in this embodiment may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + +, etc., and conventional procedural programming languages, such as the "C" programming language or similar programming languages, for performing the operations of the embodiments of the present application. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
In practice, the computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing descriptions of specific exemplary embodiments of the present application have been presented for purposes of illustration and description. It is not intended to limit the application to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the present application and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the present application and various alternatives and modifications thereof. It is intended that the scope of the application be defined by the claims and their equivalents.