CN114120220A

CN114120220A - Target detection method and device based on computer vision

Info

Publication number: CN114120220A
Application number: CN202111291144.3A
Authority: CN
Inventors: 王硕; 郑智辉; 闫威; 唐波; 郭宸瑞; 董昊天; 闫涛; 李钊; 张伯川; 张海荣; 赵玲; 朱泽林; 亓欣媛; 常城; 朱敏; 许敏; 张艺佳; 武鹏; 彭皓; 任子建
Original assignee: Beijing Aerospace Automatic Control Research Institute
Current assignee: Beijing Aerospace Automatic Control Research Institute
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-03-01
Also published as: WO2023070955A1

Abstract

The application relates to a target detection method and device based on computer vision, belongs to the technical field of security monitoring of port operation areas, and solves the problems that existing training data are relatively limited and detection accuracy of tiny objects is low. The method comprises the following steps: detecting motion information of a historical monitoring video of a port operation area, carrying out screenshot according to the motion information and marking a target to manufacture a data set; establishing a neural network Yolov5x, and performing primary training on the neural network Yolov5x by using a partial data set to obtain a target detection model; performing target detection on the other part of the data set by using a target detection model to analyze false detection and missed detection targets; based on the false detection and/or missed detection target, updating the data set through data augmentation to perform reinforced training on the target detection model; and carrying out target detection on the picture to be detected by utilizing the strengthened target detection model. And the target detection is carried out on the picture to be detected by utilizing the strengthened target detection model, so that the detection accuracy of the small target is improved.

Description

Target detection method and device based on computer vision

Technical Field

The application relates to the technical field of security monitoring of port operation areas, in particular to a target detection method and device based on computer vision.

Background

The port operation area is a well-defined place which can accommodate the complete container loading and unloading operation process and comprises water areas such as harbor ponds, anchor lands, harbor channels, berths and the like and land areas such as freight stations, storage yards, wharf frontiers, office living areas and the like. The terminal station is a terminal station for land and water intermodal transportation, is a buffer land for container goods during the transportation mode conversion, is also a cross joint point for the goods, and occupies an important position in the whole container transportation process.

Because of the importance of the port operation area, the requirements for safety performance and order are high, and therefore, a strict security system needs to be provided in the port operation area. The most important security system is a video monitoring system. The conventional video monitoring system can monitor all the positions of the harbor operation area in 24 hours. However, because the conventional monitoring system does not have complex video analysis capability, and has many monitoring points and limited energy of security personnel, the existing video monitoring system cannot react to the violation in time. Therefore, a video monitoring system is introduced into the target detection algorithm based on computer vision, so that objects such as personnel, vehicles, operation machines and the like in the port working area are detected in real time, and the violation behaviors of the port working area are monitored.

The traditional computer vision target detection algorithm is based on traditional digital image characteristics such as the edge of an object and the like for detection. Due to the fact that scenes of port operation areas are complex, the traditional digital image characteristics are adopted for detection, and the problem of serious false detection exists; because the length of the port operation area is extremely wide, and the transverse distance is more than 300 meters, the relative size of objects in the port operation area is too small, and the traditional digital image characteristics are adopted for detection, so that the problem that tiny objects are difficult to detect exists.

In recent years, the application of deep learning techniques to computer vision techniques has revealed its powerful functions. The deep learning technology gradually extracts high-level semantic features from low-level feature representation of pictures through multi-level processing. According to the high-level semantic features, the deep learning technology can accurately identify the targets in the digital image. In the existing method, a camera is built to acquire images, so that a training set for training, a verification set for verification and a test set for testing are obtained, and data of each picture set are marked. And meanwhile, using k-means for labeling the data set results to obtain the size of the preselected frame. And then performing model training and target detection by using a yolo-tiny framework in combination with a genetic algorithm.

The existing deep learning detection method has the following problems:

1. the camera is specially constructed to acquire images, however, in a port operation area, the security monitoring camera is installed completely, the positions of the camera points are numerous, and ideal image information cannot be acquired at each monitoring point position.

2. The backhaul network depth and the network width of the yo-tiny are small, so that the algorithm cannot acquire enough semantic information, and the detection accuracy of the tiny objects is low.

Disclosure of Invention

In view of the foregoing analysis, embodiments of the present application are directed to providing a method and an apparatus for detecting a target based on computer vision, so as to solve the problems of relatively limited training data and low detection accuracy of a tiny object in the prior art.

In one aspect, an embodiment of the present application provides a target detection method based on computer vision, including: detecting motion information of a historical monitoring video of a port operation area, carrying out screenshot according to the motion information and marking a target to make a data set; establishing a neural network Yolov5x, and performing preliminary training on the neural network Yolov5x by using a part of the data set to obtain a target detection model; performing target detection on another part of the data set by using the target detection model to analyze a false detection target and a missed detection target; updating the data set through data augmentation based on the false detection target and/or the missed detection target, and performing reinforced training on the target detection model by using the updated data set; and capturing the current monitoring video of the port operation area at preset time intervals to obtain a picture to be detected, and performing target detection on the picture to be detected by using an enhanced target detection model.

The beneficial effects of the above technical scheme are as follows: based on the false detection target and/or the missed detection target, the data set is updated through data augmentation to obtain enough training data of the port operation area, and the updated data set is used for carrying out strengthening training on the target detection model to improve the robustness of the target detection model. And carrying out target detection on the picture to be detected by utilizing the strengthened target detection model, designing a small target detection network in the port operation area, and improving the detection accuracy of small targets.

Based on the further improvement of the method, the steps of detecting the motion information of the historical monitoring video of the port operation area, capturing the picture according to the motion information and marking the target to produce a data set further comprise: acquiring a historical monitoring video of the port operation area from a database; classifying the historical monitoring video into static pixels and moving pixels by using a Gaussian mixture model according to picture frame information of the historical monitoring video so as to judge a moving pixel area in the historical monitoring video; and capturing a picture of a moving pixel area in the historical monitoring video and labeling a target in the moving pixel area to generate the data set.

Based on a further improvement of the above method, the targets include a large target with a relatively large size and a small target with a relatively small size to be detected, and before training the neural network Yolov5x with the data set to obtain a target detection model, the method further includes: and determining the input size of each picture frame in the data set according to the sizes of the small targets in the historical monitoring video and the current monitoring video of the port operation area.

Based on further improvement of the method, establishing the neural network Yolov5x further comprises: the base net of the neural network Yolov5x uses a CSP network architecture; and on the basis of the pyramid feature maps of 8,16 and 32 of the moving step of the neural network Yolov5x, adding the pyramid feature map of the moving step of 4 to detect the small target.

Based on the further improvement of the method, the step of analyzing the false detection target and the missed detection target further comprises the following steps: comparing the target detected by using the target detection model with the labeled target in the corresponding picture of the data set to determine the false detection target and the missed detection target, wherein the corresponding labeled first picture has no target, and the target detected by using the target detection model is determined as the false detection target; and determining the corresponding labeled second picture as the missed detection target, wherein the labeled second picture has a target, and the target which is not detected by using the target detection model is determined as the missed detection target.

Based on a further improvement of the above method, based on the false detection target, updating the data set by data augmentation, wherein the using the updated data set to improve the robustness of the target detection model further includes: when the false detection target is a stable false detection target, correctly marking the first picture of which the false detection target is detected and adding the first picture to the data set; randomly cutting out a picture with the false detection target as a negative sample from the first picture, and performing data augmentation to update the data set; and inputting a picture of an object having different relative sizes and a picture of the negative sample into the object detection model, so that the object detection model identifies a difference between the object and the false detection object.

Based on a further improvement of the above method, based on the missed detection target, updating the data set by data augmentation, wherein the using the updated data set to improve the robustness of the target detection model further comprises: updating the data set by data augmentation by changing brightness and contrast of the second picture and/or data augmentation by randomly scratching out the missed objects from the second picture and pasting the missed objects to other positions of the second picture, wherein the second picture has the missed objects; and inputting the second picture pasted with the missed detection target into the target detection model, so that the missed detection target can be detected by the target detection model.

Based on a further improvement of the above method, the large target comprises a port working machine; the small targets include port personnel; the false detection target comprises a camera; and the missed inspection target comprises a squat person.

In another aspect, an embodiment of the present application provides an object detection apparatus based on computer vision, including: the data set generating module is used for detecting motion information of a historical monitoring video of a port operation area, carrying out screenshot according to the motion information and marking a target to manufacture a data set; the model establishing module is used for establishing a neural network Yolov5x, and performing preliminary training on the neural network Yolov5x by using a part of the data set to obtain a target detection model; the target analysis module is used for carrying out target detection on the other part in the data set by using the target detection model so as to analyze a false detection target and a missed detection target; the data set updating module is used for updating the data set through data augmentation based on the false detection target and/or the missed detection target; the reinforced training module is used for carrying out reinforced training on the target detection model by utilizing the updated data set; and the target detection module is used for capturing the current monitoring video of the port operation area at intervals of preset time to obtain a picture to be detected and performing target detection on the picture to be detected by using a reinforced target detection model.

In a further refinement of the apparatus described above, the data set generation module is configured to: acquiring a historical monitoring video of the port operation area from a database; classifying the historical monitoring video into static pixels and moving pixels by using a Gaussian mixture model according to picture frame information of the historical monitoring video so as to judge a moving pixel area in the historical monitoring video; and capturing a picture of a moving pixel area in the historical monitoring video and labeling a target in the moving pixel area to generate the data set.

Based on a further improvement of the above device, the target detection method based on computer vision further comprises an input size determination module, which is used for determining the input size of each picture frame in the data set according to the sizes of the small targets in the historical monitoring video and the current monitoring video of the port operation area.

Based on the further improvement of the above device, the model building module is further configured to: the base net of the neural network Yolov5x uses a CSP network architecture; and on the basis of the pyramid feature maps of 8,16 and 32 of the moving step of the neural network Yolov5x, adding the pyramid feature map of the moving step of 4 to detect the small target.

In a further improvement of the above apparatus, the target analysis module is configured to compare a target detected by using the target detection model with a labeled target in a corresponding picture of the data set to determine the false detection target and the missed detection target, wherein the detection analysis module includes a false detection target analysis sub-module and a missed detection target analysis sub-module, and the false detection target analysis sub-module is configured to determine that the target detected by using the target detection model is the false detection target, but the target in the corresponding labeled first picture is not a target; and the missed detection target analysis submodule is used for determining the target which is not detected by the target detection model as the missed detection target, wherein the corresponding labeled second picture has the target.

In a further improvement of the above apparatus, the data set updating module is configured to: when the false detection target is a stable false detection target, correctly marking the first picture of which the false detection target is detected and adding the first picture to the data set; randomly cutting out a picture with the false detection target as a negative sample from the first picture, and performing data augmentation to update the data set; and inputting a picture of an object having different relative sizes and a picture of the negative sample into the object detection model, so that the object detection model identifies a difference between the object and the false detection object.

In a further improvement of the above apparatus, the data set updating module is configured to: updating the data set by data augmentation by changing brightness and contrast of the second picture and/or data augmentation by randomly scratching out the missed objects from the second picture and pasting the missed objects to other positions of the second picture, wherein the second picture has the missed objects; and inputting the second picture pasted with the missed detection target into the target detection model, so that the missed detection target can be detected by the target detection model.

Based on a further improvement of the above arrangement, the large object comprises a harbour work machine; the small targets include port personnel; the false detection target comprises a camera; and the missed inspection target comprises a squat person.

Compared with the prior art, the application can realize at least one of the following beneficial effects:

1. based on the false detection target and/or the missed detection target, the data set is updated through data augmentation to obtain enough training data of the port operation area, and the updated data set is used for carrying out strengthening training on the target detection model to improve the robustness of the target detection model. And carrying out target detection on the picture to be detected by utilizing the strengthened target detection model, designing a small target detection network in the port operation area, and improving the detection accuracy of small targets.

2. By redesigning the network structure of Yolov5x, the detection capability of the small target object in the port operation area is improved. By means of the system design, the input size of the monitoring model and the addition of the pyramid characteristic diagram with the step length of 4, the accuracy of the detection model for port personnel is improved to 92% from 90% of the Yolo-tiny.

3. And a motion detection algorithm is used for effectively acquiring a real-time monitoring data set of the port operation area. The method and the device can accurately acquire the picture data of the motion of the object in the real-time monitoring video of the port operation area, and filter out the video frames of the static (the front and back frames are kept unchanged) and no object, thereby improving the efficiency of acquiring the monitoring data set.

4. According to the specific detection condition, data amplification is performed in a targeted manner, and the detection capability of the small target object in the port operation area is improved. According to the specific detection condition, data amplification is performed in a targeted manner, so that useless training amount can be reduced as much as possible, and the recognition accuracy can be improved from 92% to 95% at present.

In the present application, the above technical solutions may be combined with each other to realize more preferable combination solutions. Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the application, wherein like reference numerals are used to designate like parts throughout.

Fig. 1 is a flowchart of a method for computer vision based object detection according to an embodiment of the present application.

FIG. 2 is a diagram of a pyramid feature map for predicting objects of different scales according to an embodiment of the present application.

Fig. 3 is a diagram illustrating detection results of a moving object of a video detected using a gaussian mixture model according to an embodiment of the present application.

Fig. 4 is a pictorial view of a video captured in a port operation area according to an embodiment of the present application.

Fig. 5 is a diagram of the presence of a false detection target when target detection is performed using a target detection model according to an embodiment of the present application.

Fig. 6 is a diagram of the presence of a missed detection target when target detection is performed using a target detection model according to an embodiment of the present application.

Fig. 7A and 7B are an artwork of a missed-check object and its random matting according to an embodiment of the present application, respectively.

Fig. 8A and 8B are diagrams of an original and a contrast adjusted diagram according to an embodiment of the present application, respectively.

Fig. 9A and 9B are diagrams of an original image and a missing person added according to an embodiment of the present application, respectively.

FIG. 10 is a block diagram of a computer vision based object detection apparatus according to an embodiment of the present application.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the application and together with the description, serve to explain the principles of the application and not to limit the scope of the application.

The application discloses a target detection method based on computer vision. Referring to fig. 1, a computer vision-based object detection method includes: in step S102, motion information detection is carried out on the historical monitoring video of the port operation area, screenshot is carried out according to the motion information, and a target is marked to produce a data set; in step S104, a neural network Yolov5x is established, and a part of the data set is used to perform preliminary training on the neural network Yolov5x to obtain a target detection model; in step S106, performing target detection on another part of the data set using a target detection model to analyze a false detection target and a missed detection target; in step S108, based on the false detection target and/or the missed detection target, updating the data set by data augmentation, and performing intensive training on the target detection model by using the updated data set; and in step S110, capturing a current monitoring video of the port operation area at predetermined time intervals to obtain a picture to be detected, and performing target detection on the picture to be detected by using the enhanced target detection model.

Compared with the prior art, in the target detection method based on computer vision provided by the embodiment, based on the false detection target and/or the missed detection target, the data set is updated through data augmentation to obtain enough training data of the port operation area, and the updated data set is used for performing enhanced training on the target detection model to improve the robustness of the target detection model. And carrying out target detection on the picture to be detected by utilizing the strengthened target detection model, designing a small target detection network in the port operation area, and improving the detection accuracy of small targets.

Hereinafter, referring to fig. 1, steps S102 to S110 of the computer vision-based object detection method according to an embodiment of the present application will be described in detail.

In step S102, motion information detection is performed on the historical monitoring video of the port operation area, and a screenshot is performed according to the motion information and a target is labeled to produce a data set. For example, the object includes a person, a vehicle, a work machine, and the like. Specifically, the motion information detection of the historical monitoring video of the port operation area, the screenshot according to the motion information and the labeling of the target to produce the data set further comprises: acquiring historical monitoring videos of a port operation area from a database; classifying the historical monitoring video into static pixels and moving pixels by using a Gaussian mixture model according to picture frame information of the historical monitoring video so as to judge a moving pixel area in the historical monitoring video; and capturing a picture of a moving pixel area in the historical monitoring video, and labeling a target in the moving pixel area to generate a data set.

And then, determining the input size of each picture frame in the data set according to the sizes of the small targets in the historical monitoring video and the current monitoring video of the port operation area. Specifically, in order to stably detect a minute object, we specify that the object is at least 1 × 1 pixel on the feature map with a step size of 16.

In step S104, a neural network Yolov5x is established, and a part of the data set is used to perform preliminary training on the neural network Yolov5x to obtain a target detection model. Specifically, establishing the neural network Yolov5x further includes: the base net of the neural network Yolov5x uses a CSP network architecture; and on the basis of the pyramid feature maps of 8,16 and 32 of the moving step of the neural network Yolov5x, adding the pyramid feature map of the moving step of 4 to detect the small target. And extracting all dimensional features efficiently by using a feature pyramid network, acquiring features with high semanteme and low spatial scale from the top layer of the feature pyramid network, performing up-sampling, and fusing the up-sampled features with the features with low semanteme and high spatial scale in the feature pyramid network to acquire suppressed target information and highlight the target information. For example, large targets are port work machines; small targets include port personnel or other personnel present in the surveillance video.

In step S106, target detection is performed on another portion of the data set using a target detection model to analyze the false detection target and the missed detection target. For example, the false detection target includes a camera; the missed inspection target includes squat personnel. Specifically, analyzing the false detection target and the missed detection target further includes: comparing the target detected by using the target detection model with the labeled target in the corresponding picture of the data set to determine a false detection target and a missed detection target, wherein the corresponding labeled first picture has no target, and the target detected by using the target detection model is determined as the false detection target; and determining the corresponding labeled second picture with the target, wherein the target which is not detected by using the target detection model is a missed detection target.

In step S108, based on the false detection target and/or the missed detection target, the data set is updated through data augmentation to improve the robustness of the target detection model, and the updated data set is used to perform the reinforcement training on the target detection model. Specifically, updating the data set by data augmentation based on the false positive target further includes: when the false detection target is a stable false detection target, correctly marking a first picture of the detected false detection target and adding the first picture to the data set; randomly cutting a picture with a false detection target as a negative sample from the first picture, and performing data augmentation to update a data set; and inputting the picture of the target with different relative sizes and the picture of the negative sample into the target detection model, so that the target detection model identifies the difference between the target and the false detection target. Updating the data set by data augmentation based on the missed detection objective further comprises: carrying out data augmentation by changing the brightness and the contrast of the second picture and/or carrying out data augmentation by randomly scratching out the missed objects from the second picture and pasting the missed objects to other positions of the second picture so as to update the data set, wherein the missed objects are arranged in the second picture; and inputting the second picture pasted with the missed detection target into the target detection model, so that the target detection model can detect the missed detection target.

In step S110, a current monitoring video of the port operation area is captured at predetermined time intervals to obtain a to-be-detected picture, and the to-be-detected picture is subjected to target detection by using an enhanced target detection model. For example, a video shot by a surveillance camera is typically 25 to 30 frames/second, and one picture is taken every 2 to 3 pictures, i.e., every other or two pictures are taken. For example, when the video taken by the monitoring camera is generally 25 frames/second, the predetermined time is 0.004 second or 0.006 second.

In another embodiment of the present application, an apparatus for computer vision based object detection is disclosed. Referring to fig. 10, the computer vision-based object detecting apparatus includes: a dataset generation module 1002, an input size determination module, a model building module 1004, a target analysis module 1006, a dataset update module 1008, an enriched training module 1010, and a target detection module 1012.

And the data set generating module 1002 is configured to perform motion information detection on the historical monitoring video of the port operation area, perform screenshot according to the motion information, and mark the target to produce a data set. The data set generation module 1002 acquires a historical monitoring video of the port operation area from the database, specifically, the monitoring video is shot by a camera and stored in the database; classifying the historical monitoring video into static pixels and moving pixels by using a Gaussian mixture model according to picture frame information of the historical monitoring video so as to judge a moving pixel area in the historical monitoring video; and screenshot the pictures of the moving pixel areas in the historical monitoring video, marking the targets in the moving pixel areas to generate a data set, and storing the data set in a database of a storage server.

The input size determining module determines the input size of each picture frame in the data set according to the sizes of small targets in the historical monitoring video and the current monitoring video of the port operation area.

The model establishing module 1004 is configured to establish a neural network Yolov5x, and perform preliminary training on the neural network Yolov5x by using a part of the data set to obtain a target detection model. The model building module is further configured to: the base net of the neural network Yolov5x uses a CSP network architecture; and on the basis of the pyramid feature maps of 8,16 and 32 of the moving step of the neural network Yolov5x, adding the pyramid feature map of the moving step of 4 to detect the small target. And efficiently extracting all dimensional features by using a feature pyramid network, acquiring high-semantic and low-spatial-scale features from the top layer of the feature pyramid network, performing up-sampling, and fusing the up-sampled features with the low-semantic and high-spatial-scale features in the feature pyramid network to acquire suppressed target information and highlight the target information. Large targets include port work machines. Small targets include port personnel.

And a target analysis module 1006, configured to perform target detection on another portion of the data set using the target detection model to analyze the false detection target and the missed detection target. The target analysis module 1006 is configured to compare the target detected using the target detection model with the labeled target in the corresponding picture of the data set to determine a false detection target and a missed detection target. For example, the false detection target includes a camera; and the missed inspection target comprises a squat person. The detection analysis module comprises a false detection target analysis submodule and a missed detection target analysis submodule. And the false detection target analysis submodule is used for determining the target detected by using the target detection model as the false detection target, wherein the corresponding labeled first picture does not have the target. And the missed detection target analysis submodule is used for determining the target which is not detected by using the target detection model as the missed detection target, wherein the corresponding labeled second picture has the target.

And a data set updating module 1008, configured to update the data set through data augmentation based on the false detection target and/or the missed detection target. The data set update module is to: when the false detection target is a stable false detection target, correctly marking a first picture of the detected false detection target and adding the first picture to the data set; and randomly cutting a picture with a false detection target as a negative sample from the first picture, and performing data augmentation to update the data set. The data set update module is to: and carrying out data augmentation by changing the brightness and the contrast of the second picture and/or carrying out data augmentation by randomly scratching out the missed objects from the second picture and pasting the missed objects to other positions of the second picture so as to update the data set, wherein the missed objects are arranged in the second picture.

And the reinforced training module 1010 is configured to perform reinforced training on the target detection model by using the updated data set. Pictures of targets with different relative sizes and pictures of negative samples are input into the target detection model, so that the target detection model identifies the difference between the targets and the false detection targets. And inputting the second picture pasted with the missed detection target into the target detection model, so that the target detection model can detect the missed detection target.

And the target detection module 1012 is configured to capture a current monitoring video of the port operation area at predetermined time intervals to obtain a to-be-detected picture, and perform target detection on the to-be-detected picture by using an enhanced target detection model. Specifically, targets detected in real time by using a target detection model are displayed on a display, wherein the targets comprise objects such as personnel, vehicles and operation machines, and illegal behaviors of a port operation area are displayed in real time. In addition, the security personnel is informed of the violation of the target through the loudspeaker. The illegal behaviors comprise that a person does not wear a safety helmet and/or does not wear a work clothes, the person squats down and the like; vehicle overspeed, reverse travel, etc.

Hereinafter, a computer vision-based object detection method is described in detail by way of specific examples.

The port operation area small target detection algorithm provided by the application is an algorithm improved by taking the latest single-stage detection neural network YOLOv5 as a prototype. Firstly, by increasing the network input size and correcting the input size proportion, the detailed information of an input image is analyzed, and meanwhile, the working efficiency of detecting the network is improved; then, based on the original pyramid-shaped image characteristics of YOLOv5, a shallower characteristic diagram is added as detection output, and the detection performance of the tiny objects is improved; and then, after a version of network is initially trained, according to the detection condition of new data, a plurality of data augmentation methods are used in a targeted manner, and under the premise that the picture data is limited, the data are augmented, iterative training is carried out, and the robustness of the detection network is improved. Data augmentation is one of the common skills in deep learning, and is mainly used for increasing a training data set, so that the data set is diversified as much as possible, and a trained model has stronger generalization capability.

The method is realized by the following technical scheme:

the method comprises the following steps: based on the small target detection requirement, a Yolov5x network is preliminarily selected as a port operation area small target detection network.

The reason why Yolov5x was used as the detection network is as follows:

1. compared with a ResNet structure used by a base network of the Yolo-tiny, the base network of the Yolov5x uses a CSP structure, and the calculation amount can be reduced on the premise that the accuracy is slightly improved.

Table 1: performance ratio of existing method to Yolov5x

Method	Backbone	Size	FPS	#Parameter	AP₅₀	AP₇₅	AP_S
								YoloV3	Darknet53	608	30	62.3M	57.9	34.4	18.3
YoloV3(SPP)	Darknet53	608	30	62.9M	60.6	38.2	20.6
								PANet(SSP)	CSPResNeXt50	608	35	56.9M	60.6	41.6	22.1

2. Compared with 1024-layer channels of the Yolo-tiny, the channel layer of the Yolo 5 is 1280, and more channels mean more complex semantic expression capability, that is, more accurate detection capability can be obtained on a shallow feature map, and the detection accuracy is improved.

3. Compared with two-layer pyramid characteristics of the Yolo-tiny, the Yolov5 is a three-layer pyramid characteristic, and can meet the detection capability of objects with more scales.

Step two: and determining the input size of the small target detection network according to the actual requirement of the field small target detection.

The method for selecting the size comprises the following steps:

1. determining resolution w of video_origin×w_origin(if the resolution is rectangular, w_originFor the longer side of the video width and height) the minimum dimension w required to detect an object_min×w_min。

2. Since the detection coordinates of Yolov5 are determined by the pyramid feature map (the pyramid feature map principle is explained in detail in step three), according to the scale invariance, if an object can be detected, the object needs to have a significant feature at the corresponding position of the feature map of the corresponding scale, so that the size of the object on the corresponding feature map is required to be at least 1 × 1 pixel.

3. In order to stably detect a minute object, we specify that the object is at least 1 × 1 pixel on the feature map with a step size of 16. If we set the input size of the model to w_input×w_inputThe conditions satisfied by the input size are as shown in the following figure:

derived after derivation

Since a general video frame picture is not square, if the detection model is set to the input size w of the square_inputBoth the upper and lower regions cannot be effectively utilized, which results in the waste of operation. Therefore, we will input the size scale as close as possible to the scale of the video frame picture.

Step three: on the basis of the original pyramid feature map with the moving step length of 8,16 and 32, the pyramid feature map with the moving step length of 4 is added.

The reason why the feature pyramid network defines and introduces the feature map with the step size of 4 is as follows: the feature pyramid network is a method for efficiently extracting each dimension feature in a picture by using a conventional convolutional neural network model. The working principle is as follows: firstly, carrying out forward operation of a neural network, and acquiring the characteristics of high semantic degree and low spatial scale at the top layer of the network; after the characteristics are obtained, the characteristics are subjected to up-sampling and are fused with shallow characteristics (low semantic meaning and high spatial scale characteristics) in the neural network, and the fused characteristics have high semantic property and high resolution; the different scale feature maps obtained in the first two steps are respectively used for predicting the objects with different scales (refer to fig. 2).

Due to the network designed according to step two, the size of the tiny object is only 2 × 2 on the feature map (the feature map with the step size of 8) with the largest space size, and the difficulty of judging the object by the feature with the size is still large. Therefore, a pyramid feature map with a larger spatial scale, i.e. a feature map with a step size of 4, needs to be added.

Step four: and using part of the monitoring video, carrying out screenshot annotation according to the motion information, and making a data set.

Reasons and methods for using the motion information screenshot:

because the obtained monitoring video of the port operation area is the actual operation video on site, the site characteristics are that the number of personnel is small, and the personnel occurrence time is short, the effective time of the monitoring video of the port operation area obtained by the site hard disk video recorder is very short (about 5%). If the interception is carried out within a specific time, not only a large number of invalid pictures (namely, objects to be detected are not available) can be intercepted, but also a series of important information when the personnel appear can be missed. Meanwhile, when no object exists, the scene background does not change obviously along with time; when an object is present, the object is substantially moving. Therefore, screenshot needs to be performed according to the motion information during screenshot, that is, when there is a region in the video to move, the current video frame is intercepted.

The motion detection of the video is performed using a Gaussian Mixture Model (GMM), and whether or not a specific video is cut is determined with reference to fig. 3 according to the detection result. The gaussian mixture model is a model formed by combining a plurality of gaussian distribution models (and normal distributions), and the formula is as follows:

where θ denotes a parameter to be estimated, and the sample set X ═ X₁,x₂，…x_N}，x_jFor the sample data in the sample set X, the probability of extracting N samples from the overall sample whose distribution is P (X | theta), i.e. each sample X in the sample set X_jThe joint probability of (c).

And classifying by using Gaussian mixture according to the inter-frame information characteristics of the video, and judging which pixels in the video are moving and which pixels are static. Thereby determining the pixel position of motion in the video.

Parameters of the Gaussian mixture model need to be updated in real time according to the motion condition of the video, an EM (expectation maximization) algorithm is used for parameter updating, and the steps of the EM algorithm are as follows:

1. initialization parameter theta

2. E-step: calculating the probability r that each data j comes from the submodel k according to the current parameters_jk。

Wherein alpha is_kIs the weight of each Gaussian distribution, phi is the probability function of each Gaussian sub-model, and the model parameter is theta_k。

3. M-step: calculating model parameters for a new iteration

4. The E-step and M-step are repeatedly calculated until convergence (| | theta)_i+1-θ_iI < ε, where ε is oneA small positive number indicating that the parameter changes very little after one iteration).

Step five: and (4) preliminarily training an improved version of the small target detection model by using the data set produced in the step four.

Step six: and detecting the untrained port operation area video by using the small target detection model trained in the step five, and analyzing the false detection and missed detection conditions.

And (3) false detection condition: the area that has no object is analyzed by the detection network and is considered to be a certain object.

And (4) missing detection condition: there are several objects in the original picture, which are not detected by the neural network. Step seven: based on the false detection and the missed detection, methods such as data updating and data augmentation are performed, reinforcement learning is performed, and the robustness of the detection model is improved.

And updating data:

1. for the false detection condition in the step six, whether the false detection condition is stable needs to be analyzed, and if so, the scene pictures in multiple time periods can be added as negative samples to be added into the data set. If it is only a sporadic case, it may not be added as a negative sample into the training set.

2. For the missing detection condition in the step six, if the dimension of the missing detection personnel does not meet the previously set micro-size personnel standard or the surrounding environment of the personnel is too complex, the personnel is not added as training data; otherwise, marking the picture where the missing inspection personnel are located, and putting the picture into the detection data set.

Because the amount of the collected sample video data is too small, the existing detection network training set is difficult to perform stable detection aiming at various environments of ports completely, the robustness is not high, and data amplification needs to be performed on training data. Aiming at the actual situation of the monitoring video in the port operation area, the following augmentation data method is determined:

random cutting: that is, some parts of the picture are randomly deducted for detection, the method can change the relative size of the object, and the model can extract the characteristics of the object in the objects with different relative sizes, so that the detection accuracy of the object of the type is improved.

Random horizontal turning: i.e. the picture is flipped in the horizontal direction. Since the object is not vertically symmetrical (i.e., upside down) during monitoring, horizontal flipping can be a way to add training data.

Randomly adjusting brightness, contrast and saturation: as the port operation area acquires the monitoring video, the conditions of various natural conditions on site, such as cloudy days, backlight, shadow occlusion and the like, cannot be covered. If the above weather condition occurs, the trained model detection may be unstable. Therefore, the robustness of the training model needs to be improved by randomly adjusting the brightness, contrast and saturation of the picture.

Embodiments of video monitoring of a Port work area

The method comprises the following steps: and selecting a Yolov5x detection network as a port operation area cell target detection network.

Based on the input size judgment method in the second step in the scheme, the final input size of the detection model can be calculated. Here, we determine the minimum size w of a tiny object in the surveillance video_minAt 32, the video size w is monitored_origin1920 (1920 is the original size, 1920 is multiplied by 1080), and the size calculation formula is input according to the model of the step two in the scheme

We can calculate: w is a_input> 960, here, take 2 > 960^mThe number is 1024. In addition, the input size of the model is determined to be 1024 × 576 according to the requirement that the input size ratio in the scheme fits the actual video frame.

Step three: by modifying the model configuration file of Yolov5x, a pyramid feature map (with the size of 256 × 144) with the moving step size of 4 is added on the basis of the original pyramid feature maps with the moving step sizes of 8,16 and 32.

Taking a video example, data set production is performed (refer to fig. 4).

The duration of the video is 1 minute, and the duration of the video with people is 20 seconds. If the picture is cut every specific frame as usual, 67% of the video frames are video frames without people or repeated back and forth. These video frames are invalid video frames and cannot be put into the training data set. Therefore, a mixed Gaussian model of the opencv (the function name of the opencv is createBackgroundsubtracerMOG 2) is used for judging the motion condition of the video, and whether a video frame is cut is judged according to the motion condition. By using the method, the intercepted video frame not only ensures that the object is positioned, but also can ensure that the intercepted object is changed in motion.

The method intercepts the required video frame and produces the training, verifying and testing data set required by training the detection network.

Example of false detection situation:

as shown in fig. 5, the trained detection network accurately detects a person and a port machine in the center of the video. However, this network misdetects a surveillance camera dome mounted on a track crane as a person. Because the video is not put into the data set during model training, the situation that the detection network is not completely adapted to the complex environment can be speculated, and the detection network is added into the training set for adaptive training after the detection network is required; in addition, since the camera is very similar to a human in a position with a small relative scale, the camera is required to be used as a negative sample, and the training extraction is used for extracting features which can distinguish the camera from the human.

Example of a missed detection situation:

as shown in fig. 6, there are four people in the lower right corner of the video, one of which cannot be detected by the trained network. The reason for the missed detection can be deduced by analysis: personnel backlighting, resulting in less obvious personnel characteristics; the person is in a squatting state, which is different from the characteristics of normal persons. Aiming at the backlight problem, data can be augmented by changing the brightness, contrast and other modes of the picture during training; the problem that detection is unstable when the person squats can be solved through modes of randomly scratching, pulling out the person squat, pasting the person squat to other pictures of a training set and the like, and the characteristics of the person squat can be learned through a network.

Step seven: based on the false detection and the missed detection, methods such as data updating and data augmentation are performed, reinforcement learning is performed, and the robustness of the detection model is improved.

For the false detection situation:

taking the example map of the error detection in the sixth step as an example, according to the previous analysis, the video frame needs to be correctly labeled and then put into the data set. And simultaneously, randomly deducting the camera of the track crane in the picture as a negative sample. By inputting samples and negative samples of different relative scales, the detection network can learn the difference between the human and the camera (the lower graph 7A is the original graph, the graph 7B is the graph of the random button)

For the case of missed detection:

taking the omission example diagram in the step six as an example, two data augmentation ideas can be performed according to the previous analysis: the brightness and the contrast of the picture are changed to be increased; and the degree of other pictures is increased by randomly scratching pictures, pulling out squats and pasting the squats to a training set.

Fig. 8A and 8B show the first effect of the augmentation method (taking contrast change as an example, fig. 8A is an original image, and fig. 8B is an adjusted contrast image).

Fig. 9A and 9B show the second effect of the augmentation method (fig. 9A shows the original image, and fig. 9B shows the missing person added to another position of the image).

1. by redesigning the network structure of Yolov5x, the detection capability of the small target object in the port operation area is improved. By means of the system design, the input size of the monitoring model and the addition of the pyramid characteristic diagram with the step length of 4, the accuracy of the detection model for port personnel is improved to 95% from 90% of the Yolo-tiny. Meanwhile, the introduced Yolov5x network runs on the Huawei Atlas-300 accelerator card, and the running time of the same size is reduced from the original 120ms to the current 50 ms.

2. The method and the device for acquiring the real-time monitoring data set of the port operation area effectively use a motion detection algorithm. The method and the device can accurately acquire the picture data of the motion of the object in the real-time monitoring video of the port operation area, and filter out the video frames of the static (the front and back frames are kept unchanged) and no object, thereby improving the efficiency of acquiring the monitoring data set.

3. According to the specific detection condition, data amplification is performed in a targeted manner, and the detection capability of the small target object in the port operation area is improved. According to the specific detection condition, data amplification is performed in a targeted manner, so that useless training amount can be reduced as much as possible, and the recognition accuracy can be improved from 92% to 95% at present.

The term "object detection apparatus" includes various devices, apparatuses and machines for processing data, e.g., an object detection apparatus includes a programmable processor, a computer, multiple processors or multiple computers, etc. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a runtime environment, or a combination of one or more of them.

The methods and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform these functions by operating on surveillance video and generating target detection results.

Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing the historical video data and the data set (e.g., magnetic, magneto-optical disks, or optical disks). Devices suitable for storing computer program instructions and data include all forms of non-volatile memory, media and storage devices, including by way of example: semiconductor memory devices such as EPROM (erasable programmable read only memory), EEPROM (electrically erasable programmable read only memory), and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD-ROM disks and DVD-ROM disks.

Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application.

Claims

1. A method for detecting an object based on computer vision, comprising:

detecting motion information of a historical monitoring video of a port operation area, carrying out screenshot according to the motion information and marking a target to make a data set;

establishing a neural network Yolov5x, and performing preliminary training on the neural network Yolov5x by using a part of the data set to obtain a target detection model;

performing target detection on another part of the data set by using the target detection model to analyze a false detection target and a missed detection target;

updating the data set through data augmentation based on the false detection target and/or the missed detection target, and performing reinforced training on the target detection model by using the updated data set; and

and capturing the current monitoring video of the port operation area at preset time intervals to obtain a picture to be detected, and performing target detection on the picture to be detected by using an enhanced target detection model.

2. The computer vision-based target detection method of claim 1, wherein the steps of detecting motion information of historical monitoring videos of a port operation area, capturing a picture according to the motion information and labeling a target to produce a data set further comprise:

acquiring a historical monitoring video of the port operation area from a database;

classifying the historical monitoring video into static pixels and moving pixels by using a Gaussian mixture model according to picture frame information of the historical monitoring video so as to judge a moving pixel area in the historical monitoring video; and

capturing a picture of a moving pixel area in the historical monitoring video and labeling a target in the moving pixel area to generate the data set.

3. The computer vision based object detection method of claim 2, wherein the objects include large objects of relatively large size and small objects of relatively small size to be detected, and further comprising, before training the neural network Yolov5x with the data set to obtain an object detection model:

and determining the input size of each picture frame in the data set according to the sizes of the small targets in the historical monitoring video and the current monitoring video of the port operation area.

4. The computer vision based target detection method of claim 3, wherein establishing a neural network Yolov5x further comprises:

the base net of the neural network Yolov5x uses a CSP network architecture; and

on the basis of the pyramid feature maps of 8,16 and 32 of the moving step of the neural network Yolov5x, adding the pyramid feature map of 4 of the moving step to detect the small target.

5. The computer vision based object detection method of claim 3, wherein analyzing the false detection object and the missed detection object further comprises:

comparing the target detected using the target detection model with the annotated targets in the corresponding pictures of the dataset to determine the false positive target and the false negative target, wherein,

determining the target detected by using the target detection model as the false detection target, wherein the corresponding labeled first picture does not have the target originally; and

and determining the target which is not detected by using the target detection model as the missed detection target.

6. The computer vision based object detection method of claim 5, wherein the data set is updated by data augmentation based on the false positive object, wherein using the updated data set to improve robustness of the object detection model further comprises:

when the false detection target is a stable false detection target, correctly marking the first picture of which the false detection target is detected and adding the first picture to the data set;

randomly cutting out a picture with the false detection target as a negative sample from the first picture, and performing data augmentation to update the data set; and

inputting a picture of an object having different relative sizes and a picture of the negative example into the object detection model such that the object detection model identifies a distinction between the object and the false positive object.

7. The computer vision based object detection method of claim 5, wherein the dataset is updated by data augmentation based on the missed detection object, wherein using the updated dataset to improve robustness of the object detection model further comprises:

updating the data set by data augmentation by changing brightness and contrast of the second picture and/or data augmentation by randomly scratching out the missed objects from the second picture and pasting the missed objects to other positions of the second picture, wherein the second picture has the missed objects; and

and inputting the second picture pasted with the missed detection target into the target detection model, so that the missed detection target can be detected by the target detection model.

8. The computer vision based object detection method of claim 3,

the large target comprises a port work machine;

the small targets include port personnel;

the false detection target comprises a camera; and

the missed inspection target comprises a squat person.

9. An object detection device based on computer vision, comprising:

the data set generating module is used for detecting motion information of a historical monitoring video of a port operation area, carrying out screenshot according to the motion information and marking a target to manufacture a data set;

the model establishing module is used for establishing a neural network Yolov5x, and performing preliminary training on the neural network Yolov5x by using a part of the data set to obtain a target detection model;

the target analysis module is used for carrying out target detection on the other part in the data set by using the target detection model so as to analyze a false detection target and a missed detection target;

the data set updating module is used for updating the data set through data augmentation based on the false detection target and/or the missed detection target;

the reinforced training module is used for carrying out reinforced training on the target detection model by utilizing the updated data set; and

and the target detection module is used for capturing the current monitoring video of the port operation area at intervals of preset time to obtain a picture to be detected and performing target detection on the picture to be detected by using an enhanced target detection model.

10. The computer vision based object detecting apparatus of claim 9, wherein the data set generating module is configured to:

11. The computer vision based object detection device of claim 10, further comprising an input size determination module for determining an input size of each picture frame in the data set according to sizes of small objects in the historical surveillance video and the current surveillance video of the port operation area.

12. The computer vision based object detecting device of claim 11, wherein the model building module is further configured to:

the base net of the neural network Yolov5x uses a CSP network architecture; and

13. The computer vision based object detection apparatus of claim 11, wherein the object analysis module is configured to compare an object detected using the object detection model with a labeled object in a corresponding picture of the data set to determine the false detection object and the missed detection object, wherein the detection analysis module comprises a false detection object analysis sub-module and a missed detection object analysis sub-module,

the false detection target analysis submodule is used for determining the target detected by the target detection model as the false detection target, wherein the corresponding labeled first picture does not have the target; and

and the missed detection target analysis submodule is used for determining the target which is not detected by the target detection model as the missed detection target, wherein the corresponding labeled second picture has the target.

14. The computer vision based object detecting device of claim 13, wherein the dataset updating module is configured to:

15. The computer vision based object detecting device of claim 13, wherein the dataset updating module is configured to:

16. The computer vision based object detecting device of claim 11,

the large target comprises a port work machine;

the small targets include port personnel;

the false detection target comprises a camera; and

the missed inspection target comprises a squat person.