CN110689021A

CN110689021A - Real-time target detection method in low-visibility environment based on deep learning

Info

Publication number: CN110689021A
Application number: CN201910985552.5A
Authority: CN
Inventors: 李成严; 马金涛; 赵帅
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2020-01-14

Abstract

The invention provides a low-visibility real-time target detection method based on deep learning, which solves the problem of target detection in a low-visibility environment greatly influenced by smoke dust, water mist and light shadow through guiding filtering. And processing the frame picture by adopting an SSD target detection model, finding out the target area coordinates, and fully utilizing the accuracy advantage of the SSD target detection model. The method has the advantages that the guided filtering is introduced, the guided filtering is integrated with the SSD target detection model, influence factors under low-visibility environment are solved, in a scene influenced by environmental factors, the guided filtering is used for carrying out operations such as image enhancement and defogging, processed images are clearer, the resolution ratio is higher, after the images are processed, target position coordinates are generated, the processed images are transmitted into a lower-layer network GoogleNet for accuracy verification, the efficiency of the GoogleNet network is utilized, and the detection precision is improved under the condition that the speed is not reduced. The method can accurately identify the target in a low-visibility environment, has certain reliability and higher identification precision.

Description

Real-time target detection method in low-visibility environment based on deep learning

Technical Field

The invention relates to the field of image processing and target detection, in particular to a real-time target detection method in a low-visibility environment based on deep learning. The low-visibility environment is defined as an environment in which external factors such as smoke, water mist, insufficient light and the like have a large influence.

Background

Image processing (image processing) refers to a technique for analyzing an image with a computer to achieve a desired result. Also known as image processing. Image processing generally refers to digital image processing. Digital images are large two-dimensional arrays of elements called pixels and values called gray-scale values, which are captured by industrial cameras, video cameras, scanners, etc. Image processing techniques generally include image compression, enhancement and restoration, matching, description and identification of 3 parts.

The image processing method used by the invention is guide filtering, the guide filtering is a filter needing a guide map, the guide map can be a single image or an input image, and when the guide map is the input image, the guide filtering becomes a filtering operation for keeping edges and can be used for filtering image reconstruction. The method is self-adaptive weight filtering, and can perform operations such as image smoothing, enhancement, extinction, feathering, defogging, joint sampling and the like. An intuitive approach is to apply the guided filtering directly to the three color channels (RGB). The guiding filtering is to assume that a window with a pixel k as a center has a local linear relationship, which can be determined by differentiating the local linear relationship, and the two given parameters a and b are the mean values of the weights around the window with the pixel k as the center, so as to smooth the image and maintain the boundary.

The reason that the guiding filtering can keep the linear complexity is that each pixel is contained by a plurality of windows, when a specific output value of a certain point pixel value is obtained, all linear function values containing the point are only required to be averaged, the guiding filtering supports linear calculation amount, and the processing efficiency can be obviously improved. By using the two methods, the target detection is more effective, the target can be accurately detected even under a low-visibility environment (such as affected by smoke, water mist and illumination), the target detection accuracy is improved, and the false alarm rate and the false missing report rate of the target detection are reduced.

The target detection, also called target extraction, is an image segmentation based on target geometry and statistical characteristics, which combines the segmentation and identification of targets into one, and the accuracy and real-time performance of the method are important capabilities of the whole system. Especially, in a complex scene, when a plurality of targets need to be processed in real time, automatic target extraction and identification are particularly important.

With the development of computer technology and the wide application of computer vision principle, the real-time tracking research on the target by using the computer image processing technology is more and more popular, and the dynamic real-time tracking and positioning of the target has wide application value in the aspects of intelligent traffic systems, intelligent monitoring systems, military target detection, surgical instrument positioning in medical navigation operations and the like.

The invention relates to the field of object detection, and provides a method for detecting an object in a low-visibility environment by using an object detection technology.

Disclosure of Invention

To solve the low visibility object detection problem, the present invention uses an object detection method that can smooth, enhance, and deblur an image.

Therefore, the invention provides the following technical scheme:

a real-time target detection method based on deep learning in a low-visibility environment is characterized in that an improved VGG16 network, a GoogleLeNet target detection algorithm and guiding filtering are combined, so that target detection can be accurately identified in environments with insufficient smoke, water mist and light. The specific process comprises the following steps:

step 1: real-time video stream acquisition

Step 2: generating a data set in target detection;

and step 3: setting parameters of the improved VGG16 network;

and 4, step 4: introducing an improved VGG16 network to process and classify target data;

and 5: testing the performance of the trained network model to find a current performance optimal model;

step 6: introducing guiding filtering by combining the characteristics of the target in a low-visibility environment, so that the model can accurately find the target and determine the target coordinate;

and 7: introducing a GoogLeNet network model, performing multi-scale training, extracting deep-level features of data, and then detecting a target in a coordinate region;

and 8: and constructing an integral target detection framework, namely a VGFG (VGG16 Guided Filter GoogleLeNet) model, and testing by using a model obtained by combining an improved VGG16 network, a GoogleLeNet target detection algorithm and guide filtering.

Further, the real-time video stream is obtained, and the IP of the computer (the IP of the added network card) is modified according to the IP of the hard disk video recorder, so that the network segment of the computer is consistent with the hard disk video recorder.

Further, a VOC format data set is manufactured, picture data in the data set are labeled by using a labelImg tool, an XML file is generated, a data set path document is generated, and a training set, a testing set and a verification set path file are generated.

Further, selecting a priori frame with a proper scale, a network learning rate and a training layer number. For the scale of the prior box, it obeys a linear increasing rule: as the feature map size decreases, the a priori box scales add linearly. The calculation formula is

Where m is the number of feature maps, S_kRepresenting the ratio of the prior frame size to the picture, S_minAnd S_maxThe maximum and minimum values of the ratio are indicated, respectively.

The choice of learning rate is a super-parameter problem, which requires constant testing, with basic ranges of 0.1, 0.01,

0.001, 0.0001, with increasing order of magnitude, until an optimum is found, or by adjusting through a gradient of a loss function of

Wherein N represents the number of positive samples and represents

Whether the prior frame is matched with the ith group route and the type of the prior frame is p, and c represents the confidence coefficient predicted value of the type. l represents the position prediction of the corresponding bounding box of the prior boxThe value, g, represents the location parameter of the ground truth.

Further, the improved VGG16 network is trained by pre-training convolutional layer weights, downloading the file darknet53.conv.74 and the target detection data set generated in step 1.

Further, training is repeatedly carried out, parameters are adjusted according to the step 2, the improved VGG16 network outputs one model every 10 times of iteration, the training is carried out 240000 times in total, and test results show that the model effect is best when the model is iterated 180000 times.

Further, the target characteristics in the low-visibility environment are that the influence of smoke, water mist and light shadow is large, the target is difficult to identify, and aiming at the target characteristics in the low-visibility environment, guide filtering is introduced, is self-adaptive weight filtering, can play a role in keeping a boundary while smoothing an image, and can perform operations such as image smoothing, enhancement, extinction, feathering, defogging, combined sampling and the like. Guiding filtering, wherein the output result of a certain pixel point is as follows:

where q is the output image, I is the guide image, and a and b are the invariant coefficients of the linear function when the window center is located at k. The assumed conditions of the method are: q and I have a local linear relationship in the window centered on pixel k. Taking the derivative of the above equation (i.e., representing an edge) it can be seen that the output result will only appear edge when there is an edge in the guide image. In order to solve the coefficients a and b in the test, assuming that p is the result before q filtering and satisfying the condition that the difference between q and p is minimum, the method of unconstrained image restoration can be converted into an optimization problem, and the cost function is as follows:

q_i＝p_i-n_i

where n is noise and p is a degraded image with q contaminated by noise n.

Limiting i in a window w, so that the value a is not too large, and solving the previous test by using a least square method to obtain

Wherein, mu and sigma²Respectively, the mean and variance of I in the local window ω. | ω | is the number of pixels within the window. Then, a window operation is adopted in the whole image, and finally, the average value is taken to obtain the output result of a certain pixel point as follows:

wherein

Furthermore, the GoogleLeNet network model is a 22-layer neural network, an optimal local network structure is found from the existing deep network structure, and in order to avoid the problem of gradient disappearance caused by depth increase, two losses are skillfully added at different depths of the network layer to ensure the gradient return disappearance phenomenon. In order to avoid the problems of overfitting phenomenon caused by excessive parameters due to width increase and difficulty In application caused by excessive calculation complexity, the google lenet adopts an inclusion structure, namely a Network-In-Network (Network In Network) structure, namely, an original node is also a Network. The Network-in-Network model uses a fully-connected multilayer perceptron to replace the traditional convolution process so as to obtain more comprehensive expression of features, meanwhile, because the process of improving the feature expression is already carried out in the prior art, the last fully-connected layer of the traditional CNN is replaced by a global average pooling layer, at the moment, the map has enough credibility for classification, and the loss can be directly calculated through softmax.

Further, an overall target detection model is built, an improved VGG16 network is used as a first-layer network of the model, coordinates of a target region are output, image processing of the target region is carried out after guiding filtering, finally target recognition is carried out through a GoogLeNet network model, and an overall framework is packaged to form a new target detection model.

Compared with the prior art, the invention adopts the technical scheme and has the following technical effects:

when solving the problem of object detection in low visibility environments. Using guided filtering, the solution of the problem can be made closer to reality. Under the condition that the target detection is influenced by low-visibility environments such as dust, smoke, water mist and illumination factors, the target can be accurately detected, the detection accuracy of the network model is improved, the false alarm rate of the network model is reduced, and the false alarm rate of the network model is reduced. The target coordinates are found by using the improved VGG16 network, the identification accuracy of the improved VGG16 network is fully utilized, guide filtering is applied to three color channels (RGB), images are enhanced, the problem of low-visibility target detection is solved, finally, a GoogLeNet network model is introduced, on the premise that the target detection accuracy is ensured, a target area detected by the improved VGG16 network is detected, identified and classified, and the false alarm of the VGG16 network are reduced. Compared with other inventions, the invention has higher detection precision, higher detection speed and more accurate identification result.

Drawings

FIG. 1 is a flow chart of the present invention

FIG. 2 is a diagram of an improved VGG16 network architecture

FIG. 3 is a drawing of the incorporation structure adopted by the GoogleLeNet network model

FIG. 4 is a dimension-reduced and improved inclusion structure

FIG. 5 is a graph showing the training effect of the VGFG model

FIG. 6 is a comparison graph of the effect of each target detection model

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings 1-6:

fig. 1 shows a flow chart of the present invention, and each step is explained in detail based on the contents shown in the flow chart.

Step 1, acquiring a real-time video stream;

in step 1, acquiring the real-time video stream requires adding one more network card to the host for accessing the video recorder, downloading Haokangwei video network video monitoring 4200 (selected according to the models of the camera and the silver disc recorder), modifying the computer IP (modifying the IP added with the network card) according to the IP of the hard disk video recorder to ensure that the network segment of the computer is consistent with the hard disk video recorder, accessing the IP of the hard disk video recorder by using a browser, inputting a user name and a password on a pop-up interface for login, and if the video stream is displayed, proving that the acquisition is successful. The hard disk recorder can be connected with four cameras, so that the lower channel numbers of 33,34,35 and 36 respectively need to be modified.

Step 2: generating a data set in deep learning;

in step 2, a data set in deep learning is generated, the data format is VOC, labelImg software is opened for image annotation, an XML file is generated, an XML file in the VOC format is converted into a txt file, a folder VOC is created, the VOC folder is composed of 4 folders, namely, indications (storing all XML files), ImageSets folder (under the ImageSets folder, there are two folders Main and Layout, the files in Main respectively indicate that test.txt is a test set, train.txt is a training set, val.txt is a verification set, in.txt is a training and verification set.), JPEGImages (storing all image files), and labels folder (storing txt files). And the VOC data set is manufactured, each folder stores a file, and the network model is relatively standard and convenient to train.

And step 3: setting parameters of the improved VGG16 network;

table 1 improved VGG16 network parameter settings

And 4, step 4: processing and classifying the target data using the modified VGG16 network;

the VGG16 network is specifically improved by adding new convolutional layers in the VGG16 network to obtain more feature maps for detection, the VGG16 network feature map has high resolution, the information in the original image is more complete, and the receptive field is smaller, when the VGG16 network is improved, the full connection layers fc6 and fc7 of the VGG16 are respectively converted into 3 × 3 convolutional layers Conv6 and 1 × 1 convolutional layers Conv7, and simultaneously the pooling layer pool5 is changed from 2 × 2, which is the original stride 2, to 3 × 3, which is the stride 1, in order to match the change, an AtrousAlgorithm is adopted, and the convolutional layers Conv6 adopt extended convolution or perforated convolution (translation Conv), which exponentially expands the visual field of convolution under the condition of not increasing parameters and model complexity. The dropout layer and fc8 layers are then removed and a series of convolution layers are added and fine-tuned on the test data set.

The data set is processed by using the improved VGG16 network, and the improved VGG16 network is trained, so that the target can be identified, and the coordinate information of the target area can be output for the next step. The improved VGG16 network structure is shown in FIG. 2, and the core of the improved VGG16 network structure is that a convolution kernel is adopted on a feature map to predict class scores and offsets of a series of default bounding boxes. In order to improve the detection accuracy, prediction is carried out on feature maps with different scales, and in addition, results with different aspect ratios are obtained. The improved VGG16 network realizes end-to-end training, and can ensure the detection accuracy under the condition of low image resolution. The improved VGG16 network extracts feature maps of different scales for detection, a large-scale feature map (a feature map at the front) can be used for detecting small objects, a small-scale feature map (a feature map at the back) is used for detecting large objects, Prior boxes (Default boxes) of different scales and aspect ratios are adopted, each unit is provided with a Prior box of different scales or aspect ratios, and predicted bounding boxes (bounding boxes) are based on the Prior boxes, so that the training difficulty is reduced to a certain extent.

the trained VGG16 network is tested until an optimal model is found, the training time of the improved VGG16 network is 8 hours, 240000 iterations are performed totally, 1 model is output every 10 iterations, models with the iteration times of 5000, 6000, 10000, 120000, 18000 and 240000 are taken for testing, the test result shows that the model with the iteration time of 180000 has the best effect, the best learning rate is 0.01, the best training time is 8 pieces/time and is more than 8 pieces/time, the video memory explosion can be caused, the training is stopped, and the precision is reduced when the number of training is less than 8 pieces/time.

in step 6, real-time target detection in a low-visibility environment has the characteristics that the real-time target detection is greatly influenced by smoke, water mist and light shadow, a common target detection model cannot identify the real-time target detection, according to the characteristics, guide filtering is introduced to process an image, the guide filtering is introduced after an improved VGG16 network, a target area detected by the improved VGG16 network is subjected to smoothing, enhancing, feathering and defogging, the boundary is kept while the image is smoothed, the guide filtering needs a guide image during filtering, the guide image can be an additional independent image or an input image, when the guide image is the input image, time is saved, the guide filtering becomes a filtering operation for keeping the edge, the complexity of the guide filtering is irrelevant to the size of a window, when a specific output value of a certain point pixel value is obtained, only linear average function values of the point are needed, when a large picture is processed, the efficiency is obviously improved, and meanwhile, the phenomenon of gradient overturning in bilateral filtering can be well overcome by guiding filtering.

In a general linear rotation variation filtering process, the output of a certain pixel point is:

W_ijfor weighting, in bilateral filtering, the weighting function is expressed as:

guiding filtering, wherein the output result of a certain pixel point is as follows:

there is a linear relationship between the guide images I and q, and the settings can be such that the information provided by the guide images is primarily used to indicate which are edges. If the pilot map indicates edges here, the end result tries to preserve the edge information. Therefore, the preconditions for the guided filtering are: it makes sense when I and q satisfy a linear relationship.

And introducing guide filtering to process the target area, and transmitting the processed target area coordinates into a next-level network for target identification.

google lenet proposed the most straightforward way to advance deep neural networks is to increase the size of the network, including width and depth. Depth is the number of layers in the network and width refers to the number of neurons used in each layer. This straightforward solution, however, has two significant drawbacks. The increase in network size results in an increase in parameters, which also makes the network more susceptible to overfitting, with a concomitant increase in computational resources.

Therefore, the fully-connected mode is changed into the sparse connection to solve the two problems. When the probability distribution of the data set is expressed by a large and sparse deep neural network, the network topology can be optimized by analyzing the activation values of the upper layer highly correlated with the output and the related statistical information of the clustering neurons layer by layer. But this has a very large number of limitations. It is therefore proposed to use the Hebbian principle, which makes the above idea practically feasible with a few restrictions.

Generally, full connections are used to better optimize parallel computation, while sparse connections are used to break symmetry to improve learning, and traditionally, sparsity in the spatial domain is often used by convolution, but convolution also connects to patches in the early layers of the network are dense connections, thus sparsity is used at the filter level, not on neurons. However, on the nonuniform sparse data structure, computation efficiency of numerical values is low, undefined cost of searching and caching is high, and requirements on computing infrastructure are high, so that the sparse matrix is considered to be clustered into a relatively dense subspace to be prone to computation optimization of the sparse matrix. The inclusion structure was therefore proposed (see figure 2).

The main idea of the inclusion structure is to approximate and cover an optimized local sparse structure in the convolutional vision network by a series of readily available dense substructures. The network topology is formed by analyzing the related statistical information of the previous layer by layer and aggregating the statistical information into a highly related unit group, the clusters (unit group) express the units (neurons) of the next layer and are connected with the previous units, and the related units close to the bottom layer of the input image are aggregated in a local area, so that the clusters can be aggregated on a single area to be ended, the clusters can be covered by a layer of convolution layer of 1x1 on the next layer, namely, the clusters which are spread in a larger space by using a smaller number can be covered by convolution on larger patches, and the number of the patches on the larger area can be reduced.

To avoid the patch alignment problem, the size of the filter in the inclusion structure is therefore limited to 1x1, 3x3, 5x 5. Because the inclusion structures are stacked, the output related statistical information is different: in order to extract more abstract features at a high layer, the spatial aggregation is reduced, so that the features of a larger area are captured by increasing the number of convolutions of 3x3 and 5x5 in the high-layer inclusion structure.

In the above inclusion structure, since the computational overhead of a filter of 5 × 5 size is very large due to the increase in the number of filters, in addition to the pooling operation, the merging of the pooled layer output with the convolutional layer output increases the number of output values, and may cover an optimized sparse structure, the processing is very inefficient, causing computational explosion. Thus leading to a dimension-reduced inclusion structure (see figure 4).

The dimension reduction inclusion structure has many nests, the low-dimensional nest contains a large amount of picture patch information, the nest expresses a dense and compressed information form, but the nest is required to express sparseness, and the signal is compressed only when a large amount of aggregation occurs, so that the dimension reduction processing is considered by performing 1x1 convolution before 3x3 and 5x5 convolution operations, and 1x1 not only reduces the dimension, but also introduces ReLU nonlinear activation. It has been found that it is more advantageous to use the inclusion structure only in the upper layers of the overall network.

The advantage of the dimension-reduced inclusion structure is that the number of units in each stage, namely the width and depth of the network, can be increased when the computational explosion without uncontrolled computational complexity exists; meanwhile, the structure is similar to that of the image after multi-scale processing, and processing results are gathered together so that features under different sizes can be extracted at the same time in the next stage.

Due to the problem of the large computational load of sparse structures, a convolution of 1x1 is adopted to reduce the computation of parameters, wherein the 1x1 convolution is interpreted as:

before the 3x3 and 5x5 layers, a convolution operation of 1x1 is added respectively. Convolution of 1x1 (or a network in the network layer) provides a method of reducing dimensionality. Assume that there is one input layer and the volume is 100x100x60 (input for each layer in the network). Adding 20 convolution filters of 1x1 reduces the input volume to 100x100x 20. Furthermore, 3x3 layers and 5x5 layers do not need to handle as much volume as the input layers. This can be considered as a "pooling of features" because the height of the volume is being reduced, similar to reducing the width and length using the commonly used maximum pooling layers (maxporoling layers). These 1x1 convolutional layers are followed by a ReLU unit. Dimension-reduced inclusion structural form: the model consists of a network in a network layer, a medium-sized filtering convolution, a large-sized filtering convolution, and a Poolingperation pool (Poolingperation). The network in the network convolutional layer can extract information in every detail in the input volume, while the 5x5 filter can also cover most of the input of the receiving layer, and thus extract the information therein. A pool operation may also be performed to reduce the size of the space and reduce overfitting. On top of these layers, there is a ReLU after each convolutional layer, which can improve the non-linear characteristics of the network.

Table 2 google lenet network block diagram details

Table 2 is a network block diagram detail of google lenet, where "# 3x3 reduce", "# 5x5 reduce" represents the number of convolutions using 1x1 prior to the 3x3, 5x5 convolution operation. The input image is 224x224x3, and all the dimensionality reduction layers are subjected to a preprocessing operation of zero averaging, and all the dimensionality reduction layers are also subjected to a ReLU nonlinear activation function. The GoogLeNet network model uses a dimensionality reduction inclusion structure, which is equivalent to model fusion, and simultaneously adds a counter-propagating gradient signal to the network, and also provides extra regularization, so that the training speed is increased and the precision is improved. And extracting deep-level features of the data, and then detecting the target in the coordinate region.

And 8: and constructing an integral target detection model, namely a VGFG (VGG 16-defined Filter GoogleLeNet) model, and testing the model by using a modified VGG16 network, a GoogleLeNet target detection algorithm and a guide Filter combined model.

The specific pseudo code of the overall target detection model is constructed as follows:

TABLE 3 Algorithm one

The step is mainly to load the configuration, and load the context framework, the trained improved VGG16 network, the prototxt file, the GoogleLeNet target detection model and the configuration file thereof.

TABLE 4 Algorithm two

The video stream is read from the camera according to frames, the frame picture is compressed, the color image channel is sliced, the boundary frame coordinate is calculated, the image is transmitted to the network in the forward direction, and the like, and the frame picture is converted into binary data to be transmitted to a target detection model for detection.

TABLE 5 Algorithm III

And the step is mainly used for detecting the correctness of the target region subjected to the guide filtering processing, transmitting the position coordinates into a GoogLeNet network for verification, drawing out the target region by using a wire frame if the target region is detected, and returning to detect the next frame if the target region is not detected.

Therefore, the integral target detection model VGFG is built, and the VGFG target detection model can accurately detect the target under the influence of a low-visibility environment. All the performances are better.

According to the invention, the frame picture is processed by using the guide filtering in the target detection under low visibility, so that the target can be accurately detected under the influence of a low visibility environment, the detection accuracy of the network model is improved, the false alarm rate of the network model is reduced, and the false alarm rate of the network model is reduced.

Detailed description of the invention

The model training diagram of the embodiment is shown in fig. 5, loss of the model can be zero, the accuracy of the model can reach 92.7%, and the model has high accuracy, the embodiment is tested by combining with practical application scenes, the test result is shown in table 6, the test scene is a real-time video in a belt corridor of a certain company, the influence of smoke dust, dust and water mist in the belt corridor is large, the visibility is low, the cameras at different positions are subjected to scene-divided testing, the testing time is 24 hours, and the number of tested frame pictures is 1728000 (the classification number is 2, the classification is human and background). The data statistics are obtained by counting the frame pictures in the database.

TABLE 6 VGFG model test Effect statistics

The analysis of table 6 shows that the accuracy of the VGFG model is about 92.7%, the recall rate is 80.6%, the false alarm rate of the model is low, but the missing report rate is high, and the VGFG model already has a certain target detection capability in a low-visibility environment.

In the embodiment, the detection model provided by the invention is used for carrying out accuracy comparison with other detection models, and other algorithms comprise four network models of Fast R-CNN, SSD, yolov3 and GoogleNet for carrying out comparison experiments. The other four models are pre-trained, and the public data set V0C2012 is used to verify the correctness of the VGFG model and the other four models, and the result is shown in FIG. 6, and as can be seen from FIG. 6, the problem solving of the VGFG model provided by the invention is closer to reality and can achieve high accuracy.

The foregoing detailed description has been presented in conjunction with the appended drawings to illustrate embodiments of the invention, and the detailed description is provided to facilitate an understanding of the methods of the invention. For those skilled in the art, the invention can be modified and adapted within the scope of the embodiments and applications according to the spirit of the present invention, and therefore the present invention should not be construed as being limited thereto.

Claims

1. A real-time target detection method based on deep learning in a low-visibility environment is characterized in that an improved VGG16 network, a GoogleLeNet target detection algorithm and guiding filtering are combined, so that target detection can be accurately identified in low-visibility environments such as smoke, water mist and insufficient light. The specific process comprises the following steps:

step 1: real-time video stream acquisition

Step 2: generating a data set in target detection;

and step 3: setting parameters of the VGG16 network;

and 4, step 4: introducing a VGG16 network to process and classify target data;

2. The method of claim 1, wherein the hard disk recorder IP is matched with the host IP, and the determination of the channel number in the RTSP protocol.

3. The method for detecting the real-time target in the low-visibility environment based on the deep learning as claimed in claim 1, wherein a VOC format data set is produced, and picture data in the data set is labeled by using a labelImg tool to generate an XML file.

4. The method for detecting the real-time target in the low-visibility environment based on the deep learning as claimed in claim 1, wherein a priori frames, a network learning rate and training layer numbers with appropriate scales are selected. For the scale of the prior box, it obeys a linear increasing rule: as the feature map size decreases, the a priori box scales add linearly. The calculation formula is

The selection of the learning rate is a super-parameter problem, continuous testing is needed, the basic range is 0.1, 0.01, 0.001, 0.0001, the magnitude order is increased to carry out testing until an optimal value is found, and adjustment can be carried out through the gradient of a loss function, wherein the loss function is

Wherein N represents the number of positive samples and represents

Whether the prior frame is matched with the ith group route and the type of the prior frame is p, and c represents the confidence coefficient predicted value of the type. l represents the predicted value of the position of the corresponding bounding box of the prior frame, and g represents the position parameter of the ground channel.

5. The method for detecting the real-time target under the low-visibility environment based on the deep learning as claimed in claim 1, wherein the convolution layer weight is pre-trained, and the file darknet53.conv.74 and the deep learning data set generated in the step 1 are downloaded for the improved VGG16 network training.

6. The method for detecting the real-time target in the low-visibility environment based on the deep learning as claimed in claim 1, wherein training is repeated, parameters are adjusted according to the step 2, the improved VGG16 network model outputs one model for each 100 iterations, the training is performed for 240000 times, and test results show that the model effect is optimal for 180000 iterations.

7. The method for detecting the real-time target in the low-visibility environment based on the deep learning is characterized in that the target in the low-visibility environment is characterized by being large in smoke, water fog and light shadow influence and difficult to identify, and the guiding filtering is introduced aiming at the target in the low-visibility environment, is self-adaptive weight filtering, can play a role in keeping a boundary while smoothing an image, and can perform operations of smoothing, enhancing, extinction, feathering, defogging, joint sampling and the like on the image. Guiding filtering, wherein the output result of a certain pixel point is as follows:

q_i＝p_i-n_i

where n is noise and p is a degraded image with q contaminated by noise n.

wherein

8. The method as claimed in claim 1, wherein the google lenet network model is a 22-layer neural network, and finds an optimal local network structure from the existing deep network structures, and to avoid the problem of gradient disappearance due to depth increase, two los are skillfully added at different depths of the network layer to ensure the gradient return disappearance. In order to avoid the problems of overfitting phenomenon caused by excessive parameters due to width increase and difficulty In application caused by excessive calculation complexity, the google lenet adopts an inclusion structure, namely a Network-In-Network (Network In Network) structure, namely, an original node is also a Network. The Network-in-Network model uses a fully-connected multilayer perceptron to replace the traditional convolution process so as to obtain more comprehensive expression of features, meanwhile, because the process of improving the feature expression is already carried out in the prior art, the last fully-connected layer of the traditional CNN is replaced by a global average pooling layer, at the moment, the map has enough credibility for classification, and the loss can be directly calculated through softmax.

9. The method for detecting the target in real time under the low-visibility environment based on the deep learning is characterized in that an overall target detection model is built, a modified VGG16 network is used as a first-layer network of the model, the coordinates of a target area are output, image processing of the target area is carried out through guide filtering, finally target recognition is carried out through a GoogLeNet network model, and an overall framework is packaged to form a new target detection model (VGFG model).