CN111341059A

CN111341059A - Early wildfire smoke detection method and device based on depth separation and target perception

Info

Publication number: CN111341059A
Application number: CN202010081696.0A
Authority: CN
Inventors: 赵运基; 张海波; 张楠楠; 周梦林; 魏胜强; 刘晓光
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2020-02-06
Filing date: 2020-02-06
Publication date: 2020-06-26

Abstract

The invention discloses an early wildfire smoke detection method based on depth separation and target perception, which comprises the following steps of: respectively inputting the sample image and the target image into a target perception depth network to obtain a first depth characteristic matrix and a second depth characteristic matrix; separating the first depth feature matrix into a first separated feature matrix and a second separated feature matrix by a depth separable network; performing convolution operation on the first separation characteristic matrix and the second depth characteristic matrix to obtain a region response matrix; performing convolution operation on the second separation characteristic matrix and the area response matrix to obtain a smoke response matrix; and acquiring the maximum value of the smoke response matrix, and acquiring the smoke position according to the maximum value. The invention also discloses an early wildfire smoke detection device based on depth separation and target perception. The invention applies a target perception network framework embedded in a deep separable network, and improves the real-time detection speed of wildfire smoke.

Description

Early wildfire smoke detection method and device based on depth separation and target perception

Technical Field

The invention relates to the technical field of image processing, in particular to an early wildfire smoke detection method and device based on depth separation and target perception.

Background

In the forest fire detection, smoke occurs before open fire, in the early fire detection, the smoke detection has very important significance, and the real-time smoke detection is often difficult to realize because the forest fire is limited by various factors such as complex background, climate, environment, altitude and the like. The method aims at the problem that a deep learning pre-training model needs a large number of data sets and is limited by a small sample smoke data set and the characteristics of smoke, meanwhile, smoke targets in a real scene can be limited by various conditions such as any types in any forms, the scene is not limited, the target forms are various, and the like, and the deep learning network is difficult to be directly applied to detect forest fire smoke objects. The specific data set is usually data expansion under a limited scene, namely, a data acquisition device is used for acquiring a video data set for a part of scenes, and data amplification is carried out on the data set by adopting methods of adding verification, Gaussian noise, horizontal or vertical image turnover, image rotation, scale scaling, cutting, translation transformation and the like, which means that a smoke object under a real-time real scene cannot be represented perfectly by a depth feature trained by applying the specific image data set.

Due to the particularity of the small sample smoke target, the traditional image processing method and the mode recognition method show high false alarm rate and extremely low detection precision in deep forest smoke detection. In recent years, researchers have proposed a variety of smoke detection methods, and among them, detection methods based on deep learning are widely used. The smoke features are affected by environmental factors such as climate and the like, so that the feature extraction is unstable, the color of smoke gradually changes from shallow to deep along with the spread of forest fire, and simultaneously moves along with wind continuously, and the depth feature extracted by the deep learning model is affected by the illumination change and the resolution of image acquisition equipment, so that the smoke features can be more represented.

The smoke detection method based on deep learning is different from the traditional image processing method, a deep learning algorithm is applied to extract various deep characteristics which are different from one or two typical image processing characteristics, a Fully connected Convolutional neural Network (FCN) is applied to early fire detection, the Fully connected Convolutional Network is applied to smoke semantic segmentation to construct a smoke segmentation Network, and the Network trains a large number of smoke pictures to construct a smoke segmentation mask to achieve the purpose of segmenting fuzzy smoke images.

In the traditional vision-based smoke video detection method, a smoke video frame is divided into small blocks with fixed sizes, and then stable image characteristics based on blocks are obtained to classify smoke and non-smoke. However, the significant performance achieved by these methods often depends on robust visual objects, and these video objects can be easily distinguished from significant background differences, which raises the requirement of hardware cost. The economic cost is sacrificed to meet the high performance requirement, and the economic performance is difficult to guarantee; meanwhile, the occurrence of forest wildfires is often accompanied by complex background influence and fuzzy real-time video frame data, and high-resolution high-definition smoke fire video image frames are difficult to ensure in a field environment; today's technology has difficulty collecting large amounts of small sample data.

Therefore, training an incomplete pre-trained depth feature learning modeling model with a limited set of images makes it difficult to distinguish an arbitrary object in an unpredictable form from a complex environment.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide an early wildfire smoke detection method and device based on deep separation and target perception, which apply a deep separable network and embed a target perception network framework, and improve the real-time detection speed of wildfire smoke under the condition of meeting the requirements of high accuracy, low false alarm rate and high speed of smoke objects which change in real time under various complex scenes in smoke video frame detection.

In a first aspect, an embodiment of the invention discloses an early wildfire smoke detection method based on depth separation and target perception, which includes the following steps:

respectively inputting the sample image and the target image into a target perception depth network to obtain a first depth characteristic matrix and a second depth characteristic matrix;

separating the first depth feature matrix into a first separated feature matrix and a second separated feature matrix by a depth separable network;

performing convolution operation on the first separation characteristic matrix and the second depth characteristic matrix to obtain a region response matrix; performing convolution operation on the second separation characteristic matrix and the region response matrix to obtain a smoke response matrix;

acquiring the maximum value of a smoke response matrix, and if the maximum value is smaller than a preset threshold value, detecting no wildfire smoke in the target image; and if the maximum value is larger than or equal to a preset threshold value, taking the target image position corresponding to the maximum value as a smoke central point to obtain a smoke position.

As an implementation mode, inputting a sample image into a target perception depth network to obtain a first depth feature matrix; the method comprises the following steps:

the method comprises the steps of inputting a sample image into a pre-trained VGG-16 network to output a first initial feature matrix, outputting a first depth feature matrix through a trained target perception network by the first initial feature matrix, wherein the first depth feature matrix is M1 × N1 × (A1+ A2) × B, M1 and N1 are result values of the sample image after length and width are subjected to convolution pooling respectively, A1 is a depth feature with an optimal channel, A2 is a depth feature sensitive to scale, and B is a plurality of constructed scales.

As an implementation mode, inputting a target image into a target perception depth network to obtain a first depth feature matrix; the method comprises the following steps:

and the second initial feature matrix outputs a second depth feature matrix through the trained target perception network, wherein the second depth feature matrix is M2 × N2 × (A1+ A2), and M2 and N2 are result values of the length and the width of the target image after convolution pooling respectively.

As an embodiment, the separating the first depth feature matrix into a first separated feature matrix and a second separated feature matrix through a depth separable network includes:

separating the convolution kernel into a depth sparse convolution kernel and a point-by-point convolution kernel through a depth separation network, and respectively forming a first separation feature matrix and a second separation feature matrix on the basis of the depth sparse convolution kernel and the point-by-point convolution kernel, wherein the first separation feature matrix is M1 × N1 × 1 × B, and the second separation feature matrix is 1 × 1 × (A1+ A2) × B.

As an embodiment, obtaining the smoke position by taking the target image position corresponding to the maximum value as a smoke central point includes:

setting a preselection frame, and overlapping the middle point of the preselection frame with the smoke center point; and the area of the target image selected by the pre-selection frame is the smoke position.

In a second aspect, an embodiment of the present invention discloses an early wildfire smoke detection device based on depth separation and target perception, which includes:

the acquisition module is used for respectively inputting the sample image and the target image into a target perception depth network to obtain a first depth characteristic matrix and a second depth characteristic matrix;

a separation module to separate the first depth feature matrix into a first separated feature matrix and a second separated feature matrix by a depth separable network;

the convolution module is used for performing convolution operation on the first separation characteristic matrix and the second depth characteristic matrix to obtain a region response matrix; performing convolution operation on the second separation characteristic matrix and the region response matrix to obtain a smoke response matrix;

the detection module is used for obtaining the maximum value of the smoke response matrix, and if the maximum value is smaller than a preset threshold value, the target image does not detect wildfire smoke; and if the maximum value is larger than or equal to a preset threshold value, taking the target image position corresponding to the maximum value as a smoke central point to obtain a smoke position.

The acquisition module comprises a first acquisition unit, a first initial feature matrix, a first depth feature matrix and a second acquisition unit, wherein the first initial feature matrix is used for inputting a sample image into a pre-trained VGG-16 network and outputting the first initial feature matrix, the first initial feature matrix is used for outputting the first depth feature matrix through a trained target perception network, the first depth feature matrix is M1 × N1 × (A1+ A2) × B, M1 and N1 are pooling values of the length and the width of the sample image respectively, A1 is a channel-optimal depth feature, A2 is a scale-sensitive depth feature, and B is a plurality of constructed scales.

The acquisition module comprises a second acquisition unit, wherein the second acquisition unit is used for inputting the target image into a pre-trained VGG-16 network and outputting a second initial feature matrix, and the second initial feature matrix outputs a second depth feature matrix through the trained target perception network, the second depth feature matrix is M2 × N2 × (A1+ A2), wherein M2 and N2 are the result values of the target image after the length and the width are respectively subjected to convolution pooling.

As an embodiment, the separation module comprises:

and the depth separation unit is used for separating the convolution kernels into a depth sparse convolution kernel and a point-by-point convolution kernel through a depth separation network, and respectively forming a first separation feature matrix and a second separation feature matrix on the basis of the depth sparse convolution kernel and the point-by-point convolution kernel, wherein the first separation feature matrix is M1 × N1 × 1 × B, and the second separation feature matrix is 1 × 1 × (A1+ A2) × B.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention applies a target perception network framework embedded in a deep separable network, and improves the real-time detection speed of wildfire smoke under the condition of meeting the requirements of high accuracy, low false alarm rate and high speed of smoke objects which change in real time under various complex scenes in smoke video frame detection.

2. When the Target-aware Deep network (TADT) for adaptive Target perception is used for smoke Target detection, the method does not need to pay more attention to a complex pre-training process, which means that for small sample data sets such as smoke, a Target detection task can be performed without complex pre-training on the basis of a Deep learning model obtained by pre-training. The TADT network makes up the defect that the pre-trained depth model cannot fully consider any form in visual inspection.

3. And the calculation amount of each frame of related filtering is reduced by adopting a depth separable network, and the real-time speed is greatly improved.

4. The template target perception depth features are processed in a mean pooling mode, so that the template features are processed, related filtering convolution kernels are fixed, time waste and training time consumption of error back propagation of the convolution kernels are reduced, and risks such as gradient dispersion or explosion and the like easily caused by small-range network training are effectively avoided.

Drawings

Fig. 1 is a flowchart of an early wildfire smoke detection method based on depth separation and target perception according to a first embodiment of the present invention;

FIG. 2 is a category 8 image of a wildfire smoke dataset;

FIG. 3 is an image of the category 8 image of FIG. 2 after being visualized through a TADT network border;

fig. 4 is an image of the class 8 image of fig. 2 after being visualized by a DSATA network border;

FIG. 5 is a comparison graph of TADT and DSATA algorithm operating curves;

fig. 6 is a block diagram of an early wildfire smoke detection system based on depth separation and target perception according to a second embodiment of the present invention.

Detailed Description

The present invention will now be described in more detail with reference to the accompanying drawings, in which the description of the invention is given by way of illustration and not of limitation. The various embodiments may be combined with each other to form other embodiments not shown in the following description.

Example one

The embodiment of the invention discloses an early wildfire smoke detection method based on depth separation and target perception, and the method is shown in figure 1 and comprises the following steps:

and S110, respectively inputting the sample image and the target image into a target perception depth network to obtain a first depth feature matrix and a second depth feature matrix.

A target perception algorithm is introduced into the TADT network, target perception is applied to the tracking network, the gradient of the CNN depth characteristic is extracted, a mean pooling layer is adopted to calculate and sort weight values, and the importance degree of the depth characteristic to the target characterization is expressed by the proportion of the weight values. Training a regression loss layer by using a TADT (TADT-based tree training) network, constructing a regression loss function, calculating the contrast of different channel characteristics to obtain the proportion for distinguishing deep characteristics, extracting regression depth characteristics through the regression loss layer, and preparing for subsequent calculation of characteristic channel weight sequencing; and meanwhile, training a sequencing loss layer, constructing a sequencing loss function, constructing depth features of different scales, screening scale information, extracting depth features sensitive to the scales, and constructing 3 scales by an algorithm to reflect the scale change condition of a target. The traditional complex pre-training network algorithm is used for pre-training a large-batch training set aiming at a specific scene and an object, and the TADT network extracts depth features with different purposes by constructing two types of loss functions and training two layers of different convolutional layers through the network. The TADT network structure mainly comprises 4 parts: pre-training a CNN framework, target perception, relevant filtering and a twin matching network.

The pre-training CNN framework adopts a VGG-16 network, and the VGG-16 network has 16 layers including 13 convolutional layers, 5 maxporoling layers, 3 fully-connected layers, 1 input layer and 1 output layer. In the vgg-16 model, using the smoke video frame as input, 512 depth feature maps are obtained for input of the target perception model.

In the preferred embodiment of the invention, initial depth features of the sample image and the target image are respectively extracted through a Conv4-1 layer and a Conv4-3 layer in a Block4 in the VGG-16 network, namely the obtained first initial feature matrix is M1 × N1 × 512 × 2, the obtained second initial feature matrix is M2 × N2 × 512 × 2, M1 and N1 are respectively result values of the length and the width of the sample image after convolution pooling, and M2 and N2 are respectively result values of the length and the width of the target image after convolution pooling.

The first depth feature matrix and the second initial feature matrix are respectively passed through a trained target perception network, so that the target perception network respectively outputs a first depth feature matrix and a second initial feature matrix of the first depth feature matrix, wherein the first depth feature matrix is M1 × N1 × (A1+ A2) × B, A1 is a depth feature with an optimal channel, A2 is a scale-sensitive depth feature, B is a plurality of constructed scales, the second depth feature matrix is M2 × N2 × (A1+ A2) × B, and M2 and N2 are result values of the length and the width of a target image after convolution pooling respectively.

The target-aware network in the TADT network uses regression loss to distinguish the importance of the first initial feature matrix and the second initial feature matrix, and filters out a1 depth features in the first initial feature matrix and the second initial feature matrix, for example, in the preferred embodiment of the present invention, a1 is 300. The target perception network uses regression loss to search the kernel convolution of different objects, and specific characteristic information is extracted. In a target perception model, a regression convolutional layer is trained by minimizing regression loss, targeted feature extraction is further performed on 512-dimensional VGG-16 depth features, gradient information is obtained, gradient information of a discrete feature matrix is extracted as shown in formulas 1-4, weights corresponding to the gradient features are obtained by using mean pooling and are ranked, channel features corresponding to fixed-batch weights are selected, 300 channel features are selected by a TADT network, and experiments show that the features can effectively reflect a target object. The importance of the 512 feature maps captured by the pre-trained VGG-16 is explored by ranking and screening the weights.

This means we can satisfy the validity of extracting the feature of any object in the unknown scene without fully pre-training the VGG-16 network, and avoid unnecessary large-scale smoke video acquisition and complicated network training, and the regression loss is shown in formula 5.

G_h(x, y) ═ H (x +1, y) -H (x-1, y) (equation 1)

Wherein H (x, y) represents a pixel value of the (x, y) point, G_hAnd (x, y) is the horizontal gradient of the (x, y) point.

G_v(x, y) ═ H (x, y +1) -H (x, y-1) (formula 2)

Wherein G is_v(x, y) is the gradient of the (x, y) point in the vertical direction.

Wherein G (x, y) is a gradient value of the (x, y) point.

Where α (x, y) is the gradient angle of the (x, y) point.

L_reg＝||Y(i，j)-W*X_i，j||²+λ||W||²(formula 5)

Wherein Y (i, j) is the image X_i，jThe corresponding gaussian model is shown in equation 6. W is the loss function update parameter, i.e. the filter weight, is the convolution operation sign, and lambda is the penalty factor, by minimizing the loss function L_regUpdating the weight W, constructing a loss function convolution layer, updating the weight of the filter, and modeling a model for obtaining the sequencing weight subsequently. Updating the model weight W by error back propagation, wherein the back propagation updating weight value represents the importance of the characteristic diagram, and the loss function value L in the back propagation is calculated by using a chain rule_regTo X_i，jThe derivative of (c). The derivation using the chain rule is shown in equation 7.

Wherein α is a constant, σ₁、σ₂Width and height of the filter W, respectively, in order for the algorithm to require late stage targeted parameter tuning to meet model requirements, C_i、C_jIs the coordinate value of the center point position of the cut target position.

Wherein the content of the first and second substances,

is a partial derivative symbol, X_o(i, j) is W X X_i，jI.e. X_i，jExtracting a depth feature map through a regression convolutional layer, extracting 512-dimensional depth features through a pre-training model, sending the features into the regression convolutional layer to obtain a regression depth feature map, wherein the regression depth feature map is specific depth features extracted through the specific convolutional layer aiming at a specific task, and the features have characterization capability so as to reflect the importance of the 512-dimensional features on a characterization target object. The regression feature maps obtained by training different objects in a TADT network are subjected to subsequent calculation of descending order arrangement weight constants, the depth features of fixed batches are selected as the characteristic features of the target objects, and TADT finds that the characteristics of the specific target objects in any scene can be characterized by selecting 300 depth feature maps from 512 dimensions through experiments.

Obtaining a gradient characteristic diagram by calculating the gradient of the regression depth characteristic in the TADT network; and then calculating the gradient feature matrix through global mean pooling to obtain weight constants, arranging 512 weight constants into column vectors in a descending order, selecting the weight constants with fixed dimensions, and selecting the channel filters corresponding to the weight constants. The depth feature extraction of the later frame video image does not need 512 filters, and only 300 selected filters are needed to convolve the subsequent frame image to obtain a 300-dimensional depth feature map, and the depth features can completely represent target objects. Global gradient mean pooling is shown in equation 8.

Wherein, W_iThe weight constant corresponding to the ith depth feature in 512 dimensions, GAP is global mean pooling, z_iThe output depth feature obtained by convolution of the regression convolutional layer for the ith feature map in 512 dimensions corresponds to X in formula 7_o(i,j)。

Is z_iThe gradient feature of (1) and the gradient calculation method are shown in formulas 1-4.

The occurrence of fire is difficult to expect, a smoke target object is extremely easy to be influenced by the natural environment, and shows movement change and irregular shape under the influence of weather change of the natural environment such as wind and the like, so that the algorithm is required to have strong sensitivity to scale change, a scale sensitive factor is added, and a kernel filter is trained to adapt to the scale change and has strong sensitivity to irregular scale deformation. A scale sensitive loss function, i.e. a rank order loss function, is proposed to account for the effects of motion and irregular deformation of the target object, etc. The algorithm can make corresponding scale sensitive adjustment due to strong scale change, and is particularly critical to the design of the algorithm. The scale sensitive loss function can give different scale information according to the motion and irregular phase change of the target object, and the robustness of the algorithm can be effectively enhanced. Due to the arbitrariness of the target object under any condition of the scene, targeted network training is carried out in a TADT network to construct a scale convolution layer, a scale sensitive loss function is constructed to train the convolution layer, a self-adaptive updating scale filter is propagated reversely to update the kernel weight in an iterative manner, and scale information extracted through training better adapts to the scale change caused by the movement and irregular deformation of the smoke object. Extracting an A2 dimensional scale sensitive feature image in a TADT network, wherein in the preferred embodiment of the invention, A2 is 80; the 80-dimensional scale sensitive depth feature can effectively adapt to various scale requirements of the smoke object. The sensitive loss function is constructed as shown in equation 9.

Wherein L is_rankConstructing the loss function to train the scale-sensitive convolution layer for sequencing loss so as to extract scale-sensitive characteristic information, (x)_i,x_j) ∈ omega is a scale sensitive pair, 23 pairs of image pairs are adopted in TADT, and an image scale operator corresponding to dislocation of a target image with step length of 1 is used for f, which is a prediction model, the model is realized by carrying out operation on a sample and W, the operation result requires the minimum sequencing loss value, namely, the weight value corresponding to the minimum value of formula 9 is solved, the process finally obtains the optimal solution through ridge regression processing, namely, the final result of W can be determined.

In TADT, a training model is built to train the scaling filter to reduce the complexity of the scaling. Training rank loss by adopting a random gradient descent method (SGD), and selecting 80 scale-sensitive depth features according to a rank loss model. Gradient update filter weight, L, is calculated using a chain rule_rankThe derivation calculation is shown in equation 10.

Wherein, W_rankIs the weight in the ordering loss, and the ordering loss model is realized by the operation of the sample and W_rankIn fact a matrix of the same dimensions as the original. In the training samples, if the original feature size is 23 × 11 × 512, then W is_rankThe dimension of (a) is also 23 x 11 x 512, since the correlation operation is performed with the original feature, the operation is aimed at finding a W_rank(23 × 11 × 512) can ensure that the final operation result of equation 9 is minimized; the operation process is realized by continuous iteration, the final iterative operation result of the model is the final loss after the images with different scales are input, the final loss is respectively corresponding to the loss of 23 paired samples, the loss meets the gradient derivative process, and therefore the sample loss is x_i,jA corresponding gradient; for eachAnd summing the errors corresponding to the channels, pooling the global mean value, finally obtaining the gradient response results of the corresponding 512 channels, wherein the larger the gradient is, the more sensitive the same scale transformation is, descending the order of the model to arrange the loss values, selecting 80 channel information to represent the scale sensitive characteristics, adding the regression loss characteristics, and finally constructing the characteristic channel with the target perception capability.

And S120, separating the first depth feature matrix into a first separation feature matrix and a second separation feature matrix through a depth separable network.

The depth separable network is a streamline structure, the depth sparse structure of the depth separable network is formed by convolution of a factor, the depth is divided into two steps, namely, a depth sparse convolution sum and a depth convolution point by point, the convolution part carries out convolution operation on input, and the convolution point part integrates the mapping of the depth sparse sum and the depth convolution sum, so that the depth sparse sum and the depth convolution point by point are integrated, the depth sparse sum and the depth convolution point by point, the convolution point part is integrated, the convolution point part is called as a convolution kernel of a depth convolution kernel 351, the convolution point by point is called as a convolution kernel, the convolution kernel is integrated, and the depth sparse sum and the convolution kernel are integrated, so that the depth sparse sum and the convolution kernel are integrated, the depth convolution kernel is called as a convolution kernel, the convolution kernel is integrated, the depth is called as a convolution kernel of the same size, the depth sparse sum and the convolution kernel are integrated, and the depth sparse convolution kernel is called as a convolution kernel of a depth, the depth sparse sum and a convolution kernel of a convolution point by point is called as a convolution operation, the convolution operation of a convolution point by point is integrated, and a point by point is called as a convolution operation, the convolution operation of a point by point.

The time delay problem can be effectively solved by adopting a factorization method, and the characteristic is thatThe idea of a special factorization convolution kernel can reduce the depth of the network. A typical convolutional neural network adopts one-time sliding operation of a convolutional kernel to obtain depth characteristics, and the depth characteristics are transmitted back and followed by new weight parameters through the network, the convolutional operation is convenient and easy to operate, but parameters needing to be updated in the mode are always in the order of hundreds of millions, and the computational complexity of the convolutional neural network needs higher hardware to be added. E.g. for input D_F×D_F× N, and the size of an output characteristic image obtained by a typical image convolution method is F_w×F_h× C, the convolution operation power consumption is W × H × M × C × F through the typical convolution operation mode_w×F_h。

MobileNet is different from a common convolutional neural network, a depth separable mechanism and a streamline factorization structure are adopted, so that the reduction of parameters of a convolution operation is considerable, and in the MobileNet network structure, a depth separation algorithm is adopted to decompose a convolution kernel into a depth sparse convolution kernel and a point-by-point convolution kernel of 1 × 1.

The method comprises the steps of firstly setting a depth sparse convolution kernel for a region to extract features, then setting a point-by-point convolution kernel integration region feature, considering the channel again, setting a point-by-point convolution kernel integration region feature, and realizing separation of the two features, carrying out convolution on the region, extracting the convolution kernel into a depth sparse convolution kernel with the size of W × H × 1 × M, carrying out region convolution to obtain a region-related depth feature map, extracting the convolution kernel into a point-by-point convolution kernel with the size of 1 × 1 × M × C according to the input of the region-related depth feature map, carrying out point convolution to extract a depth feature map based on channel correlation, integrating the region-related depth feature with the channel-related feature map by applying the point-by-point convolution, finally obtaining the depth features extracted by the convolution kernel, wherein the output feature maps are the same in size but the parameter calculation is obviously reduced.

W×H×M×1×F_w×F_h(formula 11)

The channel characteristics are considered to set a point-by-point convolution kernel, a depth feature map related to an area obtained by sparse convolution is integrated by point convolution, the separation of the area and the channel is realized, the depth can be separated to complete the same function of typical convolution operation, but the parameter reduction is obvious, the multilayer convolution can well reduce the time delay problem caused by the fact that the computation power is huge and complicated due to the channel depth, and the computation power consumption calculation of the point convolution is shown in a formula 12.

1×1×M×C×F_w×F_h(formula 12)

The total computation cost of the depth separable two-step computation is shown in equation 13.

W×H×M×1×F_w×F_h+1×1×M×C×F_w×F_h(formula 13)

The power consumption ratio of the convolution under the deep separable scheme to that of a typical convolution kernel convolution operation is shown in equation 14, for example.

In the formula 14, the computation power saving of the depth separable mechanism is very obvious, the computation power saving of one convolution operation reaches K times, and K changes with the large convolution kernel and the number of the convolution kernels, which means that under the condition of not affecting the network precision and speed, the larger the size of each layer of convolution kernel is, the more the number of the applied convolution kernels is, and the more the speed is improved. Each layer of convolution can save considerable computation power by applying a depth separable mechanism, so that the time delay caused by computation power waste is well reduced, and the network lightweight can be effectively realized by properly changing the layer number and optimizing a loss function and an error back-propagation parameter.

In TADT networks, cross-correlation filtering methods are referenced to speed up the matrix computation, using Fast Fourier Transforms (FFTs). The FFT converts the convolution kernel matrix and the input matrix from the discrete domain to the frequency space, so that the frequency domain convolution uses array multiplication to replace matrix operation, and the discrete space is converted to the discrete frequency domain space for calculation, thereby accelerating the mathematical transformation of calculation. However, the FFT method is mandatory for the high-dimensional feature map of each frame, and the use of mathematical transformation without changing the dimension of the matrix still requires huge calculation consumption. The depth separable algorithm reduces convolution calculation through depth and point convolution operations, a streamline factorization structure and a streamline operation of two-step distributed simplified kernel dimensionality.

We use a deep separable convolution operation to reduce the dimensionality of the target-aware network feature matrix to improve the architecture, i.e., apply a deep separable network to a TADT network to construct a deep separable target-aware network, denoted as a DSATA network.

In summary, the DSATA network divides a template depth perception feature map extracted by a target perception network into two parts, and convolves and extracts the template depth perception feature map by using mean pooling to obtain a depth sparse convolution kernel and a point-by-point convolution kernel, the depth sparse convolution kernel extracts area related depth features, the point-by-point convolution kernel convolution operation integrates the area related depth features, and finally obtains a depth feature with separated channels and areas, the depth feature is the same as the depth feature obtained by the conventional convolution kernel convolution, the subsequent full-connected layer and error back-transmission operation are not affected, but the calculation amount is greatly reduced.

And sending the extracted first depth feature matrix M1 × N1 × (A1+ A2) × B into a subsequently trained depth separable network, and separating a depth sparse convolution kernel with a template image of M1 × N1 × 1 × B and a point-by-point convolution kernel of 1 × 1 × (A1+ A2) × B.

The template image is extracted by a global mean pooling mode to be two types of separation convolution kernels, the global mean pooling can extract convolution kernels of the template image, the convolution kernels are fixed due to the immobilization of the target template image, so that the fixed convolution kernels can avoid a complex network iteration process and minimize the calculation consumption, the accuracy of the data set is not reduced through experimental verification while the calculation consumption can be effectively reduced by the fixed convolution kernels, experiments show that the accuracy of the smoke data set of a specific type is improved to different degrees, the speed of the smoke data set is remarkably improved, and the speed of the smoke image data set of various practical scenes can be improved by more than two times through experimental result tests.

S130, performing convolution operation on a depth sparse convolution kernel, namely the first separation characteristic matrix M1 × N1 × 1 × B and the second depth characteristic matrix to obtain an area response matrix, applying a point-by-point convolution kernel, namely the second separation characteristic matrix 1 × 1 × (A1+ A2) × B to act on the area response matrix, and performing convolution operation to extract a final smoke response matrix.

And S140, acquiring the smoke position.

Calculating a maximum value of a smoke response matrix, if the maximum value is less than a preset threshold (for example, 0.8), the target image does not detect wildfire smoke; and if the maximum value is larger than or equal to a preset threshold value, taking the position of the maximum value corresponding to the target image as a smoke central point to obtain a smoke position.

Presetting a preselection frame, wherein the shape and the size of the preselection frame are set according to needs, for example, the preselection frame adopts a rectangular frame, and the middle point of the preselection frame is overlapped with the smoke center point; and the area of the target image selected by the pre-selection frame is the smoke position.

For example, the following steps are carried out:

and selecting 8 types of wildfire smoke video sequences as a test data set, namely a target image, so as to verify the feasibility of the DSATA network. The DSATA network uses a target perception depth tracking network improved network layer for reference, and adds a depth separable algorithm into a global mean pooling layer to obtain the DSATA network.

These smoke videos come from standard data set resources, with different environmental conditions chosen to verify the performance of the algorithm. The selection of the smoke data set takes into account a number of factors including climate conditions, resolution acquisition equipment of the camera and like interference. And (3) combining the conditions, selecting 8 fire smoke video sequences, wherein the smoke video information is shown in table 1.

TABLE 1 Smoke video information

These smoke video collections take into account similar background interferences, as shown in fig. 2, the blurred video frame images collected in video a are for simulating the cost impact of the video capture equipment, and video b adds similar object effects, such as white clouds and the like. In videos c and d, remote image acquisition exacerbates image blur. At very low pixel resolution, the smoke position still needs to be detected accurately, which increases the experimental difficulty of the algorithm. In the video e, under the influence of weather environments such as wind and the like, smoke shows violent movement and deformation, so that the identification accuracy is reduced. The video f data set is collected under a normal condition, the video g is influenced by the white similar color of the wall, and the video h is collected under a cloudy condition and has a lower pixel level.

Experimental work used GTX 1080GPU acceleration, using the Ubuntu 16.04 operating system, win10, Matlab2018a for smoke video capture and pre-processing. Calculating the accuracy of TADT and DSATA by using the collected smoke video, wherein the experimental result of TADT is visualized as shown in FIG. 3, and the frames are selected from the smoke video selected during the operation of the algorithm; the visualization of DSATA is shown in equation 4.

In the frame of the TADT network object in fig. 3, the frame information of the videos d, e, and g shows the inaccuracy and frame drift, while the frame information shown in fig. 4 shows the good performance. Due to the poor resolution of the video d, the maximum value of the correlated response graph has different degrees of deviation in the detection process of the video, and the frame information is shown to drift. The video e is greatly influenced by wind, smoke moves violently, the shape of the video e changes greatly, the maximum point of a response graph obtained through correlation changes violently, the maximum point of the response graph between a front frame and a rear frame changes excessively, calculation errors are aggravated, and complex correlation operation further causes errors, so that the situation that the frame is not accurately depicted and even lost is presented finally. The white area of the wall of the video g is too large, the proportion occupied in the correlation calculation is too large due to the large white pixel value, so that errors occur, the maximum point relation of corresponding images between the previous frame and the next frame is very close, the calculation of the previous frame is inaccurate, so that the maximum point of the subsequent frame has a slight deviation condition, and finally, the frame is lost. The DSATA network in fig. 4 shows good performance for the interference of the videos d, e, and g, because the calculation amount of the related parameters is reduced in the accelerated calculation process, errors can be reduced, and the occurrence of frame severe drift caused by error accumulation is reduced, but because the depth separable operation only reduces the parameters of the calculation amount, but does not substantially change the values of the related convolution kernel parameters, the robustness of frame information drift shown still needs to be further researched, but the frame performance shown by the network still has a higher reference value. It should be noted that the border information of fig. 3 and 4 is selected from the random sampling in the code running process, and the performance of the border information is shown, which can explain the network performance in this document to a certain extent. To further illustrate the effectiveness of the DSATA network, table 2 presents a comparison of the accuracy of the TADT and DSATA networks to further illustrate the performance of the algorithm.

TABLE 2TADT to DSATA network accuracy comparison

Video	Video a	Video b	Video c	Video d	Video e	Video f	Video g	Video h	Synthesis of
										Accuracy TADT (%)	99.17	87.40	98.94	96.48	79.57	98.27	95.8	98.8	94.30
Accuracy DSATA (%)	99.97	90.45	99.65	96.13	98.76	95.72	93.95	97.2	96.48

Table 2 shows the accuracy of the target perception algorithm TADT under the smoke detection target object, compared with the smoke detection accuracy of the depth separable method, it can be seen from table 2 that the average accuracy of the TADT network is 94.30%, and the average accuracy of the DSATA algorithm is 96.48%. Compared with the TADT average accuracy, the average accuracy of the DSATA network is higher than 2.18%, and the test performance of the video e and the video f shows that the DSATA obviously shows excellent performance. The experimental performance of DSATA on video g is slightly worse than TADT, probably due to the decrease in the IOU calculated score caused by the fact that the border is selected to contain the true smoke area during data calibration. The accuracy calculation in table 2 is shown in equation 16.

Wherein, Tp represents that the smoke sample is detected correctly, and Fp frame shifts and detects the wrong smoke image. Fig. 5 shows a comparison of the operational curves of the TADT network and the DSATA network.

The average accuracy of the DSATA network in the table 2 is only 2.18% higher than that of the TADT network, the accuracy of the two algorithms is equivalent in terms of accuracy indexes of the algorithms, and the accurate positioning and identification of the smoke target object are realized by using the two algorithms only from the angle. From fig. 5, it can be found that the algorithm of DSATA operates more smoothly than TADT, and can reach the peak of accuracy faster under a more severe threshold value, and the number of lost iterations (euclidean distance between the predicted maximum point of the corresponding graph and the center point of the labeled frame) is less, which further illustrates the excellent performance of the algorithm in terms of operating speed. To further demonstrate the excellent performance of the DSATA network in terms of experimental speed, table 3 gives the running speed index comparison of the DSATA network with the TADT network under the category 8 smoke data set.

TABLE 3TADT to DSATA network speed comparison

Table 3 shows a speed comparison table of the TADT network and the DSATA network, and it can be found that the speed difference of the video d is most obvious, the running speed of the DSATA is 2.67 times faster than that of the TADT network, the speed of the video c is lower, which is 1.84 times faster, and the average speed is 2.06 times faster. The presentation of the data in the table shows that it is the experimental performance that the speed index is significantly better than that of the TADT network while the speed is slightly increased using the DSATA network herein. To further illustrate the experimental performance of the DSATA network, the experimental performance comparison of other algorithms is added, and table 4 shows the experimental accuracy comparison of the DSATA network and other algorithms.

TABLE 4 comparison of accuracy rates for various algorithms

Algorithm	HSV+KSVM	DBN	3DCNN	Faster-RCNN	Saliency Detection	TADT	DSATA
								Accuracy (%)	64.8	93.4	93.74	91.88	93.72	94.30	96.48

The table 4 shows the comparison table of the accuracy rates of the 7-class algorithm and the DSATA network, and it can be seen from the data result that the accuracy rate of the DSATA network has stronger competitive advantage compared with other algorithms. Therefore, the DSATA network can meet the requirement of real-time detection of smoke. A comparison of the experiments in tables 2-4 leads to the following conclusions:

the DSATA network provided by the invention has stronger performance advantages compared with the algorithm no matter the traditional image processing method or the deep learning algorithm.

Compared with the comprehensive algorithm, the DSATA adds the depth separable algorithm on the basis of the TADT, the addition of the depth separable mechanism is combined with the target perception characteristic, the speed requirement can be well improved, and meanwhile, the algorithm precision is improved in a small range.

Example two

The second embodiment of the present invention discloses an early wildfire smoke detection device based on depth separation and target perception, which is a virtual device of the above embodiments, and as shown in fig. 6, the early wildfire smoke detection device includes:

the obtaining module 210 is configured to input the sample image and the target image into a target perceptual depth network respectively to obtain a first depth feature matrix and a second depth feature matrix;

a separation module 220 for separating the first depth feature matrix into a first separated feature matrix and a second separated feature matrix through a depth separable network;

a convolution module 230, configured to perform convolution operation on the first separation feature matrix and the second depth feature matrix to obtain a region response matrix; performing convolution operation on the second separation characteristic matrix and the region response matrix to obtain a smoke response matrix;

the detection module 240 is configured to obtain a maximum value of the smoke response matrix, and if the maximum value is smaller than a preset threshold, the target image does not detect wildfire smoke; and if the maximum value is larger than or equal to a preset threshold value, taking the target image position corresponding to the maximum value as a smoke central point to obtain a smoke position.

Preferably, the acquisition module comprises a first acquisition unit, which is used for inputting the sample image into a pre-trained VGG-16 network to output a first initial feature matrix, the first initial feature matrix outputs a first depth feature matrix through a trained target perception network, the first depth feature matrix is M1 × N1 × (A1+ A2) × B, wherein M1 and N1 are pooling values of the length and the width of the sample image respectively, A1 is a depth feature with an optimal channel, A2 is a depth feature sensitive to scale, and B is a plurality of constructed scales.

Preferably, the obtaining module includes a second obtaining unit, configured to input the target image into a pre-trained VGG-16 network to output a second initial feature matrix, where the second initial feature matrix outputs a second depth feature matrix through a trained target sensing network, and the second depth feature matrix is M2 × N2 × (a1+ a2), where M2 and N2 are result values of the target image after being convolved and pooled respectively.

Preferably, the separation module comprises a depth separation unit, which is used for separating the convolution kernel into a depth sparse convolution kernel and a point-by-point convolution kernel through a depth separation network, and forming a first separation feature matrix and a second separation feature matrix respectively based on the depth sparse convolution kernel and the point-by-point convolution kernel, wherein the first separation feature matrix is M1 × N1 × 1 × B, and the second separation feature matrix is 1 × 1 × (A1+ A2) × B.

Preferably, the obtaining of the smoke position by taking the target image position corresponding to the maximum value as the smoke central point includes: setting a preselection frame, and overlapping the middle point of the preselection frame with the smoke center point; and the area of the target image selected by the pre-selection frame is the smoke position.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling an electronic device (which may be a mobile phone, a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the early wildfire smoke detection device based on depth separation and target perception, the included modules are only divided according to the functional logic, but not limited to the above division, as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims

1. An early wildfire smoke detection method based on depth separation and target perception is characterized by comprising the following steps:

2. The early wildfire smoke detection method based on depth separation and target perception as claimed in claim 1, wherein the sample image is input into a target perception depth network to obtain a first depth feature matrix; the method comprises the following steps:

3. The early wildfire smoke detection method based on depth separation and target perception as claimed in claim 2, wherein the target image is input into a target perception depth network to obtain a first depth feature matrix; the method comprises the following steps:

4. The early wildfire smoke detection method based on depth separation and target perception as claimed in claim 2, wherein separating the first depth signature matrix into a first separated signature matrix and a second separated signature matrix by a depth separable network comprises:

5. The early wildfire smoke detection method based on depth separation and target perception as claimed in any one of claims 1-4, wherein the obtaining of the smoke position with the target image position corresponding to the maximum value as the smoke center point comprises:

6. An early wildfire smoke detection device based on depth separation and target perception, comprising:

7. The early wildfire smoke detection device based on depth separation and target perception as claimed in claim 6, wherein the obtaining module comprises a first obtaining unit for inputting the sample image into a pre-trained VGG-16 network to output a first initial feature matrix, the first initial feature matrix outputs a first depth feature matrix through a trained target perception network, the first depth feature matrix is M1 × N1 × (A1+ A2) × B, wherein M1 and N1 are the result values of the sample image after convolution pooling respectively for length and width, A1 is the channel-optimized depth feature, A2 is the scale-sensitive depth feature, and B is a plurality of scales constructed.

8. The early wildfire smoke detection device based on depth separation and target perception as claimed in claim 7, wherein the obtaining module comprises a second obtaining unit for inputting the target image into a pre-trained VGG-16 network to output a second initial feature matrix, and the second initial feature matrix outputs a second depth feature matrix through a trained target perception network, the second depth feature matrix is M2 × N2 × (A1+ A2), wherein M2 and N2 are the result values of the target image after convolution pooling respectively for length and width.

9. The early wildfire smoke detection device based on depth separation and target perception as claimed in claim 7, wherein the separation module comprises:

10. The early wildfire smoke detection device based on depth separation and target perception as claimed in any one of claims 6-9, wherein the obtaining of the smoke position with the target image position corresponding to the maximum value as the smoke center point comprises: