CN111986240A

CN111986240A - Drowning person detection method and system based on visible light and thermal imaging data fusion

Info

Publication number: CN111986240A
Application number: CN202010904133.7A
Authority: CN
Inventors: 文捷; 祝闯; 李春旭; 贾昕宇; 姚治萱; 刘军; 耿雄飞; 乔媛媛
Original assignee: Beijing University of Posts and Telecommunications; China Waterborne Transport Research Institute
Current assignee: Beijing University of Posts and Telecommunications; China Waterborne Transport Research Institute
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2020-11-24

Abstract

The invention discloses a method and a system for detecting people falling into water based on fusion of visible light and thermal imaging data, wherein the method comprises the following steps: use two optical cameras to acquire visible light image and infrared image simultaneously, two optical cameras include: an optical camera and an infrared thermal imaging camera; carrying out image registration on the infrared image and the visible light image; inputting the registered infrared image and visible light image into a pre-trained fusion network, and outputting a fusion image; and inputting the fused image into a pre-trained detection network, and outputting a detection result of whether the person falls into water or not. The method disclosed by the invention fuses the visible light images and the infrared images, so that the images not only highlight human bodies, but also contain certain texture characteristics, and the detection accuracy and recall rate can be greatly improved.

Description

Drowning person detection method and system based on visible light and thermal imaging data fusion

Technical Field

The invention relates to the field of search and rescue, in particular to a drowning person detection method and system based on visible light and thermal imaging data fusion.

Background

Every year, people fall into water, crews fall into water accidentally, visitors fall into water accidentally, ships turn over and sink, and the like, so that tens of thousands of people die of drowning every year. Mainly, the water flow is turbulent, the area of the water area is large, and people falling into the water are difficult to find and position. With the upgrade of computing hardware and the optimization of artificial intelligence algorithms, image processing and detection have been applied to solve various problems, and the problem of detection by people falling into water still needs to be solved urgently.

Image fusion is an enhanced technology, and aims to combine images acquired by different types of sensors to generate an image with stronger robustness or richer information so as to facilitate subsequent processing or help decision making.

First, their signals come from different forms, providing different aspects of scene information, i.e. visible light images capture reflected light, while infrared images capture thermal radiation, and therefore this combination is more informative than the single-modality signals. Second, infrared and visible light images exhibit characteristics inherent to almost all objects, and these images can be obtained with relatively simple equipment. And finally, the infrared image and the visible light image have complementary characteristics, so that a fused image with strong robustness and rich information is generated. Visible light images generally have high spatial resolution and considerable detail and contrast, and therefore they conform to the human visual perception. However, these images are susceptible to adverse conditions such as low light, fog, and other adverse weather effects. While infrared images, which describe the thermal radiation of an object, are resistant to these disturbances, they are generally of lower resolution and of poorer texture. Visible and infrared image fusion techniques have a wider range of applications than other fusion techniques due to the ubiquitous nature and complementarity of the images utilized.

The visible light and infrared image fusion has great significance for personnel detection, especially for personnel detection falling into water. Firstly, if only use the visible light image to detect, the people is in rivers torrent and unclear river, and the proportion of the surface of water that exposes when falling into water is very little, and the personnel of falling into water almost fuses with river water as an organic whole, and naked eye and camera are all difficult to distinguish, even very outstanding detection algorithm also is difficult to detect accurately and without omission, and the light condition is good fashion just so, just can't detect under dark night or the fog condition completely. The infrared image can well distinguish the human body from the background, and the human body is higher in temperature compared with river water and the brightness of the human body reflected on the infrared image is higher than that of the river water, so that the infrared image is more prominent. However, the infrared image has low resolution and lacks texture features, only rough contour information can be acquired, and if a high-temperature object with a shape similar to that of a person falling into water exists in the picture, misjudgment and missed judgment are easily caused.

Disclosure of Invention

The present invention is directed to overcoming the technical defects, and embodiment 1 of the present invention provides a method for detecting a man falling into water based on visible light and thermal imaging data fusion, where the method includes:

use two optical cameras to acquire visible light image and infrared image simultaneously, two optical cameras include: an optical camera and an infrared thermal imaging camera;

carrying out image registration on the infrared image and the visible light image;

inputting the registered infrared image and visible light image into a pre-trained fusion network, and outputting a fusion image;

and inputting the fused image into a pre-trained detection network, and outputting a detection result of whether the person falls into water or not.

As an improvement of the above method, the image registration of the infrared image and the visible light image specifically includes:

respectively extracting an edge map of the infrared image and an edge map of the visible light image;

aligning the edge graph of the infrared image and the edge graph of the visible light image to obtain an aligned edge graph;

and respectively carrying out image conversion on the infrared image and the visible light image according to the aligned edge images to obtain the aligned infrared image and visible light image.

As an improvement of the above method, the fusion network comprises a first convolutional layer, a dense block, a fusion layer and a plurality of cascaded convolutional layers which are connected in sequence;

the first convolution layer is used for respectively extracting the depth characteristics of the aligned visible light image and infrared image and outputting the depth characteristics of the visible light image and the infrared image;

the dense block comprises a visible light branch and an infrared branch; the visible light branch comprises three convolution layers which are connected in sequence, and the infrared branch comprises three convolution layers which are connected in sequence; the depth characteristics of the visible light image are respectively used as the input of three convolution layers of the visible light branch, and in the visible light branch, the output of each convolution layer is cascaded into the input of all the convolution layers behind the convolution layer; the depth characteristics of the infrared image are respectively used as the input of three convolution layers of the infrared branch, and in the infrared branch, the output of each convolution layer is cascaded into the input of all the convolution layers behind the convolution layer;

the fusion layer is used for fusing a visible light image characteristic diagram output by the visible light branch and an infrared image characteristic diagram output by the infrared branch by applying L1 norm and softmax operation to output a fusion characteristic diagram;

the plurality of cascaded convolutional layers are used for forming a decoder and converting the fused feature map into a fused picture.

As an improvement of the above method, the loss function L of the converged network_fusBy the pixel loss function L_pAnd structural similarity loss function L_ssimThe weighting results in:

L_fus＝λL_ssim+L_p

L_p＝‖O-I‖²

L_ssim＝1-SSIM(O,I)

wherein L is_pRepresenting the euclidean distance between the output image O and the input image I, SSIM (O, I) representing the structural similarity between the output image O and the input image I, the structural similarity comprising three components: correlation, luminance loss and contrast distortion, λ 1000.

As an improvement of the above method, the detection network is a convolutional neural network CNN, and its backbone network adopts modified dark net-53, which removes the last full connection layer, and uses convolution to realize down-sampling instead of pooling layer, forming a full convolutional network using many residual error layer jumps;

detecting that the input of the network is a fused picture; the treatment process comprises the following steps: dividing the fused picture into S multiplied by S unit cells, and if the center of an object falls on a certain unit cell, the unit cell is responsible for predicting the object; predicting a plurality of bounding box values for each cell, predicting a confidence coefficient for each bounding box, and performing prediction analysis by taking each cell as a unit;

the output of the detection network is three feature maps with different scales, so that targets with different sizes are detected by adopting multiple scales, and finally, predicted bounding boxes, classification and confidence coefficients are output to identify people falling into water.

As an improvement of the above-mentioned method,loss function L of the detection network_decL introducing error to bounding box_boxError L by category_clsError L due to sum confidence_objThe sum of (1):

L_dec＝L_box+L_cls+L_obj

wherein S represents the number of horizontal unit grids, the number of the horizontal unit grids is the same as that of the vertical unit grids, B represents box,

indicates whether the ith anchor box of the ith grid is responsible for the object, w_iAnd h_iFor the predicted width and height of the ith mesh,

and

width and height of the true ith grid; x is the number of_iAnd y_iTo predict the center coordinates of the ith grid,

and

the central coordinate of the real ith grid is obtained; lambda [ alpha ]_coord、λ_class、λ_nobjAnd λ_objAre all parameters; p is a radical of_i(c) Is the predicted probability for the class c,

class is the true probability of the class c, the set of classes; c. C_iFor the purpose of the confidence level of the prediction,

for the true confidence, the value is determined by whether the cell is responsible for predicting the object.

As an improvement of the above method, the method further comprises: the step of training the fusion network and the detection network specifically comprises the following steps:

establishing a training set, capturing visible light and infrared images by using a dual-light camera, obtaining a fused image through the registration and fusion processes, and marking the image containing the person falling into the water;

the joint loss function L for both networks is:

L＝L_fus+L_dec

and training by using a training set and using the loss function and a gradient descent method to obtain the parameters of the network.

The embodiment 2 of the invention provides a drowning person detection system based on visible light and thermal imaging data fusion, which comprises: the system comprises an infrared thermal imaging camera, an optical camera, a trained fusion network, a trained detection network, an image registration module, a fusion module and a detection module;

the image registration module is used for simultaneously acquiring an infrared image acquired by the infrared thermal imaging camera and a visible light image acquired by the optical camera and performing image registration on the infrared image and the visible light image;

the fusion module is used for inputting the registered infrared image and visible light image into a trained fusion network and outputting a fusion image;

and the detection module is used for inputting the fusion image into a trained detection network and outputting a detection result of whether the person falls into water or not.

The invention has the advantages that:

the method and the system fuse the visible light image and the infrared image, so that the image not only highlights the human body, but also contains certain texture characteristics, and the detection accuracy and the recall rate are greatly improved.

Drawings

FIG. 1 is a flow chart of image registration of the present invention;

FIG. 2 is a schematic diagram of a converged network of the present invention;

FIG. 3 is a schematic diagram of a fusion layer of the fusion network of the present invention;

FIG. 4 is a schematic diagram of a detection network of the present invention;

FIG. 5 is a schematic illustration of the use of multiple scales to detect targets of different sizes;

FIG. 6 is a schematic diagram of the detection-fusion reverse training of the present invention;

fig. 7 is a flowchart of the method for detecting a person falling into water based on fusion of visible light and thermal imaging data according to the present invention.

Detailed Description

The technical solution of the present invention will be described in detail below with reference to the accompanying drawings.

The embodiment 1 of the invention provides a drowning person detection method based on visible light and thermal imaging data fusion, which comprises the following steps:

step 1) image acquisition and registration

Step 1-1) image acquisition

Use two optical cameras to acquire visible light image and infrared image simultaneously, two optical cameras include: optical cameras and infrared thermal imaging cameras.

Step 1-2) image registration

Since the infrared image and the visible image are acquired by different sensors, and are usually different in size, perspective and field of view, the dual-optical camera also causes a difference in the viewing angle. Successful image fusion, however, requires strict geometric alignment of the fused images, thus requiring registration of the visible and infrared images prior to fusion. Registration of infrared images with visible light images is a multimodal registration problem.

For the registration problem here, a feature-based registration method is used, which first extracts two groups of salient structures, then determines the correct correspondence between them, and estimates the spatial transformation accordingly, which is then used to align the given image pair.

The first step of the feature-based approach is to extract robust common features that can represent the original image. Edge information is one of the most common choices in infrared and visible image registration, as shown in fig. 1, because the size and direction of edge information can be well preserved by different registration methods. Edge mapping can be discretized into a set of points, and one popular strategy for solving the point matching problem involves two steps: a set of hypothetical correspondences is computed and then outliers are removed by geometric constraints. The given parametric model is estimated by computing feature descriptors at points, eliminating matches between points with too different descriptors, removing false matches from the set of hypotheses using random sample consensus (RANSAC), and attempting to obtain the smallest possible non-subset of outliers by resampling using the hypothesis-verification method.

Step 2) establishing a fusion network for image fusion

A deep learning architecture is employed that addresses the problem of infrared and visible image fusion. Compared with the traditional convolutional network, the coding network is combined with convolutional layers, fusion layers and dense blocks, wherein the output of each layer is connected with each other, the system structure is used for acquiring more useful features from a source image in the coding process, a proper fusion strategy is selected for fusing the features, and finally a fused image is reconstructed through a decoder.

As shown in fig. 2, the depth features of the visible light image and the infrared image are extracted before fusion, the first convolutional layer extracts the coarse features, and then three convolutional layers (the output of each layer is cascaded as the input of the subsequent layer) constitute a dense block. Such an architecture has two advantages. First, the size of the filter and the step size of the convolution operation are 3 × 3 and 1, respectively. Using this strategy, the input image can be any size; second, dense blocks can preserve depth features as much as possible in the coding network, and this operation can ensure that all salient features are used in the fusion strategy.

As shown in fig. 3, the L1 norm and softmax operations are applied at the fusion level.

The fused layer includes a plurality of convolutional layers (3 × 3 convolutions), the output of the fused layer is the input of the convolutional layers, and the plurality of convolutional layers are used to reconstruct the fused image to constitute a decoder, and the fused feature map is converted into a fused picture. This simple and efficient architecture is used to reconstruct the final fused image.

Loss function of fusion network is composed of pixel loss function L_pAnd structural similarity loss function L_ssimThe weighting results in:

L_p＝‖O-I‖²

L_ssim＝1-SSIM(O,I)

L_fus＝λL_ssim+L_p

where O and I denote the output image and the input image, respectively. L is_pIs the euclidean distance between the output O and the input I, SSIM represents the structural similarity, which represents the structural similarity of two images, the index mainly consisting of three parts: correlation, brightness loss and contrast distortion, and the product of the three components is the evaluation result of the fused image. Since there is a difference of three orders of magnitude between the pixel loss and the SSIM loss, λ is set to 1000 during the training phase.

Step 3) establishing a detection network for detecting the person falling into water

The convolutional neural network CNN is adopted to carry out target recognition on people falling into water, the central idea of the detection network is to divide a picture into S multiplied by S unit cells, and if the center of an object falls on a certain unit cell, the unit cell is responsible for predicting the object. Multiple bounding box values are predicted for each cell, a confidence is predicted for each bounding box, and prediction analysis is performed on a per cell basis.

The backbone network employs a modified darknet-53 as shown in fig. 4. The network has high classification precision, high calculation speed and fewer network layers, removes all-connection layers, is a full convolution network, largely uses layer jump connection of residual errors, abandons a pooling layer in order to reduce gradient negative effects caused by pooling, and realizes down-sampling by using the step length of the convolution layer. In this network structure, the down-sampling is performed using a convolution with a step size of 2.

The network outputs three feature maps of different scales, with reference to the FPN, and adopts multiple scales to detect targets of different sizes, so that more precise units can detect more precise objects. As shown in fig. 5.

Before model training, firstly, a data set of fused images needs to be made, visible light images and infrared images are captured and captured through double-light cameras, the fused images are obtained through the registration and fusion processes, personnel falling into water are marked to make the data set of a format required by training, a pre-training model is selected for training, and an algorithm model capable of identifying the personnel falling into water in the visible light infrared fused images is obtained. And then, indexes such as the accuracy of the model are evaluated, and optimization is performed from the aspects of data sets, algorithms and the like, so that a better identification effect can be achieved.

The loss function of the detection network is divided into three parts, L brought by a bounding box_boxL by confidence_objError L due to sum class_cls：

and

and

The loss function is the sum of the above three errors:

L_dec＝L_box+L_cls+L_obj

L＝L_fus+L_dec

step 4) detection-fusion reverse training

The purpose of the common visible light infrared image fusion technology is to make a fused image contain information of two images as much as possible, not lose contrast information in the infrared image and not lose texture information in the visible light image, so to speak, make the fused image more conform to the human visual system, so that a loss function of an initial fusion process is defined as a weighted sum of a pixel loss function and a structural similarity loss function.

The method of the invention is mainly characterized in that people falling into water can be accurately detected, the image fusion result is only an intermediate process, and the optimization of the image fusion result is the final target of accurate detection in both the image fusion process and the detection process. In order to achieve the final goal, the training of image fusion should be corrected, so that the loss function of the detection process can guide fusion, and the final detection result is optimized in the fusion stage.

As shown in fig. 6, firstly, a person falling into the water is marked on the registered visible light or infrared image, since the images are registered and aligned, and the position of the fused target is unchanged, the label can be copied on the fused image as a groudtruth, the fused image passes through a detection network to obtain a predicted bounding box, classification and confidence, and the predicted bounding box, classification and confidence are compared with the label to obtain a detection error, i.e., L_decThe loss function is not only used for evaluating and optimizing the detection network, but also used for evaluating and optimizing the fusion network, and is equivalent to the loss function of the fusion network to be corrected as follows:

L＝L_fus+L_dec

in this way, the achievement of the final objective is facilitated.

And 5) detecting the person falling into the water by using image acquisition, image registration, image fusion and target detection so as to facilitate subsequent positioning and rescue. As shown in fig. 7.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of drowning person detection based on visible light and thermal imaging data fusion, the method comprising:

2. The method for detecting man overboard based on fusion of visible light and thermal imaging data as claimed in claim 1, wherein the image registration of the infrared image and the visible light image specifically comprises:

3. The method for detecting man-in-the-water based on fusion of visible light and thermal imaging data according to claim 2, wherein the fusion network comprises a first convolutional layer, a dense block, a fusion layer and a plurality of cascaded convolutional layers connected in sequence;

4. The method for drowning person detection based on fusion of visible light and thermal imaging data according to claim 3, characterized in that the fusion network has a loss function L_fusBy the pixel loss function L_pAnd structural similarity loss function L_ssimThe weighting results in:

L_fus＝λL_ssim+L_p

L_p＝‖O-I‖²

L_ssim＝1-SSIM(O,I)

5. The method for detecting man-in-water based on fusion of visible light and thermal imaging data as claimed in claim 4, wherein the detection network is a Convolutional Neural Network (CNN), a backbone network thereof adopts modified dark net-53, a last full connection layer is removed, and a convolution is used to realize down-sampling to replace a pooling layer, so as to form a full convolution network using a plurality of residual skip layers;

6. The drowning person detection method based on visible light and thermal imaging data fusion of claim 5, characterized in that the loss function L of the detection network_decL introducing error to bounding box_boxError L by category_clsError L due to sum confidence_objThe sum of (1):

L_dec＝L_box+L_cls+L_obj

and

and

7. The method for drowning person detection based on fusion of visible light and thermal imaging data according to claim 6, characterized in that the method further comprises: the step of training the fusion network and the detection network specifically comprises the following steps:

the joint loss function L for both networks is:

L＝L_fus+L_dec

8. A drowning person detection system based on visible light and thermal imaging data fusion, the system comprising: the system comprises an infrared thermal imaging camera, an optical camera, a trained fusion network, a trained detection network, an image registration module, a fusion module and a detection module;