CN111339892A

CN111339892A - Swimming pool drowning detection method based on end-to-end 3D convolutional neural network

Info

Publication number: CN111339892A
Application number: CN202010106457.6A
Authority: CN
Inventors: 纪刚; 商胜楠; 周萌萌; 周亚敏; 周粉粉
Original assignee: Qingdao Lianhe Chuangzhi Technology Co ltd
Current assignee: Qingdao Lianhe Chuangzhi Technology Co ltd
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-06-26
Anticipated expiration: 2040-02-21
Also published as: CN111339892B

Abstract

The invention belongs to the technical field of video monitoring, and relates to a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network; including S1, carrying out pixel-level binary marking on the original monitoring video; s2, respectively inputting each video clip into the 3D convolutional neural network at the encoder end to obtain an input video clip v_iCharacteristic cube F_i(ii) a S3, video clip v_iEach frame in the system uses a segmentation branch to carry out pixel-level prediction on a background or a behavior foreground; s4 cutting feature cube F_iInputting a recognition branch according to the characteristics of the corresponding position, and performing ToI pooling to obtain a predicted behavior tag; s5, reading real-time video stream of the swimming pool area, positioning the behavior position of a swimmer, predicting a behavior label, and judging whether abnormal behaviors such as drowning occur or not; said method comprisingThe pixel-level binary marking mode is used, so that the time consumed by marking samples is saved, the pixel-level behavior positioning is provided, the positioning mode is more accurate, and the problem of difficult regression convergence of the frame is solved.

Description

Swimming pool drowning detection method based on end-to-end 3D convolutional neural network

The technical field is as follows:

the invention belongs to the technical field of video monitoring, relates to a method for detecting drowning behavior based on monitoring video, and particularly relates to a method for detecting drowning of a swimming pool based on an end-to-end 3D convolutional neural network.

Background art:

the convolutional neural network is a feedforward neural network which comprises convolutional calculation and has a deep structure, and is one of representative algorithms of deep learning. The convolutional neural network has the characteristic learning ability and can carry out translation invariant classification on input information according to the hierarchical structure of the convolutional neural network; the convolutional neural network includes an input layer, a hidden layer, and an output layer.

The input layer of the convolutional neural network can process multidimensional data, and the input layer of the one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array, wherein the one-dimensional array is usually a time or frequency spectrum sample; the two-dimensional array may include a plurality of channels; an input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array. Hidden layers of the convolutional neural network comprise 3 types of common structures such as convolutional layers, pooling layers and fully-connected layers, and some more modern algorithms may have complicated structures such as an inclusion module and a residual block (residual block). The convolution layer has the function of extracting the characteristics of input data, the interior of the convolution layer comprises a plurality of convolution kernels, and each element forming the convolution kernels corresponds to a weight coefficient and a deviation value; after the feature extraction is performed on the convolutional layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering. The pooling layer contains a pre-set pooling function whose function is to replace the result of a single point in the feature map with the feature map statistics of its neighboring regions. The fully-connected layer in the convolutional neural network is equivalent to the hidden layer in the traditional feedforward neural network. The fully-connected layer is located at the last part of the hidden layer of the convolutional neural network and only signals are transmitted to other fully-connected layers. The characteristic diagram loses the space topological structure in the full-connection layer, is expanded into vectors and passes through an excitation function, and the full-connection layer is usually arranged at the upstream of the output layer in the convolutional neural network, so the structure and the working principle of the characteristic diagram are the same as those of the output layer in the traditional feedforward neural network. For the image classification problem, the output layer outputs the classification label using a logistic function or a normalized exponential function.

In the prior art, chinese patent publication No. CN108304806A discloses a gesture recognition method based on log-path integral features and convolutional neural network, which includes the steps of: labeling video data, and training a hand detector based on fast-RCNN; detecting the video samples frame by using a hand detector to obtain the hand position of each frame; constructing two-dimensional, three-dimensional and four-dimensional hand tracks by combining time and depth based on the hand position of each frame; performing data enhancement on the hand trajectory; extracting corresponding logarithmic path integral characteristics from the enhanced track sample; arranging the logarithmic path integral features according to spatial position information to construct a corresponding feature cube; and taking the feature cube as the input of the convolutional neural network, and finally outputting a recognition result. The chinese patent publication No. CN108830157A discloses a human behavior recognition method based on an attention mechanism and a 3D convolutional neural network, the human behavior recognition method constructs a 3D convolutional neural network, and an input layer of the 3D convolutional neural network includes two channels, namely an original gray scale map and an attention moment array. According to the method, a 3D CNN model for identifying human body behaviors in a video is constructed, an attention mechanism is introduced, the distance between two frames is calculated to serve as an attention matrix, the attention matrix and an original human body behavior video sequence form a double channel to be input into the constructed 3D CNN, and the convolution operation is used for extracting the key features of a visual key area. Meanwhile, a 3DCNN structure is optimized, a Dropout layer is added into the network to randomly freeze a part of network connection weight, and a ReLU activation function is used to improve network sparsity. At present, the number of training samples required for processing and analyzing a monitoring video by utilizing a convolutional neural network is large, a priori anchor frame is required to be used, and the problems of difficult sample marking, inaccurate positioning and difficult frame regression convergence exist depending on a selection search method.

The invention content is as follows:

the invention aims to overcome the defects of the prior art, and designs a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network aiming at the defects that the number of training samples required for processing and analyzing the current monitoring video is large, a priori anchor frame is required to be used, the sample is difficult to mark, the positioning is inaccurate, the frame regression is difficult, and the like.

In order to achieve the purpose, the invention relates to a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network, which comprises the following specific process steps:

s1 pixel-level labeling of training samples

Opening a camera in the swimming pool area to obtain a monitoring video, and performing pixel-level binary marking on the original monitoring video, namely marking foreground pixels as positive and background pixels as negative, wherein the marked monitoring video is used for inputting a subsequent training network;

s2 feature extraction cube

In a given video data set, a certain video V in the data set is encoded_iDivided into n fixed-length video segments, i.e. V_i＝{v₁,v₂,…,v_nV, each video clip_iThe method comprises 8 overlapped pictures, wherein the time span of each frame of picture is 1, each video clip is respectively input into a 3D convolutional neural network at an encoder end and passes through 8 convolutional layers and 4 maximum pooling layers to obtain a video frequency band v_iSpatio-temporal features f at encoder side_i(ii) a To generate the original size pixel level segmentation map for each frame, a 3D upsampling layer is used at the decoder side to increase the spatio-temporal features f_iThe resolution of (a); after each 3D upsampling layer, corresponding space-time characteristics f of the encoder end are obtained in order to obtain the spatial and temporal information of different scales of the input video segment_iAdopting cascade connection; obtaining an input video clip v after 4 times of upsampling and 4 secondary cascade operations_iCharacteristic cube F_i；

S3, positioning of behavior target

In obtaining an input video segment v_iCharacteristic cube F_iThen, to the video segment v_iEach frame of (a) uses a segmentation branch to image the background or the behavioral foregroundPerforming prime level prediction; the segmentation branch comprises 2 convolution layers of 1 x 1, and after 2 convolution operations, an input video clip v is obtained_iIn each frame, i.e. G_i＝{g₁,g₂,…,g₈-the segmentation map is a binary segmentation map of the behavior foreground or background; dividing foreground pixels in the image according to the two values of the behavior foreground to deduce a frame in the video frame, namely positioning a behavior target in the video;

s4, behavior recognition

After the action object in the video is located through the step of S3, the position information of the action object is used for the feature cube F obtained in the step of S2_iAnd combining feature cube F_iIntercepting the features at the corresponding positions as feature tube input identification branches, performing ToI (tube of interest) pooling on the input feature tube by the identification branches, and then performing full connection operation for 3 times to finally obtain a predicted behavior tag, wherein the behavior tag is used for judging whether drowning behavior occurs in the swimming pool area;

s5 detection of abnormal behaviors of swimming pool video

Based on the method of steps S1-S4, by extracting feature cube F of video clip_iMarking different behaviors including drowning, standing, swimming and the like on various videos of different swimming pool areas in a pixel-level-based marking mode, marking pixels corresponding to the behaviors as positive and marking other backgrounds as negative, and training by adopting an end-to-end 3D convolutional neural network method to obtain an abnormal behavior detection model of the swimming pool; read regional real-time video stream of swimming pool, obtain the segmentation picture of every frame image through unusual action detection model of swimming pool to the action position of swimming personnel is fixed a position and the forecast action label, judges whether unusual actions such as drowning appear in the swimming pool region, if unusual actions such as drowning appear then carry out the police dispatch newspaper.

The 3D upsampling layer in step S2 of the present invention is used to recover the low-dimensional features obtained after the maximal pooling at the encoder end, so as to obtain the spatio-temporal features f with higher resolution_i(ii) a The 3D up-sampling layer adopts a sub-pixel convolution mode to the spatio-temporal characteristics f_iIs treated, specifically: respectively for depth, height and width of p_d、p_hAnd p_wLow resolution spatio-temporal features f_iCarry out upsampling, set p^LRRepresenting low resolution features, p^HRRepresenting a high resolution feature map, pixels of the high resolution feature map being mapped from a low resolution feature map, the mapping being implemented according to:

wherein, C ∈ { 0., C^H-1}，d∈{0,...,D^H-1}，h∈{0,...,H^H-1}，w∈{0,...,W^H-1 }; the variables respectively represent the number of channels of the features, the depth of the features, the height of the features and the width of the features;

representing the low-resolution feature map after expanding the channel;

the indexes c ', d', h 'and w' in (1) respectively represent the number of feature channels, the feature depth, the feature height and the feature width after expansion, and are defined as follows:

step S4 of the invention adopts ToI pooling operation to the feature tubes with different sizes to obtain the feature vector with fixed length; ToI the pooling operation process is as follows: let I_iIs the ith activation of the TOI pooling layer, O_jIs the jth output, each input variable I_iThe partial derivative of the loss function L of (a) can be expressed as:

each pooled output O_jThere is a corresponding input location i and the function f (j) represents the maximum selection from TOI.

Compared with the prior art, the swimming pool drowning detection method based on the end-to-end 3D convolutional neural network is designed, the abnormal behaviors existing in a video are detected by adopting a bottom-to-top end-to-end pixel level segmentation method, and a priori anchor frame is not needed for searching candidate areas; a pixel-level binary marking mode is adopted, only a small amount of samples are needed, and time consumed by marking the samples is saved; the invention provides pixel-level behavior positioning which is more accurate than a mode of using a frame to perform behavior positioning, and solves the problem of difficult regression convergence of the frame in the frame mode positioning; the model has high detection performance, standardization capability and good market prospect.

Description of the drawings:

fig. 1 is a schematic diagram of a 3D convolutional neural network structure according to the present invention.

Fig. 2 is a schematic diagram of a split branch structure according to the present invention.

Fig. 3 is a schematic diagram of the identification branch structure according to the present invention.

Fig. 4 is a schematic block diagram of a process flow of the method for detecting drowning in a swimming pool based on an end-to-end 3D convolutional neural network according to the present invention.

The specific implementation mode is as follows:

the invention is further illustrated by the following examples in conjunction with the accompanying drawings.

Example 1:

the embodiment relates to a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network, which comprises the following specific process steps:

s1 pixel-level labeling of training samples

Carrying out pixel-level binary marking on an original monitoring video, namely marking a foreground pixel as positive and a background pixel as negative, wherein the marked monitoring video is used for inputting a subsequent training network; compared with the method for marking the frame, the marking method of the step is more accurate, fewer training samples need to be marked, a priori anchor frame is not needed, a selective search method is replaced, and the problems that the marked samples are difficult, the positioning is inaccurate, and the regression convergence of the frame is difficult are solved;

s2 feature extraction cube

In a given video data set, a certain video V in the data set is encoded_iDivided into n fixed-length video segments, i.e. V_i＝{v₁,v₂,…,v_nV, each video clip_iThe method comprises 8 overlapped pictures, wherein the time span of each frame of picture is 1, each video clip is respectively input into a 3D convolutional neural network at an encoder end and passes through 8 convolutional layers and 4 maximum pooling layers to obtain a video frequency band v_iSpatio-temporal features f at encoder side_i(ii) a Spatio-temporal features f reduction due to 3D max pooling_iIn order to generate pixel-level segmentation maps of the original size for each frame, at the decoder side, a 3D upsampling layer is used to increase the spatio-temporal features f_iThe resolution of (a); after each 3D upsampling layer, corresponding space-time characteristics f of the encoder end are obtained in order to obtain the spatial and temporal information of different scales of the input video segment_iAdopting cascade connection; obtaining an input video clip v after 4 times of upsampling and 4 secondary cascade operations_iCharacteristic cube F_i；

The specific structure of the 3D convolutional neural network described in this embodiment is formed by mutually interleaving and combining an input video segment, 12 convolutional layers, 4 maximum pooling layers, and 4 upsampling layers, and finally a feature cube of the input video segment is obtained for subsequent behavior target positioning and behavior recognition; wherein:

conv1 to conv12 are convolution layers, the convolution kernel of each layer is 3 x 3, the number of filters is increased from 64 to 512 in sequence and then is decreased to 64; performing maximum pooling operation and convolution operation for 4 times and 8 times on input video clips in a crossed manner to obtain the feature of a conv8 layer, then performing upsample1 layer upsampling on the feature of the conv8 layer, cascading the upsample1 layer feature and the conv9 layer feature and inputting the upsample2 upsample layer, cascading the upsample2 layer feature and the conv10 layer feature and inputting the upsample3 upsample layer, cascading the upsample3 layer feature and the conv11 layer feature and inputting the upsample4 upsample layer, and finally cascading the upsample4 layer feature and the conv1 layer feature to be used as the finally extracted feature for target positioning and classification; the cascade connection has the function of fusing the characteristics with different time domain and space domain information to obtain the characteristics more beneficial to target positioning and classification;

max-pool1 to max-pool4 are maximum pooling layers, max-pool1 has convolution kernel of 1 × 2, other maximum pooling layers have convolution kernels of 2 × 2, and the number of filters is sequentially increased from 64 to 512; the maximum pooling layer is used for reducing the resolution of the feature map, reducing the number of parameters and obtaining the feature map with stronger semantic information;

upsample1 to upsample4 are upsampling layers, the convolution kernel of each layer is 3 x 3, the number of filters of the first 3 layers is 64, and the number of filters of the last layer is 48; as the resolution of the feature map is reduced due to the maximum pooling, in order to generate a pixel-level segmentation map of the original size for each frame of image, upsampling is adopted to increase the resolution of the feature map;

s3, positioning of behavior target

In obtaining an input video segment v_iCharacteristic cube F_iThen, to the video segment v_iEach frame in the system uses a segmentation branch to carry out pixel-level prediction on a background or a behavior foreground; the segmentation branch comprises 2 convolution layers of 1 x 1, and after 2 convolution operations, an input video clip v is obtained_iIn each frame, i.e. G_i＝{g₁,g₂,…,g₈-the segmentation map is a binary segmentation map of the behavior foreground or background; dividing foreground pixels in the image according to the two values of the behavior foreground to deduce a frame in the video frame, namely positioning a behavior target in the video;

the segmentation branch of the present embodiment specifically includes an input feature cube and 2 convolutional layers; the convolution kernels of the convolution layers are all 1 x 1, and the number of the filters is 4096 and 2 respectively; the size of the feature map obtained finally is 2 × 8 × 240 × 320, where 8 represents 8 frames of input images, 240 × 320 represents the size of the input image, 2 represents the classification result of each pixel, and the classification result of each pixel belongs to the foreground or the background; the characteristic graph is a segmentation graph of the original graph and is used for positioning the target to obtain the position information of the target;

s4, behavior recognition

the branch identification method in this embodiment specifically includes: inputting a feature cube, ToI pooling layers and 3 full-link layers; the input feature cube is obtained by intercepting the feature cube obtained in the step S2 according to the position information obtained in the step S3, since the sizes of the obtained targets are different, in order to process the feature vectors with fixed length, ToI pooling layer operation is adopted for the input feature cube to obtain the feature vectors with fixed length, then 3 full connection layers are connected to obtain a one-dimensional feature vector for behavior tag identification, and a final behavior target classification result is obtained and is used for judging whether the swimming pool area has abnormal behaviors such as drowning and the like;

s5 detection of abnormal behaviors of swimming pool video

Based on the method of the steps S1-S4, through extracting the feature cube of the video segment, marking various videos of different swimming pool areas with different behaviors in a pixel-level marking mode, wherein the behaviors comprise drowning, standing, swimming and the like, the pixels corresponding to the behaviors are marked as positive, other backgrounds are marked as negative, and an end-to-end 3D convolutional neural network method is adopted to train to obtain a swimming pool abnormal behavior detection model; read regional real-time video stream of swimming pool, obtain the segmentation picture of every frame image through unusual action detection model of swimming pool to the action position of swimming personnel is fixed a position and the forecast action label, judges whether unusual actions such as drowning appear in the swimming pool region, if unusual actions such as drowning appear then carry out the police dispatch newspaper.

The effect of the 3D upsampling layer in step S2 of this embodiment is to apply to the encoderRestoring the low-dimensional features obtained after the maximum pooling so as to obtain the space-time features f with higher resolution_i(ii) a The step adopts a sub-pixel convolution mode to the time-space characteristic f_iAnd (3) processing: respectively for depth, height and width of p_d、p_hAnd p_wLow resolution spatio-temporal features f_iCarry out upsampling, set p^LRRepresenting low resolution features, p^HRRepresenting a high resolution feature map, pixels of the high resolution feature map being mapped from a low resolution feature map, according to the following equation:

representing the low-resolution feature map after expanding the channel;

since the video is processed in segments, feature cubes F of various space-time sizes are generated for different segments_i(ii) a In order to process the fixed-length feature vector, step S4 of this embodiment employs ToI pooling operations on different-sized feature tubes to obtain fixed-length feature vectors; since the size, aspect ratio and location of the bounding box (i.e., the targeted behavioral object) may be different, in order to use spatio-temporal pooling, in the spatial and temporal domainsRespectively realize pooling, and is provided with I_iIs the ith activation of the TOI pooling layer, O_jIs the jth output, each input variable I_iThe partial derivative of the loss function L of (a) can be expressed as:

The method for detecting drowning of the swimming pool based on the end-to-end 3D convolutional neural network adopts a bottom-to-top end-to-end pixel level segmentation method to detect abnormal behaviors existing in a video, and a priori anchors frame is not needed to search for candidate areas; a pixel-level binary marking mode is adopted, only a small amount of samples are needed, and time consumed by marking the samples is saved; the invention provides pixel-level behavior positioning which is more accurate than a mode of using a frame to perform behavior positioning, and solves the problem of difficult regression convergence of the frame in the frame mode positioning; the model has high detection performance and standardization capability.

Claims

1. A swimming pool drowning detection method based on an end-to-end 3D convolutional neural network is characterized in that: the specific process steps are as follows:

s1 pixel-level labeling of training samples

s2 feature extraction cube

In a given video data set, a certain video V in the data set is encoded_iDivided into n fixed-length video segments, i.e. V_i＝{v₁,v₂,…,v_nV, each video clip_iThe method comprises 8 overlapped pictures, the time span of each frame of picture is 1, and each video clip is respectively input at an encoder endIn the 3D convolutional neural network, a video frequency band v is obtained after 8 convolutional layers and 4 maximum pooling layers_iSpatio-temporal features f at encoder side_i(ii) a To generate the original size pixel level segmentation map for each frame, a 3D upsampling layer is used at the decoder side to increase the spatio-temporal features f_iThe resolution of (a); after each 3D upsampling layer, corresponding space-time characteristics f of the encoder end are obtained in order to obtain the spatial and temporal information of different scales of the input video segment_iAdopting cascade connection; obtaining an input video clip v after 4 times of upsampling and 4 secondary cascade operations_iCharacteristic cube F_i；

S3, positioning of behavior target

s4, behavior recognition

s5 detection of abnormal behaviors of swimming pool video

Based on the method of steps S1-S4, by extracting feature cube F of video clip_iMultiple video feeds of different pool areas in a pixel-level based labeling approachMarking different behaviors, wherein the behaviors comprise drowning, standing and swimming, marking pixels corresponding to the behaviors as positive and marking other backgrounds as negative, and training by adopting an end-to-end 3D convolutional neural network method to obtain an abnormal behavior detection model of the swimming pool; read regional real-time video stream of swimming pool, obtain the segmentation picture of every frame image through unusual action detection model of swimming pool to the action position of swimming personnel is fixed a position and the prediction action label, judges the unusual action of drowning whether appear in the swimming pool region, if the unusual action such as drowning then carries out the police dispatch newspaper.

2. The method of claim 1, wherein the method comprises: the 3D upsampling layer in step S2 recovers the low-dimensional features obtained after the encoder side is maximally pooled, so as to obtain the spatio-temporal features f with higher resolution_i(ii) a The 3D up-sampling layer adopts a sub-pixel convolution mode to the spatio-temporal characteristics f_iThe treatment is carried out, specifically: respectively for depth, height and width of p_d、p_hAnd p_wLow resolution spatio-temporal features f_iCarry out upsampling, set p^LRRepresenting low resolution features, p^HRRepresenting a high resolution feature map, pixels of the high resolution feature map being mapped from a low resolution feature map, the mapping being implemented according to:

representing the low-resolution feature map after expanding the channel;

3. the method of claim 2, wherein the method comprises: the step S4 applies ToI pooling operation to the different sized feature tubes to obtain fixed length feature vectors; ToI the pooling operation process is as follows: let I_iIs the ith activation of the TOI pooling layer, O_jIs the jth output, each input variable I_iThe partial derivative of the loss function L of (a) can be expressed as:

4. The method of claim 3, wherein the method comprises: the specific structure of the 3D convolutional neural network is formed by mutually interpenetrating and combining an input video clip, 12 convolutional layers, 4 maximum pooling layers and 4 upsampling layers, and finally a feature cube of the input video clip is obtained for subsequent behavior target positioning and behavior identification; wherein:

upsample1 to upsample4 are upsampling layers, the convolution kernel of each layer is 3 x 3, the number of filters of the first 3 layers is 64, and the number of filters of the last layer is 48; since maximum pooling reduces the resolution of the feature map, upsampling is used to increase the resolution of the feature map in order to generate an original sized pixel level segmentation map for each frame of image.

5. The method of claim 4, wherein the method comprises the steps of: the segmentation branch specifically comprises an input feature cube and 2 convolutional layers; the convolution kernels of the convolution layers are all 1 x 1, and the number of the filters is 4096 and 2 respectively; the size of the feature map obtained finally is 2 × 8 × 240 × 320, where 8 represents 8 frames of input images, 240 × 320 represents the size of the input image, 2 represents the classification result of each pixel, and the classification result of each pixel belongs to the foreground or the background; the feature map is a segmentation map of the original map, and is used for positioning the target to obtain the position information of the target.

6. The method of claim 5, wherein the method comprises: the identification branch specifically comprises: inputting a feature cube, ToI pooling layers and 3 full-link layers; the input feature cube is obtained by intercepting the feature cube obtained in step S2 according to the position information obtained in step S3, and since the sizes of the obtained objects are different, in order to process the feature vectors with fixed lengths, ToI pooling layer operation is performed on the input feature cube to obtain the feature vectors with fixed lengths, and then 3 full connection layers are connected to obtain a one-dimensional feature vector for behavior tag identification, so as to obtain the final behavior object classification result, which is used for judging whether the swimming pool area has abnormal behaviors such as drowning.