CN111339892A - Swimming pool drowning detection method based on end-to-end 3D convolutional neural network - Google Patents

Swimming pool drowning detection method based on end-to-end 3D convolutional neural network Download PDF

Info

Publication number
CN111339892A
CN111339892A CN202010106457.6A CN202010106457A CN111339892A CN 111339892 A CN111339892 A CN 111339892A CN 202010106457 A CN202010106457 A CN 202010106457A CN 111339892 A CN111339892 A CN 111339892A
Authority
CN
China
Prior art keywords
feature
layer
video
behavior
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010106457.6A
Other languages
Chinese (zh)
Other versions
CN111339892B (en
Inventor
纪刚
商胜楠
周萌萌
周亚敏
周粉粉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Lianhe Chuangzhi Technology Co ltd
Original Assignee
Qingdao Lianhe Chuangzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Lianhe Chuangzhi Technology Co ltd filed Critical Qingdao Lianhe Chuangzhi Technology Co ltd
Priority to CN202010106457.6A priority Critical patent/CN111339892B/en
Publication of CN111339892A publication Critical patent/CN111339892A/en
Application granted granted Critical
Publication of CN111339892B publication Critical patent/CN111339892B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of video monitoring, and relates to a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network; including S1, carrying out pixel-level binary marking on the original monitoring video; s2, respectively inputting each video clip into the 3D convolutional neural network at the encoder end to obtain an input video clip viCharacteristic cube Fi(ii) a S3, video clip viEach frame in the system uses a segmentation branch to carry out pixel-level prediction on a background or a behavior foreground; s4 cutting feature cube FiInputting a recognition branch according to the characteristics of the corresponding position, and performing ToI pooling to obtain a predicted behavior tag; s5, reading real-time video stream of the swimming pool area, positioning the behavior position of a swimmer, predicting a behavior label, and judging whether abnormal behaviors such as drowning occur or not; said method comprisingThe pixel-level binary marking mode is used, so that the time consumed by marking samples is saved, the pixel-level behavior positioning is provided, the positioning mode is more accurate, and the problem of difficult regression convergence of the frame is solved.

Description

Swimming pool drowning detection method based on end-to-end 3D convolutional neural network
The technical field is as follows:
the invention belongs to the technical field of video monitoring, relates to a method for detecting drowning behavior based on monitoring video, and particularly relates to a method for detecting drowning of a swimming pool based on an end-to-end 3D convolutional neural network.
Background art:
the convolutional neural network is a feedforward neural network which comprises convolutional calculation and has a deep structure, and is one of representative algorithms of deep learning. The convolutional neural network has the characteristic learning ability and can carry out translation invariant classification on input information according to the hierarchical structure of the convolutional neural network; the convolutional neural network includes an input layer, a hidden layer, and an output layer.
The input layer of the convolutional neural network can process multidimensional data, and the input layer of the one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array, wherein the one-dimensional array is usually a time or frequency spectrum sample; the two-dimensional array may include a plurality of channels; an input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array. Hidden layers of the convolutional neural network comprise 3 types of common structures such as convolutional layers, pooling layers and fully-connected layers, and some more modern algorithms may have complicated structures such as an inclusion module and a residual block (residual block). The convolution layer has the function of extracting the characteristics of input data, the interior of the convolution layer comprises a plurality of convolution kernels, and each element forming the convolution kernels corresponds to a weight coefficient and a deviation value; after the feature extraction is performed on the convolutional layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering. The pooling layer contains a pre-set pooling function whose function is to replace the result of a single point in the feature map with the feature map statistics of its neighboring regions. The fully-connected layer in the convolutional neural network is equivalent to the hidden layer in the traditional feedforward neural network. The fully-connected layer is located at the last part of the hidden layer of the convolutional neural network and only signals are transmitted to other fully-connected layers. The characteristic diagram loses the space topological structure in the full-connection layer, is expanded into vectors and passes through an excitation function, and the full-connection layer is usually arranged at the upstream of the output layer in the convolutional neural network, so the structure and the working principle of the characteristic diagram are the same as those of the output layer in the traditional feedforward neural network. For the image classification problem, the output layer outputs the classification label using a logistic function or a normalized exponential function.
In the prior art, chinese patent publication No. CN108304806A discloses a gesture recognition method based on log-path integral features and convolutional neural network, which includes the steps of: labeling video data, and training a hand detector based on fast-RCNN; detecting the video samples frame by using a hand detector to obtain the hand position of each frame; constructing two-dimensional, three-dimensional and four-dimensional hand tracks by combining time and depth based on the hand position of each frame; performing data enhancement on the hand trajectory; extracting corresponding logarithmic path integral characteristics from the enhanced track sample; arranging the logarithmic path integral features according to spatial position information to construct a corresponding feature cube; and taking the feature cube as the input of the convolutional neural network, and finally outputting a recognition result. The chinese patent publication No. CN108830157A discloses a human behavior recognition method based on an attention mechanism and a 3D convolutional neural network, the human behavior recognition method constructs a 3D convolutional neural network, and an input layer of the 3D convolutional neural network includes two channels, namely an original gray scale map and an attention moment array. According to the method, a 3D CNN model for identifying human body behaviors in a video is constructed, an attention mechanism is introduced, the distance between two frames is calculated to serve as an attention matrix, the attention matrix and an original human body behavior video sequence form a double channel to be input into the constructed 3D CNN, and the convolution operation is used for extracting the key features of a visual key area. Meanwhile, a 3DCNN structure is optimized, a Dropout layer is added into the network to randomly freeze a part of network connection weight, and a ReLU activation function is used to improve network sparsity. At present, the number of training samples required for processing and analyzing a monitoring video by utilizing a convolutional neural network is large, a priori anchor frame is required to be used, and the problems of difficult sample marking, inaccurate positioning and difficult frame regression convergence exist depending on a selection search method.
The invention content is as follows:
the invention aims to overcome the defects of the prior art, and designs a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network aiming at the defects that the number of training samples required for processing and analyzing the current monitoring video is large, a priori anchor frame is required to be used, the sample is difficult to mark, the positioning is inaccurate, the frame regression is difficult, and the like.
In order to achieve the purpose, the invention relates to a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network, which comprises the following specific process steps:
s1 pixel-level labeling of training samples
Opening a camera in the swimming pool area to obtain a monitoring video, and performing pixel-level binary marking on the original monitoring video, namely marking foreground pixels as positive and background pixels as negative, wherein the marked monitoring video is used for inputting a subsequent training network;
s2 feature extraction cube
In a given video data set, a certain video V in the data set is encodediDivided into n fixed-length video segments, i.e. Vi={v1,v2,…,vnV, each video clipiThe method comprises 8 overlapped pictures, wherein the time span of each frame of picture is 1, each video clip is respectively input into a 3D convolutional neural network at an encoder end and passes through 8 convolutional layers and 4 maximum pooling layers to obtain a video frequency band viSpatio-temporal features f at encoder sidei(ii) a To generate the original size pixel level segmentation map for each frame, a 3D upsampling layer is used at the decoder side to increase the spatio-temporal features fiThe resolution of (a); after each 3D upsampling layer, corresponding space-time characteristics f of the encoder end are obtained in order to obtain the spatial and temporal information of different scales of the input video segmentiAdopting cascade connection; obtaining an input video clip v after 4 times of upsampling and 4 secondary cascade operationsiCharacteristic cube Fi
S3, positioning of behavior target
In obtaining an input video segment viCharacteristic cube FiThen, to the video segment viEach frame of (a) uses a segmentation branch to image the background or the behavioral foregroundPerforming prime level prediction; the segmentation branch comprises 2 convolution layers of 1 x 1, and after 2 convolution operations, an input video clip v is obtainediIn each frame, i.e. Gi={g1,g2,…,g8-the segmentation map is a binary segmentation map of the behavior foreground or background; dividing foreground pixels in the image according to the two values of the behavior foreground to deduce a frame in the video frame, namely positioning a behavior target in the video;
s4, behavior recognition
After the action object in the video is located through the step of S3, the position information of the action object is used for the feature cube F obtained in the step of S2iAnd combining feature cube FiIntercepting the features at the corresponding positions as feature tube input identification branches, performing ToI (tube of interest) pooling on the input feature tube by the identification branches, and then performing full connection operation for 3 times to finally obtain a predicted behavior tag, wherein the behavior tag is used for judging whether drowning behavior occurs in the swimming pool area;
s5 detection of abnormal behaviors of swimming pool video
Based on the method of steps S1-S4, by extracting feature cube F of video clipiMarking different behaviors including drowning, standing, swimming and the like on various videos of different swimming pool areas in a pixel-level-based marking mode, marking pixels corresponding to the behaviors as positive and marking other backgrounds as negative, and training by adopting an end-to-end 3D convolutional neural network method to obtain an abnormal behavior detection model of the swimming pool; read regional real-time video stream of swimming pool, obtain the segmentation picture of every frame image through unusual action detection model of swimming pool to the action position of swimming personnel is fixed a position and the forecast action label, judges whether unusual actions such as drowning appear in the swimming pool region, if unusual actions such as drowning appear then carry out the police dispatch newspaper.
The 3D upsampling layer in step S2 of the present invention is used to recover the low-dimensional features obtained after the maximal pooling at the encoder end, so as to obtain the spatio-temporal features f with higher resolutioni(ii) a The 3D up-sampling layer adopts a sub-pixel convolution mode to the spatio-temporal characteristics fiIs treated, specifically: respectively for depth, height and width of pd、phAnd pwLow resolution spatio-temporal features fiCarry out upsampling, set pLRRepresenting low resolution features, pHRRepresenting a high resolution feature map, pixels of the high resolution feature map being mapped from a low resolution feature map, the mapping being implemented according to:
Figure BDA0002388624010000031
wherein, C ∈ { 0., CH-1},d∈{0,...,DH-1},h∈{0,...,HH-1},w∈{0,...,WH-1 }; the variables respectively represent the number of channels of the features, the depth of the features, the height of the features and the width of the features;
Figure BDA0002388624010000032
representing the low-resolution feature map after expanding the channel;
Figure BDA0002388624010000041
the indexes c ', d', h 'and w' in (1) respectively represent the number of feature channels, the feature depth, the feature height and the feature width after expansion, and are defined as follows:
Figure BDA0002388624010000042
step S4 of the invention adopts ToI pooling operation to the feature tubes with different sizes to obtain the feature vector with fixed length; ToI the pooling operation process is as follows: let IiIs the ith activation of the TOI pooling layer, OjIs the jth output, each input variable IiThe partial derivative of the loss function L of (a) can be expressed as:
Figure BDA0002388624010000043
each pooled output OjThere is a corresponding input location i and the function f (j) represents the maximum selection from TOI.
Compared with the prior art, the swimming pool drowning detection method based on the end-to-end 3D convolutional neural network is designed, the abnormal behaviors existing in a video are detected by adopting a bottom-to-top end-to-end pixel level segmentation method, and a priori anchor frame is not needed for searching candidate areas; a pixel-level binary marking mode is adopted, only a small amount of samples are needed, and time consumed by marking the samples is saved; the invention provides pixel-level behavior positioning which is more accurate than a mode of using a frame to perform behavior positioning, and solves the problem of difficult regression convergence of the frame in the frame mode positioning; the model has high detection performance, standardization capability and good market prospect.
Description of the drawings:
fig. 1 is a schematic diagram of a 3D convolutional neural network structure according to the present invention.
Fig. 2 is a schematic diagram of a split branch structure according to the present invention.
Fig. 3 is a schematic diagram of the identification branch structure according to the present invention.
Fig. 4 is a schematic block diagram of a process flow of the method for detecting drowning in a swimming pool based on an end-to-end 3D convolutional neural network according to the present invention.
The specific implementation mode is as follows:
the invention is further illustrated by the following examples in conjunction with the accompanying drawings.
Example 1:
the embodiment relates to a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network, which comprises the following specific process steps:
s1 pixel-level labeling of training samples
Carrying out pixel-level binary marking on an original monitoring video, namely marking a foreground pixel as positive and a background pixel as negative, wherein the marked monitoring video is used for inputting a subsequent training network; compared with the method for marking the frame, the marking method of the step is more accurate, fewer training samples need to be marked, a priori anchor frame is not needed, a selective search method is replaced, and the problems that the marked samples are difficult, the positioning is inaccurate, and the regression convergence of the frame is difficult are solved;
s2 feature extraction cube
In a given video data set, a certain video V in the data set is encodediDivided into n fixed-length video segments, i.e. Vi={v1,v2,…,vnV, each video clipiThe method comprises 8 overlapped pictures, wherein the time span of each frame of picture is 1, each video clip is respectively input into a 3D convolutional neural network at an encoder end and passes through 8 convolutional layers and 4 maximum pooling layers to obtain a video frequency band viSpatio-temporal features f at encoder sidei(ii) a Spatio-temporal features f reduction due to 3D max poolingiIn order to generate pixel-level segmentation maps of the original size for each frame, at the decoder side, a 3D upsampling layer is used to increase the spatio-temporal features fiThe resolution of (a); after each 3D upsampling layer, corresponding space-time characteristics f of the encoder end are obtained in order to obtain the spatial and temporal information of different scales of the input video segmentiAdopting cascade connection; obtaining an input video clip v after 4 times of upsampling and 4 secondary cascade operationsiCharacteristic cube Fi
The specific structure of the 3D convolutional neural network described in this embodiment is formed by mutually interleaving and combining an input video segment, 12 convolutional layers, 4 maximum pooling layers, and 4 upsampling layers, and finally a feature cube of the input video segment is obtained for subsequent behavior target positioning and behavior recognition; wherein:
conv1 to conv12 are convolution layers, the convolution kernel of each layer is 3 x 3, the number of filters is increased from 64 to 512 in sequence and then is decreased to 64; performing maximum pooling operation and convolution operation for 4 times and 8 times on input video clips in a crossed manner to obtain the feature of a conv8 layer, then performing upsample1 layer upsampling on the feature of the conv8 layer, cascading the upsample1 layer feature and the conv9 layer feature and inputting the upsample2 upsample layer, cascading the upsample2 layer feature and the conv10 layer feature and inputting the upsample3 upsample layer, cascading the upsample3 layer feature and the conv11 layer feature and inputting the upsample4 upsample layer, and finally cascading the upsample4 layer feature and the conv1 layer feature to be used as the finally extracted feature for target positioning and classification; the cascade connection has the function of fusing the characteristics with different time domain and space domain information to obtain the characteristics more beneficial to target positioning and classification;
max-pool1 to max-pool4 are maximum pooling layers, max-pool1 has convolution kernel of 1 × 2, other maximum pooling layers have convolution kernels of 2 × 2, and the number of filters is sequentially increased from 64 to 512; the maximum pooling layer is used for reducing the resolution of the feature map, reducing the number of parameters and obtaining the feature map with stronger semantic information;
upsample1 to upsample4 are upsampling layers, the convolution kernel of each layer is 3 x 3, the number of filters of the first 3 layers is 64, and the number of filters of the last layer is 48; as the resolution of the feature map is reduced due to the maximum pooling, in order to generate a pixel-level segmentation map of the original size for each frame of image, upsampling is adopted to increase the resolution of the feature map;
s3, positioning of behavior target
In obtaining an input video segment viCharacteristic cube FiThen, to the video segment viEach frame in the system uses a segmentation branch to carry out pixel-level prediction on a background or a behavior foreground; the segmentation branch comprises 2 convolution layers of 1 x 1, and after 2 convolution operations, an input video clip v is obtainediIn each frame, i.e. Gi={g1,g2,…,g8-the segmentation map is a binary segmentation map of the behavior foreground or background; dividing foreground pixels in the image according to the two values of the behavior foreground to deduce a frame in the video frame, namely positioning a behavior target in the video;
the segmentation branch of the present embodiment specifically includes an input feature cube and 2 convolutional layers; the convolution kernels of the convolution layers are all 1 x 1, and the number of the filters is 4096 and 2 respectively; the size of the feature map obtained finally is 2 × 8 × 240 × 320, where 8 represents 8 frames of input images, 240 × 320 represents the size of the input image, 2 represents the classification result of each pixel, and the classification result of each pixel belongs to the foreground or the background; the characteristic graph is a segmentation graph of the original graph and is used for positioning the target to obtain the position information of the target;
s4, behavior recognition
After the action object in the video is located through the step of S3, the position information of the action object is used for the feature cube F obtained in the step of S2iAnd combining feature cube FiIntercepting the features at the corresponding positions as feature tube input identification branches, performing ToI (tube of interest) pooling on the input feature tube by the identification branches, and then performing full connection operation for 3 times to finally obtain a predicted behavior tag, wherein the behavior tag is used for judging whether drowning behavior occurs in the swimming pool area;
the branch identification method in this embodiment specifically includes: inputting a feature cube, ToI pooling layers and 3 full-link layers; the input feature cube is obtained by intercepting the feature cube obtained in the step S2 according to the position information obtained in the step S3, since the sizes of the obtained targets are different, in order to process the feature vectors with fixed length, ToI pooling layer operation is adopted for the input feature cube to obtain the feature vectors with fixed length, then 3 full connection layers are connected to obtain a one-dimensional feature vector for behavior tag identification, and a final behavior target classification result is obtained and is used for judging whether the swimming pool area has abnormal behaviors such as drowning and the like;
s5 detection of abnormal behaviors of swimming pool video
Based on the method of the steps S1-S4, through extracting the feature cube of the video segment, marking various videos of different swimming pool areas with different behaviors in a pixel-level marking mode, wherein the behaviors comprise drowning, standing, swimming and the like, the pixels corresponding to the behaviors are marked as positive, other backgrounds are marked as negative, and an end-to-end 3D convolutional neural network method is adopted to train to obtain a swimming pool abnormal behavior detection model; read regional real-time video stream of swimming pool, obtain the segmentation picture of every frame image through unusual action detection model of swimming pool to the action position of swimming personnel is fixed a position and the forecast action label, judges whether unusual actions such as drowning appear in the swimming pool region, if unusual actions such as drowning appear then carry out the police dispatch newspaper.
The effect of the 3D upsampling layer in step S2 of this embodiment is to apply to the encoderRestoring the low-dimensional features obtained after the maximum pooling so as to obtain the space-time features f with higher resolutioni(ii) a The step adopts a sub-pixel convolution mode to the time-space characteristic fiAnd (3) processing: respectively for depth, height and width of pd、phAnd pwLow resolution spatio-temporal features fiCarry out upsampling, set pLRRepresenting low resolution features, pHRRepresenting a high resolution feature map, pixels of the high resolution feature map being mapped from a low resolution feature map, according to the following equation:
Figure BDA0002388624010000071
wherein, C ∈ { 0., CH-1},d∈{0,...,DH-1},h∈{0,...,HH-1},w∈{0,...,WH-1 }; the variables respectively represent the number of channels of the features, the depth of the features, the height of the features and the width of the features;
Figure BDA0002388624010000072
representing the low-resolution feature map after expanding the channel;
Figure BDA0002388624010000073
the indexes c ', d', h 'and w' in (1) respectively represent the number of feature channels, the feature depth, the feature height and the feature width after expansion, and are defined as follows:
Figure BDA0002388624010000074
since the video is processed in segments, feature cubes F of various space-time sizes are generated for different segmentsi(ii) a In order to process the fixed-length feature vector, step S4 of this embodiment employs ToI pooling operations on different-sized feature tubes to obtain fixed-length feature vectors; since the size, aspect ratio and location of the bounding box (i.e., the targeted behavioral object) may be different, in order to use spatio-temporal pooling, in the spatial and temporal domainsRespectively realize pooling, and is provided with IiIs the ith activation of the TOI pooling layer, OjIs the jth output, each input variable IiThe partial derivative of the loss function L of (a) can be expressed as:
Figure BDA0002388624010000081
each pooled output OjThere is a corresponding input location i and the function f (j) represents the maximum selection from TOI.
The method for detecting drowning of the swimming pool based on the end-to-end 3D convolutional neural network adopts a bottom-to-top end-to-end pixel level segmentation method to detect abnormal behaviors existing in a video, and a priori anchors frame is not needed to search for candidate areas; a pixel-level binary marking mode is adopted, only a small amount of samples are needed, and time consumed by marking the samples is saved; the invention provides pixel-level behavior positioning which is more accurate than a mode of using a frame to perform behavior positioning, and solves the problem of difficult regression convergence of the frame in the frame mode positioning; the model has high detection performance and standardization capability.

Claims (6)

1. A swimming pool drowning detection method based on an end-to-end 3D convolutional neural network is characterized in that: the specific process steps are as follows:
s1 pixel-level labeling of training samples
Opening a camera in the swimming pool area to obtain a monitoring video, and performing pixel-level binary marking on the original monitoring video, namely marking foreground pixels as positive and background pixels as negative, wherein the marked monitoring video is used for inputting a subsequent training network;
s2 feature extraction cube
In a given video data set, a certain video V in the data set is encodediDivided into n fixed-length video segments, i.e. Vi={v1,v2,…,vnV, each video clipiThe method comprises 8 overlapped pictures, the time span of each frame of picture is 1, and each video clip is respectively input at an encoder endIn the 3D convolutional neural network, a video frequency band v is obtained after 8 convolutional layers and 4 maximum pooling layersiSpatio-temporal features f at encoder sidei(ii) a To generate the original size pixel level segmentation map for each frame, a 3D upsampling layer is used at the decoder side to increase the spatio-temporal features fiThe resolution of (a); after each 3D upsampling layer, corresponding space-time characteristics f of the encoder end are obtained in order to obtain the spatial and temporal information of different scales of the input video segmentiAdopting cascade connection; obtaining an input video clip v after 4 times of upsampling and 4 secondary cascade operationsiCharacteristic cube Fi
S3, positioning of behavior target
In obtaining an input video segment viCharacteristic cube FiThen, to the video segment viEach frame in the system uses a segmentation branch to carry out pixel-level prediction on a background or a behavior foreground; the segmentation branch comprises 2 convolution layers of 1 x 1, and after 2 convolution operations, an input video clip v is obtainediIn each frame, i.e. Gi={g1,g2,…,g8-the segmentation map is a binary segmentation map of the behavior foreground or background; dividing foreground pixels in the image according to the two values of the behavior foreground to deduce a frame in the video frame, namely positioning a behavior target in the video;
s4, behavior recognition
After the action object in the video is located through the step of S3, the position information of the action object is used for the feature cube F obtained in the step of S2iAnd combining feature cube FiIntercepting the features at the corresponding positions as feature tube input identification branches, performing ToI (tube of interest) pooling on the input feature tube by the identification branches, and then performing full connection operation for 3 times to finally obtain a predicted behavior tag, wherein the behavior tag is used for judging whether drowning behavior occurs in the swimming pool area;
s5 detection of abnormal behaviors of swimming pool video
Based on the method of steps S1-S4, by extracting feature cube F of video clipiMultiple video feeds of different pool areas in a pixel-level based labeling approachMarking different behaviors, wherein the behaviors comprise drowning, standing and swimming, marking pixels corresponding to the behaviors as positive and marking other backgrounds as negative, and training by adopting an end-to-end 3D convolutional neural network method to obtain an abnormal behavior detection model of the swimming pool; read regional real-time video stream of swimming pool, obtain the segmentation picture of every frame image through unusual action detection model of swimming pool to the action position of swimming personnel is fixed a position and the prediction action label, judges the unusual action of drowning whether appear in the swimming pool region, if the unusual action such as drowning then carries out the police dispatch newspaper.
2. The method of claim 1, wherein the method comprises: the 3D upsampling layer in step S2 recovers the low-dimensional features obtained after the encoder side is maximally pooled, so as to obtain the spatio-temporal features f with higher resolutioni(ii) a The 3D up-sampling layer adopts a sub-pixel convolution mode to the spatio-temporal characteristics fiThe treatment is carried out, specifically: respectively for depth, height and width of pd、phAnd pwLow resolution spatio-temporal features fiCarry out upsampling, set pLRRepresenting low resolution features, pHRRepresenting a high resolution feature map, pixels of the high resolution feature map being mapped from a low resolution feature map, the mapping being implemented according to:
Figure FDA0002388622000000021
wherein, C ∈ { 0., CH-1},d∈{0,...,DH-1},h∈{0,...,HH-1},w∈{0,...,WH-1 }; the variables respectively represent the number of channels of the features, the depth of the features, the height of the features and the width of the features;
Figure FDA0002388622000000022
representing the low-resolution feature map after expanding the channel;
Figure FDA0002388622000000023
the indexes c ', d', h 'and w' in (1) respectively represent the number of feature channels, the feature depth, the feature height and the feature width after expansion, and are defined as follows:
Figure FDA0002388622000000024
3. the method of claim 2, wherein the method comprises: the step S4 applies ToI pooling operation to the different sized feature tubes to obtain fixed length feature vectors; ToI the pooling operation process is as follows: let IiIs the ith activation of the TOI pooling layer, OjIs the jth output, each input variable IiThe partial derivative of the loss function L of (a) can be expressed as:
Figure FDA0002388622000000025
each pooled output OjThere is a corresponding input location i and the function f (j) represents the maximum selection from TOI.
4. The method of claim 3, wherein the method comprises: the specific structure of the 3D convolutional neural network is formed by mutually interpenetrating and combining an input video clip, 12 convolutional layers, 4 maximum pooling layers and 4 upsampling layers, and finally a feature cube of the input video clip is obtained for subsequent behavior target positioning and behavior identification; wherein:
conv1 to conv12 are convolution layers, the convolution kernel of each layer is 3 x 3, the number of filters is increased from 64 to 512 in sequence and then is decreased to 64; performing maximum pooling operation and convolution operation for 4 times and 8 times on input video clips in a crossed manner to obtain the feature of a conv8 layer, then performing upsample1 layer upsampling on the feature of the conv8 layer, cascading the upsample1 layer feature and the conv9 layer feature and inputting the upsample2 upsample layer, cascading the upsample2 layer feature and the conv10 layer feature and inputting the upsample3 upsample layer, cascading the upsample3 layer feature and the conv11 layer feature and inputting the upsample4 upsample layer, and finally cascading the upsample4 layer feature and the conv1 layer feature to be used as the finally extracted feature for target positioning and classification; the cascade connection has the function of fusing the characteristics with different time domain and space domain information to obtain the characteristics more beneficial to target positioning and classification;
max-pool1 to max-pool4 are maximum pooling layers, max-pool1 has convolution kernel of 1 × 2, other maximum pooling layers have convolution kernels of 2 × 2, and the number of filters is sequentially increased from 64 to 512; the maximum pooling layer is used for reducing the resolution of the feature map, reducing the number of parameters and obtaining the feature map with stronger semantic information;
upsample1 to upsample4 are upsampling layers, the convolution kernel of each layer is 3 x 3, the number of filters of the first 3 layers is 64, and the number of filters of the last layer is 48; since maximum pooling reduces the resolution of the feature map, upsampling is used to increase the resolution of the feature map in order to generate an original sized pixel level segmentation map for each frame of image.
5. The method of claim 4, wherein the method comprises the steps of: the segmentation branch specifically comprises an input feature cube and 2 convolutional layers; the convolution kernels of the convolution layers are all 1 x 1, and the number of the filters is 4096 and 2 respectively; the size of the feature map obtained finally is 2 × 8 × 240 × 320, where 8 represents 8 frames of input images, 240 × 320 represents the size of the input image, 2 represents the classification result of each pixel, and the classification result of each pixel belongs to the foreground or the background; the feature map is a segmentation map of the original map, and is used for positioning the target to obtain the position information of the target.
6. The method of claim 5, wherein the method comprises: the identification branch specifically comprises: inputting a feature cube, ToI pooling layers and 3 full-link layers; the input feature cube is obtained by intercepting the feature cube obtained in step S2 according to the position information obtained in step S3, and since the sizes of the obtained objects are different, in order to process the feature vectors with fixed lengths, ToI pooling layer operation is performed on the input feature cube to obtain the feature vectors with fixed lengths, and then 3 full connection layers are connected to obtain a one-dimensional feature vector for behavior tag identification, so as to obtain the final behavior object classification result, which is used for judging whether the swimming pool area has abnormal behaviors such as drowning.
CN202010106457.6A 2020-02-21 2020-02-21 Swimming pool drowning detection method based on end-to-end 3D convolutional neural network Active CN111339892B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010106457.6A CN111339892B (en) 2020-02-21 2020-02-21 Swimming pool drowning detection method based on end-to-end 3D convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010106457.6A CN111339892B (en) 2020-02-21 2020-02-21 Swimming pool drowning detection method based on end-to-end 3D convolutional neural network

Publications (2)

Publication Number Publication Date
CN111339892A true CN111339892A (en) 2020-06-26
CN111339892B CN111339892B (en) 2023-04-18

Family

ID=71181844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010106457.6A Active CN111339892B (en) 2020-02-21 2020-02-21 Swimming pool drowning detection method based on end-to-end 3D convolutional neural network

Country Status (1)

Country Link
CN (1) CN111339892B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613359A (en) * 2020-12-09 2021-04-06 苏州玖合智能科技有限公司 Method for constructing neural network for detecting abnormal behaviors of people
CN113205008A (en) * 2021-04-16 2021-08-03 深圳供电局有限公司 Alarm control method of dynamic alarm window
CN114022910A (en) * 2022-01-10 2022-02-08 杭州巨岩欣成科技有限公司 Swimming pool drowning prevention supervision method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341452A (en) * 2017-06-20 2017-11-10 东北电力大学 Human bodys' response method based on quaternary number space-time convolutional neural networks
CN108009629A (en) * 2017-11-20 2018-05-08 天津大学 A kind of station symbol dividing method based on full convolution station symbol segmentation network
US20180144477A1 (en) * 2016-06-15 2018-05-24 Beijing Sensetime Technology Development Co.,Ltd Methods and apparatuses, and computing devices for segmenting object
WO2018171109A1 (en) * 2017-03-23 2018-09-27 北京大学深圳研究生院 Video action detection method based on convolutional neural network
CN109829443A (en) * 2019-02-23 2019-05-31 重庆邮电大学 Video behavior recognition methods based on image enhancement Yu 3D convolutional neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180144477A1 (en) * 2016-06-15 2018-05-24 Beijing Sensetime Technology Development Co.,Ltd Methods and apparatuses, and computing devices for segmenting object
WO2018171109A1 (en) * 2017-03-23 2018-09-27 北京大学深圳研究生院 Video action detection method based on convolutional neural network
CN107341452A (en) * 2017-06-20 2017-11-10 东北电力大学 Human bodys' response method based on quaternary number space-time convolutional neural networks
CN108009629A (en) * 2017-11-20 2018-05-08 天津大学 A kind of station symbol dividing method based on full convolution station symbol segmentation network
CN109829443A (en) * 2019-02-23 2019-05-31 重庆邮电大学 Video behavior recognition methods based on image enhancement Yu 3D convolutional neural networks

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613359A (en) * 2020-12-09 2021-04-06 苏州玖合智能科技有限公司 Method for constructing neural network for detecting abnormal behaviors of people
CN112613359B (en) * 2020-12-09 2024-02-02 苏州玖合智能科技有限公司 Construction method of neural network for detecting abnormal behaviors of personnel
CN113205008A (en) * 2021-04-16 2021-08-03 深圳供电局有限公司 Alarm control method of dynamic alarm window
CN113205008B (en) * 2021-04-16 2023-11-17 深圳供电局有限公司 Alarm control method for dynamic alarm window
CN114022910A (en) * 2022-01-10 2022-02-08 杭州巨岩欣成科技有限公司 Swimming pool drowning prevention supervision method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN111339892B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
Sindagi et al. A survey of recent advances in cnn-based single image crowd counting and density estimation
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN111339892B (en) Swimming pool drowning detection method based on end-to-end 3D convolutional neural network
CN108062525B (en) Deep learning hand detection method based on hand region prediction
US10607098B2 (en) System of a video frame detector for video content identification and method thereof
Yao et al. When, where, and what? A new dataset for anomaly detection in driving videos
CN112232231A (en) Pedestrian attribute identification method, system, computer device and storage medium
CN112232237B (en) Method, system, computer device and storage medium for monitoring vehicle flow
CN111639564B (en) Video pedestrian re-identification method based on multi-attention heterogeneous network
CN111898432B (en) Pedestrian detection system and method based on improved YOLOv3 algorithm
CN113591968A (en) Infrared weak and small target detection method based on asymmetric attention feature fusion
CN112801068B (en) Video multi-target tracking and segmenting system and method
Zhang et al. Cross-scale generative adversarial network for crowd density estimation from images
CN116129291A (en) Unmanned aerial vehicle animal husbandry-oriented image target recognition method and device
Hu et al. Parallel spatial-temporal convolutional neural networks for anomaly detection and location in crowded scenes
Wang et al. Skip-connection convolutional neural network for still image crowd counting
Fan et al. Complementary tracking via dual color clustering and spatio-temporal regularized correlation learning
CN113255616A (en) Video behavior identification method based on deep learning
CN115880647A (en) Method, system, equipment and storage medium for analyzing abnormal behaviors of examinee examination room
CN115240024A (en) Method and system for segmenting extraterrestrial pictures by combining self-supervised learning and semi-supervised learning
Bai et al. A survey on deep learning-based single image crowd counting: Network design, loss function and supervisory signal
Tao et al. An adaptive frame selection network with enhanced dilated convolution for video smoke recognition
Wang et al. Single-column CNN for crowd counting with pixel-wise attention mechanism
CN115527133A (en) High-resolution image background optimization method based on target density information
CN113657225B (en) Target detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant