CN111611847B - Video motion detection method based on scale attention hole convolution network - Google Patents
Video motion detection method based on scale attention hole convolution network Download PDFInfo
- Publication number
- CN111611847B CN111611847B CN202010252104.7A CN202010252104A CN111611847B CN 111611847 B CN111611847 B CN 111611847B CN 202010252104 A CN202010252104 A CN 202010252104A CN 111611847 B CN111611847 B CN 111611847B
- Authority
- CN
- China
- Prior art keywords
- video
- frame
- action
- motion
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Human Computer Interaction (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video motion detection method based on a scale attention hole convolution network. The method comprises the steps of firstly sampling a video to obtain a frame image sequence, obtaining a video segment according to a segment position mark, then respectively constructing a layer scale attention motion segment model and a frame position attention motion recognition model, and sequentially obtaining the weighted feature representation of the frame image and the motion category of the video segment according to the models and by combining a watershed algorithm to complete a video motion detection task. The method utilizes the hole convolution network to extract the space-time motion information which can better reflect the intrinsic structure of the time dimension and the space dimension of the video data, more properly describes the internal association of the time sequence context of the video frame along with the change of the dimension through the attention of the layer dimension, and the designed frame position attention mechanism endows the video frame of the action segment with the weight which can more accurately represent the key content of the action segment, thereby improving the precision of the video action detection and the efficiency of the action detection.
Description
Technical Field
The invention belongs to the technical field of video analysis, particularly relates to the technical field of time sequence action detection, and relates to a video action detection method based on a scale attention hole convolution network.
Background
Understanding of human action videos plays an important role in a plurality of fields such as security monitoring and behavior analysis, and becomes a leading-edge research subject in the field of computer vision. However, un-clipped real video often contains background segments unrelated to human actions, which will affect the correct understanding of the video content. To address this problem, the video motion detection method not only classifies the motion within the video, but also locates the start and end times of the motion instance occurring in the video. The video motion detection task generally takes a video frame sequence as an input, and outputs detection results of multiple groups of segments in the form of 'motion category-start frame-end frame', and the processing process thereof can be divided into two stages: action fragment generation and action fragment recognition. The former generally outputs the start frame and the end frame of the segment, and the latter outputs the action type of the segment. Generally, the video motion detection method can help to better understand video content, including tasks such as video summarization, motion recognition, content annotation, event capture and the like. For example, for a video summarization task, key segments can be obtained through video motion detection, so that key frames or segments which can reflect video content most can be accurately positioned, and the quality of video summarization is improved.
The video motion detection is used for processing video frame images and needs to describe the time sequence relation among frames, and the tensor calculation with high dimensionality is involved. The traditional machine learning method adopts the characteristics extracted manually, such as the track characteristics, so that the extraction efficiency can not meet the real-time performance requirement, and the characteristic extraction process is separated from model training, so that the generalization performance of the model is weak. In recent years, Convolutional Neural Networks (CNN) for end-to-end learning has been developed rapidly and can be used to compensate for the shortcomings of the conventional methods. For example: the characteristic extraction efficiency of the time sequence action information based on the three-dimensional convolution neural network or the optical flow field information is higher; (ii) a The candidate segment generation scheme based on deep reinforcement learning can adaptively complete the action segment generation task end to end; the time sequence action positioning network provides a multi-scale parallel action segment generation structure for solving the problem of different lengths of action segments, and the optimal performance of the field is greatly refreshed.
The existing video motion detection method mainly has the following defects: firstly, in a feature extraction stage, the time sequence dimension of an input video is fixedly reduced layer by layer in a constructed network model by using three-dimensional convolution operation for extracting the time sequence feature of an action, the size of the extracted feature on the time sequence is constrained, the context semantics are split by an undersized scale, and the interference of different semantics is caused by an oversize scale; secondly, in the generation stage of the action segment, whether the action occurs or not and the key points of the types of the action, namely the positions and the duration (such as continuous key frames) of the key frames are often different for the actions with different durations, and the weight problem of the key points is ignored in the conventional average pooling operation; thirdly, the existing method extracts feature representations of action segments by using different network structures (such as a hole convolution network) for segments with different sizes, which greatly increases the time and space costs of network construction and training. Therefore, it is desirable to design a method that can improve the video motion detection performance and save the computation and storage overhead.
Disclosure of Invention
The invention aims to provide a video motion detection method based on a scale attention hole convolutional network, aiming at the defects of the prior art, and the method can be used for capturing the space-time motion information of video data by combining the hole convolutional network and accurately depicting the time sequence context relationship of video frames through scale attention, thereby effectively detecting motion segments in a video and accurately judging the category of the motion segments.
The method firstly acquires a video data set, and then performs the following operations:
step (1), video sampling is carried out, a frame image sequence is obtained, and a video segment is obtained according to a segment position mark;
step (2), constructing a layer scale attention action fragment model, inputting a frame image sequence of a complete video, and outputting a weighted feature representation of the complete video frame image and the probability of whether each frame is an action frame;
step (3), constructing a frame position attention action recognition model, inputting the weighted feature representation of a video clip frame image, and outputting the probability of the action category to which the video clip belongs;
and (4) generating a video segment for the new video according to the layer scale attention motion segment model and the watershed algorithm, and judging the segment motion type by the frame position attention motion recognition model to obtain a motion detection result.
Further, the step (1) is specifically:
(1-1) processing a single video into a sequence of frame images at a sampling rate of i frames per secondWherein N represents the total number of frame images, fnRGB three-channel frame image with nth width being w and height being h in representation sequence,n=1,2,…,N,i=20~40;
(1-2) marking according to video clip positionAcquiring a video clip comprising an action clip and a background clip; in which the category of the video segmentJ is the number of operation types, and is 0,1,2, …, and is the operation type number when J is 0, and is the background type number when J is 0; m is the total number of action and background segments, s for the mth video segmentmIs the starting frame number of the fragment, emIs the end frame number of the fragment, cmIs the class corresponding to the fragment, M is 1,2, …, M.
Still further, the step (2) is specifically:
(2-1) processing a frame image sequence of a complete video frame by taking a video frame as a unit, respectively obtaining a starting frame sequence number and an ending frame sequence number of an action fragment and a background fragment by using a video fragment position mark, marking a video frame in the action fragment as an action frame, and marking a video frame in the background fragment as a background frame;
(2-2) layer-scale attention motion segment modelTaking a multilayer cavity convolution neural network considering a time sequence relation as a main body, firstly, sequentially acquiring a frame image sequence from a lower layer to a higher layer in a frame-by-frame processing modeThe context feature representation of different scales of each frame, namely the feature representation of the image of the t frame at the k layer is shown asWherein ck is the number of channels of the k-th layer, and wk and hk are the width and height of the k-th layer characteristic representation respectively; a weighted feature representation of the complete video is then obtained through a layer scale attention mechanismWherein the weighted features of the t frame image are expressed asIs the scale attention weight of the k-th layer,k is the total number of layers of the multilayer cavity convolution network, and K is more than or equal to 2;
(2-3) expressing the weighted characteristics of the t frame image as StOutput vector after passing through full connection layerAs a layer scale attention motion fragment modelThe last layer of (2) that outputs the probability of whether a video frame belongs to an action frame using the Softmax (·) functionh is 0,1, wherein e represents a natural base number, y0Is the probability of a background frame, y1Is the probability of an action frame, ZqRepresents the q-th element of the vector Z and notes the probability of whether the nth video frame belongs to the motion frameCross entropy loss function of the model is then calculatedWhereinIn order to be a true mark, the mark is,indicating that the frame is an action frame,indicating that the frame is a background frame; and optimizing a training layer scale attention motion segment model by using a stochastic gradient descent algorithm, and updating model parameters through inverse gradient propagation.
Further, the step (3) is specifically:
(3-1) weighted feature representation from complete video in sequenceWherein the weighted feature representation of each video clip is obtained by using the starting frame number and the ending frame number of the video clip position mark Lm=1,…,M;
(3-2) frame position attention motion recognition modelTaking a multilayer neural network with frame position attention mechanism as a main body, wherein the input of the multilayer neural network is weighted characteristic representation of each frame of a video clipThe model obtains weighted feature representation of video segments by calculating frame position attentionWhereinIs the position attention weight of the t-th frame,
(3-3) representing the weighted features of the video segments by XmOutput vector after passing through full connection layerAs frame positionsAttention motion recognition modelThe last layer of (1) that outputs the probability of the action class j to which the video clip belongs as a function of Softmax (·)J-1, 2, …, J, and probability of belonging to a background categoryCross entropy loss of the model is then calculatedWhereinIf the video clip belongs to the category j, the video clip is a true mark, and if the video clip belongs to the category j, the video clip is 1, otherwise, the video clip is 0; and optimizing the training frame position attention motion recognition model by using a stochastic gradient descent algorithm, and updating the model parameters through inverse gradient propagation.
Still further, the step (4) is specifically:
(4-1) for new video, obtaining its frame image sequence by (1-1)Inputting the sequence into the layer scale attention motion fragment model of the step (2)And calculating to obtain the probability sequence whether the video frame image sequence belongs to the action frame or not through (2-3)Then, a watershed algorithm based on multi-level immersion is used for the probability sequence, namely the probability value is higher than a set threshold value tau, tau is 0-1, and video frames with continuous time sequences are aggregated into video segments; simultaneously, a plurality of thresholds with different ranges of 0-1 are used for generating M' video clips with different lengths,and a start frame number s 'and an end frame number e';
(4-2) inputting the video segment frame image sequence of (4-1) into the frame position attention motion recognition model of the step (3)Obtaining the probability of frame images in the video clip belonging to each categoryAnd taking the category corresponding to the maximum probability value as the category c' to which the video clip belongs; outputting the starting frame number and the ending frame number of the video clip judged as a specific action;
(4-3) obtaining a video segment from the new video through (4-1), and then obtaining a video action detection result through (4-2)Where M ' is the sequence number of the video segment, M ' is the total number of detected motion segments, s 'm'Denotes the fragment Start frame number, e'm'Denotes the end frame number, c 'of the fragment'm'Representing the action category of the segment.
The method of the invention utilizes the scale attention hole convolution network to detect the video action, which is different from the prior method in the following aspects: 1) compared with a time sequence action positioning network which uses a multi-scale parallel structure, the method uses a hole convolution layer with a multi-layer serial structure, and reduces the redundancy of the network structure while realizing the extraction of multi-scale context characteristics; 2) the method that the three-dimensional convolutional neural network is a backbone is often used for extracting time sequence downsampling information, and the method provides that a void convolutional neural network is used for extracting context characteristics at the fine granularity level of an original video frame; 3) the method provides that from two angles of scale and position, the time sequence characteristic information corresponding to the video frame and the video clip is better extracted by combining an attention mechanism; 4) the action segment can be generated in parallel by using a watershed algorithm based on multi-level immersion in the action segment generation stage, and the execution efficiency is higher than that of many existing methods.
The invention is suitable for the video motion detection task based on the deep learning method, and has the main advantages that: 1) by combining the hole convolution network, not only the spatio-temporal motion information capable of better reflecting the intrinsic structure of the time dimension and the space dimension of the video data is extracted, but also the fine granularity of the frame level is reserved for the characteristics; 2) selecting an appropriate feature representation for each frame by changing the dimension characterizing the temporal context of the current frame using a layer dimension attention mechanism; 3) using a frame position attention mechanism, the video frames within each action segment are accurately characterized by adding weights to reflect their features. The method provides a scientific and reasonable scheme for improving the performance of the video motion detection task from multiple angles, and can be widely applied to practical application scenes such as security monitoring, behavior analysis, video abstraction, event detection and the like.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
A video motion detection method based on a scale attention hole convolutional network comprises the steps of firstly sampling a video to obtain a frame image sequence, obtaining a video segment according to motion segment marks, then respectively constructing a layer scale attention motion segment model and a frame position attention motion recognition model, and finally judging the motion type of the video segment by combining a watershed algorithm. The method can more accurately capture the spatio-temporal motion information of the video data by utilizing the hole convolution network, uses a layer scale attention mechanism to depict the time sequence context relationship of the video frames, and learns proper weight through the video frames of which the frame position attention mechanism is an action segment so as to better reflect the content of the action segment. The video motion detection system constructed in the mode can effectively extract the time sequence characteristics of the video frame images and the video clips, and can effectively detect the motion types in the video.
As shown in fig. 1, the method first obtains a video data set, and then performs the following operations:
step (1), video sampling is carried out, a frame image sequence is obtained, and a video segment is obtained according to a segment position mark; the method comprises the following steps:
(1-1) processing a single video into a sequence of frame images at a sampling rate of i frames per secondWherein N represents the total number of frame images, fnRepresenting an RGB three-channel frame image with the nth width w and the height h, wherein N is 1,2, …, and N, i is 20-40; in this example, i is 30;
(1-2) marking according to video clip positionAcquiring a video clip comprising an action clip and a background clip; in which the category of the video segmentJ is the number of operation types, and is 0,1,2, …, and is the operation type number when J is 0, and is the background type number when J is 0; m is the total number of action and background segments, s for the mth video segmentmIs the starting frame number of the fragment, emIs the end frame number of the fragment, cmIs the class corresponding to the fragment, M is 1,2, …, M.
Step (2), constructing a layer scale attention action fragment model, inputting a frame image sequence of a complete video, and outputting a weighted feature representation of the complete video frame image and the probability of whether each frame is an action frame; the method comprises the following steps:
(2-1) processing a frame image sequence of a complete video frame by taking a video frame as a unit, respectively obtaining a starting frame sequence number and an ending frame sequence number of an action fragment and a background fragment by using a video fragment position mark, marking a video frame in the action fragment as an action frame, and marking a video frame in the background fragment as a background frame;
(2-2) layer-scale attention motion segment modelTaking a multilayer hole convolution neural network considering a time sequence relation as a main body, firstly, frame by frame from a lower layer to a higher layerSequentially acquiring frame image sequenceThe context feature representation of different scales of each frame, namely the feature representation of the image of the t frame at the k layer is shown asWherein ck is the number of channels of the k-th layer, and wk and hk are the width and height of the k-th layer characteristic representation respectively; a weighted feature representation of the complete video is then obtained through a layer scale attention mechanismWherein the weighted features of the t frame image are expressed asIs the scale attention weight of the k-th layer,k is the total number of layers of the multilayer cavity convolution network, and K is more than or equal to 2;
(2-3) expressing the weighted characteristics of the t frame image as StOutput vector after passing through full connection layerAs a layer scale attention motion fragment modelThe last layer of (2) that outputs the probability of whether a video frame belongs to an action frame using the Softmax (·) functionh is 0,1, wherein e represents a natural base number, y0Is the probability of a background frame, y1Is the probability of an action frame, ZqRepresents the q-th element of the vector Z and notes the probability of whether the nth video frame belongs to the motion frameCross entropy loss function of the model is then calculatedWhereinIn order to be a true mark, the mark is,indicating that the frame is an action frame,indicating that the frame is a background frame; and optimizing a training layer scale attention motion segment model by using a stochastic gradient descent algorithm, and updating model parameters through inverse gradient propagation.
Step (3), constructing a frame position attention action recognition model, inputting the weighted feature representation of a video clip frame image, and outputting the probability of the action category to which the video clip belongs; the method comprises the following steps:
(3-1) weighted feature representation from complete video in sequenceWherein the weighted feature representation of each video clip is obtained by using the starting frame number and the ending frame number of the video clip position mark Lm=1,…,M;
(3-2) frame position attention motion recognition modelTaking a multilayer neural network with frame position attention mechanism as a main body, wherein the input of the multilayer neural network is weighted characteristic representation of each frame of a video clipThe model obtains weighted feature representation of video segments by calculating frame position attentionWhereinIs the position attention weight of the t-th frame,
(3-3) representing the weighted features of the video segments by XmOutput vector after passing through full connection layerIdentifying a model as a frame position attention actionThe last layer of (1) that outputs the probability of the action class j to which the video clip belongs as a function of Softmax (·)J-1, 2, …, J, and probability of belonging to a background categoryCross entropy loss of the model is then calculatedWhereinIf the video clip belongs to the category j, the video clip is a true mark, and if the video clip belongs to the category j, the video clip is 1, otherwise, the video clip is 0; and optimizing the training frame position attention motion recognition model by using a stochastic gradient descent algorithm, and updating the model parameters through inverse gradient propagation.
Step (4), generating a video clip for the new video according to the layer scale attention motion clip model and the watershed algorithm, and judging the clip motion type by the frame position attention motion recognition model to obtain a motion detection result; the method comprises the following steps:
(4-1) for new video, obtaining its frame image sequence by (1-1)Inputting the sequence into the layer scale attention motion fragment model of the step (2)And calculating to obtain the probability sequence whether the video frame image sequence belongs to the action frame or not through (2-3)Then, a watershed algorithm based on multi-level immersion is used for the probability sequence, namely, the probability value is higher than a set threshold value tau (tau is 0-1, in the embodiment, tau is 0.7), and video frames with continuous time sequences are aggregated into video segments; simultaneously, generating M ' video clips with different lengths, a starting frame sequence number s ' and an ending frame sequence number e ' by using a plurality of different threshold values in the range of 0-1;
(4-2) inputting the video segment frame image sequence of (4-1) into the frame position attention motion recognition model of the step (3)Obtaining the probability of frame images in the video clip belonging to each categoryAnd taking the category corresponding to the maximum probability value as the category c' to which the video clip belongs; outputting the starting frame number and the ending frame number of the video clip judged as a specific action;
(4-3) obtaining a video segment from the new video through (4-1), and then obtaining a video action detection result through (4-2)Where M ' is the sequence number of the video segment, M ' is the total number of detected motion segments, s 'm'Denotes the fragment Start frame number, e'm'Denotes the end frame number, c 'of the fragment'm'To show the sheetThe action category of the segment.
The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.
Claims (6)
1. The video motion detection method based on the scale attention hole convolutional network is characterized by firstly acquiring a video data set and then performing the following operations:
step (1), video sampling is carried out, a frame image sequence is obtained, and a video segment is obtained according to a segment position mark; the method comprises the following steps:
(1-1) processing a single video into a sequence of frame images at a sampling rate of i frames per secondWherein N represents the total number of frame images, fnRepresenting the N-th RGB three-channel frame image with w height h in the sequence, wherein N is 1,2, …, N;
(1-2) marking according to video clip positionAcquiring a video clip comprising an action clip and a background clip; in which the category of the video segmentJ is the number of operation types, and is 0,1,2, …, and is the operation type number when J is 0, and is the background type number when J is 0; m is the total number of action and background segments, s for the mth video segmentmIs the starting frame number of the fragment, emIs the end frame number of the fragment, cmIs the class to which the fragment corresponds, M is 1,2, …, M;
step (2) constructing a layer scale attention motion fragment model, inputting a frame image sequence of a complete video, and outputting the frame image sequence of the complete videoWeighted feature representation and probability of whether each frame is an action frame; the layer scale attention action fragment model takes a multilayer cavity convolution neural network considering a time sequence relation as a main body, and obtains the weighted characteristic representation of the complete video through a layer scale attention mechanism; wherein the weighted characteristics of the t frame image are expressed as Is the scale attention weight of the k-th layer, representing the characteristics of the t frame image on the K layer, wherein K is the total number of layers of the multilayer cavity convolution network, and K is more than or equal to 2;
step (3), constructing a frame position attention action recognition model, inputting the weighted feature representation of a video clip frame image, and outputting the probability of the action category to which the video clip belongs; the frame position attention motion recognition model takes a multi-layer neural network considering a frame position attention mechanism as a main body, and obtains weighted feature representation of the video clip by calculating the frame position attentionWherein the content of the first and second substances,is the position attention weight of the t-th frame,
and (4) generating a video segment for the new video according to the layer scale attention motion segment model and the watershed algorithm, and judging the segment motion type by the frame position attention motion recognition model to obtain a motion detection result.
2. The method for detecting video motion based on the scale attention hole convolutional network as claimed in claim 1, wherein the step (2) is specifically:
(2-1) processing a frame image sequence of a complete video frame by taking a video frame as a unit, respectively obtaining a starting frame sequence number and an ending frame sequence number of an action fragment and a background fragment by using a video fragment position mark, marking a video frame in the action fragment as an action frame, and marking a video frame in the background fragment as a background frame;
(2-2) first, acquiring a sequence of frame images in sequence by processing from a lower layer to a higher layer frame by frameContext feature representation of different scales of each frame, and feature representation of the t frame image in the k layerWherein ck is the number of channels of the k-th layer, and wk and hk are the width and height of the k-th layer characteristic representation respectively; a weighted feature representation of the complete video is then obtained through a layer scale attention mechanism
(2-3) expressing the weighted characteristics of the t frame image as StOutput vector after passing through full connection layerAs a layer scale attention motion fragment modelThe last layer of (2) that outputs the probability of whether a video frame belongs to an action frame using the Softmax (·) functionWherein eDenotes the natural base number, y0Is the probability of a background frame, y1Is the probability of an action frame, ZqRepresents the q-th element of the vector Z and notes the probability of whether the nth video frame belongs to the motion frameCross entropy loss function of the model is then calculatedWhereinIn order to be a true mark, the mark is,indicating that the frame is an action frame,indicating that the frame is a background frame; and optimizing a training layer scale attention motion segment model by using a stochastic gradient descent algorithm, and updating model parameters through inverse gradient propagation.
3. The video motion detection method based on the scale attention hole convolutional network as claimed in claim 2, wherein the step (3) is specifically:
(3-1) weighted feature representation from complete video in sequenceWherein the weighted feature representation of each video clip is obtained by using the starting frame number and the ending frame number of the video clip position mark L
(3-2) frame position attention motion recognition modelInput as weighted feature representation of frames of video segmentThe model obtains weighted feature representation of video segments by calculating frame position attention
(3-3) representing the weighted features of the video segments by XmOutput vector after passing through full connection layerIdentifying a model as a frame position attention actionThe last layer of (1) that outputs the probability of the action class j to which the video clip belongs as a function of Softmax (·)And probability of belonging to a background categoryCross entropy loss of the model is then calculatedWhereinIf the video clip belongs to the category j, the video clip is a true mark, and if the video clip belongs to the category j, the video clip is 1, otherwise, the video clip is 0; and optimizing the training frame position attention motion recognition model by using a stochastic gradient descent algorithm, and updating the model parameters through inverse gradient propagation.
4. The video motion detection method based on the scale attention hole convolutional network as claimed in claim 3, wherein the step (4) is specifically:
(4-1) for new video, obtaining its frame image sequence by (1-1)Inputting the sequence into the layer scale attention motion fragment model of the step (2)And calculating to obtain the probability sequence whether the video frame image sequence belongs to the action frame or not through (2-3)Then, a watershed algorithm based on multi-level immersion is used for the probability sequence, namely the probability value is higher than a set threshold tau, and video frames with continuous time sequences are aggregated into a video segment; simultaneously, generating M ' video clips with different lengths, a starting frame sequence number s ' and an ending frame sequence number e ' by using a plurality of different threshold values in the range of 0-1;
(4-2) inputting the video segment frame image sequence of (4-1) into the frame position attention motion recognition model of the step (3)Obtaining the probability of frame images in the video clip belonging to each categoryAnd taking the category corresponding to the maximum probability value as the category c' to which the video clip belongs; outputting the starting frame number and the ending frame number of the video clip judged as a specific action;
(4-3) obtaining a video segment from the new video through (4-1), and then obtaining a video action detection result through (4-2)Where M ' is the sequence number of the video segment, M ' is the total number of detected motion segments, s 'm'To representThe fragment Start frame number, e'm'Denotes the end frame number, c 'of the fragment'm'Representing the action category of the segment.
5. The method of claim 1, wherein the video motion detection method based on the scale attention hole convolutional network comprises: and i is 20-40.
6. The method of claim 4 for video motion detection based on scale attention hole convolutional network, wherein: τ is 0 to 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010252104.7A CN111611847B (en) | 2020-04-01 | 2020-04-01 | Video motion detection method based on scale attention hole convolution network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010252104.7A CN111611847B (en) | 2020-04-01 | 2020-04-01 | Video motion detection method based on scale attention hole convolution network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111611847A CN111611847A (en) | 2020-09-01 |
CN111611847B true CN111611847B (en) | 2021-04-30 |
Family
ID=72200342
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010252104.7A Active CN111611847B (en) | 2020-04-01 | 2020-04-01 | Video motion detection method based on scale attention hole convolution network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111611847B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112418012B (en) * | 2020-11-09 | 2022-06-07 | 武汉大学 | Video abstract generation method based on space-time attention model |
CN112580557A (en) * | 2020-12-25 | 2021-03-30 | 深圳市优必选科技股份有限公司 | Behavior recognition method and device, terminal equipment and readable storage medium |
CN112580577B (en) * | 2020-12-28 | 2023-06-30 | 出门问问(苏州)信息科技有限公司 | Training method and device for generating speaker image based on facial key points |
CN113111842B (en) * | 2021-04-26 | 2023-06-27 | 浙江商汤科技开发有限公司 | Action recognition method, device, equipment and computer readable storage medium |
CN113408343B (en) * | 2021-05-12 | 2022-05-13 | 杭州电子科技大学 | Classroom action recognition method based on double-scale space-time block mutual attention |
CN113204674B (en) * | 2021-07-05 | 2021-09-17 | 杭州一知智能科技有限公司 | Video-paragraph retrieval method and system based on local-overall graph inference network |
CN113822172B (en) * | 2021-08-30 | 2024-06-14 | 中国科学院上海微系统与信息技术研究所 | Video space-time behavior detection method |
CN114926900B (en) * | 2022-05-10 | 2023-06-16 | 电子科技大学 | Human body action on-line detection method with separated front and back |
CN114842559B (en) * | 2022-06-29 | 2022-10-14 | 山东省人工智能研究院 | Video interaction action detection method based on multi-mode time perception and attention |
CN115834977B (en) * | 2022-11-18 | 2023-09-08 | 贝壳找房(北京)科技有限公司 | Video processing method, electronic device, storage medium and computer program product |
CN115763167B (en) * | 2022-11-22 | 2023-09-22 | 黄华集团有限公司 | Solid cabinet circuit breaker and control method thereof |
CN117630344B (en) * | 2024-01-25 | 2024-04-05 | 西南科技大学 | Method for detecting slump range of concrete on line in real time |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108664931A (en) * | 2018-05-11 | 2018-10-16 | 中国科学技术大学 | A kind of multistage video actions detection method |
CN108830212A (en) * | 2018-06-12 | 2018-11-16 | 北京大学深圳研究生院 | A kind of video behavior time shaft detection method |
CN109101896A (en) * | 2018-07-19 | 2018-12-28 | 电子科技大学 | A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism |
CN109522966A (en) * | 2018-11-28 | 2019-03-26 | 中山大学 | A kind of object detection method based on intensive connection convolutional neural networks |
CN110738129A (en) * | 2019-09-20 | 2020-01-31 | 华中科技大学 | end-to-end video time sequence behavior detection method based on R-C3D network |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108288015B (en) * | 2017-01-10 | 2021-10-22 | 武汉大学 | Human body action recognition method and system in video based on time scale invariance |
US11640710B2 (en) * | 2017-11-14 | 2023-05-02 | Google Llc | Weakly-supervised action localization by sparse temporal pooling network |
-
2020
- 2020-04-01 CN CN202010252104.7A patent/CN111611847B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108664931A (en) * | 2018-05-11 | 2018-10-16 | 中国科学技术大学 | A kind of multistage video actions detection method |
CN108830212A (en) * | 2018-06-12 | 2018-11-16 | 北京大学深圳研究生院 | A kind of video behavior time shaft detection method |
CN109101896A (en) * | 2018-07-19 | 2018-12-28 | 电子科技大学 | A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism |
CN109522966A (en) * | 2018-11-28 | 2019-03-26 | 中山大学 | A kind of object detection method based on intensive connection convolutional neural networks |
CN110738129A (en) * | 2019-09-20 | 2020-01-31 | 华中科技大学 | end-to-end video time sequence behavior detection method based on R-C3D network |
Also Published As
Publication number | Publication date |
---|---|
CN111611847A (en) | 2020-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111611847B (en) | Video motion detection method based on scale attention hole convolution network | |
CN109543667B (en) | Text recognition method based on attention mechanism | |
CN109949317B (en) | Semi-supervised image example segmentation method based on gradual confrontation learning | |
CN113936339B (en) | Fighting identification method and device based on double-channel cross attention mechanism | |
CN108133188B (en) | Behavior identification method based on motion history image and convolutional neural network | |
CN107330362B (en) | Video classification method based on space-time attention | |
CN108228915B (en) | Video retrieval method based on deep learning | |
CN111259786B (en) | Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video | |
JP7097641B2 (en) | Loop detection method based on convolution perception hash algorithm | |
CN110120064B (en) | Depth-related target tracking algorithm based on mutual reinforcement and multi-attention mechanism learning | |
US11640714B2 (en) | Video panoptic segmentation | |
CN110717411A (en) | Pedestrian re-identification method based on deep layer feature fusion | |
CN108399435B (en) | Video classification method based on dynamic and static characteristics | |
CN110853074B (en) | Video target detection network system for enhancing targets by utilizing optical flow | |
CN110516536A (en) | A kind of Weakly supervised video behavior detection method for activating figure complementary based on timing classification | |
CN110163117B (en) | Pedestrian re-identification method based on self-excitation discriminant feature learning | |
CN113239801B (en) | Cross-domain action recognition method based on multi-scale feature learning and multi-level domain alignment | |
CN110827265B (en) | Image anomaly detection method based on deep learning | |
CN110321805B (en) | Dynamic expression recognition method based on time sequence relation reasoning | |
CN111027377A (en) | Double-flow neural network time sequence action positioning method | |
CN116311483B (en) | Micro-expression recognition method based on local facial area reconstruction and memory contrast learning | |
CN111126155B (en) | Pedestrian re-identification method for generating countermeasure network based on semantic constraint | |
CN111027555A (en) | License plate recognition method and device and electronic equipment | |
CN116778346B (en) | Pipeline identification method and system based on improved self-attention mechanism | |
CN111209886A (en) | Rapid pedestrian re-identification method based on deep neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |