CN111611847B - Video motion detection method based on scale attention hole convolution network - Google Patents

Video motion detection method based on scale attention hole convolution network Download PDF

Info

Publication number
CN111611847B
CN111611847B CN202010252104.7A CN202010252104A CN111611847B CN 111611847 B CN111611847 B CN 111611847B CN 202010252104 A CN202010252104 A CN 202010252104A CN 111611847 B CN111611847 B CN 111611847B
Authority
CN
China
Prior art keywords
video
frame
action
motion
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010252104.7A
Other languages
Chinese (zh)
Other versions
CN111611847A (en
Inventor
李平
曹佳晨
陈乐聪
徐向华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202010252104.7A priority Critical patent/CN111611847B/en
Publication of CN111611847A publication Critical patent/CN111611847A/en
Application granted granted Critical
Publication of CN111611847B publication Critical patent/CN111611847B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video motion detection method based on a scale attention hole convolution network. The method comprises the steps of firstly sampling a video to obtain a frame image sequence, obtaining a video segment according to a segment position mark, then respectively constructing a layer scale attention motion segment model and a frame position attention motion recognition model, and sequentially obtaining the weighted feature representation of the frame image and the motion category of the video segment according to the models and by combining a watershed algorithm to complete a video motion detection task. The method utilizes the hole convolution network to extract the space-time motion information which can better reflect the intrinsic structure of the time dimension and the space dimension of the video data, more properly describes the internal association of the time sequence context of the video frame along with the change of the dimension through the attention of the layer dimension, and the designed frame position attention mechanism endows the video frame of the action segment with the weight which can more accurately represent the key content of the action segment, thereby improving the precision of the video action detection and the efficiency of the action detection.

Description

Video motion detection method based on scale attention hole convolution network
Technical Field
The invention belongs to the technical field of video analysis, particularly relates to the technical field of time sequence action detection, and relates to a video action detection method based on a scale attention hole convolution network.
Background
Understanding of human action videos plays an important role in a plurality of fields such as security monitoring and behavior analysis, and becomes a leading-edge research subject in the field of computer vision. However, un-clipped real video often contains background segments unrelated to human actions, which will affect the correct understanding of the video content. To address this problem, the video motion detection method not only classifies the motion within the video, but also locates the start and end times of the motion instance occurring in the video. The video motion detection task generally takes a video frame sequence as an input, and outputs detection results of multiple groups of segments in the form of 'motion category-start frame-end frame', and the processing process thereof can be divided into two stages: action fragment generation and action fragment recognition. The former generally outputs the start frame and the end frame of the segment, and the latter outputs the action type of the segment. Generally, the video motion detection method can help to better understand video content, including tasks such as video summarization, motion recognition, content annotation, event capture and the like. For example, for a video summarization task, key segments can be obtained through video motion detection, so that key frames or segments which can reflect video content most can be accurately positioned, and the quality of video summarization is improved.
The video motion detection is used for processing video frame images and needs to describe the time sequence relation among frames, and the tensor calculation with high dimensionality is involved. The traditional machine learning method adopts the characteristics extracted manually, such as the track characteristics, so that the extraction efficiency can not meet the real-time performance requirement, and the characteristic extraction process is separated from model training, so that the generalization performance of the model is weak. In recent years, Convolutional Neural Networks (CNN) for end-to-end learning has been developed rapidly and can be used to compensate for the shortcomings of the conventional methods. For example: the characteristic extraction efficiency of the time sequence action information based on the three-dimensional convolution neural network or the optical flow field information is higher; (ii) a The candidate segment generation scheme based on deep reinforcement learning can adaptively complete the action segment generation task end to end; the time sequence action positioning network provides a multi-scale parallel action segment generation structure for solving the problem of different lengths of action segments, and the optimal performance of the field is greatly refreshed.
The existing video motion detection method mainly has the following defects: firstly, in a feature extraction stage, the time sequence dimension of an input video is fixedly reduced layer by layer in a constructed network model by using three-dimensional convolution operation for extracting the time sequence feature of an action, the size of the extracted feature on the time sequence is constrained, the context semantics are split by an undersized scale, and the interference of different semantics is caused by an oversize scale; secondly, in the generation stage of the action segment, whether the action occurs or not and the key points of the types of the action, namely the positions and the duration (such as continuous key frames) of the key frames are often different for the actions with different durations, and the weight problem of the key points is ignored in the conventional average pooling operation; thirdly, the existing method extracts feature representations of action segments by using different network structures (such as a hole convolution network) for segments with different sizes, which greatly increases the time and space costs of network construction and training. Therefore, it is desirable to design a method that can improve the video motion detection performance and save the computation and storage overhead.
Disclosure of Invention
The invention aims to provide a video motion detection method based on a scale attention hole convolutional network, aiming at the defects of the prior art, and the method can be used for capturing the space-time motion information of video data by combining the hole convolutional network and accurately depicting the time sequence context relationship of video frames through scale attention, thereby effectively detecting motion segments in a video and accurately judging the category of the motion segments.
The method firstly acquires a video data set, and then performs the following operations:
step (1), video sampling is carried out, a frame image sequence is obtained, and a video segment is obtained according to a segment position mark;
step (2), constructing a layer scale attention action fragment model, inputting a frame image sequence of a complete video, and outputting a weighted feature representation of the complete video frame image and the probability of whether each frame is an action frame;
step (3), constructing a frame position attention action recognition model, inputting the weighted feature representation of a video clip frame image, and outputting the probability of the action category to which the video clip belongs;
and (4) generating a video segment for the new video according to the layer scale attention motion segment model and the watershed algorithm, and judging the segment motion type by the frame position attention motion recognition model to obtain a motion detection result.
Further, the step (1) is specifically:
(1-1) processing a single video into a sequence of frame images at a sampling rate of i frames per second
Figure GDA0002965764790000021
Wherein N represents the total number of frame images, fnRGB three-channel frame image with nth width being w and height being h in representation sequence,n=1,2,…,N,i=20~40;
(1-2) marking according to video clip position
Figure GDA0002965764790000022
Acquiring a video clip comprising an action clip and a background clip; in which the category of the video segment
Figure GDA0002965764790000023
J is the number of operation types, and is 0,1,2, …, and is the operation type number when J is 0, and is the background type number when J is 0; m is the total number of action and background segments, s for the mth video segmentmIs the starting frame number of the fragment, emIs the end frame number of the fragment, cmIs the class corresponding to the fragment, M is 1,2, …, M.
Still further, the step (2) is specifically:
(2-1) processing a frame image sequence of a complete video frame by taking a video frame as a unit, respectively obtaining a starting frame sequence number and an ending frame sequence number of an action fragment and a background fragment by using a video fragment position mark, marking a video frame in the action fragment as an action frame, and marking a video frame in the background fragment as a background frame;
(2-2) layer-scale attention motion segment model
Figure GDA0002965764790000024
Taking a multilayer cavity convolution neural network considering a time sequence relation as a main body, firstly, sequentially acquiring a frame image sequence from a lower layer to a higher layer in a frame-by-frame processing mode
Figure GDA0002965764790000031
The context feature representation of different scales of each frame, namely the feature representation of the image of the t frame at the k layer is shown as
Figure GDA0002965764790000032
Wherein ck is the number of channels of the k-th layer, and wk and hk are the width and height of the k-th layer characteristic representation respectively; a weighted feature representation of the complete video is then obtained through a layer scale attention mechanism
Figure GDA0002965764790000033
Wherein the weighted features of the t frame image are expressed as
Figure GDA0002965764790000034
Is the scale attention weight of the k-th layer,
Figure GDA0002965764790000035
k is the total number of layers of the multilayer cavity convolution network, and K is more than or equal to 2;
(2-3) expressing the weighted characteristics of the t frame image as StOutput vector after passing through full connection layer
Figure GDA0002965764790000036
As a layer scale attention motion fragment model
Figure GDA0002965764790000037
The last layer of (2) that outputs the probability of whether a video frame belongs to an action frame using the Softmax (·) function
Figure GDA0002965764790000038
h is 0,1, wherein e represents a natural base number, y0Is the probability of a background frame, y1Is the probability of an action frame, ZqRepresents the q-th element of the vector Z and notes the probability of whether the nth video frame belongs to the motion frame
Figure GDA0002965764790000039
Cross entropy loss function of the model is then calculated
Figure GDA00029657647900000310
Wherein
Figure GDA00029657647900000311
In order to be a true mark, the mark is,
Figure GDA00029657647900000312
indicating that the frame is an action frame,
Figure GDA00029657647900000313
indicating that the frame is a background frame; and optimizing a training layer scale attention motion segment model by using a stochastic gradient descent algorithm, and updating model parameters through inverse gradient propagation.
Further, the step (3) is specifically:
(3-1) weighted feature representation from complete video in sequence
Figure GDA00029657647900000314
Wherein the weighted feature representation of each video clip is obtained by using the starting frame number and the ending frame number of the video clip position mark L
Figure GDA00029657647900000315
m=1,…,M;
(3-2) frame position attention motion recognition model
Figure GDA00029657647900000316
Taking a multilayer neural network with frame position attention mechanism as a main body, wherein the input of the multilayer neural network is weighted characteristic representation of each frame of a video clip
Figure GDA00029657647900000317
The model obtains weighted feature representation of video segments by calculating frame position attention
Figure GDA00029657647900000318
Wherein
Figure GDA00029657647900000319
Is the position attention weight of the t-th frame,
Figure GDA00029657647900000320
(3-3) representing the weighted features of the video segments by XmOutput vector after passing through full connection layer
Figure GDA00029657647900000321
As frame positionsAttention motion recognition model
Figure GDA00029657647900000322
The last layer of (1) that outputs the probability of the action class j to which the video clip belongs as a function of Softmax (·)
Figure GDA00029657647900000323
J-1, 2, …, J, and probability of belonging to a background category
Figure GDA00029657647900000324
Cross entropy loss of the model is then calculated
Figure GDA00029657647900000325
Wherein
Figure GDA00029657647900000326
If the video clip belongs to the category j, the video clip is a true mark, and if the video clip belongs to the category j, the video clip is 1, otherwise, the video clip is 0; and optimizing the training frame position attention motion recognition model by using a stochastic gradient descent algorithm, and updating the model parameters through inverse gradient propagation.
Still further, the step (4) is specifically:
(4-1) for new video, obtaining its frame image sequence by (1-1)
Figure GDA0002965764790000041
Inputting the sequence into the layer scale attention motion fragment model of the step (2)
Figure GDA0002965764790000042
And calculating to obtain the probability sequence whether the video frame image sequence belongs to the action frame or not through (2-3)
Figure GDA0002965764790000043
Then, a watershed algorithm based on multi-level immersion is used for the probability sequence, namely the probability value is higher than a set threshold value tau, tau is 0-1, and video frames with continuous time sequences are aggregated into video segments; simultaneously, a plurality of thresholds with different ranges of 0-1 are used for generating M' video clips with different lengths,and a start frame number s 'and an end frame number e';
(4-2) inputting the video segment frame image sequence of (4-1) into the frame position attention motion recognition model of the step (3)
Figure GDA0002965764790000044
Obtaining the probability of frame images in the video clip belonging to each category
Figure GDA0002965764790000045
And taking the category corresponding to the maximum probability value as the category c' to which the video clip belongs; outputting the starting frame number and the ending frame number of the video clip judged as a specific action;
(4-3) obtaining a video segment from the new video through (4-1), and then obtaining a video action detection result through (4-2)
Figure GDA0002965764790000046
Where M ' is the sequence number of the video segment, M ' is the total number of detected motion segments, s 'm'Denotes the fragment Start frame number, e'm'Denotes the end frame number, c 'of the fragment'm'Representing the action category of the segment.
The method of the invention utilizes the scale attention hole convolution network to detect the video action, which is different from the prior method in the following aspects: 1) compared with a time sequence action positioning network which uses a multi-scale parallel structure, the method uses a hole convolution layer with a multi-layer serial structure, and reduces the redundancy of the network structure while realizing the extraction of multi-scale context characteristics; 2) the method that the three-dimensional convolutional neural network is a backbone is often used for extracting time sequence downsampling information, and the method provides that a void convolutional neural network is used for extracting context characteristics at the fine granularity level of an original video frame; 3) the method provides that from two angles of scale and position, the time sequence characteristic information corresponding to the video frame and the video clip is better extracted by combining an attention mechanism; 4) the action segment can be generated in parallel by using a watershed algorithm based on multi-level immersion in the action segment generation stage, and the execution efficiency is higher than that of many existing methods.
The invention is suitable for the video motion detection task based on the deep learning method, and has the main advantages that: 1) by combining the hole convolution network, not only the spatio-temporal motion information capable of better reflecting the intrinsic structure of the time dimension and the space dimension of the video data is extracted, but also the fine granularity of the frame level is reserved for the characteristics; 2) selecting an appropriate feature representation for each frame by changing the dimension characterizing the temporal context of the current frame using a layer dimension attention mechanism; 3) using a frame position attention mechanism, the video frames within each action segment are accurately characterized by adding weights to reflect their features. The method provides a scientific and reasonable scheme for improving the performance of the video motion detection task from multiple angles, and can be widely applied to practical application scenes such as security monitoring, behavior analysis, video abstraction, event detection and the like.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
A video motion detection method based on a scale attention hole convolutional network comprises the steps of firstly sampling a video to obtain a frame image sequence, obtaining a video segment according to motion segment marks, then respectively constructing a layer scale attention motion segment model and a frame position attention motion recognition model, and finally judging the motion type of the video segment by combining a watershed algorithm. The method can more accurately capture the spatio-temporal motion information of the video data by utilizing the hole convolution network, uses a layer scale attention mechanism to depict the time sequence context relationship of the video frames, and learns proper weight through the video frames of which the frame position attention mechanism is an action segment so as to better reflect the content of the action segment. The video motion detection system constructed in the mode can effectively extract the time sequence characteristics of the video frame images and the video clips, and can effectively detect the motion types in the video.
As shown in fig. 1, the method first obtains a video data set, and then performs the following operations:
step (1), video sampling is carried out, a frame image sequence is obtained, and a video segment is obtained according to a segment position mark; the method comprises the following steps:
(1-1) processing a single video into a sequence of frame images at a sampling rate of i frames per second
Figure GDA0002965764790000051
Wherein N represents the total number of frame images, fnRepresenting an RGB three-channel frame image with the nth width w and the height h, wherein N is 1,2, …, and N, i is 20-40; in this example, i is 30;
(1-2) marking according to video clip position
Figure GDA0002965764790000052
Acquiring a video clip comprising an action clip and a background clip; in which the category of the video segment
Figure GDA0002965764790000053
J is the number of operation types, and is 0,1,2, …, and is the operation type number when J is 0, and is the background type number when J is 0; m is the total number of action and background segments, s for the mth video segmentmIs the starting frame number of the fragment, emIs the end frame number of the fragment, cmIs the class corresponding to the fragment, M is 1,2, …, M.
Step (2), constructing a layer scale attention action fragment model, inputting a frame image sequence of a complete video, and outputting a weighted feature representation of the complete video frame image and the probability of whether each frame is an action frame; the method comprises the following steps:
(2-1) processing a frame image sequence of a complete video frame by taking a video frame as a unit, respectively obtaining a starting frame sequence number and an ending frame sequence number of an action fragment and a background fragment by using a video fragment position mark, marking a video frame in the action fragment as an action frame, and marking a video frame in the background fragment as a background frame;
(2-2) layer-scale attention motion segment model
Figure GDA0002965764790000054
Taking a multilayer hole convolution neural network considering a time sequence relation as a main body, firstly, frame by frame from a lower layer to a higher layerSequentially acquiring frame image sequence
Figure GDA0002965764790000055
The context feature representation of different scales of each frame, namely the feature representation of the image of the t frame at the k layer is shown as
Figure GDA0002965764790000056
Wherein ck is the number of channels of the k-th layer, and wk and hk are the width and height of the k-th layer characteristic representation respectively; a weighted feature representation of the complete video is then obtained through a layer scale attention mechanism
Figure GDA0002965764790000061
Wherein the weighted features of the t frame image are expressed as
Figure GDA0002965764790000062
Is the scale attention weight of the k-th layer,
Figure GDA0002965764790000063
k is the total number of layers of the multilayer cavity convolution network, and K is more than or equal to 2;
(2-3) expressing the weighted characteristics of the t frame image as StOutput vector after passing through full connection layer
Figure GDA0002965764790000064
As a layer scale attention motion fragment model
Figure GDA0002965764790000065
The last layer of (2) that outputs the probability of whether a video frame belongs to an action frame using the Softmax (·) function
Figure GDA0002965764790000066
h is 0,1, wherein e represents a natural base number, y0Is the probability of a background frame, y1Is the probability of an action frame, ZqRepresents the q-th element of the vector Z and notes the probability of whether the nth video frame belongs to the motion frame
Figure GDA0002965764790000067
Cross entropy loss function of the model is then calculated
Figure GDA0002965764790000068
Wherein
Figure GDA0002965764790000069
In order to be a true mark, the mark is,
Figure GDA00029657647900000610
indicating that the frame is an action frame,
Figure GDA00029657647900000611
indicating that the frame is a background frame; and optimizing a training layer scale attention motion segment model by using a stochastic gradient descent algorithm, and updating model parameters through inverse gradient propagation.
Step (3), constructing a frame position attention action recognition model, inputting the weighted feature representation of a video clip frame image, and outputting the probability of the action category to which the video clip belongs; the method comprises the following steps:
(3-1) weighted feature representation from complete video in sequence
Figure GDA00029657647900000612
Wherein the weighted feature representation of each video clip is obtained by using the starting frame number and the ending frame number of the video clip position mark L
Figure GDA00029657647900000613
m=1,…,M;
(3-2) frame position attention motion recognition model
Figure GDA00029657647900000614
Taking a multilayer neural network with frame position attention mechanism as a main body, wherein the input of the multilayer neural network is weighted characteristic representation of each frame of a video clip
Figure GDA00029657647900000615
The model obtains weighted feature representation of video segments by calculating frame position attention
Figure GDA00029657647900000616
Wherein
Figure GDA00029657647900000617
Is the position attention weight of the t-th frame,
Figure GDA00029657647900000618
(3-3) representing the weighted features of the video segments by XmOutput vector after passing through full connection layer
Figure GDA00029657647900000619
Identifying a model as a frame position attention action
Figure GDA00029657647900000620
The last layer of (1) that outputs the probability of the action class j to which the video clip belongs as a function of Softmax (·)
Figure GDA00029657647900000621
J-1, 2, …, J, and probability of belonging to a background category
Figure GDA00029657647900000622
Cross entropy loss of the model is then calculated
Figure GDA00029657647900000623
Wherein
Figure GDA00029657647900000624
If the video clip belongs to the category j, the video clip is a true mark, and if the video clip belongs to the category j, the video clip is 1, otherwise, the video clip is 0; and optimizing the training frame position attention motion recognition model by using a stochastic gradient descent algorithm, and updating the model parameters through inverse gradient propagation.
Step (4), generating a video clip for the new video according to the layer scale attention motion clip model and the watershed algorithm, and judging the clip motion type by the frame position attention motion recognition model to obtain a motion detection result; the method comprises the following steps:
(4-1) for new video, obtaining its frame image sequence by (1-1)
Figure GDA0002965764790000071
Inputting the sequence into the layer scale attention motion fragment model of the step (2)
Figure GDA0002965764790000072
And calculating to obtain the probability sequence whether the video frame image sequence belongs to the action frame or not through (2-3)
Figure GDA0002965764790000073
Then, a watershed algorithm based on multi-level immersion is used for the probability sequence, namely, the probability value is higher than a set threshold value tau (tau is 0-1, in the embodiment, tau is 0.7), and video frames with continuous time sequences are aggregated into video segments; simultaneously, generating M ' video clips with different lengths, a starting frame sequence number s ' and an ending frame sequence number e ' by using a plurality of different threshold values in the range of 0-1;
(4-2) inputting the video segment frame image sequence of (4-1) into the frame position attention motion recognition model of the step (3)
Figure GDA0002965764790000074
Obtaining the probability of frame images in the video clip belonging to each category
Figure GDA0002965764790000075
And taking the category corresponding to the maximum probability value as the category c' to which the video clip belongs; outputting the starting frame number and the ending frame number of the video clip judged as a specific action;
(4-3) obtaining a video segment from the new video through (4-1), and then obtaining a video action detection result through (4-2)
Figure GDA0002965764790000076
Where M ' is the sequence number of the video segment, M ' is the total number of detected motion segments, s 'm'Denotes the fragment Start frame number, e'm'Denotes the end frame number, c 'of the fragment'm'To show the sheetThe action category of the segment.
The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.

Claims (6)

1. The video motion detection method based on the scale attention hole convolutional network is characterized by firstly acquiring a video data set and then performing the following operations:
step (1), video sampling is carried out, a frame image sequence is obtained, and a video segment is obtained according to a segment position mark; the method comprises the following steps:
(1-1) processing a single video into a sequence of frame images at a sampling rate of i frames per second
Figure FDA0002976780730000011
Wherein N represents the total number of frame images, fnRepresenting the N-th RGB three-channel frame image with w height h in the sequence, wherein N is 1,2, …, N;
(1-2) marking according to video clip position
Figure FDA0002976780730000012
Acquiring a video clip comprising an action clip and a background clip; in which the category of the video segment
Figure FDA0002976780730000013
J is the number of operation types, and is 0,1,2, …, and is the operation type number when J is 0, and is the background type number when J is 0; m is the total number of action and background segments, s for the mth video segmentmIs the starting frame number of the fragment, emIs the end frame number of the fragment, cmIs the class to which the fragment corresponds, M is 1,2, …, M;
step (2) constructing a layer scale attention motion fragment model, inputting a frame image sequence of a complete video, and outputting the frame image sequence of the complete videoWeighted feature representation and probability of whether each frame is an action frame; the layer scale attention action fragment model takes a multilayer cavity convolution neural network considering a time sequence relation as a main body, and obtains the weighted characteristic representation of the complete video through a layer scale attention mechanism; wherein the weighted characteristics of the t frame image are expressed as
Figure FDA0002976780730000014
Figure FDA0002976780730000015
Is the scale attention weight of the k-th layer,
Figure FDA0002976780730000016
Figure FDA0002976780730000017
representing the characteristics of the t frame image on the K layer, wherein K is the total number of layers of the multilayer cavity convolution network, and K is more than or equal to 2;
step (3), constructing a frame position attention action recognition model, inputting the weighted feature representation of a video clip frame image, and outputting the probability of the action category to which the video clip belongs; the frame position attention motion recognition model takes a multi-layer neural network considering a frame position attention mechanism as a main body, and obtains weighted feature representation of the video clip by calculating the frame position attention
Figure FDA0002976780730000018
Wherein the content of the first and second substances,
Figure FDA0002976780730000019
is the position attention weight of the t-th frame,
Figure FDA00029767807300000110
and (4) generating a video segment for the new video according to the layer scale attention motion segment model and the watershed algorithm, and judging the segment motion type by the frame position attention motion recognition model to obtain a motion detection result.
2. The method for detecting video motion based on the scale attention hole convolutional network as claimed in claim 1, wherein the step (2) is specifically:
(2-1) processing a frame image sequence of a complete video frame by taking a video frame as a unit, respectively obtaining a starting frame sequence number and an ending frame sequence number of an action fragment and a background fragment by using a video fragment position mark, marking a video frame in the action fragment as an action frame, and marking a video frame in the background fragment as a background frame;
(2-2) first, acquiring a sequence of frame images in sequence by processing from a lower layer to a higher layer frame by frame
Figure FDA0002976780730000021
Context feature representation of different scales of each frame, and feature representation of the t frame image in the k layer
Figure FDA0002976780730000022
Wherein ck is the number of channels of the k-th layer, and wk and hk are the width and height of the k-th layer characteristic representation respectively; a weighted feature representation of the complete video is then obtained through a layer scale attention mechanism
Figure FDA0002976780730000023
(2-3) expressing the weighted characteristics of the t frame image as StOutput vector after passing through full connection layer
Figure FDA0002976780730000024
As a layer scale attention motion fragment model
Figure FDA0002976780730000025
The last layer of (2) that outputs the probability of whether a video frame belongs to an action frame using the Softmax (·) function
Figure FDA0002976780730000026
Wherein eDenotes the natural base number, y0Is the probability of a background frame, y1Is the probability of an action frame, ZqRepresents the q-th element of the vector Z and notes the probability of whether the nth video frame belongs to the motion frame
Figure FDA0002976780730000027
Cross entropy loss function of the model is then calculated
Figure FDA0002976780730000028
Wherein
Figure FDA0002976780730000029
In order to be a true mark, the mark is,
Figure FDA00029767807300000210
indicating that the frame is an action frame,
Figure FDA00029767807300000211
indicating that the frame is a background frame; and optimizing a training layer scale attention motion segment model by using a stochastic gradient descent algorithm, and updating model parameters through inverse gradient propagation.
3. The video motion detection method based on the scale attention hole convolutional network as claimed in claim 2, wherein the step (3) is specifically:
(3-1) weighted feature representation from complete video in sequence
Figure FDA00029767807300000212
Wherein the weighted feature representation of each video clip is obtained by using the starting frame number and the ending frame number of the video clip position mark L
Figure FDA00029767807300000213
(3-2) frame position attention motion recognition model
Figure FDA00029767807300000214
Input as weighted feature representation of frames of video segment
Figure FDA00029767807300000215
The model obtains weighted feature representation of video segments by calculating frame position attention
Figure FDA00029767807300000216
(3-3) representing the weighted features of the video segments by XmOutput vector after passing through full connection layer
Figure FDA00029767807300000217
Identifying a model as a frame position attention action
Figure FDA00029767807300000218
The last layer of (1) that outputs the probability of the action class j to which the video clip belongs as a function of Softmax (·)
Figure FDA00029767807300000219
And probability of belonging to a background category
Figure FDA00029767807300000220
Cross entropy loss of the model is then calculated
Figure FDA00029767807300000221
Wherein
Figure FDA00029767807300000222
If the video clip belongs to the category j, the video clip is a true mark, and if the video clip belongs to the category j, the video clip is 1, otherwise, the video clip is 0; and optimizing the training frame position attention motion recognition model by using a stochastic gradient descent algorithm, and updating the model parameters through inverse gradient propagation.
4. The video motion detection method based on the scale attention hole convolutional network as claimed in claim 3, wherein the step (4) is specifically:
(4-1) for new video, obtaining its frame image sequence by (1-1)
Figure FDA0002976780730000031
Inputting the sequence into the layer scale attention motion fragment model of the step (2)
Figure FDA0002976780730000032
And calculating to obtain the probability sequence whether the video frame image sequence belongs to the action frame or not through (2-3)
Figure FDA0002976780730000033
Then, a watershed algorithm based on multi-level immersion is used for the probability sequence, namely the probability value is higher than a set threshold tau, and video frames with continuous time sequences are aggregated into a video segment; simultaneously, generating M ' video clips with different lengths, a starting frame sequence number s ' and an ending frame sequence number e ' by using a plurality of different threshold values in the range of 0-1;
(4-2) inputting the video segment frame image sequence of (4-1) into the frame position attention motion recognition model of the step (3)
Figure FDA0002976780730000034
Obtaining the probability of frame images in the video clip belonging to each category
Figure FDA0002976780730000035
And taking the category corresponding to the maximum probability value as the category c' to which the video clip belongs; outputting the starting frame number and the ending frame number of the video clip judged as a specific action;
(4-3) obtaining a video segment from the new video through (4-1), and then obtaining a video action detection result through (4-2)
Figure FDA0002976780730000036
Where M ' is the sequence number of the video segment, M ' is the total number of detected motion segments, s 'm'To representThe fragment Start frame number, e'm'Denotes the end frame number, c 'of the fragment'm'Representing the action category of the segment.
5. The method of claim 1, wherein the video motion detection method based on the scale attention hole convolutional network comprises: and i is 20-40.
6. The method of claim 4 for video motion detection based on scale attention hole convolutional network, wherein: τ is 0 to 1.
CN202010252104.7A 2020-04-01 2020-04-01 Video motion detection method based on scale attention hole convolution network Active CN111611847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010252104.7A CN111611847B (en) 2020-04-01 2020-04-01 Video motion detection method based on scale attention hole convolution network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010252104.7A CN111611847B (en) 2020-04-01 2020-04-01 Video motion detection method based on scale attention hole convolution network

Publications (2)

Publication Number Publication Date
CN111611847A CN111611847A (en) 2020-09-01
CN111611847B true CN111611847B (en) 2021-04-30

Family

ID=72200342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010252104.7A Active CN111611847B (en) 2020-04-01 2020-04-01 Video motion detection method based on scale attention hole convolution network

Country Status (1)

Country Link
CN (1) CN111611847B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112418012B (en) * 2020-11-09 2022-06-07 武汉大学 Video abstract generation method based on space-time attention model
CN112580557A (en) * 2020-12-25 2021-03-30 深圳市优必选科技股份有限公司 Behavior recognition method and device, terminal equipment and readable storage medium
CN112580577B (en) * 2020-12-28 2023-06-30 出门问问(苏州)信息科技有限公司 Training method and device for generating speaker image based on facial key points
CN113111842B (en) * 2021-04-26 2023-06-27 浙江商汤科技开发有限公司 Action recognition method, device, equipment and computer readable storage medium
CN113408343B (en) * 2021-05-12 2022-05-13 杭州电子科技大学 Classroom action recognition method based on double-scale space-time block mutual attention
CN113204674B (en) * 2021-07-05 2021-09-17 杭州一知智能科技有限公司 Video-paragraph retrieval method and system based on local-overall graph inference network
CN113822172B (en) * 2021-08-30 2024-06-14 中国科学院上海微系统与信息技术研究所 Video space-time behavior detection method
CN114926900B (en) * 2022-05-10 2023-06-16 电子科技大学 Human body action on-line detection method with separated front and back
CN114842559B (en) * 2022-06-29 2022-10-14 山东省人工智能研究院 Video interaction action detection method based on multi-mode time perception and attention
CN115834977B (en) * 2022-11-18 2023-09-08 贝壳找房(北京)科技有限公司 Video processing method, electronic device, storage medium and computer program product
CN115763167B (en) * 2022-11-22 2023-09-22 黄华集团有限公司 Solid cabinet circuit breaker and control method thereof
CN117630344B (en) * 2024-01-25 2024-04-05 西南科技大学 Method for detecting slump range of concrete on line in real time

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664931A (en) * 2018-05-11 2018-10-16 中国科学技术大学 A kind of multistage video actions detection method
CN108830212A (en) * 2018-06-12 2018-11-16 北京大学深圳研究生院 A kind of video behavior time shaft detection method
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN109522966A (en) * 2018-11-28 2019-03-26 中山大学 A kind of object detection method based on intensive connection convolutional neural networks
CN110738129A (en) * 2019-09-20 2020-01-31 华中科技大学 end-to-end video time sequence behavior detection method based on R-C3D network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108288015B (en) * 2017-01-10 2021-10-22 武汉大学 Human body action recognition method and system in video based on time scale invariance
US11640710B2 (en) * 2017-11-14 2023-05-02 Google Llc Weakly-supervised action localization by sparse temporal pooling network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664931A (en) * 2018-05-11 2018-10-16 中国科学技术大学 A kind of multistage video actions detection method
CN108830212A (en) * 2018-06-12 2018-11-16 北京大学深圳研究生院 A kind of video behavior time shaft detection method
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN109522966A (en) * 2018-11-28 2019-03-26 中山大学 A kind of object detection method based on intensive connection convolutional neural networks
CN110738129A (en) * 2019-09-20 2020-01-31 华中科技大学 end-to-end video time sequence behavior detection method based on R-C3D network

Also Published As

Publication number Publication date
CN111611847A (en) 2020-09-01

Similar Documents

Publication Publication Date Title
CN111611847B (en) Video motion detection method based on scale attention hole convolution network
CN109543667B (en) Text recognition method based on attention mechanism
CN109949317B (en) Semi-supervised image example segmentation method based on gradual confrontation learning
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
CN107330362B (en) Video classification method based on space-time attention
CN108228915B (en) Video retrieval method based on deep learning
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
JP7097641B2 (en) Loop detection method based on convolution perception hash algorithm
CN110120064B (en) Depth-related target tracking algorithm based on mutual reinforcement and multi-attention mechanism learning
US11640714B2 (en) Video panoptic segmentation
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
CN108399435B (en) Video classification method based on dynamic and static characteristics
CN110853074B (en) Video target detection network system for enhancing targets by utilizing optical flow
CN110516536A (en) A kind of Weakly supervised video behavior detection method for activating figure complementary based on timing classification
CN110163117B (en) Pedestrian re-identification method based on self-excitation discriminant feature learning
CN113239801B (en) Cross-domain action recognition method based on multi-scale feature learning and multi-level domain alignment
CN110827265B (en) Image anomaly detection method based on deep learning
CN110321805B (en) Dynamic expression recognition method based on time sequence relation reasoning
CN111027377A (en) Double-flow neural network time sequence action positioning method
CN116311483B (en) Micro-expression recognition method based on local facial area reconstruction and memory contrast learning
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
CN111027555A (en) License plate recognition method and device and electronic equipment
CN116778346B (en) Pipeline identification method and system based on improved self-attention mechanism
CN111209886A (en) Rapid pedestrian re-identification method based on deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant