CN111611847B

CN111611847B - Video motion detection method based on scale attention hole convolution network

Info

Publication number: CN111611847B
Application number: CN202010252104.7A
Authority: CN
Inventors: 李平; 曹佳晨; 陈乐聪; 徐向华
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-04-01
Filing date: 2020-04-01
Publication date: 2021-04-30
Anticipated expiration: 2040-04-01
Also published as: CN111611847A

Abstract

The invention discloses a video motion detection method based on a scale attention hole convolution network. The method comprises the steps of firstly sampling a video to obtain a frame image sequence, obtaining a video segment according to a segment position mark, then respectively constructing a layer scale attention motion segment model and a frame position attention motion recognition model, and sequentially obtaining the weighted feature representation of the frame image and the motion category of the video segment according to the models and by combining a watershed algorithm to complete a video motion detection task. The method utilizes the hole convolution network to extract the space-time motion information which can better reflect the intrinsic structure of the time dimension and the space dimension of the video data, more properly describes the internal association of the time sequence context of the video frame along with the change of the dimension through the attention of the layer dimension, and the designed frame position attention mechanism endows the video frame of the action segment with the weight which can more accurately represent the key content of the action segment, thereby improving the precision of the video action detection and the efficiency of the action detection.

Description

Video motion detection method based on scale attention hole convolution network

Technical Field

The invention belongs to the technical field of video analysis, particularly relates to the technical field of time sequence action detection, and relates to a video action detection method based on a scale attention hole convolution network.

Background

Understanding of human action videos plays an important role in a plurality of fields such as security monitoring and behavior analysis, and becomes a leading-edge research subject in the field of computer vision. However, un-clipped real video often contains background segments unrelated to human actions, which will affect the correct understanding of the video content. To address this problem, the video motion detection method not only classifies the motion within the video, but also locates the start and end times of the motion instance occurring in the video. The video motion detection task generally takes a video frame sequence as an input, and outputs detection results of multiple groups of segments in the form of 'motion category-start frame-end frame', and the processing process thereof can be divided into two stages: action fragment generation and action fragment recognition. The former generally outputs the start frame and the end frame of the segment, and the latter outputs the action type of the segment. Generally, the video motion detection method can help to better understand video content, including tasks such as video summarization, motion recognition, content annotation, event capture and the like. For example, for a video summarization task, key segments can be obtained through video motion detection, so that key frames or segments which can reflect video content most can be accurately positioned, and the quality of video summarization is improved.

The video motion detection is used for processing video frame images and needs to describe the time sequence relation among frames, and the tensor calculation with high dimensionality is involved. The traditional machine learning method adopts the characteristics extracted manually, such as the track characteristics, so that the extraction efficiency can not meet the real-time performance requirement, and the characteristic extraction process is separated from model training, so that the generalization performance of the model is weak. In recent years, Convolutional Neural Networks (CNN) for end-to-end learning has been developed rapidly and can be used to compensate for the shortcomings of the conventional methods. For example: the characteristic extraction efficiency of the time sequence action information based on the three-dimensional convolution neural network or the optical flow field information is higher; (ii) a The candidate segment generation scheme based on deep reinforcement learning can adaptively complete the action segment generation task end to end; the time sequence action positioning network provides a multi-scale parallel action segment generation structure for solving the problem of different lengths of action segments, and the optimal performance of the field is greatly refreshed.

The existing video motion detection method mainly has the following defects: firstly, in a feature extraction stage, the time sequence dimension of an input video is fixedly reduced layer by layer in a constructed network model by using three-dimensional convolution operation for extracting the time sequence feature of an action, the size of the extracted feature on the time sequence is constrained, the context semantics are split by an undersized scale, and the interference of different semantics is caused by an oversize scale; secondly, in the generation stage of the action segment, whether the action occurs or not and the key points of the types of the action, namely the positions and the duration (such as continuous key frames) of the key frames are often different for the actions with different durations, and the weight problem of the key points is ignored in the conventional average pooling operation; thirdly, the existing method extracts feature representations of action segments by using different network structures (such as a hole convolution network) for segments with different sizes, which greatly increases the time and space costs of network construction and training. Therefore, it is desirable to design a method that can improve the video motion detection performance and save the computation and storage overhead.

Disclosure of Invention

The invention aims to provide a video motion detection method based on a scale attention hole convolutional network, aiming at the defects of the prior art, and the method can be used for capturing the space-time motion information of video data by combining the hole convolutional network and accurately depicting the time sequence context relationship of video frames through scale attention, thereby effectively detecting motion segments in a video and accurately judging the category of the motion segments.

The method firstly acquires a video data set, and then performs the following operations:

step (1), video sampling is carried out, a frame image sequence is obtained, and a video segment is obtained according to a segment position mark;

step (2), constructing a layer scale attention action fragment model, inputting a frame image sequence of a complete video, and outputting a weighted feature representation of the complete video frame image and the probability of whether each frame is an action frame;

step (3), constructing a frame position attention action recognition model, inputting the weighted feature representation of a video clip frame image, and outputting the probability of the action category to which the video clip belongs;

and (4) generating a video segment for the new video according to the layer scale attention motion segment model and the watershed algorithm, and judging the segment motion type by the frame position attention motion recognition model to obtain a motion detection result.

Further, the step (1) is specifically:

(1-1) processing a single video into a sequence of frame images at a sampling rate of i frames per second

Wherein N represents the total number of frame images, f_nRGB three-channel frame image with nth width being w and height being h in representation sequence，n＝1,2,…,N，i＝20～40；

(1-2) marking according to video clip position

Acquiring a video clip comprising an action clip and a background clip; in which the category of the video segment

J is the number of operation types, and is 0,1,2, …, and is the operation type number when J is 0, and is the background type number when J is 0; m is the total number of action and background segments, s for the mth video segment_mIs the starting frame number of the fragment, e_mIs the end frame number of the fragment, c_mIs the class corresponding to the fragment, M is 1,2, …, M.

Still further, the step (2) is specifically:

(2-1) processing a frame image sequence of a complete video frame by taking a video frame as a unit, respectively obtaining a starting frame sequence number and an ending frame sequence number of an action fragment and a background fragment by using a video fragment position mark, marking a video frame in the action fragment as an action frame, and marking a video frame in the background fragment as a background frame;

(2-2) layer-scale attention motion segment model

Taking a multilayer cavity convolution neural network considering a time sequence relation as a main body, firstly, sequentially acquiring a frame image sequence from a lower layer to a higher layer in a frame-by-frame processing mode

The context feature representation of different scales of each frame, namely the feature representation of the image of the t frame at the k layer is shown as

Wherein ck is the number of channels of the k-th layer, and wk and hk are the width and height of the k-th layer characteristic representation respectively; a weighted feature representation of the complete video is then obtained through a layer scale attention mechanism

Wherein the weighted features of the t frame image are expressed as

Is the scale attention weight of the k-th layer,

k is the total number of layers of the multilayer cavity convolution network, and K is more than or equal to 2;

(2-3) expressing the weighted characteristics of the t frame image as S_tOutput vector after passing through full connection layer

As a layer scale attention motion fragment model

The last layer of (2) that outputs the probability of whether a video frame belongs to an action frame using the Softmax (·) function

h is 0,1, wherein e represents a natural base number, y₀Is the probability of a background frame, y₁Is the probability of an action frame, Z_qRepresents the q-th element of the vector Z and notes the probability of whether the nth video frame belongs to the motion frame

Cross entropy loss function of the model is then calculated

Wherein

In order to be a true mark, the mark is,

indicating that the frame is an action frame,

indicating that the frame is a background frame; and optimizing a training layer scale attention motion segment model by using a stochastic gradient descent algorithm, and updating model parameters through inverse gradient propagation.

Further, the step (3) is specifically:

(3-1) weighted feature representation from complete video in sequence

Wherein the weighted feature representation of each video clip is obtained by using the starting frame number and the ending frame number of the video clip position mark L

m＝1，…，M；

(3-2) frame position attention motion recognition model

Taking a multilayer neural network with frame position attention mechanism as a main body, wherein the input of the multilayer neural network is weighted characteristic representation of each frame of a video clip

The model obtains weighted feature representation of video segments by calculating frame position attention

Wherein

Is the position attention weight of the t-th frame,

(3-3) representing the weighted features of the video segments by X_mOutput vector after passing through full connection layer

As frame positionsAttention motion recognition model

The last layer of (1) that outputs the probability of the action class j to which the video clip belongs as a function of Softmax (·)

J-1, 2, …, J, and probability of belonging to a background category

Cross entropy loss of the model is then calculated

Wherein

If the video clip belongs to the category j, the video clip is a true mark, and if the video clip belongs to the category j, the video clip is 1, otherwise, the video clip is 0; and optimizing the training frame position attention motion recognition model by using a stochastic gradient descent algorithm, and updating the model parameters through inverse gradient propagation.

Still further, the step (4) is specifically:

(4-1) for new video, obtaining its frame image sequence by (1-1)

Inputting the sequence into the layer scale attention motion fragment model of the step (2)

And calculating to obtain the probability sequence whether the video frame image sequence belongs to the action frame or not through (2-3)

Then, a watershed algorithm based on multi-level immersion is used for the probability sequence, namely the probability value is higher than a set threshold value tau, tau is 0-1, and video frames with continuous time sequences are aggregated into video segments; simultaneously, a plurality of thresholds with different ranges of 0-1 are used for generating M' video clips with different lengths,and a start frame number s 'and an end frame number e';

(4-2) inputting the video segment frame image sequence of (4-1) into the frame position attention motion recognition model of the step (3)

Obtaining the probability of frame images in the video clip belonging to each category

And taking the category corresponding to the maximum probability value as the category c' to which the video clip belongs; outputting the starting frame number and the ending frame number of the video clip judged as a specific action;

(4-3) obtaining a video segment from the new video through (4-1), and then obtaining a video action detection result through (4-2)

Where M ' is the sequence number of the video segment, M ' is the total number of detected motion segments, s '_m'Denotes the fragment Start frame number, e'_m'Denotes the end frame number, c 'of the fragment'_m'Representing the action category of the segment.

The method of the invention utilizes the scale attention hole convolution network to detect the video action, which is different from the prior method in the following aspects: 1) compared with a time sequence action positioning network which uses a multi-scale parallel structure, the method uses a hole convolution layer with a multi-layer serial structure, and reduces the redundancy of the network structure while realizing the extraction of multi-scale context characteristics; 2) the method that the three-dimensional convolutional neural network is a backbone is often used for extracting time sequence downsampling information, and the method provides that a void convolutional neural network is used for extracting context characteristics at the fine granularity level of an original video frame; 3) the method provides that from two angles of scale and position, the time sequence characteristic information corresponding to the video frame and the video clip is better extracted by combining an attention mechanism; 4) the action segment can be generated in parallel by using a watershed algorithm based on multi-level immersion in the action segment generation stage, and the execution efficiency is higher than that of many existing methods.

The invention is suitable for the video motion detection task based on the deep learning method, and has the main advantages that: 1) by combining the hole convolution network, not only the spatio-temporal motion information capable of better reflecting the intrinsic structure of the time dimension and the space dimension of the video data is extracted, but also the fine granularity of the frame level is reserved for the characteristics; 2) selecting an appropriate feature representation for each frame by changing the dimension characterizing the temporal context of the current frame using a layer dimension attention mechanism; 3) using a frame position attention mechanism, the video frames within each action segment are accurately characterized by adding weights to reflect their features. The method provides a scientific and reasonable scheme for improving the performance of the video motion detection task from multiple angles, and can be widely applied to practical application scenes such as security monitoring, behavior analysis, video abstraction, event detection and the like.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

A video motion detection method based on a scale attention hole convolutional network comprises the steps of firstly sampling a video to obtain a frame image sequence, obtaining a video segment according to motion segment marks, then respectively constructing a layer scale attention motion segment model and a frame position attention motion recognition model, and finally judging the motion type of the video segment by combining a watershed algorithm. The method can more accurately capture the spatio-temporal motion information of the video data by utilizing the hole convolution network, uses a layer scale attention mechanism to depict the time sequence context relationship of the video frames, and learns proper weight through the video frames of which the frame position attention mechanism is an action segment so as to better reflect the content of the action segment. The video motion detection system constructed in the mode can effectively extract the time sequence characteristics of the video frame images and the video clips, and can effectively detect the motion types in the video.

As shown in fig. 1, the method first obtains a video data set, and then performs the following operations:

step (1), video sampling is carried out, a frame image sequence is obtained, and a video segment is obtained according to a segment position mark; the method comprises the following steps:

Wherein N represents the total number of frame images, f_nRepresenting an RGB three-channel frame image with the nth width w and the height h, wherein N is 1,2, …, and N, i is 20-40; in this example, i is 30;

(1-2) marking according to video clip position

Step (2), constructing a layer scale attention action fragment model, inputting a frame image sequence of a complete video, and outputting a weighted feature representation of the complete video frame image and the probability of whether each frame is an action frame; the method comprises the following steps:

(2-2) layer-scale attention motion segment model

Taking a multilayer hole convolution neural network considering a time sequence relation as a main body, firstly, frame by frame from a lower layer to a higher layerSequentially acquiring frame image sequence

Wherein the weighted features of the t frame image are expressed as

Is the scale attention weight of the k-th layer,

As a layer scale attention motion fragment model

Cross entropy loss function of the model is then calculated

Wherein

In order to be a true mark, the mark is,

indicating that the frame is an action frame,

Step (3), constructing a frame position attention action recognition model, inputting the weighted feature representation of a video clip frame image, and outputting the probability of the action category to which the video clip belongs; the method comprises the following steps:

(3-1) weighted feature representation from complete video in sequence

m＝1，…，M；

(3-2) frame position attention motion recognition model

Wherein

Is the position attention weight of the t-th frame,

Identifying a model as a frame position attention action

J-1, 2, …, J, and probability of belonging to a background category

Cross entropy loss of the model is then calculated

Wherein

Step (4), generating a video clip for the new video according to the layer scale attention motion clip model and the watershed algorithm, and judging the clip motion type by the frame position attention motion recognition model to obtain a motion detection result; the method comprises the following steps:

(4-1) for new video, obtaining its frame image sequence by (1-1)

Then, a watershed algorithm based on multi-level immersion is used for the probability sequence, namely, the probability value is higher than a set threshold value tau (tau is 0-1, in the embodiment, tau is 0.7), and video frames with continuous time sequences are aggregated into video segments; simultaneously, generating M ' video clips with different lengths, a starting frame sequence number s ' and an ending frame sequence number e ' by using a plurality of different threshold values in the range of 0-1;

Where M ' is the sequence number of the video segment, M ' is the total number of detected motion segments, s '_m'Denotes the fragment Start frame number, e'_m'Denotes the end frame number, c 'of the fragment'_m'To show the sheetThe action category of the segment.

The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.

Claims

1. The video motion detection method based on the scale attention hole convolutional network is characterized by firstly acquiring a video data set and then performing the following operations:

Wherein N represents the total number of frame images, f_nRepresenting the N-th RGB three-channel frame image with w height h in the sequence, wherein N is 1,2, …, N;

(1-2) marking according to video clip position

J is the number of operation types, and is 0,1,2, …, and is the operation type number when J is 0, and is the background type number when J is 0; m is the total number of action and background segments, s for the mth video segment_mIs the starting frame number of the fragment, e_mIs the end frame number of the fragment, c_mIs the class to which the fragment corresponds, M is 1,2, …, M;

step (2) constructing a layer scale attention motion fragment model, inputting a frame image sequence of a complete video, and outputting the frame image sequence of the complete videoWeighted feature representation and probability of whether each frame is an action frame; the layer scale attention action fragment model takes a multilayer cavity convolution neural network considering a time sequence relation as a main body, and obtains the weighted characteristic representation of the complete video through a layer scale attention mechanism; wherein the weighted characteristics of the t frame image are expressed as

Is the scale attention weight of the k-th layer,

representing the characteristics of the t frame image on the K layer, wherein K is the total number of layers of the multilayer cavity convolution network, and K is more than or equal to 2;

step (3), constructing a frame position attention action recognition model, inputting the weighted feature representation of a video clip frame image, and outputting the probability of the action category to which the video clip belongs; the frame position attention motion recognition model takes a multi-layer neural network considering a frame position attention mechanism as a main body, and obtains weighted feature representation of the video clip by calculating the frame position attention

Wherein the content of the first and second substances,

is the position attention weight of the t-th frame,

2. The method for detecting video motion based on the scale attention hole convolutional network as claimed in claim 1, wherein the step (2) is specifically:

(2-2) first, acquiring a sequence of frame images in sequence by processing from a lower layer to a higher layer frame by frame

Context feature representation of different scales of each frame, and feature representation of the t frame image in the k layer

As a layer scale attention motion fragment model

Wherein eDenotes the natural base number, y₀Is the probability of a background frame, y₁Is the probability of an action frame, Z_qRepresents the q-th element of the vector Z and notes the probability of whether the nth video frame belongs to the motion frame

Cross entropy loss function of the model is then calculated

Wherein

In order to be a true mark, the mark is,

indicating that the frame is an action frame,

3. The video motion detection method based on the scale attention hole convolutional network as claimed in claim 2, wherein the step (3) is specifically:

(3-1) weighted feature representation from complete video in sequence

(3-2) frame position attention motion recognition model

Input as weighted feature representation of frames of video segment

Identifying a model as a frame position attention action

And probability of belonging to a background category

Cross entropy loss of the model is then calculated

Wherein

4. The video motion detection method based on the scale attention hole convolutional network as claimed in claim 3, wherein the step (4) is specifically:

(4-1) for new video, obtaining its frame image sequence by (1-1)

Then, a watershed algorithm based on multi-level immersion is used for the probability sequence, namely the probability value is higher than a set threshold tau, and video frames with continuous time sequences are aggregated into a video segment; simultaneously, generating M ' video clips with different lengths, a starting frame sequence number s ' and an ending frame sequence number e ' by using a plurality of different threshold values in the range of 0-1;

Where M ' is the sequence number of the video segment, M ' is the total number of detected motion segments, s '_m'To representThe fragment Start frame number, e'_m'Denotes the end frame number, c 'of the fragment'_m'Representing the action category of the segment.

5. The method of claim 1, wherein the video motion detection method based on the scale attention hole convolutional network comprises: and i is 20-40.

6. The method of claim 4 for video motion detection based on scale attention hole convolutional network, wherein: τ is 0 to 1.