CN112861758A

CN112861758A - Behavior identification method based on weak supervised learning video segmentation

Info

Publication number: CN112861758A
Application number: CN202110207458.4A
Authority: CN
Inventors: 李策; 盛龙帅; 姜中博; 李欣
Original assignee: China University of Mining and Technology Beijing CUMTB
Current assignee: China University of Mining and Technology Beijing CUMTB
Priority date: 2021-02-24
Filing date: 2021-02-24
Publication date: 2021-05-28
Anticipated expiration: 2041-02-24
Also published as: CN112861758B; NL2029182B1; NL2029182A

Abstract

The invention discloses a behavior identification method based on weak supervised learning video segmentation, wherein the method comprises the following steps: dividing the whole video into N sections with unknown quantity, distributing class labels and length labels for each section, and generating frame labels for calculating frame-by-frame cross entropy loss by using a Viterbi algorithm for video sections; finding the best action segmentation point in the initial video segmentation obtained by the Viterbi algorithm, and decomposing the initial video segmentation to obtain a visual model, a length model and a context model; connecting input data sequences in forward propagation by using a single-layer GRU network with 256 circulating gate units and softmax output to obtain a posterior probability and a length model; defining an auxiliary function and finding out an optimal segmentation point; finally, the maximum possible segmentation of the complete video is obtained by the length model and the auxiliary function. And the weak surveillance video is fully utilized to segment the actions in the complete video.

Description

Behavior identification method based on weak supervised learning video segmentation

Technical Field

The invention relates to the field of video behavior identification, in particular to a behavior identification method based on weak supervised learning video segmentation.

Background

In recent years, the generation of a large amount of video data attracts the research on the identification of video behaviors. The overall development trend in the field of behavior recognition is also converted from static scenes to dynamic scenes, the detection and recognition of single-motion targets are converted into the detection and analysis of multi-motion targets, and the transition from individual simple behaviors to complex actions is even converted into group behavior and action recognition and detection. The Breakfast, Salad and other video behavior data sets are discussed continuously on the computer vision top-level conference paper, and the classification and time segmentation of activities in the data sets have become popular contents of video behavior recognition research.

Video behavior recognition mainly depends on two labeling modes of a data set, namely full supervision and weak supervision. The full supervision consumes a large amount of labor cost to delimit and mark action frames and classes in the video, the weak supervision only provides a sequence of action classes in a section of the video, but does not provide a specific starting frame and ending frame of each action, and temporal action segmentation and marking are learned from action sets formed by action labels. The traditional video behavior recognition algorithms include dynamic time warping, CDP algorithm, HMM and Viterbi algorithm, and the basic deep learning video behavior recognition methods include a dual-stream method, LSTM, GRU, C3D and I3D, etc., which have good effects in detecting motion categories in videos but are not efficient for weak surveillance video segmentation.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a behavior recognition method based on weak supervised learning video segmentation, which utilizes a frame label generated by a Viterbi algorithm to calculate the cross entropy loss L frame by frame, uses the random gradient descent of the gradient Delta L of the cross entropy loss to update network parameters, uses a single-layer GRU network with 256 cycle gate units and softmax output to connect an input data sequence in forward propagation, and calculates the posterior probability to obtain the maximum segmentation number of the complete video and accurately segment different actions in a section of video.

The technical scheme adopted by the invention is as follows:

the method comprises the following steps that (1) the whole video is divided into N sections, a class label c and a length label L are distributed to each section, and a frame label is generated for the video sections obtained through division by using a Viterbi algorithm and is used for calculating the cross entropy loss L frame by frame; based on the cross entropy loss L of all video frames, updating GRU network parameters by using the random gradient descent of the gradient delta L;

step (2), obtaining initial video segmentation by Viterbi algorithm in step (1)

Finding the optimal action division point, wherein i is the video segment number, i is 1

Decomposing to obtain a visual model, a length model and a context model;

step (3), connecting input data sequences in forward propagation by using a single-layer GRU network with 256 cyclic gate units and softmax output to obtain a visual model in the step (2) by dividing posterior probability by class probability, and obtaining a length model in the step (2) by using class-based Poisson distribution;

step (4), defining an auxiliary function and finding out an optimal segmentation point;

and (5) obtaining the maximum segmentation number of the complete video by the length model obtained in the step (3) and the auxiliary function defined in the step (4).

The method has the advantages that the method is different from a common video behavior identification method for identifying the action category in the video, only the action category in the video is subjected to weak supervision labeling, the action identification method based on weak supervision learning video segmentation is used for segmenting the action in the complete video, and the action in the video can be accurately identified.

Drawings

The invention is further illustrated with reference to the following figures and examples.

FIG. 1 is a flow chart of a graph convolution neural network-based skeletal data behavior identification method according to an embodiment of the present invention;

FIG. 2 is a video of frames 81, 171, 366, 480 and 703 of a tea brewing action in accordance with an embodiment of the present invention;

fig. 3 is an overall network structure according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described with reference to the drawings are illustrative and are intended to be illustrative of the invention and should not be construed as limiting the invention.

First, a data set used by a behavior recognition method based on weak supervised learning video segmentation is introduced. The Breakfast dataset is a large-scale dataset for motion segmentation, comprising approximately 1712 videos for Breakfast, equivalent to about 360 ten thousand frames of 67 hours of video. Generally containing 10 behavioral actions related to breakfast activities, such as pancakes or omelets, each with detailed annotations like stirring or pouring. There are 48 action classes for this data set, with an average of 6.9 action instances per video. The 48 action classes have finer labels to label the beginning and ending frames of each action in the video in text form, typically for action detection and segmentation. The video length varies from a few seconds to a few minutes, and actions in the video are almost all more densely labeled, with only 7% of the frames belonging to background frames. Fig. 2 is an example of frames in specific actions during a tea brewing process in the data set, with weakly supervised learning only providing class labels and no specific frame delimitation.

Fig. 2 is a flow chart according to one embodiment of the invention.

Setting X as X for video containing T frames₁,..,x_t,...,x_TDividing the whole video into N segments, assigning class labels to each segment, and outputting video segment class labels c^N＝{c₁,...c_i,...,c_NAnd length label of video segment l_i∈{l₁,...,l_NIn which c is_iIs the category of the ith video, l_iFor the length of the ith video, i belongs to { 1., N }; will be allocated to frame x_tIs defined as c_n(t)Where n (T) is the segment number of the tth frame, and T e { 1., T }. Inferring most likely segmentations in video

Wherein

Can be calculated by the following formula:

wherein, p (c)_i,l_i| X) represents the probability of the motion category and the motion length of the ith video in video X,

represents a category of the predicted ith segment of video,

representing the length of the predicted ith video segment, and obtaining the category label of the video segment by formula (1)

And length

Sequence of video frames X and class label c thereof^NForwarding through a neural network, c^NProvided as ground truth class labels, only the length label of each video segment needs to be inferred during the training process. Will be allocated to frame x_tClass label of is called c_n(t)Labeling motion class and length in video using Viterbi algorithm

Class label c written frame by frame_n(1),...,c_n(t),...,c_n(T)Wherein c is_n(t)∈{c₁,...c_i,...,c_NAnd calculating the cross entropy loss of all video frames:

wherein, p (c)_n(t)|x_t) Representing video frame x_tCorresponding action class c_n(t)Probability of (c), -logp (c)_n(t)|x_t) Representing a frame x_tCross entropy loss of (2). The GRU network parameters are updated with a random gradient descent of their gradient Δ L based on the cross entropy loss L of all video frames for updating equation (6).

A buffer is used to store a recently processed sequence of video frames and their inferred frame tags, and K frames of the buffer are sampled and added to the loss function.

Wherein x is_kK frame representing a sequence of buffered video frames, K being the total number of buffered video frames, c_kRepresenting a frame x_kA corresponding category label.

FIG. 1 is an overall system architecture according to the present invention, video x with T frames₁,...,x_TIs input into the GRU network, the GRU network is connected to the input data sequence for forward propagation and then viterbi decoding, the frame-by-frame class labels generated by the viterbi algorithm being used to calculate the frame-by-frame cross entropy loss.

Decomposing the function according to equation (1):

assuming that the video frames are independent of each other, argmax in equation (4) can be converted to equation (5), as follows:

n (t) is the number of the division frame t, and defines p (x)_t|c_n(t)) Indicates a category label of c_n(t)In case of a video frame x_tProbability of p (x)_t|c_n(t)) As a visual model, p (l)_n|c_n) Represents the n-th video motion type as c_nIn the case of action length l_nProbability of p (l)_n|c_n) As a length model, p (c)_n|c_n-1) Representing a category of motion c in a video_n-1C is the action category of the video segment following the video segment_nProbability of (c), p (c)_n|c_n-1) Is a context model.

Connecting an incoming sequence of T frame-containing video frames X ═ X in forward propagation using a single layer GRU network with 256 cyclic gate units and softmax outputs₁,..,x_t,...,x_T}。p(c|x_t) Is the t-th frame video x_tThe softmax score of the GRU network of action category of (1), then the visual model p (x)_tC) can be represented by a posterior probability p (c | x)_t) Divided by p (c), can be represented as follows:

where p (c) is the prior distribution, p (c) is the normalized frame frequency of the action in the training set, and during the data training process, the number of frames marked by the class label c used by all the video frame sequences is calculated, and then the normalization is performed to obtain the estimated value of p (c). If a certain class label sequence c^NIncluding classes that have never been seen, then classes that have not been seen are used

And (4) showing. Where # classes represents the total number of categories.

The length model is implemented using class-dependent poisson distribution:

λ_cthe average length of the action class c is represented,

denotes λ_cTo the power of l, λ is updated for each iteration_cL! Is a factorial of l; training sample (X, c)^N) Definition of lambda when including classes never seen_cN/T, where N denotes the number of video segments and T denotes the number of video X frames.

Defining a helper function Q (t, l, c, g), where t denotes the video frame number, l denotes the length of the last segment, c denotes the class label of the last segment, and g denotes the context of the random syntax of the non-terminator, such that the best split point between actions in the video can be found by equation (5), the helper function generating the best probability score for the video segment before t frames that satisfies the following condition, assuming no new segment when l > 1:

Q(t,l,c,g)＝Q(t-1,l-1,c,g)·p(x_t|c) (8)

assume a new video segment at the tth frame when l is 1:

a context of a random syntax representing possible non-terminators,

the possible class labels are represented by a list of possible classes,

it is shown that the possible lengths are,

indicates the restriction condition is composed of

The context g of the random syntax of the class label c and the non-terminator can be obtained, and simultaneously the existence g' is satisfied, the possible class can be obtained by g

And possibly the context of a random syntax of a non-terminator

Representing the context of a random syntax in a possible non-terminator

The category label in the case is c. At all possible lengths

And all of

Go on maximize operation, let go through assume class c from

Transition to g.

From equation (8) and equation (9), the maximum possible segmentation N of the complete video for l > 1 and l ═ 1 can be obtained as:

by tracking the maximum parameter of equation (9)

And

can obtain the best class label

And length

The results on the Breakfast dataset show that the final motion segmentation frame accuracy in the weakly supervised case is 41.5%.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A behavior identification method based on weak supervised learning video segmentation is characterized by comprising the following steps:

step (1), initially dividing the whole video into N sections, and distributing class labels c to each section^N＝{c₁,...c_i,...,c_NAnd length label l_i∈{l₁,...,l_NIn which c is_iIs the category of the ith video, l_iGenerating a frame label for the length of the ith video by using a Viterbi algorithm for i ∈ { 1., N }, and using a frame-by-frame cross entropy loss L for calculating the cross entropy loss L of each frame according to the video frequency band, and updating GRU network parameters by using the random gradient descent of the gradient delta L of the cross entropy loss L based on all the video frames;

step (2), obtaining video segmentation by Viterbi algorithm in step (1)

Find the best action division point, pair

Decomposing to obtain a visual model, a length model and a context model, wherein,

represents a category of the predicted ith segment of video,

indication deviceMeasuring the length of the ith video;

step (3) connecting the input video frame data sequence in forward propagation by using a single-layer GRU network with 256 cyclic gate units and softmax output to obtain a visual model and a length model p (l) relative to the input video frame data sequence_n|c_n) Wherein c is_nIs a class label of the nth video segment, l_nIs the length of the nth video, p (l)_n|c_n) Indicates the action type as c_nIn the case of action length l_nThe probability of (d);

step (4), defining an auxiliary function, and finding the optimal division point among video actions;

and (5) obtaining the maximum possible segmentation number of the complete video by the length model in the step (3) and the auxiliary function in the step (4).

2. The method of claim 1, wherein the behavior recognition method based on weak supervised learning video segmentation is characterized in that

The step (1) specifically comprises:

Wherein

Can be calculated by the following formula:

represents a category of the predicted ith segment of video,

indicating the length of the predicted i-th segment of video.

For video frame sequence X and its class label c^NForwarding through a neural network, c^NThe label is provided as a ground truth class label, so that only the length label of each video segment needs to be deduced in the training process; will be allocated to frame x_tClass label of is called c_n(t)Using Viterbi algorithm to classify and length (c) the action in video_i,l_i) Class label c written frame by frame_n(1),...,c_n(t),...,c_n(T)Wherein c is_n(t)∈{c₁,...c_l,...,c_NAnd calculating the cross entropy loss of all video frames:

wherein, p (c)_n(t)|x_t) Representing video frame x_tCorresponding action class c_n(t)Probability of (c), -logp (c)_n(t)|x_t) Representing a frame x_tCross entropy loss of (d); based on the cross entropy loss L of all video frames, updating GRU network parameters by using the random gradient descent of the gradient Delta L of the video frames for updating the formula (6);

storing a recently processed video frame sequence and an inferred frame label thereof by using a buffer area, sampling K frames of the buffer area, and adding the K frames into a loss function;

wherein x is_kK frame representing a sequence of buffered video frames, K being the total number of buffered video frames, c_kRepresenting a frame x_kA corresponding category label;

the step (2) specifically comprises:

decomposing the function according to equation (1):

The step (3) specifically comprises:

connecting an incoming sequence of T frame-containing video frames X ═ X in forward propagation using a single layer GRU network with 256 cyclic gate units and softmax outputs₁,..,x_t,...,x_T}。p(c|x_t) Is the t-th frame video x_tThe softmax score of the GRU network of action category of (1), then the visual model p (x)_tC) can be determined by the posteriorProbability p (c | x)_t) Divided by p (c), can be represented as follows:

wherein p (c) is prior distribution, p (c) is normalized frame frequency of action occurrence in training set, in the data training process, the number of frames marked by all video frame sequences using class label c is calculated, and then normalization is carried out to obtain an estimated value of p (c); sequence tags

Classes contained never seen are indicated using 1/# classes.

The length model is implemented using class-dependent poisson distribution:

λ_cthe average length of the action class c is represented,

denotes λ_cTo the power of l, λ is updated for each iteration_cL! Is a factorial of l; training sample (X, c)^N) Definition of lambda when including classes never seen_cN/T, where N denotes the number of video segments and T denotes the number of video X frames;

the step (4) specifically comprises:

defining a helper function Q (t, l, c, g), where t is the video frame number, l represents the length of the last segment, c represents the class label of the last segment, g represents the context of the random syntax of the non-terminator, finding the best split point between the actions in the video by equation (5), the helper function generating the best probability score for the segment before t frame that satisfies the following condition, assuming no new segment when l > 1:

Q(t,l,c,h)＝Q(t-1,l-1,c,h)·p(x_t|c) (8)

assume a new video segment at frame t when l is 1: