CN111898461B

CN111898461B - Time sequence behavior segment generation method

Info

Publication number: CN111898461B
Application number: CN202010651476.7A
Authority: CN
Inventors: 宋井宽; 李涛; 高联丽
Original assignee: University of Electronic Science and Technology of China; Guizhou University
Current assignee: University of Electronic Science and Technology of China; Guizhou University
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2022-08-30
Anticipated expiration: 2040-07-08
Also published as: CN111898461A

Abstract

The invention discloses a time sequence behavior fragment generation method, which relates to the technical field of video processing and comprises the following steps: coding the video unit feature sequence by adopting a pyramid context sensing mechanism to obtain multi-scale information features; extracting fixed feature representation of time sequence behavior segments from the video unit feature sequence by adopting a learnable boundary matching network; and generating time sequence behavior segments based on the multi-scale information characteristics and the fixed characteristic representation. The invention effectively encodes the multi-scale information of the video through the pyramid context sensing mechanism, has great receptive field and solves the problem of different lengths of time sequence behavior segments in the video; the learnable boundary matching network can extract fixed characteristic representation of time sequence behavior segments with different lengths through a learnt coding mode, can be added into the network for end-to-end training, and overcomes the defect that the time sequence behavior segment characteristics extracted by the existing method have a lot of noises.

Description

Time sequence behavior segment generation method

Technical Field

The invention relates to the technical field of video processing, in particular to a time sequence behavior segment generation method.

Background

The time sequence behavior segment generation means that a section of unsegmented long video is given, and the algorithm needs to detect the behavior segments in the video, including the starting time and the ending time of the behavior segments, so that the effects of accurately positioning the time period of behavior occurrence in the long video and filtering out irrelevant information are achieved.

The existing methods can be classified into two categories, namely Anchor-based methods (Anchor-based methods) and Boundary-point-based methods (Boundary-based methods).

The method based on the anchor point mechanism defines time sequence windows with different sizes in advance to represent time sequence behavior segments with different lengths, and then classifies the predefined time sequence behavior segments through a classifier to distinguish behavior from background. However, in practical applications, the data sets have rich categories and different time sequence behavior segment lengths, so the predefined time sequence window has the following disadvantages: 1) time sequence behavior segments with different lengths in a long video cannot be effectively covered; 2) the setting of the window dimensions requires a large amount of manual intervention; 3) the predefined timing window does not have precise boundaries on the one hand.

The method based on boundary point positioning is to classify time points in the video, judge the probability of the time points as the start and the end of the behavior and determine whether the time points are the boundary points of the time sequence behavior segments. Existing methods acquire context information by using stacked time-sequential convolutional layers or using a long-short memory network (LSTM) method. However, this method has the following problems: 1) stacking the sequential convolutional layers can only increase the receptive field of the network model in a limited way; 2) the long and short-term memory network models the long video to obtain global information, but the time sequence behavior segments are different in length, and the existing method cannot obtain information of different scales.

In addition, most of the existing methods adopt simple methods to model the time-series behavior segments, such as average pooling, maximum pooling, random sampling and the like. However, such a modeling method can cause a lot of noise to exist in the extracted time-series behavior segment characteristics. Because some video unit features in the time sequence behavior segment may contain behavior information, and other video units contain background information, directly performing simple modeling processing may result in the extracted time sequence behavior segment features not being discriminant.

Disclosure of Invention

The invention provides a time sequence behavior segment generation method based on a pyramid context sensing mechanism and a learnable boundary matching network, which can alleviate the problems.

In order to alleviate the above problems, the technical scheme adopted by the invention is as follows:

a time series behavior segment generation method comprises the following steps:

s1, coding the input video by using a double-current convolutional neural network to extract a video unit feature sequence;

s2, coding the video unit feature sequence by adopting a pyramid context sensing mechanism to obtain multi-scale information features;

s3, extracting fixed characteristic representation of the time sequence behavior segment from the video unit characteristic sequence by adopting a learnable boundary matching network;

and S4, generating time sequence behavior segments based on the multi-scale information characteristics and the fixed characteristic representation.

The technical effect of the technical scheme is as follows: the multi-scale information of the video is effectively coded through a pyramid context sensing mechanism, the video has a great receptive field, and the problem that time sequence behavior segments in the video are different in size is solved; the learnable boundary matching network can extract fixed characteristic representation of time sequence behavior segments with different lengths through a learnt coding mode, can be added into the network for end-to-end training, and overcomes the defect that the time sequence behavior segment characteristics extracted by the existing method have a lot of noises.

Further, the step S1 is specifically:

dividing a video picture sequence into a video unit sequence;

and for each video unit in the video unit sequence, coding the video unit by using a double-current convolutional neural network to extract the initial video unit characteristics to obtain an initial video unit characteristic sequence, and then performing dimension reduction processing on the initial video unit characteristic sequence by using a layer of time sequence convolutional layer to obtain a final video unit characteristic sequence.

The technical effect of the technical scheme is as follows: the video picture sequence is divided into the video unit sequence, so that high calculation efficiency is guaranteed, and data are not redundant. Further, in step S2, the pyramid context sensing mechanism includes a plurality of continuous layers of time sequence hole convolution modules, each layer of time sequence hole convolution module is configured to acquire an information feature of one scale in the video unit feature sequence, and the information features acquired by each layer of time sequence hole convolution module are subjected to a splicing operation to obtain the multi-scale information feature.

The technical effect of the technical scheme is as follows: under the condition of ensuring that the time sequence resolution is not changed, the pyramid context sensing mechanism not only has a very large receptive field, but also can fuse information of different scales.

Further, the timing hole convolution module comprises a timing hole convolution layer, a linear rectification function, a timing hole convolution layer and a random deactivation layer which are connected step by step.

Further, the step S3 specifically includes:

s31, obtaining a clustering center vector through normal distribution initialization;

s32, calculating a video unit distribution graph of which the video unit features belong to the clustering center vector through the full connection layer and softmax;

s33, performing element product operation by using the time sequence behavior mask matrix and the video unit distribution map to obtain a time sequence behavior fragment distribution map;

s34, performing residual error operation on the time sequence behavior fragment distribution graph and the clustering center vector to obtain the characteristics of all possible time sequence behavior fragments;

s35, carrying out normalization processing of L2 on the characteristics of all possible time sequence behavior segments to obtain fixed characteristic representation of the time sequence behavior segments.

The technical effect of the technical scheme is as follows: the learnable boundary matching network can extract fixed characteristic representation of time sequence behavior segments with different lengths in a learnable coding mode, and can be added into the network to carry out end-to-end training.

Further, the step S4 specifically includes:

s41, inputting the multi-scale information characteristics into a prediction layer, and optimizing the output result of the prediction layer to obtain a probability sequence of a video unit as a behavior start and a behavior end;

s42, inputting the fixed characteristic representation of the time sequence behavior segment into a prediction layer, and optimizing the output result of the prediction layer to obtain two time sequence behavior segment confidence score graphs;

s43, setting boundary point judgment conditions, selecting a plurality of candidate behavior starting boundary points and a plurality of candidate behavior ending boundary points from the probability sequence according to the boundary point judgment conditions, and combining the candidate behavior starting boundary points and the candidate behavior ending boundary points to generate a plurality of initial time sequence behavior segments;

and S44, retrieving the confidence score of the initial time sequence behavior segment from the confidence score maps of the two time sequence behavior segments according to the sequence numbers of the boundary points, removing the initial time sequence behavior segment with low confidence score, and keeping the initial time sequence behavior segment with high confidence score as the finally generated time sequence behavior segment.

Furthermore, there are two boundary point determination conditions, namely condition 1 and condition 2, and if the probability of a certain video unit satisfies any one of the two conditions, the certain video unit is taken as a candidate boundary point;

the condition 1 is that the probability of the video unit is higher than 0.5 times of the maximum value in the probability sequence;

the condition 2 is that the probability of the video unit is higher than the probabilities of the previous video unit and the next video unit.

Further, the output result of the prediction layer in step S41 is optimized by a cross entropy loss function, and the output result of the prediction layer in step S42 is optimized by a mean square error and a cross entropy loss function.

Further, the prediction layers used in steps S41 and S42 each include a time-series convolutional layer having a convolutional kernel size of 1 and a Sigmoid function.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a schematic diagram of a network structure for generating time-series behavior segments according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a pyramid context-aware mechanism according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a learnable boundary matching network according to one embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

Referring to fig. 1 to fig. 3, the present embodiment discloses a method for generating time-series behavior segments.

Wherein a first-segment view is givenVideo picture sequence

Wherein s is _t Denotes the t-th frame picture,/ _s Representing the sum of the number of pictures in this sequence of video pictures.

In order to ensure high calculation efficiency and no data redundancy, the invention firstly divides the video picture sequence into the video unit sequence

Wherein T ═ l _s /n _u Representing the sum of the number of video units. Each video unit

Representing the smallest unit of a video unit coding network process, where s _f Indicates the starting picture frame number, s, of the video unit _f +n _u Represents the ending picture frame number of this video unit, which in total contains n _u A succession of pictures.

Then, each video unit is coded by using a double-current convolutional neural network to extract initial video unit characteristics

Wherein C represents the characteristic dimension, and the final video unit characteristic sequence is obtained after the obtained all video unit characteristic sequences are subjected to dimension reduction by a time sequence convolution layer

1. Pyramid context aware mechanism processing

Video unit feature sequence by adopting pyramid context perception mechanism

And carrying out coding processing to obtain multi-scale information characteristics.

In this embodiment, the pyramid context awareness mechanism includes a plurality of consecutive sequential hole convolution modules, where each sequential hole convolution module includes a sequential hole convolution layer, a linear rectification function (ReLU), a sequential hole convolution layer, and a random deactivation layer (Dropout) connected in a stepwise manner, as shown in fig. 2, where the convolution size of the sequential hole convolution layer and the sequential hole convolution layer is set to 3, and the random deactivation layer is introduced to prevent overfitting.

In order to acquire multi-scale information, the present invention uses a densely connected structure to splice outputs of different time sequence hole convolution layers, and finally obtains multi-scale information characteristics.

The input of each layer of time sequence hole convolution module is expressed by the following formula:

x _l ＝H _l ([x ₁ ,x ₂ ,...,x _l-1 ],r _l )

wherein [ ·]Represents a stitching operation, r _l The void rate of the convolution of the time sequence void in the 1 st time sequence void convolution module is set as 2 ^l-1 And finally, outputting by a time sequence cavity convolution module to finally obtain multi-scale information characteristics.

The invention finds that the time sequence cavity convolution module at the bottom layer has a smaller receptive field and can be responsible for shorter time sequence behavior segments, and the time sequence cavity convolution module at the top layer has a larger receptive field and can be responsible for longer time sequence behavior segments. Therefore, under the condition of ensuring that the time sequence resolution is not changed, the pyramid context perception mechanism not only has a very large receptive field, but also can fuse information of different scales.

2. Learnable boundary matching network processing

Using learnable boundary matching networks to derive sequences of video unit features

Extracting fixed characteristic representation of the time sequence behavior segment, specifically as follows:

first, the learnable boundary matching network passes through positiveInitializing state distribution to obtain K clustering center vectors { c _k }. The learnable boundary matching network then computes each video unit feature f _t Belong to a cluster center { c _k Profile a of } ═ a _k }. The calculation formula is as follows:

wherein a is _k (f _t ) Representing the t-th video unit feature f _t To the kth cluster center c _k Probability of (c) { w } _k }，{b _k And { c }and _k Are all learnable parameters in the network model.

Next, to obtain a distribution graph of time-series behavior segments belonging to the center of the cluster, we introduce a time-series behavior segment mask matrix to expand the video unit distribution graph a into a time-series behavior segment distribution graph a.

The time sequence behavior segment mask matrix is used for carrying out binary representation on all possible time sequence behavior segments. In particular, the time-series behavior segment mask matrix is a binary matrix

It contains D x T time sequence behavior segment mask vectors, where D and T represent the longest length of the time sequence behavior segment and the number of video unit feature sequences in the data set, respectively. For example, a point M (M, n) in the temporal behavior segment mask matrix represents a temporal behavior segment with a start time of the nth video unit, a duration of the M video units, and an end time of the n + mth video unit

The mask vector calculation formula is as follows:

next, the present invention first maps the video unit distribution

Dimension expansion is carried out to obtain

Then, the invention maps the video unit distribution diagram

And a time-sequential behavior mask matrix

The element product operation is carried out to obtain a behavior segment distribution diagram

The calculation formula of the a (m, n, k, t) element is as follows:

A(m,n,k,t)＝a _k (f _t )·M(m,n,1,t)

wherein A (m, n, k, t) represents a segment of time-sequential behavior

The t-th video unit feature in (b) belongs to the k-th clustering center c _k The probability of (c). Finally, the learnable boundary matching network obtains the characteristics of all possible time sequence behavior segments through residual error operation

The method comprises the characteristics of D multiplied by T time sequence behavior segments, and the characteristic dimension is K multiplied by C. The calculation formula of V (m, n, k, j) is as follows:

wherein f is _t (j) J element representing the characteristic of t video unit, c _k (j) The jth element representing the kth cluster center.

And finally, carrying out L2 normalization processing on the characteristics V (m, n, k, j) of all possible time sequence behavior segments to obtain a fixed characteristic representation of the time sequence behavior segments.

3. Generation process of time sequence behavior segment

The process is based on multi-scale information characteristics and fixed characteristic representation, and time sequence behavior segments are generated, and the process specifically comprises the following steps:

1) inputting multi-scale information characteristics into a layer of time sequence convolution layer (convolution kernel size is 1) and Sigmoid function to obtain probability sequences of each video unit as behavior start and behavior end, which are respectively expressed as

And

this network is optimized by a cross entropy loss function.

Selecting a video unit meeting one of the following two conditions in the behavior start probability sequence and the behavior end probability sequence as a candidate boundary point: condition 1-the probability of the video unit is higher than 0.5 times the maximum value in this sequence of probabilities; condition 2-the probability of the video unit is higher than the probability of the previous video unit and the probability of the following video unit, i.e. the probability value of the video unit is a local peak. Then, the invention combines the candidate behavior start boundary point and the candidate behavior end boundary point pairwise to generate an initial time sequence behavior segment.

2) Inputting the fixed characteristic representation of the time sequence behavior fragment into a time sequence convolution layer (the size of a convolution kernel is 1) and a Sigmoid function, and obtaining a confidence score chart of the two time sequence behavior fragments by regression and classification modes

And

the network is optimized by mean square error and cross entropy loss functions.

According to the generated boundary point sequence number of the initial time sequence behavior segment, a confidence score chart of the time sequence behavior segment can be obtained

And

and searching out the confidence score, removing the initial time sequence behavior segment with the low confidence score, and keeping the initial time sequence behavior segment with the high confidence score as the finally generated time sequence behavior segment.

For example, the time-series behavior segments can be obtained through the above processing

Wherein t is _s And t _e Are the video unit sequence numbers of the beginning and end of the sequential behavior segment,

and

probability value of a video unit being the beginning and end of a time-sequential behavior segment, p _cls And p _reg Is a confidence score graph

And

the middle coordinate is [ t ] _e -t _s ,t _s ]And a sum value.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for generating time-series behavior segments is characterized by comprising the following steps:

s3, extracting fixed feature representation of the time sequence behavior segment from the video unit feature sequence by adopting a learnable boundary matching network;

s4, generating time sequence behavior segments based on the multi-scale information features and the fixed feature representation;

the step S3 specifically includes:

2. The method for generating time-series behavior segments according to claim 1, wherein the step S1 specifically comprises:

dividing a video picture sequence into a video unit sequence;

and for each video unit in the video unit sequence, coding each video unit by using a double-current convolutional neural network to extract initial video unit characteristics to obtain an initial video unit characteristic sequence, and reducing the dimension of the initial video unit characteristic sequence to obtain a final video unit characteristic sequence.

3. The method of claim 2, wherein the initial video unit feature sequence is dimension-reduced using a layer of time-sequential convolutional layer.

4. The method according to claim 1, wherein the pyramid context awareness mechanism includes a plurality of layers of consecutive time-series hole convolution modules, each layer of time-series hole convolution module is configured to obtain an information feature of one scale in the video unit feature sequence, and the information features obtained by each layer of time-series hole convolution modules are subjected to a splicing operation to obtain the multi-scale information feature.

5. The method of claim 4, wherein the timing hole convolution module comprises a sequence hole convolution layer, a linear rectification function, a sequence convolution layer, and a random deactivation layer connected in a cascade.

6. The method for generating time-series behavior segments according to claim 1, wherein the step S4 specifically includes:

s42, inputting the fixed characteristic representation of the time sequence behavior fragment into a prediction layer, and optimizing the output result of the prediction layer to obtain two time sequence behavior fragment confidence score maps;

7. The method according to claim 6, wherein there are two boundary point determination conditions, namely condition 1 and condition 2, and if a certain video unit satisfies any one of the two conditions, the certain video unit is taken as a candidate boundary point;

8. The method as claimed in claim 6, wherein the output of the prediction layer in step S41 is optimized by a cross entropy loss function, and the output of the prediction layer in step S42 is optimized by a mean square error and a cross entropy loss function.

9. The method of claim 6, wherein the prediction layers used in steps S41 and S42 each include a time series convolutional layer having a convolutional kernel size of 1 and a Sigmoid function.