CN113255570B

CN113255570B - Sequential action detection method for sensing video clip relation

Info

Publication number: CN113255570B
Application number: CN202110659154.1A
Authority: CN
Inventors: 徐行; 任燚梵; 沈复民; 申恒涛
Original assignee: Chengdu Koala Youran Technology Co ltd
Current assignee: Chengdu Koala Youran Technology Co ltd
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-09-24
Anticipated expiration: 2041-06-15
Also published as: CN113255570A

Abstract

The invention relates to the field of video understanding, in particular to a time sequence action detection method for sensing video clip relation, which comprises the following steps: step S1: sampling a video; step S2: performing primary feature extraction on the video; step S3: performing feature enhancement on the extracted features to generate boundary prediction of time sequence nodes, and extracting the features of all candidate video segments; step S4: capturing the relation between the candidate video segment characteristics; step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score; step S6: removing repeated candidate video segments; step S7: classifying the candidate video segments to obtain class information of the candidate video segments; more effective video segment characteristics are generated by capturing global and local relationships between them, thereby generating more effective prediction results.

Description

Sequential action detection method for sensing video clip relation

Technical Field

The invention relates to the field of video understanding, in particular to a time sequence action detection method for sensing video clip relation.

Background

In recent years, with the continuous development of streaming media, videos on various website platforms are increased explosively, compared with traditional image information, information contained in the videos is richer and gets more attention of researchers, video understanding gradually becomes a popular research field in the industry and academia, a timing action detection task is an important branch of the video, and the task is mainly to detect a timing boundary of each action instance in a long video and determine the type of the action, so that a user can be helped to locate the action information in the videos more conveniently and quickly.

The time sequence action detection task can help people to quickly determine key contents in the long-distance video, can be used as a preprocessing step for problems such as video understanding and man-machine interaction, and is widely applied to actual life: (1) the video monitoring field: with the rapid development of the internet of things, cameras are already spread in various public places such as roads, schools and the like, play an important security role, but bring extremely large video data, if the cameras are simply analyzed manually, the method is obviously unrealistic, and the time sequence action detection task can quickly extract effective information in the monitoring video, liberate a large amount of human resources, and (2) the video retrieval field: nowadays, with the popularity of videos on various social software, common users also upload and share various video data, in order to recommend videos of interest to the users, classification and labeling are required for each uploaded video, if manual processing is used, the labor cost is very high, and the time sequence action detection can detect actions in the videos, perform preliminary classification, sorting and labeling according to a preset definition, and further integrate the actions into a subsequent video retrieval and video recommendation related algorithm.

Due to the wide application of the time sequence action detection task in the industry and academia, many effective algorithms are proposed, and at present, the algorithms can be roughly divided into two types: one-stage and two-stage:

1) one-stage: firstly, detecting video segments with different lengths by utilizing a hierarchical structure, a cascade structure and the like, and simultaneously giving the category information of the video segment in the prediction process;

2) two-stage: firstly, extracting all possible candidate video segments, and then classifying the candidate video segments by using some classical classifiers (such as Unet) to obtain the class information of the candidate video segments;

although the two-stage based methods have achieved good results, they generate the candidate video segments individually in the process of generating the candidate video segments, and ignore the relationship between the video segments.

Disclosure of Invention

Based on the above problems, the present invention provides a time sequence action detection method for sensing video segment relation, which generates more effective video segment characteristics by capturing global relation and local relation between them, thereby generating more effective prediction results.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a time sequence action detection method for sensing video clip relation comprises the following steps:

step S1: sampling a video;

step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain features

；

Step S3: using BaseNet model to pair extracted features

Feature enhancement to generate boundary prediction for time series nodes

And

in addition, features of all candidate video segments are extractedIs signed and marked as

Wherein

Each element representing a feature of a candidate video segment;

step S4: capturing candidate video segment features using global perception module and feature enhancement module

The relationship between;

step S5: combining the prediction results of step S3 and step S4 to generate a final judgment score;

step S6: removing repeated candidate video segments by using a Soft-NMS model;

step S7: and classifying the candidate video segments by using a Unet classifier to obtain the class information of the candidate video segments.

Further, the step S2 specifically includes the following steps:

step S21: firstly, acquiring a long video to a certain number of video segments according to a certain time interval;

step S22: and inputting the video segment into the TSN model, respectively acquiring the visual characteristic and the action characteristic, and connecting the visual characteristic and the action characteristic.

Further, the step S3 specifically includes the following steps:

step S31: establishing the relation among all video frames by utilizing graph convolution, and dynamically fusing multi-scale context semantic information to video characteristics;

step S32: predicting each time sequence position by graph convolution, outputting the possibility of the time sequence position being a starting node or an ending node, namely generating the boundary prediction of the time sequence node

And

and, in addition, extracting all the characteristics of the candidate video segment and outputting

。

Further, the global sensing module specifically includes:

designing a global perception unit to establish a relationship between candidate video segments in the same row and column for input

First, the

Inputting into two parallel paths, each path respectively comprising a vertical pooling layer and a horizontal pooling layer

The calculation formula is as follows:

，

wherein the content of the first and second substances,

to represent

The results after being input to the horizontal pooling,

representing all possible start times in the video,

is shown as

A plurality of channels, each of which is provided with a plurality of channels,

representing the start time of a video segment as

，

Representing a video segment of duration

,

Then it indicates to

Is a start time

For a duration of a video segment

The value of each channel;

for vertically pooled

The formula is as follows:

，

wherein the content of the first and second substances,

to represent

The results after being input to the vertical pooling,

display viewThe frequency segments are of all possible durations and,

is shown as

representing the start time of a video segment as

，

Representing a video segment of duration

,

Then it indicates to

Is a start time

For a duration of a video segment

The value of each channel;

then, the information of the current position and the neighbor thereof is aggregated by using the one-dimensional convolution operation with the convolution kernel of 3, and then the outputs of the two paths are fused to obtain a fusion result

The following are:

；

the output of the global sensing unit is then:

；

wherein the content of the first and second substances,

represents the result of the operation of the global sensing unit,

which represents the input to the unit or units,

it is shown that the activation function is,

represents a convolution operation;

repeating the global sensing unit twice to obtain a global sensing module, wherein the calculation formula is as follows:

；

wherein the content of the first and second substances,

the output of the global perception module is represented,

representing a global sense unit operation that is,

representing a convolution operation.

Further, the feature enhancement module is specifically as follows:

a hierarchy is used to capture local information between candidate video segments for input

First, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:

；

when in use

When the temperature of the water is higher than the set temperature,

is to input

If the global perception module is embedded in the feature enhancement module and all layers in the feature enhancement module share the same global perception module, the output of each layer

Can be calculated according to the following formula:

；

represents the output result of each layer of the global perception module,

representing the aforementioned global sensing module operations;

then, the upsampling operation is used to aggregate features between different layers, which is calculated as follows:

，

wherein the content of the first and second substances,

the representation is fused with different layers

The output result of the layer is then,

it is indicated that the up-sampling operation,

the second of the aforementioned global perception modules

And +1 layer outputs the result.

Further, the steps S4 to S5 specifically include the following steps:

step S41: using average pooling polymerization

The relationship between adjacent video segments in the video stream;

step S42: inputting the multi-level structure into a shared global perception module;

step S43: aggregating features output by global perception modules in adjacent levels;

step S44: the prediction score of each candidate video segment is obtained preliminarily through a shared convolution operation.

Further, the step S6 specifically includes the following steps:

step S61: generating all video segment scores and sorting according to the size;

step S62: selecting the video segments with larger scores, selecting the video segments with larger overlapping degree, attenuating the scores of the video segments, and repeating the process in sequence until a specific number of candidate video segments are reserved.

Compared with the prior art, the invention has the beneficial effects that: capturing the video frequency range relationship from different angles (local information and global information), establishing a relationship for a remote candidate video from a global view angle by using the average pooling characteristic for the global information, aggregating information between adjacent candidate video segments by using a hierarchical structure for the local information due to the distribution characteristic of the candidate video segments, and achieving a better detection result by the whole model due to the complementation of the local information and the global information;

for the prediction of candidate video segments, if they are processed separately, a constraint relation between them is ignored, and by exploring the relation between candidate video segments, more complete and accurate results can be generated, for different candidate video segments in the same video, they are often highly correlated, the relation of all candidate video segments is established, the characteristics of action instances can be enhanced by using background information, and for adjacent candidate video segments, there is a great amount of overlap between them, and effective information between them is aggregated by using average pooling, so as to generate more accurate results.

Drawings

Fig. 1 is a flowchart of this embodiment 1.

Detailed Description

The invention will be further described with reference to the accompanying drawings. Embodiments of the present invention include, but are not limited to, the following examples.

Example 1

In this embodiment, the system mainly includes a global perception module and a feature enhancement module, where:

in the global perception module, GA units are designed to establish the relation between candidate video segments in the same row and column, and for input

First, the

Inputting into two parallel paths, each path respectively comprising a vertical pooling layer and a horizontal pooling layer, for horizontal poolingIs/are as follows

The calculation formula is as follows:

，

wherein the content of the first and second substances,

to represent

The results after being input to the horizontal pooling,

representing all possible start times in the video,

is shown as

representing the start time of a video segment as

，

Representing a video segment of duration

,

Then it indicates to

Is a start time

For a duration of a video segment

The value of each channel;

for vertically pooled

The formula is as follows:

，

wherein the content of the first and second substances,

to represent

The results after being input to the vertical pooling,

representing all possible durations of the video segments,

is shown as

representing the start time of a video segment as

，

Representing a video segment of duration

,

Then it indicates to

Is a start time

For a duration of a video segment

The value of each channel;

The following are:

；

the output of the global sensing unit is then:

；

wherein the content of the first and second substances,

represents the result of the operation of the global sensing unit,

which represents the input to the unit or units,

it is shown that the activation function is,

represents a convolution operation;

；

wherein the content of the first and second substances,

the output of the global perception module is represented,

representing a global sense unit operation that is,

represents a convolution operation;

in addition, in the feature enhancement module, a hierarchy is used to capture local information between candidate video segments. For input

；

when in use

When the temperature of the water is higher than the set temperature,

is to input

Due to the limitation of average pooling, only the relationships between neighboring candidate boxes can be aggregatedAnd the long-distance dependency relationship cannot be captured, so that the global perception module is embedded into the feature enhancement module, all layers in the feature enhancement module share the same global perception module, and the output of each layer

Can be calculated according to the following formula:

；

by doing so, the relationship between all candidate video segments can be established from two complementary levels (local and global), and then the upsampling operation is used to aggregate features between different levels, since each level is supervised by label information, the fusion of information of different levels can reduce the generation of noise to the maximum extent, and the calculation formula is as follows:

；

thus, the relationship between different candidate video segments can be captured from different levels and different scales.

Based on the above, the method for detecting the time sequence action of sensing the relation of the video segments as shown in fig. 1 includes the following steps:

step S1: sampling a video;

selecting a proper training and testing data set, wherein the training and testing are mainly performed on the public data sets ActivityNet-1.3 and THUMOS-14;

the ActivityNet-1.3 dataset is an open dataset for video segment generation and detection, which contains mainly 19994 videos and contains 200 action classes, these videos are mainly crawled from youtube website, all of which are different in resolution and time, it was an ActivityNet change 2016 and 2017 game dataset, which divides all videos into training, evaluation and test sets in a 2:1:1 ratio;

the THIMOS-14 data set contains 413 videos and contains 20 categories of information, wherein the test set contains 212 videos, the validation set contains 200 videos used for the time series motion detection task, the entire model is trained on the validation set and the performance of the entire model is evaluated on the tester.

；

Firstly, for an unprocessed long video, extracting a corresponding video frame to represent the video frame as

Wherein

Representing the total number of video frames,

the nth frame represented in the video, for which the set of tags may be represented as

Wherein

Indicating the number of motion video segments contained in a video,

and

respectively represent

The start time, end time and category information of each label are only used in the training process, and thenThe TSN model is used to extract the features of each video, which are expressed as

。

Step S3: feature enhancement of extracted features using BaseNet model, resulting in boundary prediction of time series nodes

And

in addition, the characteristics of all candidate video segments are extracted and output

；

Wherein features are aligned using graph convolution (including GCN)

Enhancing and obtaining the characteristics of richer semantic information, wherein the calculation formula is as follows:

，

wherein the content of the first and second substances,

representing features enhanced by graph convolution

；

Then, the feature

Shared by two branch networks, one of which is used to determine whether each timing position is a start node or an end node, which may be denoted as

And

wherein

Representing the length of a sequence of video features that will output the features of all candidate video segments for another branching network

。

The relationship between;

for the output in the above step S3

Each position of which represents the characteristics of a candidate video segment, and inputting the characteristics into the characteristics enhancing module, the relationship between the candidate video segments can be captured from the local view and the global view, the characteristics of the action instance can be enhanced, and the background information can be suppressed, so that more accurate and more complete results can be generated, and the output is

；

Then use

The final result of each candidate video segment is predicted by the convolution of (a):

；

wherein the content of the first and second substances,

respectively representing two types of prediction results obtained by regression and classification supervision,

it is shown that the activation function is,

representing a convolution operation.

in order to fully utilize the output result of the whole model, each layer of output in the feature enhancement module is fused into the final score of the candidate video segment, and in addition, the boundary information of each video segment is considered, so that the output of each layer of output in the feature enhancement module is combined into the final score of the candidate video segment

To

Video segment of (1), score thereof

The calculation formula of (a) is as follows:

，

；

wherein the content of the first and second substances,

，

and

all the ingredients are super-ginseng,

is shown as

The output of the layer(s) is,

the total number of layers of the hierarchical result is indicated.

Step S6: removing repeated candidate video segments by using a Soft-NMS model;

after all possible candidate frames are acquired, as most of the possible candidate frames have large overlap, the Soft-NMS model is used for removing the candidate frames again, and the scores of the candidate video segments are counted

Sorting according to the size of the video segments, selecting the candidate video segment with the largest score, and then calculating the iou of the candidate video segment with the other video segments, wherein the video segments with high overlapping degree are attenuated according to the following formula.

；

Wherein

The parameters representing the gaussian function are then calculated,

which represents a pre-defined threshold value that is,

respectively representing the currently selected candidate video segment and other video segments.

After all possible candidate video segments are obtained, the video segments are classified by using a Unet classifier to obtain final class information of the video segments, and the final class information can be expressed as

Wherein

As the information of the category, it is,

is the score to which it corresponds,

to predict the number of instances of an action.

Example 2

In this embodiment, the overall model in embodiment 1 needs to be trained, and the overall loss function is represented as:

；

wherein

Is super ginseng and used

To determine whether each time node is a start node or an end node, which can be expressed as:

；

and

respectively representing weighted cross entropy loss functions,

respectively expressed as the prediction result of the start node and its corresponding label,

respectively representing the prediction result of the end node and the corresponding label, and training the feature enhancement module, wherein the label is generated for the 0 th layer in the feature enhancement module and is represented as

Since the feature enhancement module is a hierarchical structure and each layer is supervised, a label is generated for each layer using the following formula.

；

Thus, the loss function for each layer can be defined as:

；

wherein the content of the first and second substances,

a loss function for each layer is represented,

respectively representing the prediction result obtained by each layer by using regression supervision and the corresponding label,

respectively representing the prediction result obtained by each layer by using classification supervision and the corresponding label.

Wherein

And

respectively represent the square difference loss and the weighted cross entropy loss, then

Wherein, in the step (A),

a loss function representing the hierarchy is used to,

is shown as

A layer of a material selected from the group consisting of,

the number of total layers is indicated,

representing hyper-parameters, and finally, loss functions

Optimization can be done in an end-to-end fashion.

Example 3

In this embodiment, the validity of the method is verified by using the selected data set, which specifically includes:

the method is verified on the selected data set, in order to well evaluate the effectiveness of the embodiment, the average accuracy mAP is selected as a main evaluation index, the mAP is respectively calculated on an iou set {0.3,0.4,0.5,0.6 and 0.7} on the THUMOS-14 data set, the mAP on the iou set {0.5,0.75 and 0.95} is calculated on an activityNet1.3 data set, and in addition, the average mAP of ten different ious is calculated on the activityNet 1.3.

The present embodiment performs verification on the data set ActivityNet-1.3 of the current main stream, and the final verification result is shown in the following table (comparing the model performance on the ActivityNet-1.3 data set (%):

TABLE 1 comparison of model Performance on ActivinyNet-1.3 dataset

The present example performed verification on the currently mainstream data set thumb-14, and the final verification results are shown in the following table (model performance vs. (%) on thumb-14 data set).

TABLE 2 modeling Performance on THUMOS-14 dataset vs

The above is an embodiment of the present invention. The specific parameters in the above embodiments and examples are only for the purpose of clearly illustrating the invention verification process of the inventor and are not intended to limit the scope of the invention, which is defined by the claims, and all equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be covered by the scope of the present invention.

Claims

1. A method for detecting time sequence action of sensing video clip relation is characterized by comprising the following steps:

step S1: sampling a video;

step S2: performing primary feature extraction on the video by using a TSN (time delay network) model to obtain featuresF；

Step S3: using BaseNet model to pair extracted featuresFFeature enhancement to generate boundary prediction for time series nodesP _SAndP _ein addition, all the characteristics of the candidate video segments are extracted and recordedX _ACFMWhereinX _ACFMEach element representing a feature of a candidate video segment;

step S4: capturing candidate video segment features using global perception module and feature enhancement moduleX _ACFMThe relationship between;

step S6: removing repeated candidate video segments by using a Soft-NMS model;

step S7: classifying the candidate video segments by using a Unet classifier to obtain class information of the candidate video segments;

the global perception module specifically comprises the following modules:

designing a global perception unit to establish a relationship between candidate video segments in the same row and column for inputXFirst, theXInputting into two parallel paths, each path respectively comprising a vertical pooling layer and a horizontal pooling layer

The calculation formula is as follows:

，

wherein the content of the first and second substances,

to representXThe results after being input to the horizontal pooling,Trepresenting all possible start times in the video, c representing the c-th channel, i representing a video segment start time of i, j representing a video segment duration of j,

then represents the value of the c channel of the video segment with i as the start time and j as the duration;

for vertically pooled

The formula thereofComprises the following steps:

wherein the content of the first and second substances,

represents the result of X input to the vertical pooling, D represents all possible durations of the video segments, c represents the c-th channel, i represents the video segment start time i, j represents the video segment duration j,

The following are:

；

the output of the global sensing unit is then:

X _unit= Sig（Conv(Y)）×X×X

wherein the content of the first and second substances, X _unitrepresents the result of the operation of the global sensing unit,Xrepresents the input of the unit, Sig () represents the activation function, Conv () represents the convolution operation;

wherein the content of the first and second substances,X _gathe output of the global perception module is represented,

() Representing a global sense unit operation, Conv () representing a convolution operation;

the characteristic enhancement module is specifically as follows:

a hierarchy is used to capture local information between candidate video segments for inputXFirst, downsampling is used to aggregate features between adjacent candidate boxes, and the calculation formula is as follows:

when the ratio of i =0, the ratio of the total of the sum of the values of i =0,X ₀is to inputXIf the global perception module is embedded in the feature enhancement module and all layers in the feature enhancement module share the same global perception module, the output of each layer

Can be calculated according to the following formula:

；

represents the output result of each layer of the global perception module,

() Representing the aforementioned global sensing module operations;

；

wherein the content of the first and second substances,

showing the output result of the ith layer after fusing different layers,

() It is indicated that the up-sampling operation,

and (3) representing the output result of the i +1 th layer of the global perception module.

2. The method according to claim 1, wherein said method comprises: the step S2 specifically includes the following steps:

3. The method according to claim 1, wherein said method comprises: the step S3 specifically includes the following steps:

step S32: predicting each time sequence position by graph convolution, outputting the possibility of the time sequence position being a starting node or an ending node, namely generating the boundary prediction of the time sequence nodeP _SAndP _eand, in addition, extracting all the characteristics of the candidate video segment and outputtingX _ACFM。

4. The method according to claim 1, wherein said method comprises: the steps S4 to S5 specifically include the following steps:

step S41: using average pooling polymerizationX _ACFMThe relationship between adjacent video segments in the video stream;

5. The method according to claim 1, wherein said method comprises: the step S6 specifically includes the following steps: