CN112613349A

CN112613349A - Time sequence action detection method and device based on deep hybrid convolutional neural network

Info

Publication number: CN112613349A
Application number: CN202011402943.9A
Authority: CN
Inventors: 甘明刚; 张琰; 刘洁玺; 陈杰; 窦丽华; 陈文颉; 陈晨
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2021-04-06
Anticipated expiration: 2040-12-04
Also published as: CN112613349B

Abstract

The invention provides a time sequence action detection method and a time sequence action detection device based on a deep hybrid convolutional neural network, wherein the method comprises the steps of obtaining a video to be detected; inputting a video into a trained deep hybrid convolutional neural network, wherein the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet; the feature coding module extracts segment features from original video data through a double-current network, the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module, the second subnet receives the proposed features, constructs a graph based on the relationship among the proposed features, inputs the constructed graph into a GCN model, and expands the acceptance area of the proposed features; and outputting the action type, the action starting time and the action ending time of the video to be detected. According to the scheme of the invention, the relation among the proposals is effectively utilized, the accuracy rate of the time sequence action detection is improved, and the time sequence action detection is effectively solved.

Description

Time sequence action detection method and device based on deep hybrid convolutional neural network

Technical Field

The invention relates to the field of video identification, in particular to a time sequence action detection method and device based on a deep hybrid convolutional neural network.

Background

Temporal motion detection is a basic and challenging task to understand human behavior, mainly for segmenting and classifying long unsegmented videos. Given a long video without segmentation, the algorithm needs to detect motion segments in the video, and the detection result includes a start time, an end time and a motion category. A piece of video may contain one or more identical or different motion segments.

The sequential motion detection is generally divided into two phases: time offer generation and offer classification. The time proposal generation is to generate a candidate frame in the similar target detection, namely to generate a series of start and end time, and the proposal classification is to classify the action in the generated proposal time and judge the action class. For the proposed classification, most methods treat it as an action recognition task, and directly adopt the method in action recognition. However, the proposals have a more complex temporal structure than the cropped video, and there is usually a semantic relationship between the proposals. Existing processing methods ignore these differences, limiting their performance.

The method for directly adopting action recognition in classification proposed in the prior art has the following problems:

(1) there may be only one action in the action recognition and there may be multiple actions in a proposal.

(2) While motion recognition involves a complete motion, a proposal typically involves only a portion of a motion.

(3) There is no semantic relationship in the segments of the action category and multiple proposals are generated from the same video or within the same action tag.

Disclosure of Invention

In order to solve the technical problems, the invention provides a time sequence action detection method and a time sequence action detection device based on a deep hybrid convolutional neural network, and the method and the device are used for solving the problems that in the prior art, an action identification method adopted by (1) proposed classification cannot model a complex time sequence structure between proposals, the receptive field is small, only short-term time sequence relation can be obtained, and high-quality proposed characteristics cannot be obtained. (2) Proposals rarely contain the entire action, lack sufficient information to generate an accurate time boundary, and therefore need to consider the technical problem of obtaining information from other proposals.

According to a first aspect of the present invention, there is provided a time series action detection method based on a deep hybrid convolutional neural network, the method comprising the steps of:

step S101: acquiring a video to be detected;

step S102: inputting the video into a trained deep hybrid convolutional neural network;

the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;

the feature coding module extracts segment features from original video data through a double-current network;

the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;

the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and enlarges the acceptance area of the proposed features;

step S103: and outputting the action type, the action starting time and the action ending time of the video to be detected.

According to a second aspect of the present invention, there is provided a time-series action detection apparatus based on a deep hybrid convolutional neural network, the apparatus comprising:

a video acquisition module: the method comprises the steps of configuring to obtain a video to be detected;

a detection module: the method comprises the steps of inputting a video into a trained deep hybrid convolutional neural network;

an output module: and the video to be detected is configured to output the action category and the action starting and ending time of the video to be detected.

According to a third aspect of the present invention, there is provided a time-series action detection system based on a deep hybrid convolutional neural network, comprising:

a processor for executing a plurality of instructions;

a memory to store a plurality of instructions;

wherein the plurality of instructions are used for being stored by the memory and loaded and executed by the processor, and the time sequence action detection method based on the deep hybrid convolutional neural network is described in the foregoing.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium having a plurality of instructions stored therein; the plurality of instructions are used for loading and executing the time sequence action detection method based on the deep hybrid convolutional neural network by the processor.

According to the scheme of the invention, the time structure can be modeled, and meanwhile, the relationship among the proposals is effectively utilized, so that the accuracy rate of the time sequence action detection is improved, and the time sequence action detection is effectively solved.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to implement them in accordance with the contents of the description, the following detailed description is given with reference to the preferred embodiments of the present invention and the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention. In the drawings:

FIG. 1 is a flow chart of a method for detecting a time-series action based on a deep hybrid convolutional neural network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a deep hybrid convolutional neural network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a fusion RGB and Flow model for prediction according to an embodiment of the present invention;

FIG. 4 is a graph illustrating the results of training tests on a test data set, in accordance with one embodiment of the present invention;

FIG. 5 is a graph illustrating the results of a training test on a test data set, in accordance with yet another embodiment of the present invention;

fig. 6 is a block diagram of a time-series operation detection apparatus based on a deep hybrid convolutional neural network according to an embodiment of the present invention.

Detailed Description

First, a flow of a method for detecting a time-series operation based on a deep hybrid convolutional neural network according to an embodiment of the present invention will be described with reference to fig. 1. As shown in fig. 1, the method comprises the steps of:

step S101: acquiring a video to be detected;

In this embodiment, a dual-stream network (e.g., I3D network) is used for feature encoding. Each input video is first segmented into a set of segments, which are then sent into a dual stream network for characterization.

The feature coding module extracts segment features from original video data through a dual-stream network, and the method comprises the following steps:

in the present embodiment, for video data, i.e., a given uncut video V, a length n is generated_wFrame, time sequence sliding window with step length sigma, using time sequence sliding window to divide video V and generate a group of video segments

As input to a feature coding module, where s_hFor the segmented video segments, N represents the total number of segments in the video V; feature encoding module outputs feature sequences

Wherein f is_hIs s is_hCorresponding features as input to the first subnet.

Thus, for a given uncut video segment, a sequence of features can be obtained

As input to the first subnet.

Further, it is also possible to obtain a sequence of video features by stitching together the segment features, which are then fed into the first subnet to obtain a high quality proposed feature.

The structure of the deep hybrid convolutional neural network of the present embodiment is shown in fig. 2.

The first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional time sequence convolution layers; the RoI pooling layer receives the output of the one-dimensional time-sequential volume block, pooling generating a set of proposed features, wherein:

the convolution layers in the first sub-network have large convolution kernels, and the size of each convolution kernel is at least 9 x 9; the first subnet models complex timing by stacking one-dimensional timing convolution layers with large convolution kernels.

To model complex proposed temporal structures and capture long-term temporal correlations, the present embodiment employs a first sub-network consisting of one-dimensional temporal convolution blocks and RoI pooling layers to obtain high-quality proposed features based on video segment features extracted by the feature encoding module.

The one-dimensional time-series convolutional block comprises two one-dimensional convolutional layers. To avoid losing the segment features, the number of channels of the one-dimensional convolution is set to the same number C as the feature dimension of the segment_dimThe step size is set to 1.

The hidden layer feature of the last layer of the first subnet, plus the incoming segment feature, can be such that the feature contains multi-scale timing information,

further, the proposed features generated by the first subnet are sent to a second subnet, a graph network is constructed, and the long-term time sequence dependency relationship is captured through the graph convolution network. The second subnet is used to mine the relationships between the offers and expand the acceptance area of the offer features.

The second subnet receives the proposed features, constructs a graph based on relationships between the proposed features, inputs the constructed graph into a GCN model, expands an acceptance region of the proposed features, wherein:

the second subnet receives the offer features and constructs all offers in the video into an offer graph G<V(G)，E(G)>Representing a graph comprising N nodes, where node v_iE.g. V (G), edge v_ijE (G); taking each proposal as a node, and taking the proposal characteristics obtained from the first subnet as the node characteristics of the corresponding node;

constructing edges based on the timing relationship between nodes, choosing IoU metric to measure the degree of overlap between two proposals if there is overlap between them, if IoU (p)_i，p_j)＞θ_iouThen propose p_iAnd p_jAn edge is established between the two edges,

if there is no overlap between the two proposals, d (p) is used_i，p_j) Measure the distance between the proposals, where θ_iouIs a certain threshold; c. C_i、c_jRespectively represent proposals p_i、p_jCentral coordinate of (b), U (p)_i，p_j) Represents the union of two proposals;

at IoU (p)_i，p_j) Or d (p)_i，p_j) Greater than a threshold value theta_iouAn edge is established between the nodes.

The constructed graph is input into the GCN model, i.e. a GCN (graph convolution network) model is applied on the graph so that each node can focus information from the neighborhood.

In this embodiment, the GCN model is used for classifying the proposal, and the GCN model includes a first GCN model and a second GCN model, and the first GCN model is operated based on a graph G with proposal features; the input of the first GCN model is an original proposed feature used for predicting action categories; the second GCN model input is an extended proposal feature that extends both the start and end of the proposal feature by half the duration of a proposal in the first subnet, and then applies the ROI pooling layer to obtain an extended proposal feature for predicting integrity tags and action boundaries.

The first GCN model and the second GCN model are both composed of two graph volume layers, and the graph volume layers are realized in the following mode:

X⁽ⁱ⁾＝AX⁽ⁱ¹⁾W⁽ⁱ⁾

where A is an adjacency matrix computed from cosine similarities between proposed features; w⁽ⁱ⁾Is a parameter matrix to be learned; x⁽ⁱ⁾Is an implicit feature of all proposals at layer i;

a fully-connected (FC) layer with softmax operation is applied over the first GCN model to predict action tags, and two fully-connected (FC) layers are employed over the second GCN model to predict integrity tags and boundaries.

Further, an integrity classifier is added to predict whether a proposal contains a complete action instance, and a proposal containing only a part of the action instance is filtered out. The integrity classifier includes n binary classifiers, each binary classifier corresponding to an action class. For each action class k, the corresponding completeness classifier C_kA probability value is generated that represents the probability of a proposed capture of the complete action instance of the action class k. Considering both the proposed category score and completeness score, very good performance is obtained in temporal motion detection.

To predict the precise boundaries of the action instance, the present embodiment further learns the center coordinate offset and length offset between the real action instance and the proposal, and the calculated regression offset is:

t_c＝(c_p，i-c_gt，i)/l_p，i t_l＝log(l_gt，i/l_p，i)

wherein, t_cOffset value of center coordinate and length, c, for proposal_p，iTo the proposed center coordinate, c_gt，iAs the central coordinate of the label,/_p，iIs the proposed length; t is t_lOffset value for the center coordinate and length of the tag, l_gt，iIs the length of the label.

Further, the training process of the deep hybrid convolutional neural network comprises:

first, as explained below with respect to a sample video, an uncut video may be represented as

Wherein T represents the number of video frames, X_tRepresents the height and width of the image of the t-th frame, H represents the height of the image, and W represents the width of the image. Annotation information for video V is composed of a set of action instances

Wherein N is_gNumber representing real action instance in video V, c_gt，n，l_gt，nRespectively represent an action instance phi_nCenter coordinate and length of (y)_gt，nRepresentative action instance phi_nThe category label of (1).

Given a set of proposals, three types of sample proposals are collected by evaluating their intersection ratios with annotations (IoU): (1) positive samples greater than 0.7 from IoU of the most recently labeled instance; (2) background proposals that meet the following criteria: they are below 0.01 from IoU of the closest annotation instance, and their span is greater than 0.01 of the video length; (3) incomplete recommendations for meeting the following criteria: the span ratio it contains in the noted example is greater than 0.01, while it is less than 0.3 from IoU for this example. During the training process, each small lot is guaranteed to contain all three types of proposals, with positive sample and background suggestions used to train the action classifier; positive samples and incomplete suggestions are applied to the integrity classifier; only positive samples are used to train the regression fully connected layer.

In the embodiment, a multi-task loss function is adopted in the training process, and the multi-task loss function classifies the loss L by integrating the actions_clsRegression loss L_regIntegrity loss L_comAnd the L2 canonical term to train the network; total loss is defined as

L_total＝L_cls+λ₁L_reg+λ₂L_com+λ₃L₂

Here, λ 1, λ 2, and λ 3 are weight coefficients; l is_clsIs the cross entropy loss; l is_comIs hinge loss, which predicts whether a proposal captures a complete instance of action.

Regression loss L_regIs a smoothing L between the closest actual action instances designed as positive offers and offers₁And (4) loss. Regression loss L_regThe calculation method is as follows:

wherein the function SL₁To smooth L₁Loss, N is the number of nodes, t_i，cTo offset the proposal from the center coordinates of the tag,

is a predicted value of the offset of the center coordinate; t is t_i，lTo offset the length of the offer and the tag,

is a predicted value of the label length offset.

Optionally, in the embodiment, an iterative method is further adopted in the training process to perform boundary regression and classification, and the output boundary is fed back to the network as an input for use in next refinement. As shown in fig. 3, the predictions of the RGB and Flow models are fused to obtain the final prediction, and the final prediction is fed back to the RGB and Flow models respectively for the next refinement. At k_iAfter the second iteration, the final boundary and classification score are obtained.

For each iteration, the DHCNet outputs n pairs of temporal boundary offset values, n pairs of integrity scores, and n +1 category scores. The final confidence score is obtained by multiplying and fusing the category score and the integrity score for the n non-background categories, that is, for each proposal, the final score belonging to the kth category is calculated as follows:

wherein the content of the first and second substances,

is a score of a category of the user,

is the integrity score.

For one proposal p_iAnd taking the highest confidence as a prediction, and obtaining a corresponding regression deviation value.

Since the predictions of the RGB and Flow models have different confidence levels, a weighted average method is used to fuse the RGB and Flow streams. Specifically, the proposed network of the present embodiment is trained separately for two streams, however

After-use 1: eta to fuse the predictions for RGB and Flow streams.

Wherein the content of the first and second substances,

and

respectively representing the final prediction scores of the RGB and Flow models belonging to the kth class,

and

representing the center coordinate offset predicted by the RGB and Flow models respectively,

and

representing the length offset predicted by the RGB and Flow models, respectively. For each suggestion, the final confidence score and regression boundary are used for evaluation and non-maximum suppression (NMS) is used to reduce redundant results.

The training detection results of the embodiment on two public data sets are shown in fig. 4-5, the method of the embodiment has better detection effect, and the effectiveness of the method is verified.

An embodiment of the present invention further provides a time sequence action detection apparatus based on a deep hybrid convolutional neural network, as shown in fig. 6, the apparatus includes:

The embodiment of the invention further provides a time sequence action detection system based on the deep hybrid convolutional neural network, which comprises the following steps:

a processor for executing a plurality of instructions;

a memory to store a plurality of instructions;

The embodiment of the invention further provides a computer readable storage medium, wherein a plurality of instructions are stored in the storage medium; the plurality of instructions are used for loading and executing the time sequence action detection method based on the deep hybrid convolutional neural network by the processor.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a physical machine Server, or a network cloud Server, etc., and needs to install a Windows or Windows Server operating system) to perform some steps of the method according to various embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent change and modification made to the above embodiment according to the technical spirit of the present invention are still within the scope of the technical solution of the present invention.

Claims

1. A time sequence action detection method based on a deep hybrid convolutional neural network is characterized by comprising the following steps:

step S101: acquiring a video to be detected;

2. The time-series motion detection method based on the deep hybrid convolutional neural network of claim 1, wherein the number of channels of one-dimensional convolution is set to the same number C as the feature dimension of the segment_dimThe step size is set to 1.

3. The method of claim 1, wherein the second sub-network receives the proposed features, constructs a graph based on relationships between the proposed features, inputs the constructed graph to a GCN model, and expands an acceptance region of the proposed features, wherein:

constructing edges based on the timing relationship between nodes, choosing IoU metric to measure the degree of overlap between two proposals if there is overlap between them, if IoU (p)_i，p_j)＞θ_iouThen propose p_iAnd p_jAn edge is established between the two edges;

if there is no overlap between the two proposals, d (p) is used_i，p_j) Measure the distance between the proposals, where θ_iouIs a certain threshold; c. C_i、c_jRespectively represent proposals p_i、p_jCentral coordinate of (b), U (p)_i，p_j) Represents the union of two proposals, at d (p)_i，p_j) Establishing edges between nodes larger than a threshold value;

at IoU (p)_i，p_j) And d (p)_i，p_j) Greater than a threshold value theta_iouAn edge is established between the nodes.

4. The method of claim 3, wherein the GCN model comprises a first GCN model and a second GCN model, and the input of the first GCN model is the original proposed features used for predicting the action category; the second GCN model input is an extended proposal feature that extends both the start and end of the proposal feature by half the duration of a proposal in the first subnet, and then applies the ROI pooling layer to obtain an extended proposal feature for predicting integrity tags and action boundaries.

5. The method of claim 4, wherein the first GCN model and the second GCN model are both composed of two graph convolutional layers, and the graph convolutional layers are implemented by:

X⁽ⁱ⁾＝AX^(i-1)W⁽ⁱ⁾

where A is an adjacency matrix computed from cosine similarities between proposed features; w⁽ⁱ⁾Is a parameter matrix to be learned; x⁽ⁱ⁾Is the first layer of all proposed hidden features;

6. The method for time series action detection based on the deep hybrid convolutional neural network as claimed in claim 1, wherein a multitask loss function is adopted in the deep hybrid convolutional neural network training process, and the multitask loss function classifies the loss L by integrating the action_clsRegression loss L_regIntegrity loss L_comAnd the L2 canonical term to train the network; total loss is defined as

L_total＝L_cls+λ₁L_reg+λ₂L_com+λ₃L₂

7. A time series motion detection apparatus based on a deep hybrid convolutional neural network, the apparatus comprising:

8. A time sequence action detection system based on a deep hybrid convolutional neural network is characterized by comprising:

a processor for executing a plurality of instructions;

a memory to store a plurality of instructions;

wherein the plurality of instructions are stored by the memory and loaded and executed by the processor to perform the method for detecting the time sequence action based on the deep hybrid convolutional neural network as claimed in any one of claims 1 to 6.

9. A computer-readable storage medium having stored therein a plurality of instructions; the plurality of instructions for loading and executing by a processor the method for deep hybrid convolutional neural network-based temporal motion detection as claimed in any one of claims 1 to 6.