CN112613349B

CN112613349B - Time sequence action detection method and device based on deep hybrid convolutional neural network

Info

Publication number: CN112613349B
Application number: CN202011402943.9A
Authority: CN
Inventors: 甘明刚; 张琰; 刘洁玺; 陈杰; 窦丽华; 陈文颉; 陈晨
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2023-01-10
Anticipated expiration: 2040-12-04
Also published as: CN112613349A

Abstract

The invention provides a time sequence action detection method and a time sequence action detection device based on a deep hybrid convolutional neural network, wherein the method comprises the steps of obtaining a video to be detected; inputting a video into a trained deep hybrid convolutional neural network, wherein the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet; the feature coding module extracts segment features from original video data through a double-current network, the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module, the second subnet receives the proposed features, constructs a graph based on the relationship among the proposed features, inputs the constructed graph into a GCN model, and expands the acceptance area of the proposed features; and outputting the action type, the action starting time and the action ending time of the video to be detected. According to the scheme of the invention, the relation among the proposals is effectively utilized, the accuracy of the time sequence action detection is improved, and the time sequence action detection is effectively solved.

Description

Time sequence action detection method and device based on deep hybrid convolutional neural network

Technical Field

The invention relates to the field of video identification, in particular to a time sequence action detection method and device based on a deep hybrid convolutional neural network.

Background

Temporal motion detection is a basic and challenging task to understand human behavior, mainly for segmenting and classifying long unsegmented videos. Given a long video without segmentation, the algorithm needs to detect motion segments in the video, and the detection result includes a start time, an end time and a motion category. A piece of video may contain one or more identical or different motion segments.

The sequential motion detection is generally divided into two phases: time offer generation and offer classification. The time proposal generation is to generate a candidate frame in the similar target detection, namely to generate a series of start and end time, and the proposal classification is to classify the action in the generated proposal time and judge the action class. For the proposed classification, most methods treat it as an action recognition task, and directly adopt the method in action recognition. However, the proposals have a more complex temporal structure than the cropped video, and there is typically a semantic relationship between the proposals. Existing processing methods ignore these differences, limiting their performance.

The method for directly adopting action recognition in classification proposed in the prior art has the following problems:

(1) There is only one action in the action recognition and there may be multiple actions in a proposal.

(2) While a complete action is included in an action recognition, a proposal typically includes only a portion of an action.

(3) There is no semantic relationship in the segments of the action category and multiple proposals are generated from the same video or within the same action tag.

Disclosure of Invention

In order to solve the technical problems, the invention provides a time sequence action detection method and a time sequence action detection device based on a deep hybrid convolutional neural network, and the method and the device are used for solving the problems that in the prior art, an action identification method adopted by (1) proposed classification cannot model a complex time sequence structure between proposals, the receptive field is small, only short-term time sequence relation can be obtained, and high-quality proposed characteristics cannot be obtained. (2) Proposals rarely contain the entire action, lack sufficient information to generate an accurate time boundary, and therefore need to consider the technical problem of obtaining information from other proposals.

According to a first aspect of the present invention, there is provided a time series action detection method based on a deep hybrid convolutional neural network, the method comprising the steps of:

step S101: acquiring a video to be detected;

step S102: inputting the video into a trained deep hybrid convolutional neural network;

the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;

the feature coding module extracts segment features from original video data through a double-current network;

the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;

the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and enlarges the acceptance area of the proposed features;

step S103: and outputting the action type, the action starting time and the action ending time of the video to be detected.

According to a second aspect of the present invention, there is provided a time-series action detection apparatus based on a deep hybrid convolutional neural network, the apparatus comprising:

a video acquisition module: the method comprises the steps of configuring to obtain a video to be detected;

a detection module: the method comprises the steps of inputting a video into a trained deep hybrid convolutional neural network;

an output module: and the video to be detected is configured to output the action category and the action starting and ending time of the video to be detected.

According to a third aspect of the present invention, there is provided a time-series action detection system based on a deep hybrid convolutional neural network, comprising:

a processor for executing a plurality of instructions;

a memory to store a plurality of instructions;

wherein the plurality of instructions are used for being stored by the memory and loaded and executed by the processor, and the time sequence action detection method based on the deep hybrid convolutional neural network is described in the foregoing.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium having a plurality of instructions stored therein; the plurality of instructions are used for loading and executing the time sequence action detection method based on the deep hybrid convolutional neural network by the processor.

According to the scheme of the invention, the relationship among the proposals can be effectively utilized while the time structure is modeled, the accuracy rate of the time sequence action detection is improved, and the time sequence action detection is effectively solved.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to implement them in accordance with the contents of the description, the following detailed description is given with reference to the preferred embodiments of the present invention and the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention. In the drawings:

FIG. 1 is a flowchart of a method for detecting a time sequence action based on a deep hybrid convolutional neural network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a deep hybrid convolutional neural network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a fusion RGB and Flow model for prediction according to an embodiment of the present invention;

FIG. 4 is a graph illustrating the results of training tests on a test data set, in accordance with one embodiment of the present invention;

FIG. 5 is a graph illustrating the results of a training test on a test data set, in accordance with yet another embodiment of the present invention;

fig. 6 is a block diagram of a time-series operation detection apparatus based on a deep hybrid convolutional neural network according to an embodiment of the present invention.

Detailed Description

First, a flow of a method for detecting a time-series operation based on a deep hybrid convolutional neural network according to an embodiment of the present invention will be described with reference to fig. 1. As shown in fig. 1, the method comprises the steps of:

step S101: acquiring a video to be detected;

the second subnet receives the proposed features, constructs a graph based on the relationship among the proposed features, inputs the constructed graph into a GCN model and enlarges the acceptance area of the proposed features;

In this embodiment, a dual-stream network (e.g., an I3D network) is used for feature coding. Each input video is first segmented into a set of segments, which are then sent into a dual stream network for characterization.

The feature coding module extracts segment features from original video data through a dual-stream network, and the method comprises the following steps:

in the present embodiment, for video data, i.e., a given uncut video V, a length n is generated _w Frame, time sequence sliding window with step length of sigma, using time sequence sliding window to divide video V and generate a group of video segments

As input to a feature encoding module, wherein s _h For the segmented video segments, N represents the total number of segments in the video V; feature encoding module outputs feature sequences

Wherein, f _h Is s is _h Corresponding features as input to the first subnet.

Thus, for a given uncut video segment, a sequence of features can be obtained

As input to the first subnet.

Further, it is also possible to obtain a sequence of video features by stitching together the segment features, which are then fed into the first subnet to obtain a high quality proposed feature.

The structure of the deep hybrid convolutional neural network of the present embodiment is shown in fig. 2.

The first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional time sequence convolution layers; the RoI pooling layer receives the output of the one-dimensional time-sequential volume block, pooling generating a set of proposed features, wherein:

the convolution layers in the first subnet have large convolution kernels, and the size of the convolution kernels is at least 9*9; the first subnet models complex timing by stacking one-dimensional timing convolution layers with large convolution kernels.

To model complex proposed temporal structures and capture long-term temporal correlations, the present embodiment employs a first sub-network consisting of one-dimensional temporal convolution blocks and RoI pooling layers to obtain high-quality proposed features based on video segment features extracted by the feature encoding module.

The one-dimensional time-sequential convolution block includes two one-dimensional convolution layers. To avoid losing the segment features, the number of channels of the one-dimensional convolution is set to the same number C as the feature dimension of the segment _dim The step size is set to 1.

The hidden layer feature of the last layer of the first subnet, plus the incoming segment feature, can be such that the feature contains multi-scale timing information,

further, the proposed features generated by the first subnet are sent to a second subnet, a graph network is constructed, and the long-term time sequence dependency relationship is captured through the graph convolution network. The second subnet is used to mine the relationships between the offers and expand the acceptance area of the offer features.

The second subnet receives the proposed features, constructs a graph based on relationships between the proposed features, inputs the constructed graph into a GCN model, expands an acceptance region of the proposed features, wherein:

the second sub-network receives the proposal characteristic and constructs all proposals in the video into a proposal graph G, G = G<V(G)，E(G)>Representing a graph comprising N nodes, where node v _i ∈V(G) V side of _ij E (G); taking each proposal as a node, and taking the proposal characteristics obtained from the first subnet as the node characteristics of the corresponding node;

constructing edges according to the timing relationship between nodes, if there is overlap between two proposals, selecting IoU metric method to measure the overlap degree between the proposals, if IoU (p) _i ，p _j )＞θ _iou Then propose p _i And p _j An edge is established between the two edges,

if there is no overlap between the two proposals, d (p) is used _i ，p _j ) Measure the distance between the proposals, where θ _iou Is a certain threshold; c. C _i 、c _j Respectively represent proposals p _i 、p _j Central coordinate of (b), U (p) _i ，p _j ) Represents the union of two proposals;

at IoU (p) _i ，p _j ) Or d (p) _i ，p _j ) Greater than a threshold value theta _iou An edge is established between the nodes.

The constructed graph is input into the GCN model, i.e. a GCN (graph convolution network) model is applied on the graph so that each node can focus information from the neighborhood.

In this embodiment, the GCN model is used for classifying the proposal, and the GCN model includes a first GCN model and a second GCN model, and the first GCN model is operated based on a graph G with proposal features; the input of the first GCN model is an original proposed feature used for predicting action categories; the second GCN model input is an extended proposal feature that extends both the start and end of the proposal feature by half the duration of a proposal in the first subnet, and then applies the ROI pooling layer to obtain an extended proposal feature for predicting integrity tags and action boundaries.

The first GCN model and the second GCN model are both composed of two graph volume layers, and the graph volume layers are realized in the following mode:

X ⁽ⁱ⁾ ＝AX ⁽ⁱ¹⁾ W ⁽ⁱ⁾

where A is an adjacency matrix computed from cosine similarities between proposed features; w ⁽ⁱ⁾ Is a parameter matrix to be learned; x ⁽ⁱ⁾ Is an implicit feature of all proposals at layer i;

a fully-connected (FC) layer with softmax operation is applied over the first GCN model to predict action tags, and two fully-connected (FC) layers are employed over the second GCN model to predict integrity tags and boundaries.

Further, in this embodiment, an integrity classifier is added to predict whether a proposal contains a complete action instance, and to filter out proposals that contain only a portion of the action instance. The integrity classifier includes n binary classifiers, each binary classifier corresponding to an action class. For each action class k, the corresponding completeness classifier C _k A probability value is generated that represents the probability of a proposed capture of the complete action instance of the action class k. Considering both the proposed category score and completeness score, very good performance is obtained in temporal motion detection.

To predict the precise boundaries of the action instance, the present embodiment further learns the center coordinate offset and length offset between the real action instance and the proposal, and the calculated regression offset is:

t _c ＝(c _p，i -c _gt，i )/l _p，i t _l ＝log(l _gt，i /l _p，i )

wherein, t _c Offset value of center coordinate and length, c, for proposal _p，i To the proposed center coordinate, c _gt，i As the central coordinate of the label,/ _p，i Is the proposed length; t is t _l Offset value of the center coordinate and length of the label, l _gt，i Is the length of the label.

Further, the training process of the deep hybrid convolutional neural network comprises:

first, as explained below with respect to a sample video, an uncut video may be represented as

Wherein T represents the video frame number, X _t Represents the height and width of the image of the t-th frame, H represents the height of the image, and W represents the width of the image. Annotation information for video V is composed of a set of action instances

Wherein N is _g Number representing real action instance in video V, c _gt，n ，l _gt，n Respectively represent an action instance phi _n Center coordinates and length of (a), y _gt，n Representative action instance phi _n The category label of (2).

Given a set of proposals, three types of samples of proposals are collected by evaluating their intersection ratios with annotations (IoU): (1) positive samples greater than 0.7 with IoU of the most recently labeled example; (2) background proposals that meet the following criteria: they are below 0.01 from IoU of the closest annotation instance, and their span is greater than 0.01 of the video length; (3) incomplete suggestions for meeting the following criteria: it includes span ratios greater than 0.01 in the noted example, and less than 0.3 in IoU for this example. During the training process, each small lot is guaranteed to contain all three types of proposals, with positive sample and background suggestions used to train the action classifier; positive samples and incomplete suggestions are applied to the integrity classifier; only positive samples are used to train the regression fully connected layer.

In the embodiment, a multi-task loss function is adopted in the training process, and the multi-task loss function classifies the loss L by integrating the actions _cls Regression loss L _reg Integrity loss L _com And an L2 regularization term to train the network; total loss is defined as

L _total ＝L _cls +λ ₁ L _reg +λ ₂ L _com +λ ₃ L ₂

Here, λ 1, λ 2, and λ 3 are weight coefficients; l is _cls Is the cross entropy loss; l is a radical of an alcohol _com Is hinge loss, which predicts whether a proposal captures a complete instance of action.

Regression loss L _reg Is the smoothing L between the closest actual action instances designed as positive proposals and proposals ₁ And (4) loss. Regression loss L _reg The calculation method of (A) is as follows:

wherein the function SL ₁ To smooth L ₁ Loss, N is the number of nodes, t _i，c To offset the proposal from the center coordinates of the tag,

is a predicted value of the offset of the center coordinate; t is t _i，l To offset the length of the offer and the tag,

is a predicted value of the label length offset.

Optionally, in the embodiment, an iterative method is further adopted in the training process to perform boundary regression and classification, and the output boundary is fed back to the network as an input for use in next refinement. As shown in fig. 3, the predictions of the RGB and Flow models are fused to obtain the final prediction, and the final prediction is fed back to the RGB and Flow models respectively for the next refinement. At k _i After the second iteration, the final boundary and classification score are obtained.

For each iteration, the DHCNet outputs n pairs of temporal boundary offset values, n pairs of integrity scores, and n +1 category scores. The final confidence score is obtained by multiplying and fusing the category score and the integrity score for the n non-background categories, that is, for each proposal, the final score belonging to the kth category is calculated as follows:

wherein the content of the first and second substances,

is a score of a category of the user,

is the integrity score.

For one proposal p _i And taking the highest confidence as a prediction, and obtaining a corresponding regression deviation value.

Since the predictions of the RGB and Flow models have different confidence levels, a weighted average method is used to fuse the RGB and Flow streams. Specifically, the proposed network of the present embodiment is trained separately for two streams, however

After-use 1: eta to fuse the predictions for RGB and Flow streams.

Wherein the content of the first and second substances,

and

respectively representing the final prediction scores of the RGB and Flow models belonging to the kth class,

and

representing the center coordinate offsets predicted by the RGB and Flow models respectively,

and

representing the length offsets predicted by the RGB and Flow models, respectively. For each suggestion, the final confidence score and regression boundary are used for evaluation and non-maximum suppression (NMS) is used to reduce redundant results.

The training detection results of the embodiment on two public data sets are shown in fig. 4-5, the method of the embodiment has better detection effect, and the effectiveness of the method is verified.

An embodiment of the present invention further provides a time sequence action detection apparatus based on a deep hybrid convolutional neural network, as shown in fig. 6, the apparatus includes:

the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first sub-network comprises a one-dimensional time sequence convolution block and a RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;

The embodiment of the invention further provides a time sequence action detection system based on the deep hybrid convolutional neural network, which comprises the following steps:

a processor for executing a plurality of instructions;

a memory to store a plurality of instructions;

The embodiment of the invention further provides a computer readable storage medium, wherein a plurality of instructions are stored in the storage medium; the plurality of instructions are used for loading and executing the time sequence action detection method based on the deep hybrid convolutional neural network by the processor.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions to enable a computer device (which may be a personal computer, a physical machine Server, or a network cloud Server, etc., and needs to install a Windows or Windows Server operating system) to perform some steps of the method according to various embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent change and modification made to the above embodiment according to the technical spirit of the present invention are still within the scope of the technical solution of the present invention.

Claims

1. A time sequence action detection method based on a deep hybrid convolutional neural network is characterized by comprising the following steps:

step S101: acquiring a video to be detected;

the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and expands the acceptance area of the proposed features, specifically:

the second subnet receives the proposal feature, and constructs all proposals in the video into a proposal graph G, G = and<V(G),E(G)>representing a graph comprising N nodes, where node v _i E.g. V (G), edge V _ij E (G); taking each proposal as a node, and taking the proposal characteristic obtained from the first subnet as the node characteristic of the corresponding node;

constructing edges according to the timing relationship between nodes, if there is overlap between two proposals, selecting IoU metric method to measure the overlap degree between the proposals, if IoU (p) _i ,p _j )>θ _iou Then propose p _i And p _j An edge is established between the two edges;

if there is no overlap between the two proposals, d (p) is used _i ,p _j ) Measure the distance between the proposals, where θ _iou Is a certain threshold; c. C _i 、c _j Respectively represent proposals p _i 、p _j Central coordinate of (b), U (p) _i ,p _j ) Represents the union of two proposals, at d (p) _i ,p _j ) Establishing edges between nodes larger than a threshold value;

at IoU (p) _i ,p _j ) And d (p) _i ,p _j ) Greater than a threshold value theta _iou Node (a) ofA side is established between the two;

the GCN model comprises a first GCN model and a second GCN model, wherein the input of the first GCN model is an original proposed characteristic used for predicting action categories; the second GCN model input is an extended proposal feature that extends both the start and end of the proposal feature by half the proposal duration in the first subnet, and then applies the ROI pooling layer to obtain an extended proposal feature for predicting integrity tags and action boundaries;

X ⁽ⁱ⁾ ＝AX ^(i-1) W ⁽ⁱ⁾

wherein A is an adjacency matrix computed from cosine similarities between proposed features; w ⁽ⁱ⁾ Is a parameter matrix to be learned; x ⁽ⁱ⁾ Is the first layer of all proposed hidden features;

applying a fully-connected (FC) layer with softmax operations over a first GCN model to predict action tags, and employing two fully-connected (FC) layers over a second GCN model to predict integrity tags and boundaries;

2. The time-series motion detection method based on the deep hybrid convolutional neural network of claim 1, wherein the number of channels of one-dimensional convolution is set to the same number C as the feature dimension of the segment _dim The step size is set to 1.

3. The method for time-series motion detection based on the deep hybrid convolutional neural network as claimed in claim 1, wherein a multi-task loss function is adopted in the deep hybrid convolutional neural network training process, and the multi-task loss function is used for classifying the loss L by integrating the motion _cls Regression loss L _reg Integrity loss L _com And an L2 regularization term to train the network; total loss is defined as

L _total ＝L _cls +λ ₁ L _reg +λ ₂ L _com +λ ₃ L ₂

Here, λ 1, λ 2, and λ 3 are weight coefficients; l is _cls Is the cross entropy loss; l is _com Is hinge loss, which predicts whether a proposal captures a complete instance of action.

4. A time series motion detection apparatus based on a deep hybrid convolutional neural network, the apparatus comprising:

the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and a set of proposed features are generated through pooling;

the second subnet receives the proposal feature, and constructs all proposals in the video into a proposal graph G, G = and<V(G),E(G)>representing a graph comprising N nodes, where node v _i E.g. V (G), edge V _ij E (G); taking each proposal as a node, and taking the proposal characteristics obtained from the first subnet as the node characteristics of the corresponding node;

based on the nodeThe timing relationship between the two proposals is constructed by constructing an edge, and if there is overlap between the two proposals, the IoU metric method is selected to measure the degree of overlap between the proposals, if IoU (p) _i ,p _j )>θ _iou Then propose p _i And p _j An edge is established between the two edges;

if there is no overlap between the two proposals, d (p) is used _i ,p _j ) Measure the distance between the proposals, where θ _iou Is a certain threshold; c. C _i 、c _j Respectively represent proposals p _i 、p _j Central coordinate of (b), U (p) _i , p _j ) Represents the union of two proposals, at d (p) _i , p _j ) Establishing edges between nodes larger than a threshold value;

at IoU (p) _i ,p _j ) And d (p) _i , p _j ) Greater than a threshold value theta _iou Establishing edges among the nodes;

X ⁽ⁱ⁾ ＝AX ^(i-1) W ⁽ⁱ⁾

where A is an adjacency matrix computed from cosine similarities between proposed features; w ⁽ⁱ⁾ Is a parameter matrix to be learned; x ⁽ⁱ⁾ Is the first layer of all proposed hidden features;

an output module: and the video detection device is configured to output the action category and the action starting and ending time of the video to be detected.

5. A time sequence action detection system based on a deep hybrid convolutional neural network is characterized by comprising:

a processor for executing a plurality of instructions;

a memory to store a plurality of instructions;

wherein the plurality of instructions are stored by the memory and loaded and executed by the processor to perform the method for detecting the time sequence action based on the deep hybrid convolutional neural network as claimed in any one of claims 1 to 3.

6. A computer-readable storage medium having stored therein a plurality of instructions; the plurality of instructions for loading and executing by a processor the method for deep hybrid convolutional neural network-based temporal motion detection as claimed in any one of claims 1 to 3.