CN112613349B - Time sequence action detection method and device based on deep hybrid convolutional neural network - Google Patents

Time sequence action detection method and device based on deep hybrid convolutional neural network Download PDF

Info

Publication number
CN112613349B
CN112613349B CN202011402943.9A CN202011402943A CN112613349B CN 112613349 B CN112613349 B CN 112613349B CN 202011402943 A CN202011402943 A CN 202011402943A CN 112613349 B CN112613349 B CN 112613349B
Authority
CN
China
Prior art keywords
subnet
action
neural network
convolutional neural
proposal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011402943.9A
Other languages
Chinese (zh)
Other versions
CN112613349A (en
Inventor
甘明刚
张琰
刘洁玺
陈杰
窦丽华
陈文颉
陈晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202011402943.9A priority Critical patent/CN112613349B/en
Publication of CN112613349A publication Critical patent/CN112613349A/en
Application granted granted Critical
Publication of CN112613349B publication Critical patent/CN112613349B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a time sequence action detection method and a time sequence action detection device based on a deep hybrid convolutional neural network, wherein the method comprises the steps of obtaining a video to be detected; inputting a video into a trained deep hybrid convolutional neural network, wherein the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet; the feature coding module extracts segment features from original video data through a double-current network, the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module, the second subnet receives the proposed features, constructs a graph based on the relationship among the proposed features, inputs the constructed graph into a GCN model, and expands the acceptance area of the proposed features; and outputting the action type, the action starting time and the action ending time of the video to be detected. According to the scheme of the invention, the relation among the proposals is effectively utilized, the accuracy of the time sequence action detection is improved, and the time sequence action detection is effectively solved.

Description

Time sequence action detection method and device based on deep hybrid convolutional neural network
Technical Field
The invention relates to the field of video identification, in particular to a time sequence action detection method and device based on a deep hybrid convolutional neural network.
Background
Temporal motion detection is a basic and challenging task to understand human behavior, mainly for segmenting and classifying long unsegmented videos. Given a long video without segmentation, the algorithm needs to detect motion segments in the video, and the detection result includes a start time, an end time and a motion category. A piece of video may contain one or more identical or different motion segments.
The sequential motion detection is generally divided into two phases: time offer generation and offer classification. The time proposal generation is to generate a candidate frame in the similar target detection, namely to generate a series of start and end time, and the proposal classification is to classify the action in the generated proposal time and judge the action class. For the proposed classification, most methods treat it as an action recognition task, and directly adopt the method in action recognition. However, the proposals have a more complex temporal structure than the cropped video, and there is typically a semantic relationship between the proposals. Existing processing methods ignore these differences, limiting their performance.
The method for directly adopting action recognition in classification proposed in the prior art has the following problems:
(1) There is only one action in the action recognition and there may be multiple actions in a proposal.
(2) While a complete action is included in an action recognition, a proposal typically includes only a portion of an action.
(3) There is no semantic relationship in the segments of the action category and multiple proposals are generated from the same video or within the same action tag.
Disclosure of Invention
In order to solve the technical problems, the invention provides a time sequence action detection method and a time sequence action detection device based on a deep hybrid convolutional neural network, and the method and the device are used for solving the problems that in the prior art, an action identification method adopted by (1) proposed classification cannot model a complex time sequence structure between proposals, the receptive field is small, only short-term time sequence relation can be obtained, and high-quality proposed characteristics cannot be obtained. (2) Proposals rarely contain the entire action, lack sufficient information to generate an accurate time boundary, and therefore need to consider the technical problem of obtaining information from other proposals.
According to a first aspect of the present invention, there is provided a time series action detection method based on a deep hybrid convolutional neural network, the method comprising the steps of:
step S101: acquiring a video to be detected;
step S102: inputting the video into a trained deep hybrid convolutional neural network;
the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;
the feature coding module extracts segment features from original video data through a double-current network;
the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;
the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and enlarges the acceptance area of the proposed features;
step S103: and outputting the action type, the action starting time and the action ending time of the video to be detected.
According to a second aspect of the present invention, there is provided a time-series action detection apparatus based on a deep hybrid convolutional neural network, the apparatus comprising:
a video acquisition module: the method comprises the steps of configuring to obtain a video to be detected;
a detection module: the method comprises the steps of inputting a video into a trained deep hybrid convolutional neural network;
the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;
the feature coding module extracts segment features from original video data through a double-current network;
the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;
the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and enlarges the acceptance area of the proposed features;
an output module: and the video to be detected is configured to output the action category and the action starting and ending time of the video to be detected.
According to a third aspect of the present invention, there is provided a time-series action detection system based on a deep hybrid convolutional neural network, comprising:
a processor for executing a plurality of instructions;
a memory to store a plurality of instructions;
wherein the plurality of instructions are used for being stored by the memory and loaded and executed by the processor, and the time sequence action detection method based on the deep hybrid convolutional neural network is described in the foregoing.
According to a fourth aspect of the present invention, there is provided a computer readable storage medium having a plurality of instructions stored therein; the plurality of instructions are used for loading and executing the time sequence action detection method based on the deep hybrid convolutional neural network by the processor.
According to the scheme of the invention, the relationship among the proposals can be effectively utilized while the time structure is modeled, the accuracy rate of the time sequence action detection is improved, and the time sequence action detection is effectively solved.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to implement them in accordance with the contents of the description, the following detailed description is given with reference to the preferred embodiments of the present invention and the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention. In the drawings:
FIG. 1 is a flowchart of a method for detecting a time sequence action based on a deep hybrid convolutional neural network according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a deep hybrid convolutional neural network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a fusion RGB and Flow model for prediction according to an embodiment of the present invention;
FIG. 4 is a graph illustrating the results of training tests on a test data set, in accordance with one embodiment of the present invention;
FIG. 5 is a graph illustrating the results of a training test on a test data set, in accordance with yet another embodiment of the present invention;
fig. 6 is a block diagram of a time-series operation detection apparatus based on a deep hybrid convolutional neural network according to an embodiment of the present invention.
Detailed Description
First, a flow of a method for detecting a time-series operation based on a deep hybrid convolutional neural network according to an embodiment of the present invention will be described with reference to fig. 1. As shown in fig. 1, the method comprises the steps of:
step S101: acquiring a video to be detected;
step S102: inputting the video into a trained deep hybrid convolutional neural network;
the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;
the feature coding module extracts segment features from original video data through a double-current network;
the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;
the second subnet receives the proposed features, constructs a graph based on the relationship among the proposed features, inputs the constructed graph into a GCN model and enlarges the acceptance area of the proposed features;
step S103: and outputting the action type, the action starting time and the action ending time of the video to be detected.
In this embodiment, a dual-stream network (e.g., an I3D network) is used for feature coding. Each input video is first segmented into a set of segments, which are then sent into a dual stream network for characterization.
The feature coding module extracts segment features from original video data through a dual-stream network, and the method comprises the following steps:
in the present embodiment, for video data, i.e., a given uncut video V, a length n is generated w Frame, time sequence sliding window with step length of sigma, using time sequence sliding window to divide video V and generate a group of video segments
Figure BDA0002817526760000051
As input to a feature encoding module, wherein s h For the segmented video segments, N represents the total number of segments in the video V; feature encoding module outputs feature sequences
Figure BDA0002817526760000052
Wherein, f h Is s is h Corresponding features as input to the first subnet.
Thus, for a given uncut video segment, a sequence of features can be obtained
Figure BDA0002817526760000053
As input to the first subnet.
Further, it is also possible to obtain a sequence of video features by stitching together the segment features, which are then fed into the first subnet to obtain a high quality proposed feature.
The structure of the deep hybrid convolutional neural network of the present embodiment is shown in fig. 2.
The first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional time sequence convolution layers; the RoI pooling layer receives the output of the one-dimensional time-sequential volume block, pooling generating a set of proposed features, wherein:
the convolution layers in the first subnet have large convolution kernels, and the size of the convolution kernels is at least 9*9; the first subnet models complex timing by stacking one-dimensional timing convolution layers with large convolution kernels.
To model complex proposed temporal structures and capture long-term temporal correlations, the present embodiment employs a first sub-network consisting of one-dimensional temporal convolution blocks and RoI pooling layers to obtain high-quality proposed features based on video segment features extracted by the feature encoding module.
The one-dimensional time-sequential convolution block includes two one-dimensional convolution layers. To avoid losing the segment features, the number of channels of the one-dimensional convolution is set to the same number C as the feature dimension of the segment dim The step size is set to 1.
The hidden layer feature of the last layer of the first subnet, plus the incoming segment feature, can be such that the feature contains multi-scale timing information,
further, the proposed features generated by the first subnet are sent to a second subnet, a graph network is constructed, and the long-term time sequence dependency relationship is captured through the graph convolution network. The second subnet is used to mine the relationships between the offers and expand the acceptance area of the offer features.
The second subnet receives the proposed features, constructs a graph based on relationships between the proposed features, inputs the constructed graph into a GCN model, expands an acceptance region of the proposed features, wherein:
the second sub-network receives the proposal characteristic and constructs all proposals in the video into a proposal graph G, G = G<V(G),E(G)>Representing a graph comprising N nodes, where node v i ∈V(G) V side of ij E (G); taking each proposal as a node, and taking the proposal characteristics obtained from the first subnet as the node characteristics of the corresponding node;
constructing edges according to the timing relationship between nodes, if there is overlap between two proposals, selecting IoU metric method to measure the overlap degree between the proposals, if IoU (p) i ,p j )>θ iou Then propose p i And p j An edge is established between the two edges,
Figure BDA0002817526760000061
if there is no overlap between the two proposals, d (p) is used i ,p j ) Measure the distance between the proposals, where θ iou Is a certain threshold; c. C i 、c j Respectively represent proposals p i 、p j Central coordinate of (b), U (p) i ,p j ) Represents the union of two proposals;
at IoU (p) i ,p j ) Or d (p) i ,p j ) Greater than a threshold value theta iou An edge is established between the nodes.
The constructed graph is input into the GCN model, i.e. a GCN (graph convolution network) model is applied on the graph so that each node can focus information from the neighborhood.
In this embodiment, the GCN model is used for classifying the proposal, and the GCN model includes a first GCN model and a second GCN model, and the first GCN model is operated based on a graph G with proposal features; the input of the first GCN model is an original proposed feature used for predicting action categories; the second GCN model input is an extended proposal feature that extends both the start and end of the proposal feature by half the duration of a proposal in the first subnet, and then applies the ROI pooling layer to obtain an extended proposal feature for predicting integrity tags and action boundaries.
The first GCN model and the second GCN model are both composed of two graph volume layers, and the graph volume layers are realized in the following mode:
X (i) =AX (i1) W (i)
where A is an adjacency matrix computed from cosine similarities between proposed features; w (i) Is a parameter matrix to be learned; x (i) Is an implicit feature of all proposals at layer i;
a fully-connected (FC) layer with softmax operation is applied over the first GCN model to predict action tags, and two fully-connected (FC) layers are employed over the second GCN model to predict integrity tags and boundaries.
Further, in this embodiment, an integrity classifier is added to predict whether a proposal contains a complete action instance, and to filter out proposals that contain only a portion of the action instance. The integrity classifier includes n binary classifiers, each binary classifier corresponding to an action class. For each action class k, the corresponding completeness classifier C k A probability value is generated that represents the probability of a proposed capture of the complete action instance of the action class k. Considering both the proposed category score and completeness score, very good performance is obtained in temporal motion detection.
To predict the precise boundaries of the action instance, the present embodiment further learns the center coordinate offset and length offset between the real action instance and the proposal, and the calculated regression offset is:
t c =(c p,i -c gt,i )/l p,i t l =log(l gt,i /l p,i )
wherein, t c Offset value of center coordinate and length, c, for proposal p,i To the proposed center coordinate, c gt,i As the central coordinate of the label,/ p,i Is the proposed length; t is t l Offset value of the center coordinate and length of the label, l gt,i Is the length of the label.
Further, the training process of the deep hybrid convolutional neural network comprises:
first, as explained below with respect to a sample video, an uncut video may be represented as
Figure BDA0002817526760000071
Figure BDA0002817526760000072
Wherein T represents the video frame number, X t Represents the height and width of the image of the t-th frame, H represents the height of the image, and W represents the width of the image. Annotation information for video V is composed of a set of action instances
Figure BDA0002817526760000073
Figure BDA0002817526760000074
Wherein N is g Number representing real action instance in video V, c gt,n ,l gt,n Respectively represent an action instance phi n Center coordinates and length of (a), y gt,n Representative action instance phi n The category label of (2).
Given a set of proposals, three types of samples of proposals are collected by evaluating their intersection ratios with annotations (IoU): (1) positive samples greater than 0.7 with IoU of the most recently labeled example; (2) background proposals that meet the following criteria: they are below 0.01 from IoU of the closest annotation instance, and their span is greater than 0.01 of the video length; (3) incomplete suggestions for meeting the following criteria: it includes span ratios greater than 0.01 in the noted example, and less than 0.3 in IoU for this example. During the training process, each small lot is guaranteed to contain all three types of proposals, with positive sample and background suggestions used to train the action classifier; positive samples and incomplete suggestions are applied to the integrity classifier; only positive samples are used to train the regression fully connected layer.
In the embodiment, a multi-task loss function is adopted in the training process, and the multi-task loss function classifies the loss L by integrating the actions cls Regression loss L reg Integrity loss L com And an L2 regularization term to train the network; total loss is defined as
L total =L cls1 L reg2 L com3 L 2
Here, λ 1, λ 2, and λ 3 are weight coefficients; l is cls Is the cross entropy loss; l is a radical of an alcohol com Is hinge loss, which predicts whether a proposal captures a complete instance of action.
Regression loss L reg Is the smoothing L between the closest actual action instances designed as positive proposals and proposals 1 And (4) loss. Regression loss L reg The calculation method of (A) is as follows:
Figure BDA0002817526760000081
wherein the function SL 1 To smooth L 1 Loss, N is the number of nodes, t i,c To offset the proposal from the center coordinates of the tag,
Figure BDA0002817526760000082
is a predicted value of the offset of the center coordinate; t is t i,l To offset the length of the offer and the tag,
Figure BDA0002817526760000083
is a predicted value of the label length offset.
Optionally, in the embodiment, an iterative method is further adopted in the training process to perform boundary regression and classification, and the output boundary is fed back to the network as an input for use in next refinement. As shown in fig. 3, the predictions of the RGB and Flow models are fused to obtain the final prediction, and the final prediction is fed back to the RGB and Flow models respectively for the next refinement. At k i After the second iteration, the final boundary and classification score are obtained.
For each iteration, the DHCNet outputs n pairs of temporal boundary offset values, n pairs of integrity scores, and n +1 category scores. The final confidence score is obtained by multiplying and fusing the category score and the integrity score for the n non-background categories, that is, for each proposal, the final score belonging to the kth category is calculated as follows:
Figure BDA0002817526760000084
wherein the content of the first and second substances,
Figure BDA0002817526760000085
is a score of a category of the user,
Figure BDA0002817526760000086
is the integrity score.
For one proposal p i And taking the highest confidence as a prediction, and obtaining a corresponding regression deviation value.
Since the predictions of the RGB and Flow models have different confidence levels, a weighted average method is used to fuse the RGB and Flow streams. Specifically, the proposed network of the present embodiment is trained separately for two streams, however
Figure BDA0002817526760000087
Figure BDA0002817526760000088
Figure BDA0002817526760000089
After-use 1: eta to fuse the predictions for RGB and Flow streams.
Wherein the content of the first and second substances,
Figure BDA0002817526760000091
and
Figure BDA0002817526760000092
respectively representing the final prediction scores of the RGB and Flow models belonging to the kth class,
Figure BDA0002817526760000093
and
Figure BDA0002817526760000094
representing the center coordinate offsets predicted by the RGB and Flow models respectively,
Figure BDA0002817526760000095
and
Figure BDA0002817526760000096
representing the length offsets predicted by the RGB and Flow models, respectively. For each suggestion, the final confidence score and regression boundary are used for evaluation and non-maximum suppression (NMS) is used to reduce redundant results.
The training detection results of the embodiment on two public data sets are shown in fig. 4-5, the method of the embodiment has better detection effect, and the effectiveness of the method is verified.
An embodiment of the present invention further provides a time sequence action detection apparatus based on a deep hybrid convolutional neural network, as shown in fig. 6, the apparatus includes:
a video acquisition module: the method comprises the steps of configuring to obtain a video to be detected;
a detection module: the method comprises the steps of inputting a video into a trained deep hybrid convolutional neural network;
the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;
the feature coding module extracts segment features from original video data through a double-current network;
the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first sub-network comprises a one-dimensional time sequence convolution block and a RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;
the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and enlarges the acceptance area of the proposed features;
an output module: and the video to be detected is configured to output the action category and the action starting and ending time of the video to be detected.
The embodiment of the invention further provides a time sequence action detection system based on the deep hybrid convolutional neural network, which comprises the following steps:
a processor for executing a plurality of instructions;
a memory to store a plurality of instructions;
wherein the plurality of instructions are used for being stored by the memory and loaded and executed by the processor, and the time sequence action detection method based on the deep hybrid convolutional neural network is described in the foregoing.
The embodiment of the invention further provides a computer readable storage medium, wherein a plurality of instructions are stored in the storage medium; the plurality of instructions are used for loading and executing the time sequence action detection method based on the deep hybrid convolutional neural network by the processor.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions to enable a computer device (which may be a personal computer, a physical machine Server, or a network cloud Server, etc., and needs to install a Windows or Windows Server operating system) to perform some steps of the method according to various embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent change and modification made to the above embodiment according to the technical spirit of the present invention are still within the scope of the technical solution of the present invention.

Claims (6)

1. A time sequence action detection method based on a deep hybrid convolutional neural network is characterized by comprising the following steps:
step S101: acquiring a video to be detected;
step S102: inputting the video into a trained deep hybrid convolutional neural network;
the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;
the feature coding module extracts segment features from original video data through a double-current network;
the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;
the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and expands the acceptance area of the proposed features, specifically:
the second subnet receives the proposal feature, and constructs all proposals in the video into a proposal graph G, G = and<V(G),E(G)>representing a graph comprising N nodes, where node v i E.g. V (G), edge V ij E (G); taking each proposal as a node, and taking the proposal characteristic obtained from the first subnet as the node characteristic of the corresponding node;
constructing edges according to the timing relationship between nodes, if there is overlap between two proposals, selecting IoU metric method to measure the overlap degree between the proposals, if IoU (p) i ,p j )>θ iou Then propose p i And p j An edge is established between the two edges;
Figure FDA0003940969840000011
if there is no overlap between the two proposals, d (p) is used i ,p j ) Measure the distance between the proposals, where θ iou Is a certain threshold; c. C i 、c j Respectively represent proposals p i 、p j Central coordinate of (b), U (p) i ,p j ) Represents the union of two proposals, at d (p) i ,p j ) Establishing edges between nodes larger than a threshold value;
at IoU (p) i ,p j ) And d (p) i ,p j ) Greater than a threshold value theta iou Node (a) ofA side is established between the two;
the GCN model comprises a first GCN model and a second GCN model, wherein the input of the first GCN model is an original proposed characteristic used for predicting action categories; the second GCN model input is an extended proposal feature that extends both the start and end of the proposal feature by half the proposal duration in the first subnet, and then applies the ROI pooling layer to obtain an extended proposal feature for predicting integrity tags and action boundaries;
the first GCN model and the second GCN model are both composed of two graph volume layers, and the graph volume layers are realized in the following mode:
X (i) =AX (i-1) W (i)
wherein A is an adjacency matrix computed from cosine similarities between proposed features; w (i) Is a parameter matrix to be learned; x (i) Is the first layer of all proposed hidden features;
applying a fully-connected (FC) layer with softmax operations over a first GCN model to predict action tags, and employing two fully-connected (FC) layers over a second GCN model to predict integrity tags and boundaries;
step S103: and outputting the action type, the action starting time and the action ending time of the video to be detected.
2. The time-series motion detection method based on the deep hybrid convolutional neural network of claim 1, wherein the number of channels of one-dimensional convolution is set to the same number C as the feature dimension of the segment dim The step size is set to 1.
3. The method for time-series motion detection based on the deep hybrid convolutional neural network as claimed in claim 1, wherein a multi-task loss function is adopted in the deep hybrid convolutional neural network training process, and the multi-task loss function is used for classifying the loss L by integrating the motion cls Regression loss L reg Integrity loss L com And an L2 regularization term to train the network; total loss is defined as
L total =L cls1 L reg2 L com3 L 2
Here, λ 1, λ 2, and λ 3 are weight coefficients; l is cls Is the cross entropy loss; l is com Is hinge loss, which predicts whether a proposal captures a complete instance of action.
4. A time series motion detection apparatus based on a deep hybrid convolutional neural network, the apparatus comprising:
a video acquisition module: the method comprises the steps of configuring to obtain a video to be detected;
a detection module: the method comprises the steps of inputting a video into a trained deep hybrid convolutional neural network;
the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;
the feature coding module extracts segment features from original video data through a double-current network;
the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and a set of proposed features are generated through pooling;
the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and expands the acceptance area of the proposed features, specifically:
the second subnet receives the proposal feature, and constructs all proposals in the video into a proposal graph G, G = and<V(G),E(G)>representing a graph comprising N nodes, where node v i E.g. V (G), edge V ij E (G); taking each proposal as a node, and taking the proposal characteristics obtained from the first subnet as the node characteristics of the corresponding node;
based on the nodeThe timing relationship between the two proposals is constructed by constructing an edge, and if there is overlap between the two proposals, the IoU metric method is selected to measure the degree of overlap between the proposals, if IoU (p) i ,p j )>θ iou Then propose p i And p j An edge is established between the two edges;
Figure FDA0003940969840000031
if there is no overlap between the two proposals, d (p) is used i ,p j ) Measure the distance between the proposals, where θ iou Is a certain threshold; c. C i 、c j Respectively represent proposals p i 、p j Central coordinate of (b), U (p) i , p j ) Represents the union of two proposals, at d (p) i , p j ) Establishing edges between nodes larger than a threshold value;
at IoU (p) i ,p j ) And d (p) i , p j ) Greater than a threshold value theta iou Establishing edges among the nodes;
the GCN model comprises a first GCN model and a second GCN model, wherein the input of the first GCN model is an original proposed characteristic used for predicting action categories; the second GCN model input is an extended proposal feature that extends both the start and end of the proposal feature by half the proposal duration in the first subnet, and then applies the ROI pooling layer to obtain an extended proposal feature for predicting integrity tags and action boundaries;
the first GCN model and the second GCN model are both composed of two graph volume layers, and the graph volume layers are realized in the following mode:
X (i) =AX (i-1) W (i)
where A is an adjacency matrix computed from cosine similarities between proposed features; w (i) Is a parameter matrix to be learned; x (i) Is the first layer of all proposed hidden features;
applying a fully-connected (FC) layer with softmax operations over a first GCN model to predict action tags, and employing two fully-connected (FC) layers over a second GCN model to predict integrity tags and boundaries;
an output module: and the video detection device is configured to output the action category and the action starting and ending time of the video to be detected.
5. A time sequence action detection system based on a deep hybrid convolutional neural network is characterized by comprising:
a processor for executing a plurality of instructions;
a memory to store a plurality of instructions;
wherein the plurality of instructions are stored by the memory and loaded and executed by the processor to perform the method for detecting the time sequence action based on the deep hybrid convolutional neural network as claimed in any one of claims 1 to 3.
6. A computer-readable storage medium having stored therein a plurality of instructions; the plurality of instructions for loading and executing by a processor the method for deep hybrid convolutional neural network-based temporal motion detection as claimed in any one of claims 1 to 3.
CN202011402943.9A 2020-12-04 2020-12-04 Time sequence action detection method and device based on deep hybrid convolutional neural network Active CN112613349B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011402943.9A CN112613349B (en) 2020-12-04 2020-12-04 Time sequence action detection method and device based on deep hybrid convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011402943.9A CN112613349B (en) 2020-12-04 2020-12-04 Time sequence action detection method and device based on deep hybrid convolutional neural network

Publications (2)

Publication Number Publication Date
CN112613349A CN112613349A (en) 2021-04-06
CN112613349B true CN112613349B (en) 2023-01-10

Family

ID=75228795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011402943.9A Active CN112613349B (en) 2020-12-04 2020-12-04 Time sequence action detection method and device based on deep hybrid convolutional neural network

Country Status (1)

Country Link
CN (1) CN112613349B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113128395B (en) * 2021-04-16 2022-05-20 重庆邮电大学 Video action recognition method and system based on hybrid convolution multistage feature fusion model
CN113420598B (en) * 2021-05-25 2024-05-14 江苏大学 Time sequence action detection method based on decoupling of context information and proposal classification
CN114863356B (en) * 2022-03-10 2023-02-03 西南交通大学 Group activity identification method and system based on residual aggregation graph network

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3249610A1 (en) * 2016-05-26 2017-11-29 Nokia Technologies Oy A method, an apparatus and a computer program product for video object segmentation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919122A (en) * 2019-03-18 2019-06-21 中国石油大学(华东) A kind of timing behavioral value method based on 3D human body key point
CN110362715B (en) * 2019-06-28 2021-11-19 西安交通大学 Non-clipped video action time sequence positioning method based on graph convolution network

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3249610A1 (en) * 2016-05-26 2017-11-29 Nokia Technologies Oy A method, an apparatus and a computer program product for video object segmentation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Fall detection and recognition based on GCN and 2D Pose;Yin Zheng, et al;《2019 6th International Conference on Systems and Informatics》;20200227;全文 *

Also Published As

Publication number Publication date
CN112613349A (en) 2021-04-06

Similar Documents

Publication Publication Date Title
CN109697434B (en) Behavior recognition method and device and storage medium
CN112613349B (en) Time sequence action detection method and device based on deep hybrid convolutional neural network
Kukleva et al. Unsupervised learning of action classes with continuous temporal embedding
Fong et al. Interpretable explanations of black boxes by meaningful perturbation
Liang et al. Interpretable structure-evolving LSTM
Du et al. Towards explanation of dnn-based prediction with guided feature inversion
US11640714B2 (en) Video panoptic segmentation
Esmaeili et al. Fast-at: Fast automatic thumbnail generation using deep neural networks
CN107169463A (en) Method for detecting human face, device, computer equipment and storage medium
KR102042168B1 (en) Methods and apparatuses for generating text to video based on time series adversarial neural network
CN109063626B (en) Dynamic face recognition method and device
Giraldo et al. Graph CNN for moving object detection in complex environments from unseen videos
Vu et al. Energy-based models for video anomaly detection
CN111523421A (en) Multi-user behavior detection method and system based on deep learning and fusion of various interaction information
CN112801063B (en) Neural network system and image crowd counting method based on neural network system
Roy et al. Foreground segmentation using adaptive 3 phase background model
CN112131944B (en) Video behavior recognition method and system
CN112084812A (en) Image processing method, image processing device, computer equipment and storage medium
CN114782997A (en) Pedestrian re-identification method and system based on multi-loss attention adaptive network
Lu et al. Learning the relation between interested objects and aesthetic region for image cropping
Nemade et al. Image segmentation using convolutional neural network for image annotation
CN110956157A (en) Deep learning remote sensing image target detection method and device based on candidate frame selection
CN115147890A (en) System, method and storage medium for creating image data embedding for image recognition
Xiao et al. Self-explanatory deep salient object detection
Baraka et al. Weakly-supervised temporal action localization: a survey

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant