CN112613349A - Time sequence action detection method and device based on deep hybrid convolutional neural network - Google Patents

Time sequence action detection method and device based on deep hybrid convolutional neural network Download PDF

Info

Publication number
CN112613349A
CN112613349A CN202011402943.9A CN202011402943A CN112613349A CN 112613349 A CN112613349 A CN 112613349A CN 202011402943 A CN202011402943 A CN 202011402943A CN 112613349 A CN112613349 A CN 112613349A
Authority
CN
China
Prior art keywords
subnet
action
neural network
features
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011402943.9A
Other languages
Chinese (zh)
Other versions
CN112613349B (en
Inventor
甘明刚
张琰
刘洁玺
陈杰
窦丽华
陈文颉
陈晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202011402943.9A priority Critical patent/CN112613349B/en
Publication of CN112613349A publication Critical patent/CN112613349A/en
Application granted granted Critical
Publication of CN112613349B publication Critical patent/CN112613349B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Abstract

The invention provides a time sequence action detection method and a time sequence action detection device based on a deep hybrid convolutional neural network, wherein the method comprises the steps of obtaining a video to be detected; inputting a video into a trained deep hybrid convolutional neural network, wherein the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet; the feature coding module extracts segment features from original video data through a double-current network, the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module, the second subnet receives the proposed features, constructs a graph based on the relationship among the proposed features, inputs the constructed graph into a GCN model, and expands the acceptance area of the proposed features; and outputting the action type, the action starting time and the action ending time of the video to be detected. According to the scheme of the invention, the relation among the proposals is effectively utilized, the accuracy rate of the time sequence action detection is improved, and the time sequence action detection is effectively solved.

Description

Time sequence action detection method and device based on deep hybrid convolutional neural network
Technical Field
The invention relates to the field of video identification, in particular to a time sequence action detection method and device based on a deep hybrid convolutional neural network.
Background
Temporal motion detection is a basic and challenging task to understand human behavior, mainly for segmenting and classifying long unsegmented videos. Given a long video without segmentation, the algorithm needs to detect motion segments in the video, and the detection result includes a start time, an end time and a motion category. A piece of video may contain one or more identical or different motion segments.
The sequential motion detection is generally divided into two phases: time offer generation and offer classification. The time proposal generation is to generate a candidate frame in the similar target detection, namely to generate a series of start and end time, and the proposal classification is to classify the action in the generated proposal time and judge the action class. For the proposed classification, most methods treat it as an action recognition task, and directly adopt the method in action recognition. However, the proposals have a more complex temporal structure than the cropped video, and there is usually a semantic relationship between the proposals. Existing processing methods ignore these differences, limiting their performance.
The method for directly adopting action recognition in classification proposed in the prior art has the following problems:
(1) there may be only one action in the action recognition and there may be multiple actions in a proposal.
(2) While motion recognition involves a complete motion, a proposal typically involves only a portion of a motion.
(3) There is no semantic relationship in the segments of the action category and multiple proposals are generated from the same video or within the same action tag.
Disclosure of Invention
In order to solve the technical problems, the invention provides a time sequence action detection method and a time sequence action detection device based on a deep hybrid convolutional neural network, and the method and the device are used for solving the problems that in the prior art, an action identification method adopted by (1) proposed classification cannot model a complex time sequence structure between proposals, the receptive field is small, only short-term time sequence relation can be obtained, and high-quality proposed characteristics cannot be obtained. (2) Proposals rarely contain the entire action, lack sufficient information to generate an accurate time boundary, and therefore need to consider the technical problem of obtaining information from other proposals.
According to a first aspect of the present invention, there is provided a time series action detection method based on a deep hybrid convolutional neural network, the method comprising the steps of:
step S101: acquiring a video to be detected;
step S102: inputting the video into a trained deep hybrid convolutional neural network;
the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;
the feature coding module extracts segment features from original video data through a double-current network;
the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;
the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and enlarges the acceptance area of the proposed features;
step S103: and outputting the action type, the action starting time and the action ending time of the video to be detected.
According to a second aspect of the present invention, there is provided a time-series action detection apparatus based on a deep hybrid convolutional neural network, the apparatus comprising:
a video acquisition module: the method comprises the steps of configuring to obtain a video to be detected;
a detection module: the method comprises the steps of inputting a video into a trained deep hybrid convolutional neural network;
the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;
the feature coding module extracts segment features from original video data through a double-current network;
the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;
the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and enlarges the acceptance area of the proposed features;
an output module: and the video to be detected is configured to output the action category and the action starting and ending time of the video to be detected.
According to a third aspect of the present invention, there is provided a time-series action detection system based on a deep hybrid convolutional neural network, comprising:
a processor for executing a plurality of instructions;
a memory to store a plurality of instructions;
wherein the plurality of instructions are used for being stored by the memory and loaded and executed by the processor, and the time sequence action detection method based on the deep hybrid convolutional neural network is described in the foregoing.
According to a fourth aspect of the present invention, there is provided a computer readable storage medium having a plurality of instructions stored therein; the plurality of instructions are used for loading and executing the time sequence action detection method based on the deep hybrid convolutional neural network by the processor.
According to the scheme of the invention, the time structure can be modeled, and meanwhile, the relationship among the proposals is effectively utilized, so that the accuracy rate of the time sequence action detection is improved, and the time sequence action detection is effectively solved.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to implement them in accordance with the contents of the description, the following detailed description is given with reference to the preferred embodiments of the present invention and the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention. In the drawings:
FIG. 1 is a flow chart of a method for detecting a time-series action based on a deep hybrid convolutional neural network according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a deep hybrid convolutional neural network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a fusion RGB and Flow model for prediction according to an embodiment of the present invention;
FIG. 4 is a graph illustrating the results of training tests on a test data set, in accordance with one embodiment of the present invention;
FIG. 5 is a graph illustrating the results of a training test on a test data set, in accordance with yet another embodiment of the present invention;
fig. 6 is a block diagram of a time-series operation detection apparatus based on a deep hybrid convolutional neural network according to an embodiment of the present invention.
Detailed Description
First, a flow of a method for detecting a time-series operation based on a deep hybrid convolutional neural network according to an embodiment of the present invention will be described with reference to fig. 1. As shown in fig. 1, the method comprises the steps of:
step S101: acquiring a video to be detected;
step S102: inputting the video into a trained deep hybrid convolutional neural network;
the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;
the feature coding module extracts segment features from original video data through a double-current network;
the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;
the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and enlarges the acceptance area of the proposed features;
step S103: and outputting the action type, the action starting time and the action ending time of the video to be detected.
In this embodiment, a dual-stream network (e.g., I3D network) is used for feature encoding. Each input video is first segmented into a set of segments, which are then sent into a dual stream network for characterization.
The feature coding module extracts segment features from original video data through a dual-stream network, and the method comprises the following steps:
in the present embodiment, for video data, i.e., a given uncut video V, a length n is generatedwFrame, time sequence sliding window with step length sigma, using time sequence sliding window to divide video V and generate a group of video segments
Figure BDA0002817526760000051
As input to a feature coding module, where shFor the segmented video segments, N represents the total number of segments in the video V; feature encoding module outputs feature sequences
Figure BDA0002817526760000052
Wherein f ishIs s ishCorresponding features as input to the first subnet.
Thus, for a given uncut video segment, a sequence of features can be obtained
Figure BDA0002817526760000053
As input to the first subnet.
Further, it is also possible to obtain a sequence of video features by stitching together the segment features, which are then fed into the first subnet to obtain a high quality proposed feature.
The structure of the deep hybrid convolutional neural network of the present embodiment is shown in fig. 2.
The first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional time sequence convolution layers; the RoI pooling layer receives the output of the one-dimensional time-sequential volume block, pooling generating a set of proposed features, wherein:
the convolution layers in the first sub-network have large convolution kernels, and the size of each convolution kernel is at least 9 x 9; the first subnet models complex timing by stacking one-dimensional timing convolution layers with large convolution kernels.
To model complex proposed temporal structures and capture long-term temporal correlations, the present embodiment employs a first sub-network consisting of one-dimensional temporal convolution blocks and RoI pooling layers to obtain high-quality proposed features based on video segment features extracted by the feature encoding module.
The one-dimensional time-series convolutional block comprises two one-dimensional convolutional layers. To avoid losing the segment features, the number of channels of the one-dimensional convolution is set to the same number C as the feature dimension of the segmentdimThe step size is set to 1.
The hidden layer feature of the last layer of the first subnet, plus the incoming segment feature, can be such that the feature contains multi-scale timing information,
further, the proposed features generated by the first subnet are sent to a second subnet, a graph network is constructed, and the long-term time sequence dependency relationship is captured through the graph convolution network. The second subnet is used to mine the relationships between the offers and expand the acceptance area of the offer features.
The second subnet receives the proposed features, constructs a graph based on relationships between the proposed features, inputs the constructed graph into a GCN model, expands an acceptance region of the proposed features, wherein:
the second subnet receives the offer features and constructs all offers in the video into an offer graph G<V(G),E(G)>Representing a graph comprising N nodes, where node viE.g. V (G), edge vijE (G); taking each proposal as a node, and taking the proposal characteristics obtained from the first subnet as the node characteristics of the corresponding node;
constructing edges based on the timing relationship between nodes, choosing IoU metric to measure the degree of overlap between two proposals if there is overlap between them, if IoU (p)i,pj)>θiouThen propose piAnd pjAn edge is established between the two edges,
Figure BDA0002817526760000061
if there is no overlap between the two proposals, d (p) is usedi,pj) Measure the distance between the proposals, where θiouIs a certain threshold; c. Ci、cjRespectively represent proposals pi、pjCentral coordinate of (b), U (p)i,pj) Represents the union of two proposals;
at IoU (p)i,pj) Or d (p)i,pj) Greater than a threshold value thetaiouAn edge is established between the nodes.
The constructed graph is input into the GCN model, i.e. a GCN (graph convolution network) model is applied on the graph so that each node can focus information from the neighborhood.
In this embodiment, the GCN model is used for classifying the proposal, and the GCN model includes a first GCN model and a second GCN model, and the first GCN model is operated based on a graph G with proposal features; the input of the first GCN model is an original proposed feature used for predicting action categories; the second GCN model input is an extended proposal feature that extends both the start and end of the proposal feature by half the duration of a proposal in the first subnet, and then applies the ROI pooling layer to obtain an extended proposal feature for predicting integrity tags and action boundaries.
The first GCN model and the second GCN model are both composed of two graph volume layers, and the graph volume layers are realized in the following mode:
X(i)=AX(i1)W(i)
where A is an adjacency matrix computed from cosine similarities between proposed features; w(i)Is a parameter matrix to be learned; x(i)Is an implicit feature of all proposals at layer i;
a fully-connected (FC) layer with softmax operation is applied over the first GCN model to predict action tags, and two fully-connected (FC) layers are employed over the second GCN model to predict integrity tags and boundaries.
Further, an integrity classifier is added to predict whether a proposal contains a complete action instance, and a proposal containing only a part of the action instance is filtered out. The integrity classifier includes n binary classifiers, each binary classifier corresponding to an action class. For each action class k, the corresponding completeness classifier CkA probability value is generated that represents the probability of a proposed capture of the complete action instance of the action class k. Considering both the proposed category score and completeness score, very good performance is obtained in temporal motion detection.
To predict the precise boundaries of the action instance, the present embodiment further learns the center coordinate offset and length offset between the real action instance and the proposal, and the calculated regression offset is:
tc=(cp,i-cgt,i)/lp,i tl=log(lgt,i/lp,i)
wherein, tcOffset value of center coordinate and length, c, for proposalp,iTo the proposed center coordinate, cgt,iAs the central coordinate of the label,/p,iIs the proposed length; t is tlOffset value for the center coordinate and length of the tag, lgt,iIs the length of the label.
Further, the training process of the deep hybrid convolutional neural network comprises:
first, as explained below with respect to a sample video, an uncut video may be represented as
Figure BDA0002817526760000071
Figure BDA0002817526760000072
Wherein T represents the number of video frames, XtRepresents the height and width of the image of the t-th frame, H represents the height of the image, and W represents the width of the image. Annotation information for video V is composed of a set of action instances
Figure BDA0002817526760000073
Figure BDA0002817526760000074
Wherein N isgNumber representing real action instance in video V, cgt,n,lgt,nRespectively represent an action instance phinCenter coordinate and length of (y)gt,nRepresentative action instance phinThe category label of (1).
Given a set of proposals, three types of sample proposals are collected by evaluating their intersection ratios with annotations (IoU): (1) positive samples greater than 0.7 from IoU of the most recently labeled instance; (2) background proposals that meet the following criteria: they are below 0.01 from IoU of the closest annotation instance, and their span is greater than 0.01 of the video length; (3) incomplete recommendations for meeting the following criteria: the span ratio it contains in the noted example is greater than 0.01, while it is less than 0.3 from IoU for this example. During the training process, each small lot is guaranteed to contain all three types of proposals, with positive sample and background suggestions used to train the action classifier; positive samples and incomplete suggestions are applied to the integrity classifier; only positive samples are used to train the regression fully connected layer.
In the embodiment, a multi-task loss function is adopted in the training process, and the multi-task loss function classifies the loss L by integrating the actionsclsRegression loss LregIntegrity loss LcomAnd the L2 canonical term to train the network; total loss is defined as
Ltotal=Lcls1Lreg2Lcom3L2
Here, λ 1, λ 2, and λ 3 are weight coefficients; l isclsIs the cross entropy loss; l iscomIs hinge loss, which predicts whether a proposal captures a complete instance of action.
Regression loss LregIs a smoothing L between the closest actual action instances designed as positive offers and offers1And (4) loss. Regression loss LregThe calculation method is as follows:
Figure BDA0002817526760000081
wherein the function SL1To smooth L1Loss, N is the number of nodes, ti,cTo offset the proposal from the center coordinates of the tag,
Figure BDA0002817526760000082
is a predicted value of the offset of the center coordinate; t is ti,lTo offset the length of the offer and the tag,
Figure BDA0002817526760000083
is a predicted value of the label length offset.
Optionally, in the embodiment, an iterative method is further adopted in the training process to perform boundary regression and classification, and the output boundary is fed back to the network as an input for use in next refinement. As shown in fig. 3, the predictions of the RGB and Flow models are fused to obtain the final prediction, and the final prediction is fed back to the RGB and Flow models respectively for the next refinement. At kiAfter the second iteration, the final boundary and classification score are obtained.
For each iteration, the DHCNet outputs n pairs of temporal boundary offset values, n pairs of integrity scores, and n +1 category scores. The final confidence score is obtained by multiplying and fusing the category score and the integrity score for the n non-background categories, that is, for each proposal, the final score belonging to the kth category is calculated as follows:
Figure BDA0002817526760000084
wherein the content of the first and second substances,
Figure BDA0002817526760000085
is a score of a category of the user,
Figure BDA0002817526760000086
is the integrity score.
For one proposal piAnd taking the highest confidence as a prediction, and obtaining a corresponding regression deviation value.
Since the predictions of the RGB and Flow models have different confidence levels, a weighted average method is used to fuse the RGB and Flow streams. Specifically, the proposed network of the present embodiment is trained separately for two streams, however
Figure BDA0002817526760000087
Figure BDA0002817526760000088
Figure BDA0002817526760000089
After-use 1: eta to fuse the predictions for RGB and Flow streams.
Wherein the content of the first and second substances,
Figure BDA0002817526760000091
and
Figure BDA0002817526760000092
respectively representing the final prediction scores of the RGB and Flow models belonging to the kth class,
Figure BDA0002817526760000093
and
Figure BDA0002817526760000094
representing the center coordinate offset predicted by the RGB and Flow models respectively,
Figure BDA0002817526760000095
and
Figure BDA0002817526760000096
representing the length offset predicted by the RGB and Flow models, respectively. For each suggestion, the final confidence score and regression boundary are used for evaluation and non-maximum suppression (NMS) is used to reduce redundant results.
The training detection results of the embodiment on two public data sets are shown in fig. 4-5, the method of the embodiment has better detection effect, and the effectiveness of the method is verified.
An embodiment of the present invention further provides a time sequence action detection apparatus based on a deep hybrid convolutional neural network, as shown in fig. 6, the apparatus includes:
a video acquisition module: the method comprises the steps of configuring to obtain a video to be detected;
a detection module: the method comprises the steps of inputting a video into a trained deep hybrid convolutional neural network;
the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;
the feature coding module extracts segment features from original video data through a double-current network;
the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;
the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and enlarges the acceptance area of the proposed features;
an output module: and the video to be detected is configured to output the action category and the action starting and ending time of the video to be detected.
The embodiment of the invention further provides a time sequence action detection system based on the deep hybrid convolutional neural network, which comprises the following steps:
a processor for executing a plurality of instructions;
a memory to store a plurality of instructions;
wherein the plurality of instructions are used for being stored by the memory and loaded and executed by the processor, and the time sequence action detection method based on the deep hybrid convolutional neural network is described in the foregoing.
The embodiment of the invention further provides a computer readable storage medium, wherein a plurality of instructions are stored in the storage medium; the plurality of instructions are used for loading and executing the time sequence action detection method based on the deep hybrid convolutional neural network by the processor.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a physical machine Server, or a network cloud Server, etc., and needs to install a Windows or Windows Server operating system) to perform some steps of the method according to various embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent change and modification made to the above embodiment according to the technical spirit of the present invention are still within the scope of the technical solution of the present invention.

Claims (9)

1. A time sequence action detection method based on a deep hybrid convolutional neural network is characterized by comprising the following steps:
step S101: acquiring a video to be detected;
step S102: inputting the video into a trained deep hybrid convolutional neural network;
the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;
the feature coding module extracts segment features from original video data through a double-current network;
the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;
the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and enlarges the acceptance area of the proposed features;
step S103: and outputting the action type, the action starting time and the action ending time of the video to be detected.
2. The time-series motion detection method based on the deep hybrid convolutional neural network of claim 1, wherein the number of channels of one-dimensional convolution is set to the same number C as the feature dimension of the segmentdimThe step size is set to 1.
3. The method of claim 1, wherein the second sub-network receives the proposed features, constructs a graph based on relationships between the proposed features, inputs the constructed graph to a GCN model, and expands an acceptance region of the proposed features, wherein:
the second subnet receives the offer features and constructs all offers in the video into an offer graph G<V(G),E(G)>Representing a graph comprising N nodes, where node viE.g. V (G), edge vijE (G); taking each proposal as a node, and taking the proposal characteristics obtained from the first subnet as the node characteristics of the corresponding node;
constructing edges based on the timing relationship between nodes, choosing IoU metric to measure the degree of overlap between two proposals if there is overlap between them, if IoU (p)i,pj)>θiouThen propose piAnd pjAn edge is established between the two edges;
Figure FDA0002817526750000011
if there is no overlap between the two proposals, d (p) is usedi,pj) Measure the distance between the proposals, where θiouIs a certain threshold; c. Ci、cjRespectively represent proposals pi、pjCentral coordinate of (b), U (p)i,pj) Represents the union of two proposals, at d (p)i,pj) Establishing edges between nodes larger than a threshold value;
at IoU (p)i,pj) And d (p)i,pj) Greater than a threshold value thetaiouAn edge is established between the nodes.
4. The method of claim 3, wherein the GCN model comprises a first GCN model and a second GCN model, and the input of the first GCN model is the original proposed features used for predicting the action category; the second GCN model input is an extended proposal feature that extends both the start and end of the proposal feature by half the duration of a proposal in the first subnet, and then applies the ROI pooling layer to obtain an extended proposal feature for predicting integrity tags and action boundaries.
5. The method of claim 4, wherein the first GCN model and the second GCN model are both composed of two graph convolutional layers, and the graph convolutional layers are implemented by:
X(i)=AX(i-1)W(i)
where A is an adjacency matrix computed from cosine similarities between proposed features; w(i)Is a parameter matrix to be learned; x(i)Is the first layer of all proposed hidden features;
a fully-connected (FC) layer with softmax operation is applied over the first GCN model to predict action tags, and two fully-connected (FC) layers are employed over the second GCN model to predict integrity tags and boundaries.
6. The method for time series action detection based on the deep hybrid convolutional neural network as claimed in claim 1, wherein a multitask loss function is adopted in the deep hybrid convolutional neural network training process, and the multitask loss function classifies the loss L by integrating the actionclsRegression loss LregIntegrity loss LcomAnd the L2 canonical term to train the network; total loss is defined as
Ltotal=Lcls1Lreg2Lcom3L2
Here, λ 1, λ 2, and λ 3 are weight coefficients; l isclsIs the cross entropy loss; l iscomIs hinge loss, which predicts whether a proposal captures a complete instance of action.
7. A time series motion detection apparatus based on a deep hybrid convolutional neural network, the apparatus comprising:
a video acquisition module: the method comprises the steps of configuring to obtain a video to be detected;
a detection module: the method comprises the steps of inputting a video into a trained deep hybrid convolutional neural network;
the deep hybrid convolutional neural network model comprises a feature coding module, a first subnet and a second subnet;
the feature coding module extracts segment features from original video data through a double-current network;
the first subnet obtains a group of proposed features based on the segment features extracted by the feature coding module; the first subnet comprises a one-dimensional time sequence convolution block and an RoI pooling layer which are sequentially connected, wherein the one-dimensional time sequence convolution block is used for receiving the segment characteristics and comprises two one-dimensional convolution layers; the RoI pooling layer receives the output of the one-dimensional time series convolution block, and generates a group of proposed features through pooling;
the second subnet receives the proposed features, constructs a graph based on the relationship between the proposed features, inputs the constructed graph into a GCN model, and enlarges the acceptance area of the proposed features;
an output module: and the video to be detected is configured to output the action category and the action starting and ending time of the video to be detected.
8. A time sequence action detection system based on a deep hybrid convolutional neural network is characterized by comprising:
a processor for executing a plurality of instructions;
a memory to store a plurality of instructions;
wherein the plurality of instructions are stored by the memory and loaded and executed by the processor to perform the method for detecting the time sequence action based on the deep hybrid convolutional neural network as claimed in any one of claims 1 to 6.
9. A computer-readable storage medium having stored therein a plurality of instructions; the plurality of instructions for loading and executing by a processor the method for deep hybrid convolutional neural network-based temporal motion detection as claimed in any one of claims 1 to 6.
CN202011402943.9A 2020-12-04 2020-12-04 Time sequence action detection method and device based on deep hybrid convolutional neural network Active CN112613349B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011402943.9A CN112613349B (en) 2020-12-04 2020-12-04 Time sequence action detection method and device based on deep hybrid convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011402943.9A CN112613349B (en) 2020-12-04 2020-12-04 Time sequence action detection method and device based on deep hybrid convolutional neural network

Publications (2)

Publication Number Publication Date
CN112613349A true CN112613349A (en) 2021-04-06
CN112613349B CN112613349B (en) 2023-01-10

Family

ID=75228795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011402943.9A Active CN112613349B (en) 2020-12-04 2020-12-04 Time sequence action detection method and device based on deep hybrid convolutional neural network

Country Status (1)

Country Link
CN (1) CN112613349B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113128395A (en) * 2021-04-16 2021-07-16 重庆邮电大学 Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model
CN113420598A (en) * 2021-05-25 2021-09-21 江苏大学 Time sequence action detection method based on context information and proposed classification decoupling
CN114863356A (en) * 2022-03-10 2022-08-05 西南交通大学 Group activity identification method and system based on residual aggregation graph network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3249610A1 (en) * 2016-05-26 2017-11-29 Nokia Technologies Oy A method, an apparatus and a computer program product for video object segmentation
CN109919122A (en) * 2019-03-18 2019-06-21 中国石油大学(华东) A kind of timing behavioral value method based on 3D human body key point
CN110362715A (en) * 2019-06-28 2019-10-22 西安交通大学 A kind of non-editing video actions timing localization method based on figure convolutional network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3249610A1 (en) * 2016-05-26 2017-11-29 Nokia Technologies Oy A method, an apparatus and a computer program product for video object segmentation
CN109919122A (en) * 2019-03-18 2019-06-21 中国石油大学(华东) A kind of timing behavioral value method based on 3D human body key point
CN110362715A (en) * 2019-06-28 2019-10-22 西安交通大学 A kind of non-editing video actions timing localization method based on figure convolutional network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YIN ZHENG, ET AL: "Fall detection and recognition based on GCN and 2D Pose", 《2019 6TH INTERNATIONAL CONFERENCE ON SYSTEMS AND INFORMATICS》 *
张聪聪等: "基于关键帧的双流卷积网络的人体动作识别方法", 《南京信息工程大学学报(自然科学版)》 *
王倩等: "基于双流卷积神经网络的时序动作定位", 《软件导刊》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113128395A (en) * 2021-04-16 2021-07-16 重庆邮电大学 Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model
CN113128395B (en) * 2021-04-16 2022-05-20 重庆邮电大学 Video action recognition method and system based on hybrid convolution multistage feature fusion model
CN113420598A (en) * 2021-05-25 2021-09-21 江苏大学 Time sequence action detection method based on context information and proposed classification decoupling
CN114863356A (en) * 2022-03-10 2022-08-05 西南交通大学 Group activity identification method and system based on residual aggregation graph network
CN114863356B (en) * 2022-03-10 2023-02-03 西南交通大学 Group activity identification method and system based on residual aggregation graph network

Also Published As

Publication number Publication date
CN112613349B (en) 2023-01-10

Similar Documents

Publication Publication Date Title
CN109697434B (en) Behavior recognition method and device and storage medium
Kukleva et al. Unsupervised learning of action classes with continuous temporal embedding
CN112613349B (en) Time sequence action detection method and device based on deep hybrid convolutional neural network
Fong et al. Interpretable explanations of black boxes by meaningful perturbation
Du et al. Towards explanation of dnn-based prediction with guided feature inversion
Li et al. Contrast-oriented deep neural networks for salient object detection
US11640714B2 (en) Video panoptic segmentation
Esmaeili et al. Fast-at: Fast automatic thumbnail generation using deep neural networks
CN107169463A (en) Method for detecting human face, device, computer equipment and storage medium
Qi et al. Embedding deep networks into visual explanations
KR102042168B1 (en) Methods and apparatuses for generating text to video based on time series adversarial neural network
CN111523421A (en) Multi-user behavior detection method and system based on deep learning and fusion of various interaction information
Giraldo et al. Graph CNN for moving object detection in complex environments from unseen videos
Vu et al. Energy-based models for video anomaly detection
CN112801063B (en) Neural network system and image crowd counting method based on neural network system
CN111291817A (en) Image recognition method and device, electronic equipment and computer readable medium
Roy et al. Foreground segmentation using adaptive 3 phase background model
CN112084812A (en) Image processing method, image processing device, computer equipment and storage medium
Lu et al. Learning the relation between interested objects and aesthetic region for image cropping
CN115147890A (en) System, method and storage medium for creating image data embedding for image recognition
Mseddi et al. Real-time scene background initialization based on spatio-temporal neighborhood exploration
Xiao et al. Self-explanatory deep salient object detection
CN112597997A (en) Region-of-interest determining method, image content identifying method and device
Sellars et al. Two cycle learning: clustering based regularisation for deep semi-supervised classification
CN110956157A (en) Deep learning remote sensing image target detection method and device based on candidate frame selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant