CN111898461B - Time sequence behavior segment generation method - Google Patents

Time sequence behavior segment generation method Download PDF

Info

Publication number
CN111898461B
CN111898461B CN202010651476.7A CN202010651476A CN111898461B CN 111898461 B CN111898461 B CN 111898461B CN 202010651476 A CN202010651476 A CN 202010651476A CN 111898461 B CN111898461 B CN 111898461B
Authority
CN
China
Prior art keywords
video unit
behavior
time sequence
sequence
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010651476.7A
Other languages
Chinese (zh)
Other versions
CN111898461A (en
Inventor
宋井宽
李涛
高联丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Guizhou University
Original Assignee
University of Electronic Science and Technology of China
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China, Guizhou University filed Critical University of Electronic Science and Technology of China
Priority to CN202010651476.7A priority Critical patent/CN111898461B/en
Publication of CN111898461A publication Critical patent/CN111898461A/en
Application granted granted Critical
Publication of CN111898461B publication Critical patent/CN111898461B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a time sequence behavior fragment generation method, which relates to the technical field of video processing and comprises the following steps: coding the video unit feature sequence by adopting a pyramid context sensing mechanism to obtain multi-scale information features; extracting fixed feature representation of time sequence behavior segments from the video unit feature sequence by adopting a learnable boundary matching network; and generating time sequence behavior segments based on the multi-scale information characteristics and the fixed characteristic representation. The invention effectively encodes the multi-scale information of the video through the pyramid context sensing mechanism, has great receptive field and solves the problem of different lengths of time sequence behavior segments in the video; the learnable boundary matching network can extract fixed characteristic representation of time sequence behavior segments with different lengths through a learnt coding mode, can be added into the network for end-to-end training, and overcomes the defect that the time sequence behavior segment characteristics extracted by the existing method have a lot of noises.

Description

Time sequence behavior segment generation method
Technical Field
The invention relates to the technical field of video processing, in particular to a time sequence behavior segment generation method.
Background
The time sequence behavior segment generation means that a section of unsegmented long video is given, and the algorithm needs to detect the behavior segments in the video, including the starting time and the ending time of the behavior segments, so that the effects of accurately positioning the time period of behavior occurrence in the long video and filtering out irrelevant information are achieved.
The existing methods can be classified into two categories, namely Anchor-based methods (Anchor-based methods) and Boundary-point-based methods (Boundary-based methods).
The method based on the anchor point mechanism defines time sequence windows with different sizes in advance to represent time sequence behavior segments with different lengths, and then classifies the predefined time sequence behavior segments through a classifier to distinguish behavior from background. However, in practical applications, the data sets have rich categories and different time sequence behavior segment lengths, so the predefined time sequence window has the following disadvantages: 1) time sequence behavior segments with different lengths in a long video cannot be effectively covered; 2) the setting of the window dimensions requires a large amount of manual intervention; 3) the predefined timing window does not have precise boundaries on the one hand.
The method based on boundary point positioning is to classify time points in the video, judge the probability of the time points as the start and the end of the behavior and determine whether the time points are the boundary points of the time sequence behavior segments. Existing methods acquire context information by using stacked time-sequential convolutional layers or using a long-short memory network (LSTM) method. However, this method has the following problems: 1) stacking the sequential convolutional layers can only increase the receptive field of the network model in a limited way; 2) the long and short-term memory network models the long video to obtain global information, but the time sequence behavior segments are different in length, and the existing method cannot obtain information of different scales.
In addition, most of the existing methods adopt simple methods to model the time-series behavior segments, such as average pooling, maximum pooling, random sampling and the like. However, such a modeling method can cause a lot of noise to exist in the extracted time-series behavior segment characteristics. Because some video unit features in the time sequence behavior segment may contain behavior information, and other video units contain background information, directly performing simple modeling processing may result in the extracted time sequence behavior segment features not being discriminant.
Disclosure of Invention
The invention provides a time sequence behavior segment generation method based on a pyramid context sensing mechanism and a learnable boundary matching network, which can alleviate the problems.
In order to alleviate the above problems, the technical scheme adopted by the invention is as follows:
a time series behavior segment generation method comprises the following steps:
s1, coding the input video by using a double-current convolutional neural network to extract a video unit feature sequence;
s2, coding the video unit feature sequence by adopting a pyramid context sensing mechanism to obtain multi-scale information features;
s3, extracting fixed characteristic representation of the time sequence behavior segment from the video unit characteristic sequence by adopting a learnable boundary matching network;
and S4, generating time sequence behavior segments based on the multi-scale information characteristics and the fixed characteristic representation.
The technical effect of the technical scheme is as follows: the multi-scale information of the video is effectively coded through a pyramid context sensing mechanism, the video has a great receptive field, and the problem that time sequence behavior segments in the video are different in size is solved; the learnable boundary matching network can extract fixed characteristic representation of time sequence behavior segments with different lengths through a learnt coding mode, can be added into the network for end-to-end training, and overcomes the defect that the time sequence behavior segment characteristics extracted by the existing method have a lot of noises.
Further, the step S1 is specifically:
dividing a video picture sequence into a video unit sequence;
and for each video unit in the video unit sequence, coding the video unit by using a double-current convolutional neural network to extract the initial video unit characteristics to obtain an initial video unit characteristic sequence, and then performing dimension reduction processing on the initial video unit characteristic sequence by using a layer of time sequence convolutional layer to obtain a final video unit characteristic sequence.
The technical effect of the technical scheme is as follows: the video picture sequence is divided into the video unit sequence, so that high calculation efficiency is guaranteed, and data are not redundant. Further, in step S2, the pyramid context sensing mechanism includes a plurality of continuous layers of time sequence hole convolution modules, each layer of time sequence hole convolution module is configured to acquire an information feature of one scale in the video unit feature sequence, and the information features acquired by each layer of time sequence hole convolution module are subjected to a splicing operation to obtain the multi-scale information feature.
The technical effect of the technical scheme is as follows: under the condition of ensuring that the time sequence resolution is not changed, the pyramid context sensing mechanism not only has a very large receptive field, but also can fuse information of different scales.
Further, the timing hole convolution module comprises a timing hole convolution layer, a linear rectification function, a timing hole convolution layer and a random deactivation layer which are connected step by step.
Further, the step S3 specifically includes:
s31, obtaining a clustering center vector through normal distribution initialization;
s32, calculating a video unit distribution graph of which the video unit features belong to the clustering center vector through the full connection layer and softmax;
s33, performing element product operation by using the time sequence behavior mask matrix and the video unit distribution map to obtain a time sequence behavior fragment distribution map;
s34, performing residual error operation on the time sequence behavior fragment distribution graph and the clustering center vector to obtain the characteristics of all possible time sequence behavior fragments;
s35, carrying out normalization processing of L2 on the characteristics of all possible time sequence behavior segments to obtain fixed characteristic representation of the time sequence behavior segments.
The technical effect of the technical scheme is as follows: the learnable boundary matching network can extract fixed characteristic representation of time sequence behavior segments with different lengths in a learnable coding mode, and can be added into the network to carry out end-to-end training.
Further, the step S4 specifically includes:
s41, inputting the multi-scale information characteristics into a prediction layer, and optimizing the output result of the prediction layer to obtain a probability sequence of a video unit as a behavior start and a behavior end;
s42, inputting the fixed characteristic representation of the time sequence behavior segment into a prediction layer, and optimizing the output result of the prediction layer to obtain two time sequence behavior segment confidence score graphs;
s43, setting boundary point judgment conditions, selecting a plurality of candidate behavior starting boundary points and a plurality of candidate behavior ending boundary points from the probability sequence according to the boundary point judgment conditions, and combining the candidate behavior starting boundary points and the candidate behavior ending boundary points to generate a plurality of initial time sequence behavior segments;
and S44, retrieving the confidence score of the initial time sequence behavior segment from the confidence score maps of the two time sequence behavior segments according to the sequence numbers of the boundary points, removing the initial time sequence behavior segment with low confidence score, and keeping the initial time sequence behavior segment with high confidence score as the finally generated time sequence behavior segment.
Furthermore, there are two boundary point determination conditions, namely condition 1 and condition 2, and if the probability of a certain video unit satisfies any one of the two conditions, the certain video unit is taken as a candidate boundary point;
the condition 1 is that the probability of the video unit is higher than 0.5 times of the maximum value in the probability sequence;
the condition 2 is that the probability of the video unit is higher than the probabilities of the previous video unit and the next video unit.
Further, the output result of the prediction layer in step S41 is optimized by a cross entropy loss function, and the output result of the prediction layer in step S42 is optimized by a mean square error and a cross entropy loss function.
Further, the prediction layers used in steps S41 and S42 each include a time-series convolutional layer having a convolutional kernel size of 1 and a Sigmoid function.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a schematic diagram of a network structure for generating time-series behavior segments according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a pyramid context-aware mechanism according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a learnable boundary matching network according to one embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
Referring to fig. 1 to fig. 3, the present embodiment discloses a method for generating time-series behavior segments.
Wherein a first-segment view is givenVideo picture sequence
Figure BDA0002575131710000041
Wherein s is t Denotes the t-th frame picture,/ s Representing the sum of the number of pictures in this sequence of video pictures.
In order to ensure high calculation efficiency and no data redundancy, the invention firstly divides the video picture sequence into the video unit sequence
Figure BDA0002575131710000042
Wherein T ═ l s /n u Representing the sum of the number of video units. Each video unit
Figure BDA0002575131710000043
Representing the smallest unit of a video unit coding network process, where s f Indicates the starting picture frame number, s, of the video unit f +n u Represents the ending picture frame number of this video unit, which in total contains n u A succession of pictures.
Then, each video unit is coded by using a double-current convolutional neural network to extract initial video unit characteristics
Figure BDA0002575131710000051
Wherein C represents the characteristic dimension, and the final video unit characteristic sequence is obtained after the obtained all video unit characteristic sequences are subjected to dimension reduction by a time sequence convolution layer
Figure BDA0002575131710000052
1. Pyramid context aware mechanism processing
Video unit feature sequence by adopting pyramid context perception mechanism
Figure BDA0002575131710000053
And carrying out coding processing to obtain multi-scale information characteristics.
In this embodiment, the pyramid context awareness mechanism includes a plurality of consecutive sequential hole convolution modules, where each sequential hole convolution module includes a sequential hole convolution layer, a linear rectification function (ReLU), a sequential hole convolution layer, and a random deactivation layer (Dropout) connected in a stepwise manner, as shown in fig. 2, where the convolution size of the sequential hole convolution layer and the sequential hole convolution layer is set to 3, and the random deactivation layer is introduced to prevent overfitting.
In order to acquire multi-scale information, the present invention uses a densely connected structure to splice outputs of different time sequence hole convolution layers, and finally obtains multi-scale information characteristics.
The input of each layer of time sequence hole convolution module is expressed by the following formula:
x l =H l ([x 1 ,x 2 ,...,x l-1 ],r l )
wherein [ ·]Represents a stitching operation, r l The void rate of the convolution of the time sequence void in the 1 st time sequence void convolution module is set as 2 l-1 And finally, outputting by a time sequence cavity convolution module to finally obtain multi-scale information characteristics.
The invention finds that the time sequence cavity convolution module at the bottom layer has a smaller receptive field and can be responsible for shorter time sequence behavior segments, and the time sequence cavity convolution module at the top layer has a larger receptive field and can be responsible for longer time sequence behavior segments. Therefore, under the condition of ensuring that the time sequence resolution is not changed, the pyramid context perception mechanism not only has a very large receptive field, but also can fuse information of different scales.
2. Learnable boundary matching network processing
Using learnable boundary matching networks to derive sequences of video unit features
Figure BDA0002575131710000054
Extracting fixed characteristic representation of the time sequence behavior segment, specifically as follows:
first, the learnable boundary matching network passes through positiveInitializing state distribution to obtain K clustering center vectors { c k }. The learnable boundary matching network then computes each video unit feature f t Belong to a cluster center { c k Profile a of } ═ a k }. The calculation formula is as follows:
Figure BDA0002575131710000061
wherein a is k (f t ) Representing the t-th video unit feature f t To the kth cluster center c k Probability of (c) { w } k },{b k And { c }and k Are all learnable parameters in the network model.
Next, to obtain a distribution graph of time-series behavior segments belonging to the center of the cluster, we introduce a time-series behavior segment mask matrix to expand the video unit distribution graph a into a time-series behavior segment distribution graph a.
The time sequence behavior segment mask matrix is used for carrying out binary representation on all possible time sequence behavior segments. In particular, the time-series behavior segment mask matrix is a binary matrix
Figure BDA0002575131710000062
It contains D x T time sequence behavior segment mask vectors, where D and T represent the longest length of the time sequence behavior segment and the number of video unit feature sequences in the data set, respectively. For example, a point M (M, n) in the temporal behavior segment mask matrix represents a temporal behavior segment with a start time of the nth video unit, a duration of the M video units, and an end time of the n + mth video unit
Figure BDA0002575131710000063
The mask vector calculation formula is as follows:
Figure BDA0002575131710000064
next, the present invention first maps the video unit distribution
Figure BDA0002575131710000065
Dimension expansion is carried out to obtain
Figure BDA0002575131710000066
Then, the invention maps the video unit distribution diagram
Figure BDA0002575131710000067
And a time-sequential behavior mask matrix
Figure BDA0002575131710000068
The element product operation is carried out to obtain a behavior segment distribution diagram
Figure BDA0002575131710000069
The calculation formula of the a (m, n, k, t) element is as follows:
A(m,n,k,t)=a k (f t )·M(m,n,1,t)
wherein A (m, n, k, t) represents a segment of time-sequential behavior
Figure BDA00025751317100000610
The t-th video unit feature in (b) belongs to the k-th clustering center c k The probability of (c). Finally, the learnable boundary matching network obtains the characteristics of all possible time sequence behavior segments through residual error operation
Figure BDA00025751317100000611
The method comprises the characteristics of D multiplied by T time sequence behavior segments, and the characteristic dimension is K multiplied by C. The calculation formula of V (m, n, k, j) is as follows:
Figure BDA00025751317100000612
wherein f is t (j) J element representing the characteristic of t video unit, c k (j) The jth element representing the kth cluster center.
And finally, carrying out L2 normalization processing on the characteristics V (m, n, k, j) of all possible time sequence behavior segments to obtain a fixed characteristic representation of the time sequence behavior segments.
3. Generation process of time sequence behavior segment
The process is based on multi-scale information characteristics and fixed characteristic representation, and time sequence behavior segments are generated, and the process specifically comprises the following steps:
1) inputting multi-scale information characteristics into a layer of time sequence convolution layer (convolution kernel size is 1) and Sigmoid function to obtain probability sequences of each video unit as behavior start and behavior end, which are respectively expressed as
Figure BDA0002575131710000071
And
Figure BDA0002575131710000072
this network is optimized by a cross entropy loss function.
Selecting a video unit meeting one of the following two conditions in the behavior start probability sequence and the behavior end probability sequence as a candidate boundary point: condition 1-the probability of the video unit is higher than 0.5 times the maximum value in this sequence of probabilities; condition 2-the probability of the video unit is higher than the probability of the previous video unit and the probability of the following video unit, i.e. the probability value of the video unit is a local peak. Then, the invention combines the candidate behavior start boundary point and the candidate behavior end boundary point pairwise to generate an initial time sequence behavior segment.
2) Inputting the fixed characteristic representation of the time sequence behavior fragment into a time sequence convolution layer (the size of a convolution kernel is 1) and a Sigmoid function, and obtaining a confidence score chart of the two time sequence behavior fragments by regression and classification modes
Figure BDA0002575131710000073
And
Figure BDA0002575131710000074
the network is optimized by mean square error and cross entropy loss functions.
According to the generated boundary point sequence number of the initial time sequence behavior segment, a confidence score chart of the time sequence behavior segment can be obtained
Figure BDA0002575131710000075
And
Figure BDA0002575131710000076
and searching out the confidence score, removing the initial time sequence behavior segment with the low confidence score, and keeping the initial time sequence behavior segment with the high confidence score as the finally generated time sequence behavior segment.
For example, the time-series behavior segments can be obtained through the above processing
Figure BDA0002575131710000077
Wherein t is s And t e Are the video unit sequence numbers of the beginning and end of the sequential behavior segment,
Figure BDA0002575131710000078
and
Figure BDA0002575131710000079
probability value of a video unit being the beginning and end of a time-sequential behavior segment, p cls And p reg Is a confidence score graph
Figure BDA00025751317100000710
And
Figure BDA00025751317100000711
the middle coordinate is [ t ] e -t s ,t s ]And a sum value.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A method for generating time-series behavior segments is characterized by comprising the following steps:
s1, coding the input video by using a double-current convolutional neural network to extract a video unit feature sequence;
s2, coding the video unit feature sequence by adopting a pyramid context sensing mechanism to obtain multi-scale information features;
s3, extracting fixed feature representation of the time sequence behavior segment from the video unit feature sequence by adopting a learnable boundary matching network;
s4, generating time sequence behavior segments based on the multi-scale information features and the fixed feature representation;
the step S3 specifically includes:
s31, obtaining a clustering center vector through normal distribution initialization;
s32, calculating a video unit distribution graph of which the video unit features belong to the clustering center vector through the full connection layer and softmax;
s33, performing element product operation by using the time sequence behavior mask matrix and the video unit distribution map to obtain a time sequence behavior fragment distribution map;
s34, performing residual error operation on the time sequence behavior fragment distribution graph and the clustering center vector to obtain the characteristics of all possible time sequence behavior fragments;
s35, carrying out normalization processing of L2 on the characteristics of all possible time sequence behavior segments to obtain fixed characteristic representation of the time sequence behavior segments.
2. The method for generating time-series behavior segments according to claim 1, wherein the step S1 specifically comprises:
dividing a video picture sequence into a video unit sequence;
and for each video unit in the video unit sequence, coding each video unit by using a double-current convolutional neural network to extract initial video unit characteristics to obtain an initial video unit characteristic sequence, and reducing the dimension of the initial video unit characteristic sequence to obtain a final video unit characteristic sequence.
3. The method of claim 2, wherein the initial video unit feature sequence is dimension-reduced using a layer of time-sequential convolutional layer.
4. The method according to claim 1, wherein the pyramid context awareness mechanism includes a plurality of layers of consecutive time-series hole convolution modules, each layer of time-series hole convolution module is configured to obtain an information feature of one scale in the video unit feature sequence, and the information features obtained by each layer of time-series hole convolution modules are subjected to a splicing operation to obtain the multi-scale information feature.
5. The method of claim 4, wherein the timing hole convolution module comprises a sequence hole convolution layer, a linear rectification function, a sequence convolution layer, and a random deactivation layer connected in a cascade.
6. The method for generating time-series behavior segments according to claim 1, wherein the step S4 specifically includes:
s41, inputting the multi-scale information characteristics into a prediction layer, and optimizing the output result of the prediction layer to obtain a probability sequence of a video unit as a behavior start and a behavior end;
s42, inputting the fixed characteristic representation of the time sequence behavior fragment into a prediction layer, and optimizing the output result of the prediction layer to obtain two time sequence behavior fragment confidence score maps;
s43, setting boundary point judgment conditions, selecting a plurality of candidate behavior starting boundary points and a plurality of candidate behavior ending boundary points from the probability sequence according to the boundary point judgment conditions, and combining the candidate behavior starting boundary points and the candidate behavior ending boundary points to generate a plurality of initial time sequence behavior segments;
and S44, retrieving the confidence score of the initial time sequence behavior segment from the confidence score maps of the two time sequence behavior segments according to the sequence numbers of the boundary points, removing the initial time sequence behavior segment with low confidence score, and keeping the initial time sequence behavior segment with high confidence score as the finally generated time sequence behavior segment.
7. The method according to claim 6, wherein there are two boundary point determination conditions, namely condition 1 and condition 2, and if a certain video unit satisfies any one of the two conditions, the certain video unit is taken as a candidate boundary point;
the condition 1 is that the probability of the video unit is higher than 0.5 times of the maximum value in the probability sequence;
the condition 2 is that the probability of the video unit is higher than the probabilities of the previous video unit and the next video unit.
8. The method as claimed in claim 6, wherein the output of the prediction layer in step S41 is optimized by a cross entropy loss function, and the output of the prediction layer in step S42 is optimized by a mean square error and a cross entropy loss function.
9. The method of claim 6, wherein the prediction layers used in steps S41 and S42 each include a time series convolutional layer having a convolutional kernel size of 1 and a Sigmoid function.
CN202010651476.7A 2020-07-08 2020-07-08 Time sequence behavior segment generation method Active CN111898461B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010651476.7A CN111898461B (en) 2020-07-08 2020-07-08 Time sequence behavior segment generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010651476.7A CN111898461B (en) 2020-07-08 2020-07-08 Time sequence behavior segment generation method

Publications (2)

Publication Number Publication Date
CN111898461A CN111898461A (en) 2020-11-06
CN111898461B true CN111898461B (en) 2022-08-30

Family

ID=73191933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010651476.7A Active CN111898461B (en) 2020-07-08 2020-07-08 Time sequence behavior segment generation method

Country Status (1)

Country Link
CN (1) CN111898461B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364852B (en) * 2021-01-13 2021-04-20 成都考拉悠然科技有限公司 Action video segment extraction method fusing global information
CN112990013B (en) * 2021-03-15 2024-01-12 西安邮电大学 Time sequence behavior detection method based on dense boundary space-time network
CN113486754B (en) * 2021-06-29 2024-01-09 中国科学院自动化研究所 Event evolution prediction method and system based on video
CN117152692B (en) * 2023-10-30 2024-02-23 中国市政工程西南设计研究总院有限公司 Traffic target detection method and system based on video monitoring

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2608107A2 (en) * 2011-12-22 2013-06-26 Broadcom Corporation System and method for fingerprinting video
WO2013104205A1 (en) * 2012-01-09 2013-07-18 中兴通讯股份有限公司 Method for encoding and decoding image layer and slice layer, codec and electronic device
CN107430687A (en) * 2015-05-14 2017-12-01 谷歌公司 The segmentation of the time based on entity of video flowing
CN107506712A (en) * 2017-08-15 2017-12-22 成都考拉悠然科技有限公司 Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks
CN108573246A (en) * 2018-05-08 2018-09-25 北京工业大学 A kind of sequential action identification method based on deep learning
CN109711380A (en) * 2019-01-03 2019-05-03 电子科技大学 A kind of timing behavior segment generation system and method based on global context information
CN109726696A (en) * 2019-01-03 2019-05-07 电子科技大学 System and method is generated based on the iamge description for weighing attention mechanism
CN109919122A (en) * 2019-03-18 2019-06-21 中国石油大学(华东) A kind of timing behavioral value method based on 3D human body key point
CN110097000A (en) * 2019-04-29 2019-08-06 东南大学 Video behavior recognition methods based on local feature Aggregation Descriptor and sequential relationship network
CN110602526A (en) * 2019-09-11 2019-12-20 腾讯科技(深圳)有限公司 Video processing method, video processing device, computer equipment and storage medium
CN111368786A (en) * 2020-03-16 2020-07-03 平安科技(深圳)有限公司 Action region extraction method, device, equipment and computer readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805083B (en) * 2018-06-13 2022-03-01 中国科学技术大学 Single-stage video behavior detection method
CN109671125B (en) * 2018-12-17 2023-04-07 电子科技大学 Highly-integrated GAN network device and method for realizing text image generation
CN110222592B (en) * 2019-05-16 2023-01-17 西安特种设备检验检测院 Construction method of time sequence behavior detection network model based on complementary time sequence behavior proposal generation

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2608107A2 (en) * 2011-12-22 2013-06-26 Broadcom Corporation System and method for fingerprinting video
WO2013104205A1 (en) * 2012-01-09 2013-07-18 中兴通讯股份有限公司 Method for encoding and decoding image layer and slice layer, codec and electronic device
CN107430687A (en) * 2015-05-14 2017-12-01 谷歌公司 The segmentation of the time based on entity of video flowing
CN107506712A (en) * 2017-08-15 2017-12-22 成都考拉悠然科技有限公司 Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks
CN108573246A (en) * 2018-05-08 2018-09-25 北京工业大学 A kind of sequential action identification method based on deep learning
CN109711380A (en) * 2019-01-03 2019-05-03 电子科技大学 A kind of timing behavior segment generation system and method based on global context information
CN109726696A (en) * 2019-01-03 2019-05-07 电子科技大学 System and method is generated based on the iamge description for weighing attention mechanism
CN109919122A (en) * 2019-03-18 2019-06-21 中国石油大学(华东) A kind of timing behavioral value method based on 3D human body key point
CN110097000A (en) * 2019-04-29 2019-08-06 东南大学 Video behavior recognition methods based on local feature Aggregation Descriptor and sequential relationship network
CN110602526A (en) * 2019-09-11 2019-12-20 腾讯科技(深圳)有限公司 Video processing method, video processing device, computer equipment and storage medium
CN111368786A (en) * 2020-03-16 2020-07-03 平安科技(深圳)有限公司 Action region extraction method, device, equipment and computer readable storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BSN: Boundary Sensitive Network for Temporal Action Proposal Generation;Tianwei Lin 等;《Computer Vision and Pattern Recognition》;20180926;1-17 *
Play and rewind: Context-aware video temporal action proposals;Lianli Gao 等;《Pattern Recognition》;20200606;1-9 *
基于深度学习的时序动作检测和视频描述算法研究;刘晓宁;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20190915(第(2019)09期);I138-1104 *

Also Published As

Publication number Publication date
CN111898461A (en) 2020-11-06

Similar Documents

Publication Publication Date Title
CN111898461B (en) Time sequence behavior segment generation method
CN110059772B (en) Remote sensing image semantic segmentation method based on multi-scale decoding network
KR101880901B1 (en) Method and apparatus for machine learning
CN109840556B (en) Image classification and identification method based on twin network
US9400918B2 (en) Compact face representation
CN111723645B (en) Multi-camera high-precision pedestrian re-identification method for in-phase built-in supervised scene
CN110929029A (en) Text classification method and system based on graph convolution neural network
CN109063666A (en) The lightweight face identification method and system of convolution are separated based on depth
CN109543727B (en) Semi-supervised anomaly detection method based on competitive reconstruction learning
CN111738054B (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
CN112580523A (en) Behavior recognition method, behavior recognition device, behavior recognition equipment and storage medium
CN111310852B (en) Image classification method and system
CN111625675A (en) Depth hash image retrieval method based on feature pyramid under attention mechanism
CN110188827B (en) Scene recognition method based on convolutional neural network and recursive automatic encoder model
CN113673346A (en) Motor vibration data processing and state recognition method based on multi-scale SE-Resnet
CN113269054A (en) Aerial video analysis method based on space-time 2D convolutional neural network
CN108875532A (en) A kind of video actions detection method based on sparse coding and length posterior probability
CN110942057A (en) Container number identification method and device and computer equipment
CN111371611B (en) Weighted network community discovery method and device based on deep learning
CN108805280B (en) Image retrieval method and device
CN115393289A (en) Tumor image semi-supervised segmentation method based on integrated cross pseudo label
CN114565789B (en) Text detection method, system, device and medium based on set prediction
CN116844041A (en) Cultivated land extraction method based on bidirectional convolution time self-attention mechanism
CN112016590A (en) Prediction method combining sequence local feature extraction and depth convolution prediction model
CN114373092A (en) Progressive training fine-grained vision classification method based on jigsaw arrangement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant