CN109409307B - Online video behavior detection method based on space-time context analysis - Google Patents

Online video behavior detection method based on space-time context analysis Download PDF

Info

Publication number
CN109409307B
CN109409307B CN201811298487.0A CN201811298487A CN109409307B CN 109409307 B CN109409307 B CN 109409307B CN 201811298487 A CN201811298487 A CN 201811298487A CN 109409307 B CN109409307 B CN 109409307B
Authority
CN
China
Prior art keywords
behavior
video
chain
frame
online
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811298487.0A
Other languages
Chinese (zh)
Other versions
CN109409307A (en
Inventor
李楠楠
张世雄
张子尧
李革
安欣赏
张伟民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Instritute Of Intelligent Video Audio Technology Longgang Shenzhen
Original Assignee
Instritute Of Intelligent Video Audio Technology Longgang Shenzhen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Instritute Of Intelligent Video Audio Technology Longgang Shenzhen filed Critical Instritute Of Intelligent Video Audio Technology Longgang Shenzhen
Priority to CN201811298487.0A priority Critical patent/CN109409307B/en
Publication of CN109409307A publication Critical patent/CN109409307A/en
Application granted granted Critical
Publication of CN109409307B publication Critical patent/CN109409307B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

An online video behavior detection method based on spatio-temporal context analysis is characterized in that a deep learning framework is adopted, a spatio-temporal context analysis technology is combined, so that online detection of behaviors occurring in an input video is realized, and the detection is carried out in a time domain and a space domain in a combined mode. The invention comprises two parts: behavior detection within a video segment and linking between video segments. Generating a candidate action region by an algorithm in a video segment by combining a current frame and space-time dynamic information by using a coding-decoding model; video inter-segment linking links candidate action regions into action chains that continuously focus on a specified action object, from its appearance until its end, while predicting the category of action in an online manner.

Description

Online video behavior detection method based on space-time context analysis
Technical Field
The invention relates to the technical field of video behavior analysis, in particular to an online video behavior detection system based on space-time context analysis and a method thereof.
Background
Video behavior detection not only correctly classifies behaviors occurring in a given video, but also locates behaviors in a time domain and a space domain, and is a key step in video human behavior understanding research. In short, the existing methods usually use a two-step process to solve this problem: generating a single frame of motion detection results with the retrained motion detector, including the regressed object box and the corresponding motion classification score; the final spatiotemporal motion chain is formed by concatenating or tracking the motion detection results of a single frame over the entire video duration, usually under some constraints, such as: it is required that the motion detection frame overlapping areas of adjacent frames be as large as possible. The limitations of this process are mainly reflected in two aspects: 1) the method only utilizes the current image or motion information to perform single-frame behavior detection, and ignores the continuity of motion behaviors in time; 2) the connection algorithm is usually performed in an off-line and batch manner, i.e.: the action chain is continued from the beginning of the video to the end of the video, and then another time domain pruning algorithm is used for eliminating false detection results. In the present invention, the above two problems are solved by the following means: 1) combining the current frame and the space-time context information to perform action detection; 2) and (3) completing the generation of the behavior chain and the classification and prediction of the behavior in one-time processing by adopting an online detection mode.
In 2017, Zhu et al (Zhu H., visual R., and Lu S.2017. "A spread-temporal relational Regression Network for Video Action protocol", IEEE International Conference on Computer Vision, pp.5814-5822) proposed a Regression Network model for generating Action behavior Proposal, which was constructed based on Conv LSTM (relational Long Short-Term Memory), and fused Spatio-temporal information and current frame information for Action detection. The drawback of this method is that within a small segment of video, only the later located video frames can usually use spatio-temporal dynamic information to assist the current detection.
Disclosure of Invention
The invention aims to provide an online video behavior detection system based on space-time context analysis, which can utilize video sequence context information when detecting the action behavior of a current frame, can incrementally generate a behavior chain along with the continuous input of video frames, and dynamically classify the video behavior.
The invention also aims to provide an online video behavior detection method based on the spatio-temporal context analysis.
The method provided by the invention has two main improvements compared with the existing method: 1) the method is based on ConvGRU (connected Gated Recurrent Unit), compared with ConvLSTM, the method is a lightweight cyclic memory model, has much fewer parameters, and reduces the risk of overfitting on a small sample number set; 2) their model is a single forward model, so that only video frames at the back end of the input video sequence can utilize the fused spatio-temporal dynamic information for behavior detection, while the method proposed by the present invention is an encoding-decoding model, and the spatio-temporal context information of the video sequence can be used in each frame during decoding.
The principle of the invention is as follows: 1) extracting single-frame video characteristics by using a deep convolutional neural network, inputting a plurality of continuous frame video characteristics into ConvGRU to construct a coding-decoding video sequence description model, coding spatiotemporal context information in forward transmission, decoding spatiotemporal dynamic information coded in backward transmission into each frame, and finishing motion detection by combining current frame information; 2) Maintaining a dynamic behavior category candidate pool, gradually narrowing the range of possible behavior categories as the input video sequence grows, and dynamically pruning the currently generated behavior chain, comprising: growth, termination and time domain pruning.
The technical scheme provided by the invention is as follows:
the time-space domain behavior detection method provided by the invention comprises two parts: behavior detection within a video segment and linking between video segments. Generating a candidate action region by an algorithm in a video segment by combining a current frame and space-time dynamic information by using a coding-decoding model; video inter-segment linking links candidate action regions into action chains that continuously focus on a specified action object, from its appearance until its end, while predicting the category of action in an online manner.
An online video behavior detection system based on spatiotemporal context analysis comprises a video behavior spatiotemporal context information fusion network and a motion frame online linking and classifying algorithm; wherein: the video behavior spatiotemporal context information fusion network is used for fusing current frame information and behavior spatiotemporal context information in a video segment; the motion frame online linking and classifying algorithm is used for linking the motion frames corresponding to the same motion target in an online mode to form a complete behavior chain and classifying the behavior classes of the motion frames.
The video behavior spatiotemporal context information fusion network specifically comprises: the single-frame feature extraction network is used for extracting the depth expression features of the current frame RGB image and the optical flow image in the video segment; the video segment space-time context information fusion network is characterized by comprising a ConvGRU model-based coding-decoding module, a ConvGRU model-based coding-decoding module and a ConvGRU model-based coding-decoding module, wherein the ConvGRU model-based coding-decoding module is used for extracting video segment space-time context expression characteristics and fusing the video segment space-time context expression characteristics with video current frame characteristics to obtain fusion characteristics; and the behavior detection network is used for carrying out single-frame behavior detection on the fusion characteristics to obtain a behavior classification score and position the position where the behavior occurs to generate a motion frame.
The online linking and classifying algorithm for the motion frame specifically comprises the following steps: constructing a behavior category candidate pool for maintaining a specified number of behavior categories that are currently most likely to occur for a given video; the behavior category candidate pool updating algorithm is used for scoring behavior categories, gradually reducing the range of the behavior categories to which the current video possibly belongs, and realizing online rapid classification of a behavior chain; the behavior chain online growth algorithm is used for linking the behavior candidate region corresponding to the video clip to the existing behavior chain to realize online growth of the behavior chain; or determining the behavior candidate region as a new behavior chain.
An online video behavior detection method based on spatiotemporal context analysis comprises the following steps:
step 1: calculating an optical flow image for the current frame, and extracting depth expression characteristics of the RGB image and the optical flow image;
step 2: constructing a coding-decoding network to extract space-time context information of video behaviors, and fusing the space-time context information with current frame information to obtain fusion characteristics;
and step 3: classifying and position regressing the fusion characteristics to generate a motion frame, and linking the motion frame by using a Viterbi algorithm to obtain a behavior candidate region;
and 4, step 4: constructing a behavior category candidate pool, and updating behavior categories which may appear;
and 5: linking the behavior candidate region to an existing behavior chain in an online mode or generating a new behavior chain;
step 6: and fusing the detection results of the RGB image branches and the optical flow image branches to obtain a final detection result.
Compared with the prior art, the invention has the beneficial effects that:
by utilizing the technical scheme provided by the invention, when the behavior detection is carried out on the video single-frame image, the behavior spatio-temporal context information in the video segment is utilized, so that the accuracy of the behavior detection is improved; meanwhile, the video behavior can be detected online, compared with the traditional offline batch processing method, the timeliness of the video behavior detection is improved, and the method can be applied to occasions with higher real-time requirements, such as intelligent robots, human-computer interaction systems and the like. Compared with the existing video behavior detection technology, on the basis of the current popular public test set, the technology provided by the invention obtains better detection effect under the condition of utilizing fewer candidate proposals.
The invention will be further illustrated by way of example with reference to the accompanying drawings in which:
drawings
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a diagram of a video unit motion information coder-Decoder En-Decoder model framework.
FIG. 3 is a framework diagram of a single-frame detection model of video behavior based on spatio-temporal context analysis.
FIG. 4 is a video behavior chain set TdAnd dynamically updating the operation flow chart on line.
In the drawings:
1-Single-frame image representation feature p'i2-ConVGRU Unit, 3-fusion expression feature pd4, image sequence contained in video unit, 5-feature extraction Network, 6-dimension reduction Network, 7-RPN Network, 8-Detection Network, 9-behavior classification result, 10-position adjustment quantity, 11-motion proposal score, 12-motion proposal, 13-time domain cutting, 14-computing behavior score, 15-building behavior candidate pool, 16-building candidate set Pt17-update action chain, 18-add new action chain.
Detailed Description
The invention discloses an online video behavior detection method based on spatio-temporal context analysis, which comprises the following steps:
1) a section of video sequence to be detected is uniformly divided into a plurality of video segments (8 frames are a segment, and a frame is overlapped between adjacent segments);
2) extracting optical flow information from each video segment, and respectively inputting an original RGB image and an optical-flow optical flow image into a model to form two independent calculation branches, wherein one RGB branch is taken as an example for explanation, and the optical-flow branches have the same condition;
3) each frame of image in the video clip is respectively input into a depth convolution network (used as behavior classification) trained in advance for motion feature extraction;
4) the extracted motion characteristics are input into a coding-decoding network formed by ConvGRU to extract video behavior space-time context information, and are fused with the motion information of the current frame, and each frame outputs fusion characteristics;
5) the fusion features are accessed into a behavior classification network and a position regression network, the behaviors appearing in each frame are classified, and meanwhile, the positions where the behaviors occur are positioned to generate behavior frames;
6) using Viterbi algorithm to link the behavior frames detected by each frame in the video segment according to behavior categories to form a plurality of behavior candidate areas, and circularly executing the following steps (7-9) until the end of the input video sequence;
7) if the ordinal number of the current video clip is a multiple of 10, executing a behavior chain time domain pruning algorithm, calculating the score of the behavior chain relative to each behavior category, and updating a behavior category candidate pool to enable the behavior chain to only contain a plurality of behavior categories with the maximum scores;
8) for a certain current behavior chain, if a behavior candidate region and an overlapping area (the overlapping area refers to the overlapping area between the last behavior frame of the behavior chain and the first behavior frame of the candidate region) of the behavior candidate region are larger than a specified threshold value, linking the behavior candidate region with the largest score to the behavior chain; if the overlapping area of the non-existing behavior candidate region and the non-existing behavior candidate region is larger than a specified threshold value, the behavior chain is terminated. This step of operation is performed separately for different categories in the behavior category candidate pool;
9) if the behavior candidate region does not form a link with any behavior chain, taking the behavior candidate region as a new behavior chain;
10) and fusing the behavior chain detection results of the RGB branch and the Optical-flow branch to obtain a final detection result.
FIG. 1 is a flow chart of the present invention, wherein S1-S8 correspond to embodiments steps 1) -8) in sequence. An online video behavior detection method based on spatio-temporal context analysis comprises the following specific operation flows:
1) video is evenly divided into segments S1: given a segment of input video, it is divided evenly into several video segments, each segment containing 8 frames of images. Each section is taken as an independent video unit, and a behavior candidate area is extracted;
2) extracting RGB images or optical flows S2: for each video unit, extracting optical-flow optical flow images of each frame as motion information description. The original RGB image and the optical flow image are input into the model as two independent branches for computation, respectively. The following description will be mainly given of the RGB image branch, and the optical-flow optical flow branch is the same as that;
3) extracting single-frame expression feature p'iS3: the behavior detection network framework of the single-frame image is shown in FIG. 3. In fig. 3, an image 4 contained by a video unit is shown. Each frame of image is extracted by a feature extraction network 5 to express features, which are marked as pi. The feature extraction network is obtained based on fine tuning training of a VGG-16 model (Simnyan K.and Zisserman A.2014.Very Deep conditional Networks for Large Scale Image recognition. ArXiv (2014). https:// doi. org/arXiv:1409.1556), and the conv5 layer features are taken. Constructing a dimensionality reduction network 6, and adding piFrom 512 to 128, denoted as p'iAnd the overfitting of the whole network model is prevented. The dimensionality reduction network is a convolution module consisting of 128 layers of convolution layers;
4) extracting fusion expression feature pdS4: representing feature p 'of each frame image'iAnd inputting the video units into a ConvGRU network 2 to construct video unit space-time fusion motion coding. The structure of the En-Decoder model of the motion information codec is shown in FIG. 2. FIG. 2 shows the input feature p 'of the current frame'i1 and fusion expression feature p d2. The En-Decoder model works over the entire video unit, including the forward encoding and backward decoding processes. Characteristic p'iAnd simultaneously participate in the forward encoding and backward decoding processes, and the specific input mode is shown in fig. 2. Forward coding of a feature p 'to a single frame'iAccumulating along with time to obtain the representation of the video unit motion sequence; backward decoding propagates motion sequence characterization back to each frame in the video unit and with feature p'iFusing to obtain the feature p fusing the current frame and the space-time context informationd
5) Calculating the outputs of Detection network and RPN network S5: characteristic p of handlebardInputting the result into the RPN 7, calculating to obtain a sport proposition score 11, which is marked as srAnd sports proposal 12, denoted as pr. The RPN network is a 2-layer 3 x 3 convolutional network, which is at pdUp slide, calculating a sports proposal score s at each positionrThe score value is greater than a specified value (e.g.0.5) is considered as a movement proposal pr. Detection Network8 accepts pdAnd prAs input, the behavior classification result 9 is output, which is marked as scAnd a position adjustment 10, noted as deltar. The Detection Network is composed of 2 layers of full connection layers containing 1024 hidden units, and a behavior classification result scIncluding a classification score, a position adjustment delta, for each of the behavior and background classesrThe corresponding 3 position deviations (center position, width and height) are given for each type of behavior. From prAnd deltarA revised behavior candidate box b may be calculatedt
6) Computer video unit behavior candidate region P S6: note btThe corresponding RPN motion proposal is scored as sr(bt) Using Viterbi algorithm to carry out processing on different image frames in the same video unittAnd linking to obtain a behavior candidate region p, as shown in formula (1):
Figure BDA0001851716160000061
Tpis the video unit duration, here taken as 8;
Figure BDA0001851716160000062
is b istAnd bt-1Cross Over Intersection Union (IoU);
Figure BDA0001851716160000063
taking the harmonic coefficient as 0.5;
7) computing a video behavior chain set TdS7: as video is continuously input, a behavior candidate region p corresponding to each video unit is obtained, and a dynamically increasing video behavior chain set T is obtained by the following rules (a) - (f)d. FIG. 4 is a video behavior chain set TdDynamically updating operational flow diagram on-line, for
Figure BDA0001851716160000064
Rules (a) - (f) are implemented, the main idea being: maintaining a dynamically updated behavior category candidate pool, and gradually reducing the number of behavior category candidates according to the judgment of continuously input videos; and determining whether the newly generated candidate region p is linked to the original video behavior chain or used as a new behavior chain according to the set linking method. As shown in fig. 4, the specific steps are:
(a) time domain clipping 13. If the current set TdNumber of elements > Upper bound NdAnd the updating is finished. Otherwise, performing time domain clipping on T by using the Viterbi algorithm, as shown in equation (2):
Figure BDA0001851716160000065
Tlb being included in the action chain TtThe number of (2);
Figure BDA0001851716160000068
lte {0, c } is btThe category of the system is 0, background category and behavior category c; if ltC, then
Figure BDA0001851716160000066
Is b istCorresponding Detection Network class c classification score sc(bt) If l ist=0,
Figure BDA0001851716160000067
Is defined as 1-sc(bt) (ii) a If lt=lt-1,ω(lt,lt-1) 0, otherwise, ω (l)t,lt-1)=0.7;λω0.4; through time domain cutting, the background block contained in T is subtracted;
(b) calculate a behavior score 14: for T, calculating its score s (T) with respect to each behavior category, s (T) being defined as the average of all p scores s (p) belonging to T; similarly, the score s (p) of p is defined as all b's belonging to ptScore sc(bt) Average value;
(c) constructing behavior candidate pool15: constructing a behavior category candidate pool according to the sequence of s (T) from high to low, and specifically: i) at the beginning, all categories are reserved; ii) retaining the first 5 behavior classes while processing the 10 th video unit; iii) when processing the 20 th video unit, retaining the first 3 behavior categories; iv) processing the 30 th video unit and later, only the first ranked behavior category is retained. Setting the upper limit of the behavior category of the current candidate pool as NpFor each behavior category in the candidate pool j is less than or equal to NpExecute rules (d) - (e):
(d) constructing a candidate set Pt16: let the newly generated behavior candidate region P be added to the set P if IoU between T and P is greater than a specified threshold (e.g., 0.5)t(initial P)tEmpty). IoU between T and p is defined as IoU between the last behavior candidate box belonging to T and the first behavior candidate box belonging to p;
(e) update action chain 17: if PtIf not, the score is set to P '(P' is belonged to P ') corresponding to s (P') with the maximum scoret) Linking to T, namely adding p ' to the back of T to form a new behavior chain T ', and updating T to T ';
(f) adding a new chain of actions 18: if p is presentnew(pnew∈Pt) If there is no link to any T, then p is linkednewAdding the set T as a new behavior chaind
8) Fusing the RGB and Optical-flow detection results S8: and fusing the behavior chain detection results of the RGB branch and the Optical-flow branch to obtain a final detection result. The fusion method comprises the following steps: let TrgbFor a chain of lines of RGB branches, ToptIs a behavior chain of Optical-flow branch if TrgbAnd ToptIoU between is greater than a specified threshold (e.g., 0.7), then max (s (T)rgb),s(Topt) The corresponding action chain, the other one is deleted; otherwise, both behavior chains are retained.
Taking mAP (mean Average precision) as an evaluation standard, the method provided by the invention obtains the best action detection result on a J-HMDB-21 data set, and the comparison with other methods is shown in Table 1:
mAP 0.5 0.5:0.95
Gurkirt et al.[1] 72.0 41.6
ACT[2] 72.2 43.2
Peng and Schmid[3] 73.1 -
Harkirat et al.[4] 67.3 36.1
the invention 75.9 44.8
TABLE 1 comparison with other methods, '-' indication is not mentioned, the higher the number the better the result
The methods compared in table 1 are listed below:
[1]S.G.,S.S.,and C.F.,“Online real time multiple spatiotemporal action localisation and prediction on a single platform,”ArXiv,2016.
[2]V.Kalogeiton,P.Weinzaepfel,V.Ferrari,and C.Schmid,“Action tubelet detector for spatio-temporal action localization,”in IEEE International Conference on Computer Vision,2017, pp.4415–4423.
[3]P.X.and S.C.,“Multi-region two-stream r-cnn for action detection,”European Conference on Computer Vision,pp.744–759,2016.
[4]B.H.,S.M.,S.G.,S.S.,C.F.,and T.P.H.,“Incremental tube construction for human action detection,”ArXiv,2017.
it is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (5)

1. An online video behavior detection method based on spatio-temporal context analysis is used for detecting online video behaviors and is characterized by comprising the following steps:
step 1: calculating an optical flow image for the current frame, extracting depth expression features of the RGB image and the optical flow image, specifically, constructing an additional convolution network on a conv5 layer of a VGG16 network structure to realize extraction of depth features of specified dimensions;
step 2: constructing a coding-decoding network to extract space-time context information of video behaviors, and fusing the space-time context information with current frame information to obtain fusion characteristics, namely constructing a coding-decoding module by using a ConvGRU model to realize the fusion of the space-time context information and the current frame information;
and step 3: classifying and position regressing the fusion characteristics to generate a motion frame, and linking the motion frame by using a Viterbi algorithm to obtain a behavior candidate region;
and 4, step 4: constructing a behavior category candidate pool, updating the behavior categories which may appear, namely constructing and maintaining a possible behavior category candidate pool, and gradually reducing the number of the behavior categories in the candidate pool along with the continuous input of the video to realize the purpose of online judgment of the behavior categories;
and 5: linking the behavior candidate region to an existing behavior chain in an online mode or generating a new behavior chain;
step 6: and fusing the detection results of the RGB image branches and the optical flow image branches to obtain a final detection result, wherein the final detection result is specifically that if the overlapping rate of two behavior chains generated by two different branches is greater than a specified threshold, the behavior chain with higher score is reserved, otherwise, the two behavior chains are reserved simultaneously.
2. The on-line video behavior detection method based on spatio-temporal context analysis according to claim 1, characterized in that: before step 1, the method further comprises the following steps:
a video sequence to be detected is evenly divided into a plurality of video segments, wherein 8 frames are a segment, and adjacent segments are overlapped by one frame.
3. The on-line video behavior detection method based on spatio-temporal context analysis according to claim 1, characterized in that:
in the step 3, the fusion features are accessed into a behavior classification network and a position regression network, the behaviors appearing in each frame are classified, and meanwhile, the positions where the behaviors occur are positioned, so that a behavior frame is generated.
4. The on-line video behavior detection method based on spatio-temporal context analysis according to claim 1, characterized in that: in step 4, if the ordinal number of the current video segment is a multiple of 10, a behavior chain time domain pruning algorithm is executed, the score of the behavior chain relative to each behavior category is calculated, and the behavior category candidate pool is updated so that only a plurality of behavior categories with the largest scores are included.
5. The on-line video behavior detection method based on spatio-temporal context analysis according to claim 1, characterized in that: in the step 5, for a current behavior chain, if a behavior candidate region and an overlapping area of the behavior candidate region are larger than a specified threshold, the behavior candidate region with the largest score is linked to the behavior chain; if the overlapping area of the non-existing behavior candidate region and the non-existing behavior candidate region is larger than a specified threshold value, the behavior chain is terminated.
CN201811298487.0A 2018-11-02 2018-11-02 Online video behavior detection method based on space-time context analysis Active CN109409307B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811298487.0A CN109409307B (en) 2018-11-02 2018-11-02 Online video behavior detection method based on space-time context analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811298487.0A CN109409307B (en) 2018-11-02 2018-11-02 Online video behavior detection method based on space-time context analysis

Publications (2)

Publication Number Publication Date
CN109409307A CN109409307A (en) 2019-03-01
CN109409307B true CN109409307B (en) 2022-04-01

Family

ID=65471476

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811298487.0A Active CN109409307B (en) 2018-11-02 2018-11-02 Online video behavior detection method based on space-time context analysis

Country Status (1)

Country Link
CN (1) CN109409307B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059761A (en) * 2019-04-25 2019-07-26 成都睿沿科技有限公司 A kind of human body behavior prediction method and device
CN110348345B (en) * 2019-06-28 2021-08-13 西安交通大学 Weak supervision time sequence action positioning method based on action consistency
CN110472230B (en) * 2019-07-11 2023-09-05 平安科技(深圳)有限公司 Chinese text recognition method and device
CN111414876B (en) * 2020-03-26 2022-04-22 西安交通大学 Violent behavior identification method based on time sequence guide space attention
CN111523421B (en) * 2020-04-14 2023-05-19 上海交通大学 Multi-person behavior detection method and system based on deep learning fusion of various interaction information
CN113569824B (en) * 2021-09-26 2021-12-17 腾讯科技(深圳)有限公司 Model processing method, related device, storage medium and computer program product
CN115424207B (en) * 2022-09-05 2023-04-14 南京星云软件科技有限公司 Self-adaptive monitoring system and method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1466104A (en) * 2002-07-03 2004-01-07 中国科学院计算技术研究所 Statistics and rule combination based phonetic driving human face carton method
CN107330920A (en) * 2017-06-28 2017-11-07 华中科技大学 A kind of monitor video multi-target tracking method based on deep learning
CN108573246A (en) * 2018-05-08 2018-09-25 北京工业大学 A kind of sequential action identification method based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8027542B2 (en) * 2007-06-18 2011-09-27 The Regents Of The University Of California High speed video action recognition and localization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1466104A (en) * 2002-07-03 2004-01-07 中国科学院计算技术研究所 Statistics and rule combination based phonetic driving human face carton method
CN107330920A (en) * 2017-06-28 2017-11-07 华中科技大学 A kind of monitor video multi-target tracking method based on deep learning
CN108573246A (en) * 2018-05-08 2018-09-25 北京工业大学 A kind of sequential action identification method based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《Convolutional Gated Recurrent Units Fusion for Video Action Recognition》;Bo Huang et al;《ICONIP 2017: Neural Information Processing》;20171028;第3.3小节 *
《基于多载体的人群异常事件检测方法研究》;郭会文;《中国博士学位论文全文数据库 信息科技辑》;20180515(第5期);第23-60页第3-5章 *
郭会文.《基于多载体的人群异常事件检测方法研究》.《中国博士学位论文全文数据库 信息科技辑》.2018,(第5期), *

Also Published As

Publication number Publication date
CN109409307A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
CN109409307B (en) Online video behavior detection method based on space-time context analysis
CN107330362B (en) Video classification method based on space-time attention
CN110188343B (en) Multi-mode emotion recognition method based on fusion attention network
CN110164476B (en) BLSTM voice emotion recognition method based on multi-output feature fusion
CN113158875B (en) Image-text emotion analysis method and system based on multi-mode interaction fusion network
CN110096950A (en) A kind of multiple features fusion Activity recognition method based on key frame
CN103943107B (en) A kind of audio frequency and video keyword recognition method based on Decision-level fusion
CN110781776B (en) Road extraction method based on prediction and residual refinement network
CN111372123B (en) Video time sequence segment extraction method based on local to global
CN111767847B (en) Pedestrian multi-target tracking method integrating target detection and association
CN101187990A (en) A session robotic system
CN115205730A (en) Target tracking method combining feature enhancement and template updating
CN104461000B (en) A kind of on-line continuous human motion identification method based on a small amount of deleted signal
CN110110648B (en) Action nomination method based on visual perception and artificial intelligence
CN111476771B (en) Domain self-adaption method and system based on distance countermeasure generation network
CN107657625A (en) Merge the unsupervised methods of video segmentation that space-time multiple features represent
CN111652357A (en) Method and system for solving video question-answer problem by using specific target network based on graph
CN110569823B (en) Sign language identification and skeleton generation method based on RNN
CN115495552A (en) Multi-round dialogue reply generation method based on two-channel semantic enhancement and terminal equipment
CN111198966A (en) Natural language video clip retrieval method based on multi-agent boundary perception network
CN115424177A (en) Twin network target tracking method based on incremental learning
CN116564355A (en) Multi-mode emotion recognition method, system, equipment and medium based on self-attention mechanism fusion
CN113989933B (en) Online behavior recognition model training and detecting method and system
Zhao et al. Robust online tracking with meta-updater
CN114399661A (en) Instance awareness backbone network training method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant