CN108647591A - Activity recognition method and system in a kind of video of view-based access control model-semantic feature - Google Patents

Activity recognition method and system in a kind of video of view-based access control model-semantic feature Download PDF

Info

Publication number
CN108647591A
CN108647591A CN201810379626.6A CN201810379626A CN108647591A CN 108647591 A CN108647591 A CN 108647591A CN 201810379626 A CN201810379626 A CN 201810379626A CN 108647591 A CN108647591 A CN 108647591A
Authority
CN
China
Prior art keywords
image sequence
feature vector
image
gru
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201810379626.6A
Other languages
Chinese (zh)
Inventor
李方敏
尤天宇
刘新华
旷海兰
张韬
栾悉道
阳超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha University
Original Assignee
Changsha University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha University filed Critical Changsha University
Priority to CN201810379626.6A priority Critical patent/CN108647591A/en
Publication of CN108647591A publication Critical patent/CN108647591A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of Activity recognition methods in video of view-based access control model semantic feature, extract short-term space-time visual signature first with Three dimensional convolution neural network, avoid the high computation complexity brought using light stream or intensive method of loci;Then the object detector based on convolutional neural networks is utilized to extract the semanteme and spatial positional information of people and object, it constructs personage's body spatial position feature and carries out Fusion Features with space-time visual signature, the recognition accuracy of interbehavior in video is improved using additional semantic information;Finally, on the basis of the short-term space-time visual signature with versatility of extraction, Activity recognition accuracy rate is improved by the long-term action feature of Recognition with Recurrent Neural Network extraction specificity.The present invention can solve the technical issues of computation complexity present in the existing Activity recognition method for video is high, Activity recognition accuracy rate is low and can not extract through the long-term action feature of entire video time dimension.

Description

Activity recognition method and system in a kind of video of view-based access control model-semantic feature
Technical field
Technical field of computer vision of the present invention, more particularly, to row in a kind of video of view-based access control model-semantic feature For recognition methods and system.
Background technology
Have become the popular research neck of computer vision field for the Activity recognition problem of video data type Domain.At present for the Activity recognition in video, mainly there are 3 kinds of methods:Optical flow method, Recognition with Recurrent Neural Network method and Three dimensional convolution Neural network.
For optical flow method, the accuracy rate of Activity recognition is higher, but because the computation complexity of optical flow method is high, institute The real-time of calculating is cannot achieve with it;The input data of Recognition with Recurrent Neural Network includes mainly two kinds:First, using convolutional Neural The feature of the single-frame images of network extraction, this feature lack time-domain related information, cause the recognition accuracy of method low;Two It is light stream or intensive trace information, as optical flow method, the computation complexity of method can be caused high;For Three dimensional convolution nerve For network, input data is the image sequence segment of regular length, therefore this method is merely able to the short-term of extraction versatility Space-time visual signature, and the long-term action feature through entire video time dimension cannot be extracted.
Invention content
For the disadvantages described above or Improvement requirement of the prior art, the present invention provides a kind of view-based access control model-semantic features Activity recognition method and system in video, it is intended that solving to count present in the existing Activity recognition method for video It calculates that complexity is high, Activity recognition accuracy rate is low and the long-term action feature through entire video time dimension can not be extracted Technical problem.
To achieve the above object, according to one aspect of the present invention, a kind of video of view-based access control model-semantic feature is provided Middle Activity recognition method, includes the following steps:
(1) obtain image sequence from data set, down-sampled processing carried out to the image sequence, with obtain it is down-sampled after figure As sequence V={ vt, t ∈ 0,1 ..., T-1, and will be down-sampled after image sequence be sliced, with obtain it is N number of have fix The image sequence segment of length, wherein T indicate that the length of image sequence, N indicate the quantity of image sequence segment.
(2) each image in N number of image sequence segment with regular length is zoomed in and out and cutting is handled, and will In N number of image sequence segment input Three dimensional convolution neural network, to obtain N number of space-time visual feature vector.
(3) piece image will be chosen in each image sequence segment obtained in step (1), which is zoomed in and out and Cutting is handled, and the image after scaling and cutting is input in object detector, to obtain confidence level and the position of kind of object Offset, and according to the confidence level of kind of object and position offset construction people-object space position feature vector.
(4) people-object space position that will be obtained in the space-time visual feature vector obtained in step (2) and step (3) Feature vector carries out Fusion Features.
(5) feature vector after step (4) Fusion Features is inputted into Recognition with Recurrent Neural Network, to obtain long-term action feature.
(6) the long-term action feature obtained to step (5) using Softmax graders is classified, and is corresponded to generating The class probability of each behavior type.
Preferably, image sequence is sliced and specifically uses following formula:
Wherein TcIt is the frame step-length of image sequence segment, δ is the frame length of image sequence in image sequence segment, n ∈ 0, 1 ... N-1, and have Tc=8, δ=16.
Preferably, the Three dimensional convolution neural network used is C3D networks, and the physical detector used is that resolution ratio is 300 × 300 more box detectors of single-shot.
Preferably, which is characterized in that input N number of image sequence segment in Three dimensional convolution neural network, when obtaining N number of Image sequence segment is inputted C3D by the process of empty visual feature vector first specifically, for each image sequence segment Then network uses the output of the 5th pond layer in C3D networks as short-term space-time visual signature, finally that this feature figure is regular The feature vector for being 8192 for 1 length, wherein the output matrix size of the 5th pond layer is 1 × 4 × 4 × 512.
Preferably, step (3) is specifically, first, and physical detector is exported according to the image after the scaling of input and cutting Corresponding to multiple output vectors of multiple bounding boxes, each output vector include L kind of object confidence level P={ pl }, with And position offset [x, y, w, h], wherein l ∈ 0,1 ... L-1, L indicate the number of kind of object, plIndicate first of object kind The confidence level of class;Then the corresponding output vector of institute's bounding box is merged, it is more with the correspondence for obtaining multiple detection objects Spatial position feature vector [q, the x/W that a length is 5I,y/HI,w/WI,h/HI], wherein q indicates the affiliated object kind of detection object The confidence level of class, x and y are respectively the transverse and longitudinal coordinate of the bounding box of detection object, and w and h are respectively the bounding box of detection object It is wide and high, WIAnd HIThe width and height of image after respectively scaling and cut;Finally, for each of all L kind of object For kind of object, it is sky to construct a length using the spatial position feature vector of highest 5 detection objects of its confidence level Between position feature vector length × L × 5 feature vector.
Preferably, the Recognition with Recurrent Neural Network used in step (5) is 3 layers of GRU networks, is by one layer of full articulamentum and 3 Level joins GRU layers of composition, and full articulamentum has 4096 neurons, and the neuronal quantity of GRU units is in first two layers of GRU networks 4096, the neuronal quantity of GRU units is 256 in last layer, and the output of preceding layer GRU units is later layer GRU units Input.
Preferably, the Recognition with Recurrent Neural Network used in step (5) is combination GRU networks, is by 3 layers of full articulamentum and one GRU layers of layer is constituted, and has 4096 neurons in preceding two layers of full articulamentum, there is 512 neurons in the full articulamentum of last layer, The neuronal quantity of GRU units is 512 in GRU layers.
It is another aspect of this invention to provide that Activity recognition system in a kind of video of view-based access control model-semantic feature is provided, Including:
First module carries out down-sampled processing, to be dropped for obtaining image sequence from data set to the image sequence Image sequence V={ v after samplingt, t ∈ 0,1 ..., T-1, and will be down-sampled after image sequence be sliced, to obtain N A image sequence segment with regular length, wherein T indicate that the length of image sequence, N indicate the quantity of image sequence segment.
Second module, for each image in N number of image sequence segment with regular length to be zoomed in and out and cut out Processing is cut, and N number of image sequence segment is inputted in Three dimensional convolution neural network, to obtain N number of space-time visual feature vector.
Third module, for piece image will to be chosen in each image sequence segment obtained in the first module, to the figure Picture zooms in and out and cutting processing, the image after scaling and cutting is input in object detector, to obtain kind of object Confidence level and position offset, and people-object space position feature is constructed according to the confidence level of kind of object and position offset Vector.
The people-obtained in 4th module, space-time visual feature vector for will be obtained in the second module and third module Object space position feature vector carries out Fusion Features.
5th module inputs Recognition with Recurrent Neural Network, to be grown for the feature vector after merging the 4th modular character Phase behavioural characteristic.
6th module, the long-term action feature for being obtained using the 5th module of Softmax graders pair are classified, with Generate the class probability corresponding to each behavior type.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect:
(1) computation complexity of the invention is low, can ensure the real-time calculated:Due to being used in step of the present invention (2) Three dimensional convolution neural network extracts acts and efforts for expediency feature, avoids using the high computation complexity that is brought using optical flow method, real The Activity recognition of quickly and efficiently rate is showed.
(2) Activity recognition accuracy rate of the invention is high:Since the present invention constructs people-object space position in step (3) Feature vector is set, the recognition accuracy of the interbehavior in video between people and object is improved.
(3) long-term based on acts and efforts for expediency feature extraction using improved GRU network structures in step (5) due to the present invention Behavioural characteristic can further increase recognition accuracy.
Description of the drawings
The schematic diagram of the 3 layers of GRU networks used in the step of Fig. 1 is the method for the present invention (5).
The schematic diagram of the combination GRU networks used in the step of Fig. 2 is the method for the present invention (5).
Fig. 3 is the ratio of Fig. 1, GRU networks shown in Fig. 2 and conventional monolayers GRU networks in terms of behavior recognition accuracy Compared with schematic diagram.
Fig. 4 is that the present invention is based on the flow charts of Activity recognition method in the video of vision-semantic feature.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below It does not constitute a conflict with each other and can be combined with each other.
The present invention proposes a kind of short-term space-time vision mode (Long-Short of length-of fusion people-object vision relationship Term Spatio-Temporal Visual Model with Human-Object Visual Relationship, this hair It is bright), short-term space-time visual signature is extracted first with Three dimensional convolution neural network, is avoided using light stream or intensive method of loci The high computation complexity brought;Then the object detector based on convolutional neural networks is utilized to extract the semanteme and sky of people and object Between location information, construction people-object space position feature simultaneously with space-time visual signature carry out Fusion Features, utilize additional semanteme Information improves the recognition accuracy of interbehavior in video;Finally, the Short-term characteristic based on fusion proposes a kind of improved cycle Neural network extraction long-term action feature passes through that is, on the basis of the short-term space-time visual signature with versatility of extraction The long-term action feature of Recognition with Recurrent Neural Network extraction specificity improves the accuracy rate of Activity recognition.
As shown in figure 4, the present invention is based on Activity recognition methods in the video of vision-semantic feature to include the following steps:
(1) obtain image sequence from data set, down-sampled processing carried out to the image sequence, with obtain it is down-sampled after figure As sequence V={ vt, t ∈ 0,1 ..., T-1, wherein T indicate image sequence length, and will be down-sampled after image sequence into Row slice, to obtain N number of image sequence segment with regular length, wherein N indicates the quantity of image sequence segment, specially Integer between 5 to 10.
Specifically, the data set used in this step is the UCF101 Activity recognition data sets acquired from Youtube, drop The interval of sampling processing is 5 frames.
Image sequence is sliced and specifically uses following formula:
Wherein TcIt is the frame step-length of image sequence segment, δ is the frame length of image sequence in image sequence segment, n ∈ 0, 1 ... N-1, and have Tc=8, δ=16.
It is 3 image sequences image sequence cutting for example, for length is 32 image sequence (i.e. T=32) Segment, each image sequence segment include 16 width images, and adjacent two image sequence segments have the overlapping of 8 width images.
(2) each image in N number of image sequence segment with regular length is zoomed in and out and cutting handles (example Such as, 112 × 112 resolution sizes are scaled and are cut to, the resolution ratio which depends on Three dimensional convolution neural network is big It is small), and N number of image sequence segment is inputted in Three dimensional convolution neural network, to obtain N number of space-time visual feature vector.
The dimension of each image sequence segment inputted in Three dimensional convolution neural network is 16 × 112 × 112 × 3.
In this step, the Three dimensional convolution neural network used is C3D networks, the space-time for extracting image sequence segment Visual signature.
Image sequence segment is inputted C3D networks by this step first specifically, for each image sequence segment, Then use the output of the 5th pond (pool5) layer in C3D networks as short-term space-time visual signature (wherein the 5th pond layer Output matrix size is 1 × 4 × 4 × 512, i.e. the characteristic pattern that 512 resolution ratio are 4 × 4), finally by this feature figure it is regular be 1 The feature vector that a length is 8192.
(3) piece image will be chosen in each image sequence segment obtained in step (1), which is zoomed in and out and (for example, scaling and being cut to 300 × 300 resolution sizes, which depends on follow-up object detector for cutting processing Resolution sizes), the image after scaling and cutting is input in object detector, to obtain confidence level and the position of kind of object Offset is set, and according to the confidence level of kind of object and position offset construction people-object space position feature vector.
Specifically, the physical detector used in this step is the more box detectors of single-shot that resolution ratio is 300 × 300 (Single shot multibox detector, abbreviation SSD300).
Specifically, first, physical detector corresponds to more this step according to the image output after the scaling of input and cutting Multiple output vectors of a bounding box, each output vector include L kind of object confidence level P={ pl } and position it is inclined Shifting amount [x, y, w, h], wherein l ∈ 0,1 ... L-1, L indicate the number of kind of object, plIndicate the credible of first kind of object Degree;Then merge that (merging process is using non-maxima suppression (Non to the corresponding output vector of institute's bounding box Maximum Suppression, abbreviation NMS) algorithm), the space bit that the multiple length of correspondence to obtain multiple detection objects are 5 Set feature vector [q, x/WI,y/HI,w/WI,h/HI], wherein q indicates that the confidence level of the affiliated kind of object of detection object, x and y are divided Not Wei detection object bounding box transverse and longitudinal coordinate, w and h are respectively the width and height of the bounding box of detection object, WIAnd HIRespectively The width and height of image after scaling and cutting;Finally, it for each kind of object in all L kind of object, utilizes It is spatial position feature vector length that the spatial position feature vector of highest 5 detection objects of its confidence level, which constructs a length, The feature vector of × L × 5.Since the SSD300 in this step can detect 201 kinds of kind of object, and for each object kind Class chooses the feature vector of 5 highest objects of probability, so L=201 is to get the feature vector for being 5025 to length.
(4) people-object space position that will be obtained in the space-time visual feature vector obtained in step (2) and step (3) Feature vector carries out Fusion Features.
Specifically, the Fusion Features process of this step is exactly to be by space-time visual signature and length that length is 8192 5025 people-object space position feature splices, and becomes the feature vector that a length is 13217 to merge.
(5) feature vector after step (4) Fusion Features is inputted into Recognition with Recurrent Neural Network, to obtain long-term action feature.
The Recognition with Recurrent Neural Network used in this step is gating cycle unit (Gated Recurrent Unit, abbreviation GRU)。
The present invention proposes 2 kinds of improved GRU network structures, wherein external square indicates the feature vector of input.It is right In fusion feature, input is feature vector that the length after Fusion Features is 13217.GRU networks input short-term at any time Space-time visual signature simultaneously generates long-term action feature under All Time scale.
Be a kind of 3 layers of GRU networks (3-Layer Stacked GRU, abbreviation sGRU) shown in Fig. 1, connected entirely by one layer It connecing layer (Fool connection layer, abbreviation FC) and 3 levels joins GRU layers of composition, full articulamentum has 4096 neurons, The neuronal quantity of GRU units is 4096 in first two layers of GRU networks, and the neuronal quantity of GRU units is in last layer 256, the output of preceding layer GRU units is the input of later layer GRU units.The purpose of this framework is deep by increasing GRU networks Degree, improves the learning ability of network.
Long-term action feature vector length after the output of above-mentioned sGRU networks is 256.
A kind of combination GRU networks (Composite GRU, abbreviation cGRU) shown in Fig. 2, be by 3 layers of full articulamentum and One layer GRU layers are constituted, and have 4096 neurons in preceding two layers of full articulamentum, there is 512 neurons in the full articulamentum of last layer, The neuronal quantity of GRU units is 512 in GRU layers.The purpose of this framework be first two layers full articulamentum can to input feature vector into Row dimensionality reduction, and last GRU layers can learn to long-term behavioural characteristic.
Long-term action feature vector length after the output of above-mentioned cGRU networks is 512.
(6) the long-term action feature obtained to step (5) using Softmax graders is classified, and is corresponded to generating The class probability of each behavior type.
The finally obtained output of this step is probability vector:PB={ pb, wherein b ∈ 0,1 ... B-1, B indicates behavior type Quantity, each element representation in probability vector corresponds to the class probability of each behavior type.
Because UCF101 data used in the present invention are concentrated with 101 behavior types, B=101, in probability vector Maximum element pyThe behavior type that corresponding y-th of behavior type as finally identifies.
Experimental result
Test is the video data in UCF101 Activity recognition data sets, the video in UCF101 data sets using data set It is acquired from YouTube, shares 101 behavior types, 13320 video clips are not only various with behavior type Property, the also diversity of camera motion, gestures of object, article size, shooting visual angle, background and illumination etc..UCF101 In behavior type can be divided into 5 big types:People-object interaction, people-people's interaction, plays an instrument and is transported with sport at limb action It is dynamic.
(1) recognition accuracy
Recognition accuracy refers to that totally 3783 samples, method identify that correct sample number accounts for total number of samples for test set Ratio.Test can help to analyze each module to method performance using the accuracy rate of the method for different module combination modes It influences.The accuracy rate of each method is as shown in table 1 below, and the method for wherein italic textual representation has used the intensive track of improvement or light stream Information.
As can be seen that the method for the present invention has been respectively increased 8.2% compared to LSTM composite models method and C3D methods, accuracy rate With 10.2%.Depth god is used only in the method for having used light stream compared to other or having improved intensive trace information, the method for the present invention The feature of original sequence is extracted through network, inference speed is faster.In fact, it is a kind of artificial to improve intensive track The feature of rule construct, the histogram of gradients based on optical flow tracking and image, and the operation of light stream consumes a large amount of computing resource And the time.In the method for the different GRU network structures of 2 kinds of uses, the method for the present invention achieves best performance, has been more than to use Improve the accuracy rate of multi-hop feature storehouse (Multi-skip feature stacking) method 3.4% of intensive trace information.
Accuracy rate of 1 each method of table on UCF101 data sets
(2) influence of the GRU networks to method performance
This section tests the method for using sGRU networks, cGRU networks and single layer GRU networks, wherein single layer GRU nets Network is used for benchmark test, and single layer GRU networks include 512 neurons, and feature vector directly inputs GRU networks, is a kind of basis Recognition with Recurrent Neural Network structure.
Each method is compared about the accuracy rate of GRU networks as shown in figure 3, so that the method for the present invention is compared using cGRU networks makes The accuracy rate that 3.7% is improved with the method for sGRU networks, compared to the standard that the method using single layer GRU networks improves 5.5% True rate.
Using the method for the present invention of single layer GRU networks and sGRU networks user-object space position feature the case where It is lower just to have reached other using accuracy rate similar in the method for light stream or the intensive trace information of improvement, illustrate to use single layer GRU nets The ability to express of network and the long-term action feature of sGRU networks extraction is poor.For the excessive feature vector of length, as feature is melted The feature vector that length after conjunction is 13217, sGRU networks are since parameter amount is excessive, and not only reasoning and training speed are slow, but also It is easy to cause over-fitting.And single layer GRU networks, since network depth is excessively shallow, learning performance is poor, be easy to cause poor fitting.cGRU Web vector graphic fully-connected network carries out dimensionality reduction to feature, the long-term behavioural characteristic of GRU e-learnings is reused, due to network parameter Amount is few, and not only reasoning and training speed faster, but also do not easily cause over-fitting, accuracy rate higher.
To sum up, cGRU networks preferably realize the function that long-term action feature is extracted on the basis of Short-term characteristic.
(3) computation rate
Computation rate such as the following table 2 institute of the method for the present invention and the other 4 kinds Activity recognition methods based on UCF101 data sets Show, test uses one piece of K40Tesla GPU.Because the computation complexity of optical flow algorithm is high, intensive track and binary-flow network are improved The GPU of the optical flow algorithm used in (Two-stream networks) realizes 91.4 times and 274.6 slower than C3D method respectively Times.Because the method for the present invention contains people-object space characteristic extracting module and long-term action characteristic extracting module, containing additional SSD300 and cGRU networks, so 2.5 times slower than individual C3D networks, but be still far faster than intensive using improving The method of track and Optic flow information has reached 125.2 frames/second, realizes the super real-time of calculating.
The computation rate of 2 each method of table compares
People-object space position feature extraction module be divided between having used 16 it is down-sampled, for each video clip only People-object space feature of one sub-picture of extraction is needed, so the calculating time has shared the meter of each image in video segment On evaluation time.By independent test, 17.8 frames of computation rate/second of SSD300, i.e. 56.18ms/ frames, the calculating time after sharing For 3.51ms/ frames.And C3D networks every time make inferences the video clip comprising 16 frame images, computation rate be 313.9 frames/ Second, i.e. 3.19ms/ frames.Theoretically, the calculating time and space visual feature extraction module of people-object space position extraction module The calculating time add up to 6.70ms/ frames, i.e. 149.3 frames/second.And in actual test, the computation rate of the method for the present invention is 125.2 frames/second, this is because method consumes the additional calculating time during pretreatment and cGRU network reasonings etc., but It is the calculating time much smaller than SSD300 and C3D network reasonings.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, all within the spirits and principles of the present invention made by all any modification, equivalent and improvement etc., should all include Within protection scope of the present invention.

Claims (8)

1. a kind of Activity recognition method in video of view-based access control model-semantic feature, which is characterized in that include the following steps:
(1) obtain image sequence from data set, down-sampled processing carried out to the image sequence, with obtain it is down-sampled after image sequence Arrange V={ vt, t ∈ 0,1 ..., T-1, and will be down-sampled after image sequence be sliced, with obtain it is N number of have regular length Image sequence segment, wherein T indicate image sequence length, N indicate image sequence segment quantity.
(2) each image in N number of image sequence segment with regular length is zoomed in and out and cutting is handled, and will be N number of Image sequence segment inputs in Three dimensional convolution neural network, to obtain N number of space-time visual feature vector.
(3) piece image will be chosen in each image sequence segment obtained in step (1), which is zoomed in and out and cut Processing, the image after scaling and cutting is input in object detector, to obtain the confidence level and position offset of kind of object Amount, and according to the confidence level of kind of object and position offset construction people-object space position feature vector.
(4) people-object space position feature that will be obtained in the space-time visual feature vector obtained in step (2) and step (3) Vector carries out Fusion Features.
(5) feature vector after step (4) Fusion Features is inputted into Recognition with Recurrent Neural Network, to obtain long-term action feature.
(6) the long-term action feature obtained to step (5) using Softmax graders is classified, to generate corresponding to each The class probability of kind behavior type.
2. Activity recognition method in video according to claim 1, which is characterized in that by image sequence be sliced specific It is to use following formula:
Wherein TcIt is the frame step-length of image sequence segment, δ is the frame length of image sequence in image sequence segment, n ∈ 0,1 ... N- 1, and have Tc=8, δ=16.
3. Activity recognition method in video according to claim 1 or 2, which is characterized in that the Three dimensional convolution nerve used Network is C3D networks, and the physical detector used is the more box detectors of single-shot that resolution ratio is 300 × 300.
4. Activity recognition method in video as claimed in any of claims 1 to 3, which is characterized in that by N number of image Sequence fragment inputs in Three dimensional convolution neural network, to obtain the process of N number of space-time visual feature vector specifically, for each For image sequence segment, image sequence segment is inputted into C3D networks first, then uses the 5th pond layer in C3D networks Output is used as short-term space-time visual signature, finally by this feature figure it is regular be feature vector that 1 length is 8192, wherein the 5th The output matrix size of pond layer is 1 × 4 × 4 × 512.
5. Activity recognition method in video according to claim 4, which is characterized in that step (3) is specifically, first, object Multiple output vectors of the detector according to the image output after the scaling of input and cutting corresponding to multiple bounding boxes are managed, it is each defeated Outgoing vector includes the confidence level P={ p of L kind of objectlAnd position offset [x, y, w, h], wherein l ∈ 0,1 ... L- 1, L indicates the number of kind of object, plIndicate the confidence level of first of kind of object;Then to the corresponding output of institute's bounding box Vector merges, spatial position feature vector [q, the x/W that the multiple length of the correspondence to obtain multiple detection objects are 5I,y/ HI,w/WI,h/HI], wherein q indicates that the confidence level of the affiliated kind of object of detection object, x and y are respectively the bounding box of detection object Transverse and longitudinal coordinate, w and h are respectively the width and height of the bounding box of detection object, WIAnd HIImage after respectively scaling and cut It is wide and high;Finally, for each kind of object in all L kind of object, highest 5 detections of its confidence level are utilized The spatial position feature vector of object constructs the feature vector that a length is spatial position feature vector length × L × 5.
6. Activity recognition method in video according to claim 1, which is characterized in that the cycle god used in step (5) It is 3 layers of GRU networks through network, is to join GRU layers by one layer of full articulamentum and 3 levels to constitute, full articulamentum there are 4096 nerves Member, the neuronal quantity of GRU units is 4096 in first two layers of GRU networks, and the neuronal quantity of GRU units is in last layer 256, the output of preceding layer GRU units is the input of later layer GRU units.
7. Activity recognition method in video according to claim 1, which is characterized in that the cycle god used in step (5) It is combination GRU networks through network, is to be made of 3 layers of full articulamentum and one layer GRU layers, there are 4096 in preceding two layers of full articulamentum Neuron, there is 512 neurons in the full articulamentum of last layer, and the neuronal quantity of GRU units is 512 in GRU layers.
8. Activity recognition system in a kind of video of view-based access control model-semantic feature, which is characterized in that including:
First module carries out down-sampled processing for obtaining image sequence from data set to the image sequence, down-sampled to obtain Image sequence V={ v afterwardst, t ∈ 0,1 ..., T-1, and will be down-sampled after image sequence be sliced, to obtain N number of tool There are the image sequence segment of regular length, wherein T to indicate that the length of image sequence, N indicate the quantity of image sequence segment.
Second module, for each image in N number of image sequence segment with regular length zoom in and out and cutting at Reason, and N number of image sequence segment is inputted in Three dimensional convolution neural network, to obtain N number of space-time visual feature vector.
Third module, for piece image will to be chosen in each image sequence segment obtained in the first module, to the image into Row scaling and cutting processing, the image after scaling and cutting are input in object detector, to obtain the credible of kind of object Degree and position offset, and according to the confidence level of kind of object and position offset construction people-object space position feature vector.
People-the object obtained in 4th module, space-time visual feature vector for will be obtained in the second module and third module Spatial position feature vector carries out Fusion Features.
5th module inputs Recognition with Recurrent Neural Network, to obtain long-term row for the feature vector after merging the 4th modular character It is characterized.
6th module, the long-term action feature for being obtained using the 5th module of Softmax graders pair are classified, to generate Corresponding to the class probability of each behavior type.
CN201810379626.6A 2018-04-25 2018-04-25 Activity recognition method and system in a kind of video of view-based access control model-semantic feature Withdrawn CN108647591A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810379626.6A CN108647591A (en) 2018-04-25 2018-04-25 Activity recognition method and system in a kind of video of view-based access control model-semantic feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810379626.6A CN108647591A (en) 2018-04-25 2018-04-25 Activity recognition method and system in a kind of video of view-based access control model-semantic feature

Publications (1)

Publication Number Publication Date
CN108647591A true CN108647591A (en) 2018-10-12

Family

ID=63747734

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810379626.6A Withdrawn CN108647591A (en) 2018-04-25 2018-04-25 Activity recognition method and system in a kind of video of view-based access control model-semantic feature

Country Status (1)

Country Link
CN (1) CN108647591A (en)

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109615358A (en) * 2018-11-01 2019-04-12 北京伟景智能科技有限公司 A kind of dining room automatic settlement method and system based on deep learning image recognition
CN109784295A (en) * 2019-01-25 2019-05-21 佳都新太科技股份有限公司 Video stream characteristics recognition methods, device, equipment and storage medium
CN109977773A (en) * 2019-02-18 2019-07-05 华南理工大学 Human bodys' response method and system based on multi-target detection 3D CNN
CN109977872A (en) * 2019-03-27 2019-07-05 北京迈格威科技有限公司 Motion detection method, device, electronic equipment and computer readable storage medium
CN110070002A (en) * 2019-03-29 2019-07-30 上海理工大学 A kind of Activity recognition method based on 3D convolutional neural networks
CN110348290A (en) * 2019-05-27 2019-10-18 天津中科智能识别产业技术研究院有限公司 Coke tank truck safe early warning visible detection method
CN110427831A (en) * 2019-07-09 2019-11-08 淮阴工学院 A kind of human action classification method based on fusion feature
CN110490109A (en) * 2019-08-09 2019-11-22 郑州大学 A kind of online human body recovery action identification method based on monocular vision
CN110503076A (en) * 2019-08-29 2019-11-26 腾讯科技(深圳)有限公司 Video classification methods, device, equipment and medium based on artificial intelligence
CN110598608A (en) * 2019-09-02 2019-12-20 中国航天员科研训练中心 Non-contact and contact cooperative psychological and physiological state intelligent monitoring system
CN111259838A (en) * 2020-01-20 2020-06-09 山东大学 Method and system for deeply understanding human body behaviors in service robot service environment
CN111507421A (en) * 2020-04-22 2020-08-07 上海极链网络科技有限公司 Video-based emotion recognition method and device
WO2020206850A1 (en) * 2019-04-09 2020-10-15 华为技术有限公司 Image annotation method and device employing high-dimensional image
CN111783692A (en) * 2020-07-06 2020-10-16 广东工业大学 Action recognition method and device, electronic equipment and storage medium
CN111783760A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Character recognition method and device, electronic equipment and computer readable storage medium
CN112232283A (en) * 2020-11-05 2021-01-15 深兰科技(上海)有限公司 Bubble detection method and system based on optical flow and C3D network
CN113807318A (en) * 2021-10-11 2021-12-17 南京信息工程大学 Action identification method based on double-current convolutional neural network and bidirectional GRU
US11270147B1 (en) 2020-10-05 2022-03-08 International Business Machines Corporation Action-object recognition in cluttered video scenes using text
US11322234B2 (en) 2019-07-25 2022-05-03 International Business Machines Corporation Automated content avoidance based on medical conditions
US11403069B2 (en) 2017-07-24 2022-08-02 Tesla, Inc. Accelerated mathematical engine
US11409692B2 (en) 2017-07-24 2022-08-09 Tesla, Inc. Vector computational unit
US11423223B2 (en) 2019-12-02 2022-08-23 International Business Machines Corporation Dynamic creation/expansion of cognitive model dictionaries based on analysis of natural language content
US11423252B1 (en) 2021-04-29 2022-08-23 International Business Machines Corporation Object dataset creation or modification using labeled action-object videos
US11487288B2 (en) 2017-03-23 2022-11-01 Tesla, Inc. Data synthesis for autonomous control systems
US11537811B2 (en) 2018-12-04 2022-12-27 Tesla, Inc. Enhanced object detection for autonomous vehicles based on field view
US11561791B2 (en) 2018-02-01 2023-01-24 Tesla, Inc. Vector computational unit receiving data elements in parallel from a last row of a computational array
US11562231B2 (en) 2018-09-03 2023-01-24 Tesla, Inc. Neural networks for embedded devices
US11567514B2 (en) 2019-02-11 2023-01-31 Tesla, Inc. Autonomous and user controlled vehicle summon to a target
US11610117B2 (en) 2018-12-27 2023-03-21 Tesla, Inc. System and method for adapting a neural network model on a hardware platform
US11625422B2 (en) 2019-12-02 2023-04-11 Merative Us L.P. Context based surface form generation for cognitive system dictionaries
US11636346B2 (en) 2019-05-06 2023-04-25 Brown University Recurrent neural circuits
US11636333B2 (en) 2018-07-26 2023-04-25 Tesla, Inc. Optimizing neural network structures for embedded systems
US11665108B2 (en) 2018-10-25 2023-05-30 Tesla, Inc. QoS manager for system on a chip communications
US11681649B2 (en) 2017-07-24 2023-06-20 Tesla, Inc. Computational array microprocessor system using non-consecutive data formatting
US11734562B2 (en) 2018-06-20 2023-08-22 Tesla, Inc. Data pipeline and deep learning system for autonomous driving
US11748620B2 (en) 2019-02-01 2023-09-05 Tesla, Inc. Generating ground truth for machine learning from time series elements
US11790664B2 (en) 2019-02-19 2023-10-17 Tesla, Inc. Estimating object properties using visual image data
US11816585B2 (en) 2018-12-03 2023-11-14 Tesla, Inc. Machine learning models operating at different frequencies for autonomous vehicles
CN117158904A (en) * 2023-09-08 2023-12-05 上海市第四人民医院 Old people cognitive disorder detection system and method based on behavior analysis
US11841434B2 (en) 2018-07-20 2023-12-12 Tesla, Inc. Annotation cross-labeling for autonomous control systems
US11893393B2 (en) 2017-07-24 2024-02-06 Tesla, Inc. Computational array microprocessor system with hardware arbiter managing memory requests
US11893774B2 (en) 2018-10-11 2024-02-06 Tesla, Inc. Systems and methods for training machine models with augmented data
CN117158904B (en) * 2023-09-08 2024-05-24 上海市第四人民医院 Old people cognitive disorder detection system and method based on behavior analysis

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XINHUA LIU,ET AL.: "An Optimization Model for Human Activity Recognition Inspired by Information on Human-Object Interaction", 《IEEE:2018 10TH INTERNATIONAL CONFERENCE ON MEASURING TECHNOLOGY AND MECHATRONICS AUTOMATION》 *

Cited By (58)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11487288B2 (en) 2017-03-23 2022-11-01 Tesla, Inc. Data synthesis for autonomous control systems
US11409692B2 (en) 2017-07-24 2022-08-09 Tesla, Inc. Vector computational unit
US11893393B2 (en) 2017-07-24 2024-02-06 Tesla, Inc. Computational array microprocessor system with hardware arbiter managing memory requests
US11681649B2 (en) 2017-07-24 2023-06-20 Tesla, Inc. Computational array microprocessor system using non-consecutive data formatting
US11403069B2 (en) 2017-07-24 2022-08-02 Tesla, Inc. Accelerated mathematical engine
US11561791B2 (en) 2018-02-01 2023-01-24 Tesla, Inc. Vector computational unit receiving data elements in parallel from a last row of a computational array
US11797304B2 (en) 2018-02-01 2023-10-24 Tesla, Inc. Instruction set architecture for a vector computational unit
US11734562B2 (en) 2018-06-20 2023-08-22 Tesla, Inc. Data pipeline and deep learning system for autonomous driving
US11841434B2 (en) 2018-07-20 2023-12-12 Tesla, Inc. Annotation cross-labeling for autonomous control systems
US11636333B2 (en) 2018-07-26 2023-04-25 Tesla, Inc. Optimizing neural network structures for embedded systems
US11562231B2 (en) 2018-09-03 2023-01-24 Tesla, Inc. Neural networks for embedded devices
US11983630B2 (en) 2018-09-03 2024-05-14 Tesla, Inc. Neural networks for embedded devices
US11893774B2 (en) 2018-10-11 2024-02-06 Tesla, Inc. Systems and methods for training machine models with augmented data
US11665108B2 (en) 2018-10-25 2023-05-30 Tesla, Inc. QoS manager for system on a chip communications
CN109615358B (en) * 2018-11-01 2020-11-03 北京伟景智能科技有限公司 Deep learning image recognition-based restaurant automatic settlement method and system
CN109615358A (en) * 2018-11-01 2019-04-12 北京伟景智能科技有限公司 A kind of dining room automatic settlement method and system based on deep learning image recognition
US11816585B2 (en) 2018-12-03 2023-11-14 Tesla, Inc. Machine learning models operating at different frequencies for autonomous vehicles
US11537811B2 (en) 2018-12-04 2022-12-27 Tesla, Inc. Enhanced object detection for autonomous vehicles based on field view
US11908171B2 (en) 2018-12-04 2024-02-20 Tesla, Inc. Enhanced object detection for autonomous vehicles based on field view
US11610117B2 (en) 2018-12-27 2023-03-21 Tesla, Inc. System and method for adapting a neural network model on a hardware platform
CN109784295B (en) * 2019-01-25 2020-12-25 佳都新太科技股份有限公司 Video stream feature identification method, device, equipment and storage medium
CN109784295A (en) * 2019-01-25 2019-05-21 佳都新太科技股份有限公司 Video stream characteristics recognition methods, device, equipment and storage medium
US11748620B2 (en) 2019-02-01 2023-09-05 Tesla, Inc. Generating ground truth for machine learning from time series elements
US11567514B2 (en) 2019-02-11 2023-01-31 Tesla, Inc. Autonomous and user controlled vehicle summon to a target
CN109977773B (en) * 2019-02-18 2021-01-19 华南理工大学 Human behavior identification method and system based on multi-target detection 3D CNN
CN109977773A (en) * 2019-02-18 2019-07-05 华南理工大学 Human bodys' response method and system based on multi-target detection 3D CNN
US11790664B2 (en) 2019-02-19 2023-10-17 Tesla, Inc. Estimating object properties using visual image data
CN109977872A (en) * 2019-03-27 2019-07-05 北京迈格威科技有限公司 Motion detection method, device, electronic equipment and computer readable storage medium
CN110070002A (en) * 2019-03-29 2019-07-30 上海理工大学 A kind of Activity recognition method based on 3D convolutional neural networks
WO2020206850A1 (en) * 2019-04-09 2020-10-15 华为技术有限公司 Image annotation method and device employing high-dimensional image
US11636346B2 (en) 2019-05-06 2023-04-25 Brown University Recurrent neural circuits
CN110348290A (en) * 2019-05-27 2019-10-18 天津中科智能识别产业技术研究院有限公司 Coke tank truck safe early warning visible detection method
CN110427831A (en) * 2019-07-09 2019-11-08 淮阴工学院 A kind of human action classification method based on fusion feature
US11322234B2 (en) 2019-07-25 2022-05-03 International Business Machines Corporation Automated content avoidance based on medical conditions
CN110490109B (en) * 2019-08-09 2022-03-25 郑州大学 Monocular vision-based online human body rehabilitation action recognition method
CN110490109A (en) * 2019-08-09 2019-11-22 郑州大学 A kind of online human body recovery action identification method based on monocular vision
CN110503076A (en) * 2019-08-29 2019-11-26 腾讯科技(深圳)有限公司 Video classification methods, device, equipment and medium based on artificial intelligence
CN110503076B (en) * 2019-08-29 2023-06-30 腾讯科技(深圳)有限公司 Video classification method, device, equipment and medium based on artificial intelligence
CN110598608A (en) * 2019-09-02 2019-12-20 中国航天员科研训练中心 Non-contact and contact cooperative psychological and physiological state intelligent monitoring system
CN110598608B (en) * 2019-09-02 2022-01-14 中国航天员科研训练中心 Non-contact and contact cooperative psychological and physiological state intelligent monitoring system
US11625422B2 (en) 2019-12-02 2023-04-11 Merative Us L.P. Context based surface form generation for cognitive system dictionaries
US11423223B2 (en) 2019-12-02 2022-08-23 International Business Machines Corporation Dynamic creation/expansion of cognitive model dictionaries based on analysis of natural language content
CN111259838B (en) * 2020-01-20 2023-02-03 山东大学 Method and system for deeply understanding human body behaviors in service robot service environment
CN111259838A (en) * 2020-01-20 2020-06-09 山东大学 Method and system for deeply understanding human body behaviors in service robot service environment
CN111507421A (en) * 2020-04-22 2020-08-07 上海极链网络科技有限公司 Video-based emotion recognition method and device
CN111783760A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Character recognition method and device, electronic equipment and computer readable storage medium
US11775845B2 (en) 2020-06-30 2023-10-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Character recognition method and apparatus, electronic device and computer readable storage medium
CN111783760B (en) * 2020-06-30 2023-08-08 北京百度网讯科技有限公司 Character recognition method, device, electronic equipment and computer readable storage medium
CN111783692A (en) * 2020-07-06 2020-10-16 广东工业大学 Action recognition method and device, electronic equipment and storage medium
US11270147B1 (en) 2020-10-05 2022-03-08 International Business Machines Corporation Action-object recognition in cluttered video scenes using text
US11928849B2 (en) 2020-10-05 2024-03-12 International Business Machines Corporation Action-object recognition in cluttered video scenes using text
CN112232283B (en) * 2020-11-05 2023-09-01 深兰科技(上海)有限公司 Bubble detection method and system based on optical flow and C3D network
CN112232283A (en) * 2020-11-05 2021-01-15 深兰科技(上海)有限公司 Bubble detection method and system based on optical flow and C3D network
US11423252B1 (en) 2021-04-29 2022-08-23 International Business Machines Corporation Object dataset creation or modification using labeled action-object videos
CN113807318B (en) * 2021-10-11 2023-10-31 南京信息工程大学 Action recognition method based on double-flow convolutional neural network and bidirectional GRU
CN113807318A (en) * 2021-10-11 2021-12-17 南京信息工程大学 Action identification method based on double-current convolutional neural network and bidirectional GRU
CN117158904A (en) * 2023-09-08 2023-12-05 上海市第四人民医院 Old people cognitive disorder detection system and method based on behavior analysis
CN117158904B (en) * 2023-09-08 2024-05-24 上海市第四人民医院 Old people cognitive disorder detection system and method based on behavior analysis

Similar Documents

Publication Publication Date Title
CN108647591A (en) Activity recognition method and system in a kind of video of view-based access control model-semantic feature
Chen et al. Motion guided spatial attention for video captioning
Wang et al. A self-training approach for point-supervised object detection and counting in crowds
Aich et al. Bidirectional attention network for monocular depth estimation
Yang et al. A part-aware multi-scale fully convolutional network for pedestrian detection
Varior et al. Multi-scale attention network for crowd counting
CN110276253A (en) A kind of fuzzy literal detection recognition method based on deep learning
CN109816689A (en) A kind of motion target tracking method that multilayer convolution feature adaptively merges
Mahjourian et al. Geometry-based next frame prediction from monocular video
Wang et al. Robust object detection via instance-level temporal cycle confusion
CN110135446A (en) Method for text detection and computer storage medium
Mo et al. Background noise filtering and distribution dividing for crowd counting
CN112101344B (en) Video text tracking method and device
Xu et al. BANet: A balanced atrous net improved from SSD for autonomous driving in smart transportation
CN110175597A (en) Video target detection method integrating feature propagation and aggregation
Zhu et al. CACrowdGAN: Cascaded attentional generative adversarial network for crowd counting
Liu et al. Density-aware and background-aware network for crowd counting via multi-task learning
Chu et al. Attention guided feature pyramid network for crowd counting
Chen et al. SSR-HEF: crowd counting with multiscale semantic refining and hard example focusing
Aliakbarian et al. Deep action-and context-aware sequence learning for activity recognition and anticipation
Ju et al. An improved YOLO V3 for small vehicles detection in aerial images
Li et al. Multi-Scale correlation module for video-based facial expression recognition in the wild
Huang et al. Video frame prediction with dual-stream deep network emphasizing motions and content details
CN112184767A (en) Method, device, equipment and storage medium for tracking moving object track
de Almeida Maia et al. Action recognition in videos using multi-stream convolutional neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20181012