CN108647591A

CN108647591A - Activity recognition method and system in a kind of video of view-based access control model-semantic feature

Info

Publication number: CN108647591A
Application number: CN201810379626.6A
Authority: CN
Inventors: 李方敏; 尤天宇; 刘新华; 旷海兰; 张韬; 栾悉道; 阳超
Original assignee: Changsha University
Current assignee: Changsha University
Priority date: 2018-04-25
Filing date: 2018-04-25
Publication date: 2018-10-12

Abstract

The invention discloses a kind of Activity recognition methods in video of view-based access control model semantic feature, extract short-term space-time visual signature first with Three dimensional convolution neural network, avoid the high computation complexity brought using light stream or intensive method of loci；Then the object detector based on convolutional neural networks is utilized to extract the semanteme and spatial positional information of people and object, it constructs personage's body spatial position feature and carries out Fusion Features with space-time visual signature, the recognition accuracy of interbehavior in video is improved using additional semantic information；Finally, on the basis of the short-term space-time visual signature with versatility of extraction, Activity recognition accuracy rate is improved by the long-term action feature of Recognition with Recurrent Neural Network extraction specificity.The present invention can solve the technical issues of computation complexity present in the existing Activity recognition method for video is high, Activity recognition accuracy rate is low and can not extract through the long-term action feature of entire video time dimension.

Description

Activity recognition method and system in a kind of video of view-based access control model-semantic feature

Technical field

Technical field of computer vision of the present invention, more particularly, to row in a kind of video of view-based access control model-semantic feature For recognition methods and system.

Background technology

Have become the popular research neck of computer vision field for the Activity recognition problem of video data type Domain.At present for the Activity recognition in video, mainly there are 3 kinds of methods：Optical flow method, Recognition with Recurrent Neural Network method and Three dimensional convolution Neural network.

For optical flow method, the accuracy rate of Activity recognition is higher, but because the computation complexity of optical flow method is high, institute The real-time of calculating is cannot achieve with it；The input data of Recognition with Recurrent Neural Network includes mainly two kinds：First, using convolutional Neural The feature of the single-frame images of network extraction, this feature lack time-domain related information, cause the recognition accuracy of method low；Two It is light stream or intensive trace information, as optical flow method, the computation complexity of method can be caused high；For Three dimensional convolution nerve For network, input data is the image sequence segment of regular length, therefore this method is merely able to the short-term of extraction versatility Space-time visual signature, and the long-term action feature through entire video time dimension cannot be extracted.

Invention content

For the disadvantages described above or Improvement requirement of the prior art, the present invention provides a kind of view-based access control model-semantic features Activity recognition method and system in video, it is intended that solving to count present in the existing Activity recognition method for video It calculates that complexity is high, Activity recognition accuracy rate is low and the long-term action feature through entire video time dimension can not be extracted Technical problem.

To achieve the above object, according to one aspect of the present invention, a kind of video of view-based access control model-semantic feature is provided Middle Activity recognition method, includes the following steps：

(1) obtain image sequence from data set, down-sampled processing carried out to the image sequence, with obtain it is down-sampled after figure As sequence V={ v_t, t ∈ 0,1 ..., T-1, and will be down-sampled after image sequence be sliced, with obtain it is N number of have fix The image sequence segment of length, wherein T indicate that the length of image sequence, N indicate the quantity of image sequence segment.

(2) each image in N number of image sequence segment with regular length is zoomed in and out and cutting is handled, and will In N number of image sequence segment input Three dimensional convolution neural network, to obtain N number of space-time visual feature vector.

(3) piece image will be chosen in each image sequence segment obtained in step (1), which is zoomed in and out and Cutting is handled, and the image after scaling and cutting is input in object detector, to obtain confidence level and the position of kind of object Offset, and according to the confidence level of kind of object and position offset construction people-object space position feature vector.

(4) people-object space position that will be obtained in the space-time visual feature vector obtained in step (2) and step (3) Feature vector carries out Fusion Features.

(5) feature vector after step (4) Fusion Features is inputted into Recognition with Recurrent Neural Network, to obtain long-term action feature.

(6) the long-term action feature obtained to step (5) using Softmax graders is classified, and is corresponded to generating The class probability of each behavior type.

Preferably, image sequence is sliced and specifically uses following formula：

Wherein T_cIt is the frame step-length of image sequence segment, δ is the frame length of image sequence in image sequence segment, n ∈ 0, 1 ... N-1, and have T_c=8, δ=16.

Preferably, the Three dimensional convolution neural network used is C3D networks, and the physical detector used is that resolution ratio is 300 × 300 more box detectors of single-shot.

Preferably, which is characterized in that input N number of image sequence segment in Three dimensional convolution neural network, when obtaining N number of Image sequence segment is inputted C3D by the process of empty visual feature vector first specifically, for each image sequence segment Then network uses the output of the 5th pond layer in C3D networks as short-term space-time visual signature, finally that this feature figure is regular The feature vector for being 8192 for 1 length, wherein the output matrix size of the 5th pond layer is 1 × 4 × 4 × 512.

Preferably, step (3) is specifically, first, and physical detector is exported according to the image after the scaling of input and cutting Corresponding to multiple output vectors of multiple bounding boxes, each output vector include L kind of object confidence level P={ pl }, with And position offset [x, y, w, h], wherein l ∈ 0,1 ... L-1, L indicate the number of kind of object, p_lIndicate first of object kind The confidence level of class；Then the corresponding output vector of institute's bounding box is merged, it is more with the correspondence for obtaining multiple detection objects Spatial position feature vector [q, the x/W that a length is 5_I,y/H_I,w/W_I,h/H_I], wherein q indicates the affiliated object kind of detection object The confidence level of class, x and y are respectively the transverse and longitudinal coordinate of the bounding box of detection object, and w and h are respectively the bounding box of detection object It is wide and high, W_IAnd H_IThe width and height of image after respectively scaling and cut；Finally, for each of all L kind of object For kind of object, it is sky to construct a length using the spatial position feature vector of highest 5 detection objects of its confidence level Between position feature vector length × L × 5 feature vector.

Preferably, the Recognition with Recurrent Neural Network used in step (5) is 3 layers of GRU networks, is by one layer of full articulamentum and 3 Level joins GRU layers of composition, and full articulamentum has 4096 neurons, and the neuronal quantity of GRU units is in first two layers of GRU networks 4096, the neuronal quantity of GRU units is 256 in last layer, and the output of preceding layer GRU units is later layer GRU units Input.

Preferably, the Recognition with Recurrent Neural Network used in step (5) is combination GRU networks, is by 3 layers of full articulamentum and one GRU layers of layer is constituted, and has 4096 neurons in preceding two layers of full articulamentum, there is 512 neurons in the full articulamentum of last layer, The neuronal quantity of GRU units is 512 in GRU layers.

It is another aspect of this invention to provide that Activity recognition system in a kind of video of view-based access control model-semantic feature is provided, Including：

First module carries out down-sampled processing, to be dropped for obtaining image sequence from data set to the image sequence Image sequence V={ v after sampling_t, t ∈ 0,1 ..., T-1, and will be down-sampled after image sequence be sliced, to obtain N A image sequence segment with regular length, wherein T indicate that the length of image sequence, N indicate the quantity of image sequence segment.

Second module, for each image in N number of image sequence segment with regular length to be zoomed in and out and cut out Processing is cut, and N number of image sequence segment is inputted in Three dimensional convolution neural network, to obtain N number of space-time visual feature vector.

Third module, for piece image will to be chosen in each image sequence segment obtained in the first module, to the figure Picture zooms in and out and cutting processing, the image after scaling and cutting is input in object detector, to obtain kind of object Confidence level and position offset, and people-object space position feature is constructed according to the confidence level of kind of object and position offset Vector.

The people-obtained in 4th module, space-time visual feature vector for will be obtained in the second module and third module Object space position feature vector carries out Fusion Features.

5th module inputs Recognition with Recurrent Neural Network, to be grown for the feature vector after merging the 4th modular character Phase behavioural characteristic.

6th module, the long-term action feature for being obtained using the 5th module of Softmax graders pair are classified, with Generate the class probability corresponding to each behavior type.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect：

(1) computation complexity of the invention is low, can ensure the real-time calculated：Due to being used in step of the present invention (2) Three dimensional convolution neural network extracts acts and efforts for expediency feature, avoids using the high computation complexity that is brought using optical flow method, real The Activity recognition of quickly and efficiently rate is showed.

(2) Activity recognition accuracy rate of the invention is high：Since the present invention constructs people-object space position in step (3) Feature vector is set, the recognition accuracy of the interbehavior in video between people and object is improved.

(3) long-term based on acts and efforts for expediency feature extraction using improved GRU network structures in step (5) due to the present invention Behavioural characteristic can further increase recognition accuracy.

Description of the drawings

The schematic diagram of the 3 layers of GRU networks used in the step of Fig. 1 is the method for the present invention (5).

The schematic diagram of the combination GRU networks used in the step of Fig. 2 is the method for the present invention (5).

Fig. 3 is the ratio of Fig. 1, GRU networks shown in Fig. 2 and conventional monolayers GRU networks in terms of behavior recognition accuracy Compared with schematic diagram.

Fig. 4 is that the present invention is based on the flow charts of Activity recognition method in the video of vision-semantic feature.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below It does not constitute a conflict with each other and can be combined with each other.

The present invention proposes a kind of short-term space-time vision mode (Long-Short of length-of fusion people-object vision relationship Term Spatio-Temporal Visual Model with Human-Object Visual Relationship, this hair It is bright), short-term space-time visual signature is extracted first with Three dimensional convolution neural network, is avoided using light stream or intensive method of loci The high computation complexity brought；Then the object detector based on convolutional neural networks is utilized to extract the semanteme and sky of people and object Between location information, construction people-object space position feature simultaneously with space-time visual signature carry out Fusion Features, utilize additional semanteme Information improves the recognition accuracy of interbehavior in video；Finally, the Short-term characteristic based on fusion proposes a kind of improved cycle Neural network extraction long-term action feature passes through that is, on the basis of the short-term space-time visual signature with versatility of extraction The long-term action feature of Recognition with Recurrent Neural Network extraction specificity improves the accuracy rate of Activity recognition.

As shown in figure 4, the present invention is based on Activity recognition methods in the video of vision-semantic feature to include the following steps：

(1) obtain image sequence from data set, down-sampled processing carried out to the image sequence, with obtain it is down-sampled after figure As sequence V={ v_t, t ∈ 0,1 ..., T-1, wherein T indicate image sequence length, and will be down-sampled after image sequence into Row slice, to obtain N number of image sequence segment with regular length, wherein N indicates the quantity of image sequence segment, specially Integer between 5 to 10.

Specifically, the data set used in this step is the UCF101 Activity recognition data sets acquired from Youtube, drop The interval of sampling processing is 5 frames.

Image sequence is sliced and specifically uses following formula：

It is 3 image sequences image sequence cutting for example, for length is 32 image sequence (i.e. T=32) Segment, each image sequence segment include 16 width images, and adjacent two image sequence segments have the overlapping of 8 width images.

(2) each image in N number of image sequence segment with regular length is zoomed in and out and cutting handles (example Such as, 112 × 112 resolution sizes are scaled and are cut to, the resolution ratio which depends on Three dimensional convolution neural network is big It is small), and N number of image sequence segment is inputted in Three dimensional convolution neural network, to obtain N number of space-time visual feature vector.

The dimension of each image sequence segment inputted in Three dimensional convolution neural network is 16 × 112 × 112 × 3.

In this step, the Three dimensional convolution neural network used is C3D networks, the space-time for extracting image sequence segment Visual signature.

Image sequence segment is inputted C3D networks by this step first specifically, for each image sequence segment, Then use the output of the 5th pond (pool5) layer in C3D networks as short-term space-time visual signature (wherein the 5th pond layer Output matrix size is 1 × 4 × 4 × 512, i.e. the characteristic pattern that 512 resolution ratio are 4 × 4), finally by this feature figure it is regular be 1 The feature vector that a length is 8192.

(3) piece image will be chosen in each image sequence segment obtained in step (1), which is zoomed in and out and (for example, scaling and being cut to 300 × 300 resolution sizes, which depends on follow-up object detector for cutting processing Resolution sizes), the image after scaling and cutting is input in object detector, to obtain confidence level and the position of kind of object Offset is set, and according to the confidence level of kind of object and position offset construction people-object space position feature vector.

Specifically, the physical detector used in this step is the more box detectors of single-shot that resolution ratio is 300 × 300 (Single shot multibox detector, abbreviation SSD300).

Specifically, first, physical detector corresponds to more this step according to the image output after the scaling of input and cutting Multiple output vectors of a bounding box, each output vector include L kind of object confidence level P={ pl } and position it is inclined Shifting amount [x, y, w, h], wherein l ∈ 0,1 ... L-1, L indicate the number of kind of object, p_lIndicate the credible of first kind of object Degree；Then merge that (merging process is using non-maxima suppression (Non to the corresponding output vector of institute's bounding box Maximum Suppression, abbreviation NMS) algorithm), the space bit that the multiple length of correspondence to obtain multiple detection objects are 5 Set feature vector [q, x/W_I,y/H_I,w/W_I,h/H_I], wherein q indicates that the confidence level of the affiliated kind of object of detection object, x and y are divided Not Wei detection object bounding box transverse and longitudinal coordinate, w and h are respectively the width and height of the bounding box of detection object, W_IAnd H_IRespectively The width and height of image after scaling and cutting；Finally, it for each kind of object in all L kind of object, utilizes It is spatial position feature vector length that the spatial position feature vector of highest 5 detection objects of its confidence level, which constructs a length, The feature vector of × L × 5.Since the SSD300 in this step can detect 201 kinds of kind of object, and for each object kind Class chooses the feature vector of 5 highest objects of probability, so L=201 is to get the feature vector for being 5025 to length.

Specifically, the Fusion Features process of this step is exactly to be by space-time visual signature and length that length is 8192 5025 people-object space position feature splices, and becomes the feature vector that a length is 13217 to merge.

The Recognition with Recurrent Neural Network used in this step is gating cycle unit (Gated Recurrent Unit, abbreviation GRU)。

The present invention proposes 2 kinds of improved GRU network structures, wherein external square indicates the feature vector of input.It is right In fusion feature, input is feature vector that the length after Fusion Features is 13217.GRU networks input short-term at any time Space-time visual signature simultaneously generates long-term action feature under All Time scale.

Be a kind of 3 layers of GRU networks (3-Layer Stacked GRU, abbreviation sGRU) shown in Fig. 1, connected entirely by one layer It connecing layer (Fool connection layer, abbreviation FC) and 3 levels joins GRU layers of composition, full articulamentum has 4096 neurons, The neuronal quantity of GRU units is 4096 in first two layers of GRU networks, and the neuronal quantity of GRU units is in last layer 256, the output of preceding layer GRU units is the input of later layer GRU units.The purpose of this framework is deep by increasing GRU networks Degree, improves the learning ability of network.

Long-term action feature vector length after the output of above-mentioned sGRU networks is 256.

A kind of combination GRU networks (Composite GRU, abbreviation cGRU) shown in Fig. 2, be by 3 layers of full articulamentum and One layer GRU layers are constituted, and have 4096 neurons in preceding two layers of full articulamentum, there is 512 neurons in the full articulamentum of last layer, The neuronal quantity of GRU units is 512 in GRU layers.The purpose of this framework be first two layers full articulamentum can to input feature vector into Row dimensionality reduction, and last GRU layers can learn to long-term behavioural characteristic.

Long-term action feature vector length after the output of above-mentioned cGRU networks is 512.

The finally obtained output of this step is probability vector：P_B={ p_b, wherein b ∈ 0,1 ... B-1, B indicates behavior type Quantity, each element representation in probability vector corresponds to the class probability of each behavior type.

Because UCF101 data used in the present invention are concentrated with 101 behavior types, B=101, in probability vector Maximum element p_yThe behavior type that corresponding y-th of behavior type as finally identifies.

Experimental result

Test is the video data in UCF101 Activity recognition data sets, the video in UCF101 data sets using data set It is acquired from YouTube, shares 101 behavior types, 13320 video clips are not only various with behavior type Property, the also diversity of camera motion, gestures of object, article size, shooting visual angle, background and illumination etc..UCF101 In behavior type can be divided into 5 big types：People-object interaction, people-people's interaction, plays an instrument and is transported with sport at limb action It is dynamic.

(1) recognition accuracy

Recognition accuracy refers to that totally 3783 samples, method identify that correct sample number accounts for total number of samples for test set Ratio.Test can help to analyze each module to method performance using the accuracy rate of the method for different module combination modes It influences.The accuracy rate of each method is as shown in table 1 below, and the method for wherein italic textual representation has used the intensive track of improvement or light stream Information.

As can be seen that the method for the present invention has been respectively increased 8.2% compared to LSTM composite models method and C3D methods, accuracy rate With 10.2%.Depth god is used only in the method for having used light stream compared to other or having improved intensive trace information, the method for the present invention The feature of original sequence is extracted through network, inference speed is faster.In fact, it is a kind of artificial to improve intensive track The feature of rule construct, the histogram of gradients based on optical flow tracking and image, and the operation of light stream consumes a large amount of computing resource And the time.In the method for the different GRU network structures of 2 kinds of uses, the method for the present invention achieves best performance, has been more than to use Improve the accuracy rate of multi-hop feature storehouse (Multi-skip feature stacking) method 3.4% of intensive trace information.

Accuracy rate of 1 each method of table on UCF101 data sets

(2) influence of the GRU networks to method performance

This section tests the method for using sGRU networks, cGRU networks and single layer GRU networks, wherein single layer GRU nets Network is used for benchmark test, and single layer GRU networks include 512 neurons, and feature vector directly inputs GRU networks, is a kind of basis Recognition with Recurrent Neural Network structure.

Each method is compared about the accuracy rate of GRU networks as shown in figure 3, so that the method for the present invention is compared using cGRU networks makes The accuracy rate that 3.7% is improved with the method for sGRU networks, compared to the standard that the method using single layer GRU networks improves 5.5% True rate.

Using the method for the present invention of single layer GRU networks and sGRU networks user-object space position feature the case where It is lower just to have reached other using accuracy rate similar in the method for light stream or the intensive trace information of improvement, illustrate to use single layer GRU nets The ability to express of network and the long-term action feature of sGRU networks extraction is poor.For the excessive feature vector of length, as feature is melted The feature vector that length after conjunction is 13217, sGRU networks are since parameter amount is excessive, and not only reasoning and training speed are slow, but also It is easy to cause over-fitting.And single layer GRU networks, since network depth is excessively shallow, learning performance is poor, be easy to cause poor fitting.cGRU Web vector graphic fully-connected network carries out dimensionality reduction to feature, the long-term behavioural characteristic of GRU e-learnings is reused, due to network parameter Amount is few, and not only reasoning and training speed faster, but also do not easily cause over-fitting, accuracy rate higher.

To sum up, cGRU networks preferably realize the function that long-term action feature is extracted on the basis of Short-term characteristic.

(3) computation rate

Computation rate such as the following table 2 institute of the method for the present invention and the other 4 kinds Activity recognition methods based on UCF101 data sets Show, test uses one piece of K40Tesla GPU.Because the computation complexity of optical flow algorithm is high, intensive track and binary-flow network are improved The GPU of the optical flow algorithm used in (Two-stream networks) realizes 91.4 times and 274.6 slower than C3D method respectively Times.Because the method for the present invention contains people-object space characteristic extracting module and long-term action characteristic extracting module, containing additional SSD300 and cGRU networks, so 2.5 times slower than individual C3D networks, but be still far faster than intensive using improving The method of track and Optic flow information has reached 125.2 frames/second, realizes the super real-time of calculating.

The computation rate of 2 each method of table compares

People-object space position feature extraction module be divided between having used 16 it is down-sampled, for each video clip only People-object space feature of one sub-picture of extraction is needed, so the calculating time has shared the meter of each image in video segment On evaluation time.By independent test, 17.8 frames of computation rate/second of SSD300, i.e. 56.18ms/ frames, the calculating time after sharing For 3.51ms/ frames.And C3D networks every time make inferences the video clip comprising 16 frame images, computation rate be 313.9 frames/ Second, i.e. 3.19ms/ frames.Theoretically, the calculating time and space visual feature extraction module of people-object space position extraction module The calculating time add up to 6.70ms/ frames, i.e. 149.3 frames/second.And in actual test, the computation rate of the method for the present invention is 125.2 frames/second, this is because method consumes the additional calculating time during pretreatment and cGRU network reasonings etc., but It is the calculating time much smaller than SSD300 and C3D network reasonings.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, all within the spirits and principles of the present invention made by all any modification, equivalent and improvement etc., should all include Within protection scope of the present invention.

Claims

1. a kind of Activity recognition method in video of view-based access control model-semantic feature, which is characterized in that include the following steps：

(1) obtain image sequence from data set, down-sampled processing carried out to the image sequence, with obtain it is down-sampled after image sequence Arrange V={ v_t, t ∈ 0,1 ..., T-1, and will be down-sampled after image sequence be sliced, with obtain it is N number of have regular length Image sequence segment, wherein T indicate image sequence length, N indicate image sequence segment quantity.

(2) each image in N number of image sequence segment with regular length is zoomed in and out and cutting is handled, and will be N number of Image sequence segment inputs in Three dimensional convolution neural network, to obtain N number of space-time visual feature vector.

(3) piece image will be chosen in each image sequence segment obtained in step (1), which is zoomed in and out and cut Processing, the image after scaling and cutting is input in object detector, to obtain the confidence level and position offset of kind of object Amount, and according to the confidence level of kind of object and position offset construction people-object space position feature vector.

(4) people-object space position feature that will be obtained in the space-time visual feature vector obtained in step (2) and step (3) Vector carries out Fusion Features.

(6) the long-term action feature obtained to step (5) using Softmax graders is classified, to generate corresponding to each The class probability of kind behavior type.

2. Activity recognition method in video according to claim 1, which is characterized in that by image sequence be sliced specific It is to use following formula：

Wherein T_cIt is the frame step-length of image sequence segment, δ is the frame length of image sequence in image sequence segment, n ∈ 0,1 ... N- 1, and have T_c=8, δ=16.

3. Activity recognition method in video according to claim 1 or 2, which is characterized in that the Three dimensional convolution nerve used Network is C3D networks, and the physical detector used is the more box detectors of single-shot that resolution ratio is 300 × 300.

4. Activity recognition method in video as claimed in any of claims 1 to 3, which is characterized in that by N number of image Sequence fragment inputs in Three dimensional convolution neural network, to obtain the process of N number of space-time visual feature vector specifically, for each For image sequence segment, image sequence segment is inputted into C3D networks first, then uses the 5th pond layer in C3D networks Output is used as short-term space-time visual signature, finally by this feature figure it is regular be feature vector that 1 length is 8192, wherein the 5th The output matrix size of pond layer is 1 × 4 × 4 × 512.

5. Activity recognition method in video according to claim 4, which is characterized in that step (3) is specifically, first, object Multiple output vectors of the detector according to the image output after the scaling of input and cutting corresponding to multiple bounding boxes are managed, it is each defeated Outgoing vector includes the confidence level P={ p of L kind of object_lAnd position offset [x, y, w, h], wherein l ∈ 0,1 ... L- 1, L indicates the number of kind of object, p_lIndicate the confidence level of first of kind of object；Then to the corresponding output of institute's bounding box Vector merges, spatial position feature vector [q, the x/W that the multiple length of the correspondence to obtain multiple detection objects are 5_I,y/ H_I,w/W_I,h/H_I], wherein q indicates that the confidence level of the affiliated kind of object of detection object, x and y are respectively the bounding box of detection object Transverse and longitudinal coordinate, w and h are respectively the width and height of the bounding box of detection object, W_IAnd H_IImage after respectively scaling and cut It is wide and high；Finally, for each kind of object in all L kind of object, highest 5 detections of its confidence level are utilized The spatial position feature vector of object constructs the feature vector that a length is spatial position feature vector length × L × 5.

6. Activity recognition method in video according to claim 1, which is characterized in that the cycle god used in step (5) It is 3 layers of GRU networks through network, is to join GRU layers by one layer of full articulamentum and 3 levels to constitute, full articulamentum there are 4096 nerves Member, the neuronal quantity of GRU units is 4096 in first two layers of GRU networks, and the neuronal quantity of GRU units is in last layer 256, the output of preceding layer GRU units is the input of later layer GRU units.

7. Activity recognition method in video according to claim 1, which is characterized in that the cycle god used in step (5) It is combination GRU networks through network, is to be made of 3 layers of full articulamentum and one layer GRU layers, there are 4096 in preceding two layers of full articulamentum Neuron, there is 512 neurons in the full articulamentum of last layer, and the neuronal quantity of GRU units is 512 in GRU layers.

8. Activity recognition system in a kind of video of view-based access control model-semantic feature, which is characterized in that including：

First module carries out down-sampled processing for obtaining image sequence from data set to the image sequence, down-sampled to obtain Image sequence V={ v afterwards_t, t ∈ 0,1 ..., T-1, and will be down-sampled after image sequence be sliced, to obtain N number of tool There are the image sequence segment of regular length, wherein T to indicate that the length of image sequence, N indicate the quantity of image sequence segment.

Second module, for each image in N number of image sequence segment with regular length zoom in and out and cutting at Reason, and N number of image sequence segment is inputted in Three dimensional convolution neural network, to obtain N number of space-time visual feature vector.

Third module, for piece image will to be chosen in each image sequence segment obtained in the first module, to the image into Row scaling and cutting processing, the image after scaling and cutting are input in object detector, to obtain the credible of kind of object Degree and position offset, and according to the confidence level of kind of object and position offset construction people-object space position feature vector.

People-the object obtained in 4th module, space-time visual feature vector for will be obtained in the second module and third module Spatial position feature vector carries out Fusion Features.

5th module inputs Recognition with Recurrent Neural Network, to obtain long-term row for the feature vector after merging the 4th modular character It is characterized.

6th module, the long-term action feature for being obtained using the 5th module of Softmax graders pair are classified, to generate Corresponding to the class probability of each behavior type.