CN105740773B

CN105740773B - Activity recognition method based on deep learning and multi-scale information

Info

Publication number: CN105740773B
Application number: CN201610047682.0A
Authority: CN
Inventors: 刘智; 冯欣; 张�杰; 张杰慧; 张凌; 黄智勇
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2016-01-25
Filing date: 2016-01-25
Publication date: 2019-02-01
Anticipated expiration: 2036-01-25
Also published as: CN105740773A

Abstract

The Activity recognition method based on deep learning and multi-scale information that the invention discloses a kind of, by constructing multiple depth networks, composition parallel organization carrys out the Human bodys' response of the depth of investigation video, deep video is first split into multiple video-frequency bands first, then learnt respectively using each parallel branch neural network, fusion connection is carried out to the high-rise expression that each neural network branch learns again, full articulamentum and classification layer finally are sent into fused high-rise expression and carries out Classification and Identification.Activity recognition can be effectively carried out using the method for deep learning, especially when each behavior act difference is larger, discrimination can be significantly improved, and real-time is high.

Description

Activity recognition method based on deep learning and multi-scale information

Technical field

The present invention relates to Human bodys' response fields, more particularly to a kind of row based on deep learning and multi-scale information For recognition methods.

Background technique

With the maturation of the hardware technologies such as computer, camera and the requirements at the higher level of social management, Human bodys' response Research increasingly causes the attention of computer vision research worker, and is widely used to monitor automatically, and event detection is man-machine Interface, the every field such as video acquisition.Traditional Human bodys' response method describes the view of human body behavior first against each Frequency carries out feature extraction, such as histograms of oriented gradients (Histograms of Oriented Gradient, HOG), motion history Image (Motion History Image, MHI) etc., then using classifiers such as support vector machines, random forests to extraction Feature carries out Classification and Identification.The research of Human bodys' response based on calculation method has been achieved for many outstanding achievements, however There is also some insoluble problems: the feature of extraction has specific aim, is not easy extensive to other data；Computing cost is too Greatly, it is difficult to accomplish real-time.

Deep learning can automatically extract the expression of the multilayer feature between being hidden in data, the depth based on convolutional neural networks Practise research image classification, identification, positioning, in terms of achieve very big success.However, the convolution in image procossing is Two-dimentional operation is not directly applicable the 3 D video of description human body behavior.

Summary of the invention

In view of the above-mentioned deficiencies in the prior art, it is an object of the present invention to provide a kind of based on deep learning and multi-scale information Activity recognition method can effectively carry out Activity recognition using the method for deep learning, especially when each behavior act difference compared with When big, discrimination can be significantly improved, and Generalization Capability of the invention is good, can be trained on a large data sets, Be subsequently used for lacking in training the Activity recognition fields of data, can greatly reduce the time overhead of Activity recognition, real-time is high.

The present invention by constructing the deep neural network structure based on CNN, and is melted using deep video data as research object The multi-scale informations such as the global hand motion of human body behavioural information and part are closed, study three-dimensional using traditional two-dimentional CNN Human bodys' response.

The present invention carrys out the Human bodys' response of the depth of investigation video by constructing multiple depth networks, composition parallel organization. Deep video is first split into multiple video-frequency bands first, is then learnt respectively using each parallel branch neural network, then right The high-rise expression that each neural network branch learns carries out fusion connection, and the data vectorization of each branch's neural network is laggard Row connection, becomes one-dimensional vector, to input subsequent full articulamentum.Full articulamentum finally is sent into fused high-rise expression Classification and Identification is carried out with classification layer.At the same time, only exist for behavior most of in MSRDailyActivity3D data set Hand has nuance, such as read, write, with laptop, play game behavior, the invention proposes by fusion coarse grain The thought of the multi-scale informations such as the global behavior information of degree and fine-grained hand motion.

The object of the present invention is achieved like this: a kind of Activity recognition method based on deep learning and multi-scale information, Include the following steps:

(1) training dataset is established；The coarseness global behavior video that the training data is concentrated is selected from MSRDailyActivity3D data set.

(2) building has the deep neural network model of several parallel depth convolutional neural networks；

(3) step-length L of the coarseness global behavior video of training data concentration to set is chosen_StrideIt is segmented, In, every segment length is set as L_Seg, N is formd after segmentation_SegA coarseness video-frequency band matrix, segments N_Seg=1+ (N_F? L_Seg)/L_Stride, N_FFor the frame number of coarseness global behavior video；

(4) behavior video in fine granularity part is obtained from the coarseness global behavior video in step (3), to fine granularity office Portion's behavior video takes steps, and (3) similarly method is segmented to obtain N_SegA fine granularity video-frequency band matrix；Fine granularity video-frequency band The size of each frame of matrix is identical as the size of each frame of coarseness video-frequency band matrix.Intercept coarseness global behavior video Fine granularity part behavior sequence in each frame forms fine granularity part behavior video.Fine-grained part behavior can be hand Movement, or the details at other positions acts.Obtain fine granularity video method: with each frame of coarseness global behavior video Left hand joint centered on, interception W/4 × H/4 size frame form N_FThe new video of × W/4 × H/4, the video are fine granularity Hand motion video, wherein W, H, N_FThe frame number for respectively including in the width of original depth video frame, height and video.This is big It is in the same size after the small video down-sampling with coarseness.

(5) N for obtaining step (3)_SegThe N that a coarseness video-frequency band matrix and step (4) obtain_SegA fine granularity video What is constructed in section matrix parallel feeding step (1) has 2N_SegThe deep neural network model of a parallel depth convolutional neural networks In be trained；

(6) selection coarseness global behavior video to be identified carries out step (3), (4) respectively obtain N_SegA coarseness view Frequency range matrix and N_SegA fine granularity video-frequency band matrix, the N that will be obtained_SegA coarseness video-frequency band matrix and N_SegA fine granularity view Frequency range matrix parallel is sent into the trained deep neural network model that step (5) obtain and carries out Activity recognition.Wait know Other coarseness global behavior video is to pass through pretreated video.

Deep neural network in step (2) using convolutional neural networks as structure block, have a classification layer, at least one Convolutional layer, at least one pond layer and at least one full articulamentum.Parallel depth convolutional neural networks include sequentially connected First convolutional layer, the first pond layer, the second convolutional layer, the second pond layer, third convolutional layer, third pond layer, the first full connection Layer, the second full articulamentum and classification layer.

It is segmented, acts on again after each frame of coarseness global behavior video in step (3) is carried out down-sampling are as follows: 1, calculation amount is reduced；2, make the big of the size of each frame of coarseness video-frequency band matrix and each frame of fine granularity video-frequency band matrix It is small identical, convenient for input network.

Coarseness global behavior video is deep video.

The coarseness global behavior video that training data is concentrated is to pass through pretreated video, and coarseness to be identified is global Behavior video is to pass through pretreated video.The pretreatment are as follows: firstly, using interpolation technique by all videos in data set Standardize to unified length.The length is the median of all video lengths.Secondly, removal background, only retain taking human as The video section at center, and by video size specification to certain size.Again, using min-max method respectively by all videos X, y, z coordinate value standardization arrive [0,1] range.Finally, all samples progress flip horizontal is formed new sample from forming Training sample in times dilated data set.

A kind of Activity recognition method based on deep learning and multi-scale information, includes the following steps:

(1) training dataset is established；The deep video that the training data is concentrated is selected from MSRDailyActivity3D number According to collection；

(3) step-length L of the behavior video of training data concentration to set is chosen_StrideIt is segmented, wherein every segment length is set It is set to L_Seg, N is formd after segmentation_SegA video-frequency band matrix, segments N_Seg=1+ (N_F- L_Seg)/L_Stride, N_FFor depth view The frame number of frequency；

(4) N for obtaining step (3)_SegWhat is constructed in a video-frequency band matrix parallel feeding step (2) has N_SegIt is a parallel It is trained in the deep neural network model of depth convolutional neural networks；

(5) it chooses behavior video progress step (3) to be identified and obtains N_SegA video-frequency band matrix, the N that will be obtained_SegA view Frequency range matrix parallel is sent into trained deep neural network model and carries out Activity recognition.Behavior video to be identified is By pretreated video.

Deep neural network in step (2) using convolutional neural networks as structure block, have a classification layer, at least one Convolutional layer, at least one pond layer and at least one full articulamentum.

Behavior video is deep video.

The behavior video that training data is concentrated is to pass through pretreated video, and behavior video to be identified is by pretreatment Video.The pretreatment are as follows: firstly, all video specificationizations in data set are arrived unified length using interpolation technique. The length is the median of all video lengths.Secondly, removal background, only retains video section focusing on people, and will Video size specification is to certain size.Again, using min-max method respectively by the x of all videos, y, z coordinate value standardization To [0,1] range.Finally, all samples progress flip horizontal is formed new sample thus the training in dilated data set at double Sample.

The invention has the benefit that the present invention obtains coarseness and fine granularity video matrix, to designed parallel depth Degree convolutional neural networks are trained, and the identification that the deep neural network after training is used for behavior are classified, so that of the invention Generalization Capability is good, can be trained on a large data sets, the Activity recognition field for the data that are subsequently used for lacking in training.

The present invention devises a parallel depth convolutional neural networks can be subtracted significantly by the parallel input of behavior video The time overhead of few Activity recognition, real-time effect are good.

The present invention is research object using deep video, and deep video has description object geometry and light, color Insensitive feature.

Experiment and the result shows that, the deep learning method proposed by the present invention based on CNN can be indicated with deep video Human body behavior effectively identified that behavioral difference is more significantly lain down sand in MSRDailyActivity3D data set Send out, walk, play guitar, stand and sit down five behaviors average recognition rate be 98%, to all behaviors on entire data set Discrimination is 60.625%.

Present invention will be further explained below with reference to the attached drawings and specific embodiments.

Detailed description of the invention

Fig. 1 is the functional block diagram of the Activity recognition method of the invention based on deep learning and multi-scale information；

Fig. 2 be in MSRDailyActivity3D behavior video (it is upper before pretreatment: drink water, under: write)；

Fig. 3 be in MSRDailyActivity3D behavior video (it is upper after pretreatment: drink water, under: write).

Specific embodiment

Embodiment one

Referring to Fig. 1, a kind of Activity recognition method based on deep learning and multi-scale information includes the following steps:

(1) training dataset is established；The coarseness global behavior video that the training data is concentrated is selected from MSRDailyActivity3D data set.The coarseness global behavior video that training data is concentrated is to pass through pretreated video. Coarseness global behavior video to be identified is to pass through pretreated video.The pretreatment are as follows: firstly, will using interpolation technique All video specificationizations in data set arrive unified length.The length is the median of all video lengths.Secondly, removal Background, only retains video section focusing on people, and by video size specification to certain size.Again, the side min-max is used Method is respectively by the x of all videos, y, z coordinate value standardization to [0,1] range.Finally, all samples are carried out flip horizontal shape The sample of Cheng Xin is to the training sample in dilated data set at double.

(2) building has the deep neural network model of several parallel depth convolutional neural networks.Depth in step (2) Neural network using convolutional neural networks as structure block, have a classification layer, at least one convolutional layer, at least one pond layer with And at least one full articulamentum.Present invention classification layer uses softmax classifier.The parallel depth convolution mind of the present embodiment It include sequentially connected first convolutional layer, the first pond layer, the second convolutional layer, the second pond layer, third convolutional layer, through network Three pond layers, the first full articulamentum, the second full articulamentum and classification layer.

(3) step-length L of the coarseness global behavior video of training data concentration to set is chosen_StrideIt is segmented, In, every segment length is set as L_Seg, N is formd after segmentation_SegA coarseness video-frequency band matrix, segments N_Seg=1+ (N_F? L_Seg)/L_Stride, N_FFor the frame number of coarseness global behavior video.By each of the coarseness global behavior video in step (3) Frame is segmented again after carrying out down-sampling, is acted on are as follows: 1, reduction calculation amount；2, make the big of each frame of coarseness video-frequency band matrix It is small identical as the size of each frame of fine granularity video-frequency band matrix, convenient for input network.Research object, that is, coarseness global behavior Video uses deep video.

(4) behavior video in fine granularity part is obtained from the coarseness global behavior video in step (3), to fine granularity office Portion's behavior video takes steps, and (3) similarly method is segmented to obtain N_SegA fine granularity video-frequency band matrix.Fine granularity video-frequency band The size of each frame of matrix is identical as the size of each frame of coarseness video-frequency band matrix.Intercept coarseness global behavior video Fine granularity part behavior sequence in each frame forms fine granularity part behavior video.Fine-grained part behavior can be hand Movement, or other details movement.Fine-grained partial row is to be determined according to specifically application, and the details of notebook data collection is dynamic It is concentrated mainly on hand, if details movement may choose the details movement of other parts at other positions.The present embodiment Centered on the hand joint of each frame of coarseness global behavior video, intercepting the frame composition frame number being sized is N_FParticulate Spend local behavior video.

(6) selection coarseness global behavior video to be identified carries out step (3), (4) respectively obtain N_SegA coarseness view Frequency range matrix and N_SegA fine granularity video-frequency band matrix, the N that will be obtained_SegA coarseness video-frequency band matrix and N_SegA fine granularity view Frequency range matrix parallel is sent into trained deep neural network model and carries out Activity recognition.N before the present embodiment_SegA net Network handles coarseness video, rear N_SegA network processes fine granularity video.

Embodiment two

The Activity recognition method based on deep learning and multi-scale information that present embodiment discloses a kind of, the present embodiment only make Activity recognition is carried out with the global behavior information of coarseness.Include the following steps:

(1) training dataset is established；The deep video that the training data is concentrated is selected from MSRDailyActivity3D number According to collection；The behavior video that training data is concentrated is to pass through pretreated video.Behavior video to be identified is by pretreated Video.The pretreatment are as follows: firstly, all video specificationizations in data set are arrived unified length using interpolation technique.It should Length is the median of all video lengths.Secondly, removal background, only retains video section focusing on people, and will view Frequency size specification is to certain size.Again, using min-max method respectively by the x of all videos, y, z coordinate value, which standardizes, to be arrived [0,1] range.Finally, all samples progress flip horizontal is formed new sample to the training sample in dilated data set at double This.

(2) referring to Fig. 1, the deep neural network model with several parallel depth convolutional neural networks is constructed.Step (2) In deep neural network using convolutional neural networks as structure block, have a classification layer, at least one convolutional layer, at least one Pond layer and at least one full articulamentum.

Present invention classification layer uses softmax classifier.

(3) step-length L of the deep video of training data concentration to set is chosen_StrideIt is segmented, wherein every segment length is set It is set to L_Seg, N is formd after segmentation_SegA video-frequency band matrix, segments N_Seg=1+ (N_F- L_Seg)/L_Stride, N_FFor depth view The frame number of frequency；

(5) it chooses deep video progress step (3) to be identified and obtains N_SegA video-frequency band matrix, the N that will be obtained_SegA view Frequency range matrix parallel is sent into trained deep neural network model and carries out Activity recognition.

Experimental procedure of the present invention is described as follows: assuming that the video size of one behavior of expression after standardization is N_F×W×H (being 192 × 128 × 128 in the present invention), wherein W, H are respectively the width and height of video frame.

It (1) is N by frame number_FBehavior video with L_StrideIt is segmented for step-length, wherein every segment length is L_Seg, then it is segmented Number is N_Seg=1+ (N_F- L_Seg/L_Stride, then by 1/4 down-sampling of video frame, then N is formd after being segmented_Seg×L_Seg×W/4× The video-frequency band matrix of H/4；

(2) centered on the left hand joint of each frame of deep video, the frame of interception W/4 × H/4 size forms N_F×W/4× The new video of H/4, taking steps to new video, (1) similarly method obtains N_Seg×L_SegThe video-frequency band matrix of × W/4 × H/4；

(3) it is merged the video-frequency band matrix of step (1) and step (2) to obtain 2N_Seg×L_SegThe view of × W/4 × H/4 Frequency range matrix；The video-frequency band matrix is the input of depth network, i.e., the network has 2N_SegA parallel depth convolutional Neural net Network, the input of each deep neural network are L_SegThe video of × W/4 × H/4.

(4) parallel depth convolutional neural networks are trained using training dataset, then using test data set into The test of row Human bodys' response, training dataset and subject data set are completely non-intersecting.{ 1,3,5,7,9 } is tested in the present invention The behavior video of performance is tested the behavior video of { 2,4,6,8,10 } performance for testing for training.The data set is by 10 Personal (subject) is completed, and the data of the 1st, 3,5,7,9 people are for training, and the data of 2,4,6,8,10 this 5 people are for testing.

Assuming that L_Seg=16, L_Stride=16, then deep neural network frame is needed using 24 parallel networks, each network Input be 16 × 32 × 32 video-frequency band sequence, i.e., each video-frequency band contains 16 frame videos, and video image size is 32 × 32.

The depth network and its parameter that 1 present invention of table uses

Experiment and discussion

1. data set and pretreatment

The MSRDailyActivity3D data set that the present invention uses Kinect device to acquire using Microsoft, the number Have collected 16 kinds of behaviors common in daily life according to collection: drink water, eat a piece, read, make a phone call, write, with laptop, With dust catcher, hails, stands still, paper-tear picture, plays game, sofa of lying down, walks, plays guitar, stands and sit down.Each behavior is dynamic Make to be completed in two different ways by same main examination: being sitting on sofa or stand.Entire data set shares 320 behavior views Frequently.Fig. 2 gives some behavior samples in the data set.The data set has recorded human body behavior and ambient enviroment simultaneously, mentions The depth information of taking-up contains a large amount of noise, and only nuance is being locally present in most of behavior in data set, such as Shown in Fig. 2, Fig. 3, thus it is extremely challenging.

Before experiment, simple pretreatment is carried out to each video, firstly, using interpolation technique by the institute in data set There is video specificationization to unified length, which is the median of all video lengths；Secondly, removal background, only retains Video section focusing on people, and by video size specification to certain size, as shown in Figure 3；Again, the side min-max is used Method is respectively by the x of all videos, y, z coordinate value standardization to [0,1] range；Finally, all samples are carried out flip horizontal shape The sample of Cheng Xin is to the training sample in dilated data set at double.Experiment of the invention is compiled using Torch platform [20] It writes, learning rate therein is 1*10^-4, loss function is the soft max function that platform carries.

2. the HAR based on multi-scale information fusion and deep learning is identified

The present invention identifies video and fine-grained hand using the 2CNN2F network in table 1, by the global behavior of coarseness Input of the multi-scale informations such as action sequence as depth network.Step-length L in this section experiment_StrideWith segments L_SegIt is respectively provided with It is 16, i.e., it is the local hand of 12 × 16 × 32 × 32 global behavior sequence and 12 × 16 × 32 × 32 for extracting entire video is dynamic Make sequence and merges fusion 24 × 16 × 32 × 32 input video matrixes of formation.Table 2 gives proposition method of the present invention and its other party The comparison of method recognition performance on MSRDailyActivity3D data set.Wherein 2CNN2F refers to the overall situation using only coarseness Behavioural information, and 2CNN2F+Joint then indicates multi-scale information fusion method of the invention.It can be seen that the method for the present invention from table The accuracy of Activity recognition is 60.625%, if only discrimination is in a slight decrease using the global behavior information of coarseness, It is 56.875%, the method for recognition performance and traditional artificial feature extraction is comparable.It is worth noting that, if only Then discrimination, which reaches, is identified to the 11-16 behavior (play game, sofa of lying down, walk, play guitar, stand and sit down) To 98%, this may be because with biggish difference between the 11-16 behavior, and between other large number of rows in data set are Difference it is very subtle, such as read, write, only having nuance in hand motion with the several behaviors of laptop.Experiment As a result illustrate, can effectively carry out Activity recognition using the method for deep learning, especially when each behavior act difference is larger, Discrimination can be significantly improved.

2 the method for the present invention of table is compared with other methods are in the recognition performance on MSRDailyActivity3D data set

Algorithm	Discrimination
		LOP features[8]	42.5%
Joint Position features[8]	68%
		Dynamic Temporal Warping[21]	54%
2CNN2F	56.875%
		2CNN2F+Joint	60.625%

3. influence of the network depth to identification

The present invention constructs the neural network containing 3 layers of CNN and 4 layer of CNN simultaneously respectively, i.e. 3CNN2F_8 and 4 CNN2F are (such as Shown in table 3), the influence for Probe into Network depth to recognition effect.Network parameter is as shown in table 1.Since network depth increases, In order to guarantee that network not transition is fitted, this experiment uses the input of 24 × 8 × 128 × 128 video sequence as neural network, 192 × 128 × 128 videos after will standardizing split into 24 8 × 128 × 128 video-frequency bands, simultaneously with step-length for 8 It is input to the neural network with 24 parallel organizations.Discrimination when as shown in Table 2, using 3CNN2F_8 network is 52.5%, and the discrimination of 4 CNN2F is 58.75%.Experimental result illustrates that the increase of network depth can effectively improve behavior Discrimination.

Parameter configuration and discrimination in 3 heterogeneous networks of table

4. splitting influence of the step-length to recognition effect

In order to examine the influence for splitting step-length to recognition effect, the present invention constructs two differences for 3CNN2F type network The network of input: the video sequence that the input of 3CNN2F_8 and 3CNN2F_4,3CNN2F_8 are 24 × 8 × 128 × 128, and The size of the input of 3CNN2F_4 is 47 × 8 × 128 × 128, i.e., by 192 × 128 × 128 videos after standardization, with step-length It is 4, splits into 47 8 × 128 × 128 video-frequency bands, with the repetition of 4 frames between the two adjacent video section after fractionation.Experiment The results are shown in Table 3.Step-length be 8 when, recognition accuracy 52.5%, and step-length be 4 when, recognition accuracy 56.875%. Discrimination effectively improves, and the reduction for being primarily due to step-length will lead to both sides variation, and one side step-length is smaller, splits Video-frequency band it is more, depth network needs more parallel branch, and what is horizontally become is wider, and network parameter is more, network General Huaneng Group power is better；On the other hand, the increase of the reduction of step-length and fractionation video-frequency band, so that training data is also increased simultaneously Add, network training effect is more preferable.

In view of deep video have the characteristics that describe object geometry and light, color it is insensitive, the present invention is with depth Degree video is research object, constructs deep neural network model using traditional two-dimentional CNN (convolutional neural networks), right Behavior in MSRDailyActivity3D data set carries out Classification and Identification.Experiment and the result shows that, this article propose based on CNN Deep learning method the human body behavior indicated with deep video can effectively be identified, in MSRDailyActivity3D In data set behavioral difference more significantly lie down sofa, five behaviors of walking, play guitar, stand and sit down average recognition rate It is 98%, the discrimination to all behaviors on entire data set is 60.625%.How the present invention is also to improving depth simultaneously The discrimination of habit has carried out certain explorative experiment.Research finds to split the reduction of video-frequency band step-length, fusion coarseness and particulate The video information of degree, the appropriate network depth that increases can effectively improve the discrimination of depth network.

The present invention is not limited solely to above-described embodiment, without departing substantially from technical solution of the present invention spirit into The technical solution of row few modifications should fall into protection scope of the present invention.

Claims

1. a kind of Activity recognition method based on deep learning and multi-scale information, which comprises the steps of:

(1) training dataset is established；

(3) the coarseness global behavior video that training data is concentrated is chosen, with the step-length L of setting_StrideIt is segmented, wherein every Segment length is set as L_Seg, N is formd after segmentation_SegA coarseness video-frequency band matrix, segments N_Seg=1+ (N_F- L_Seg)/ L_Stride, N_FFor the frame number of coarseness global behavior video；

(4) behavior video in fine granularity part is obtained from the coarseness global behavior video in step (3), to fine granularity partial row Taking steps for video, (3) similarly method is segmented to obtain N_SegA fine granularity video-frequency band matrix；

(5) N for obtaining step (3)_SegThe N that a coarseness video-frequency band matrix and step (4) obtain_SegA fine granularity video-frequency band What is constructed in matrix parallel feeding step (2) has 2N_SegIn the deep neural network model of a parallel depth convolutional neural networks It is trained；

(6) selection coarseness global behavior video to be identified carries out step (3), (4) respectively obtain N_SegA coarseness video Section matrix and N_SegA fine granularity video-frequency band matrix, the N that will be obtained_SegA coarseness video-frequency band matrix and N_SegA fine granularity video Section matrix parallel, which is sent into the trained deep neural network model that step (5) obtain, carries out Activity recognition.

2. the Activity recognition method according to claim 1 based on deep learning and multi-scale information, it is characterised in that: step Suddenly the deep neural network model in (2) has classification layer, at least one convolution using convolutional neural networks as structure block Layer, at least one pond layer and at least one full articulamentum.

3. the Activity recognition method according to claim 1 based on deep learning and multi-scale information, it is characterised in that: will Each frame of coarseness global behavior video in step (3) is segmented again after carrying out down-sampling, makes coarseness video-frequency band square The size of each frame of battle array is identical as the size of each frame of fine granularity video-frequency band matrix.

4. the Activity recognition method according to claim 1 based on deep learning and multi-scale information, it is characterised in that: thick Granularity global behavior video is deep video.

5. the Activity recognition method according to claim 1 or 4 based on deep learning and multi-scale information, feature exist In: the coarseness global behavior video that training data is concentrated is to pass through pretreated video, coarseness global behavior to be identified Video is to pass through pretreated video.

6. the Activity recognition method according to claim 1 based on deep learning and multi-scale information, it is characterised in that: cut The fine granularity part behavior sequence in each frame of coarseness global behavior video is taken to form fine granularity part behavior video.

7. a kind of Activity recognition method of the global behavior information based on deep learning and coarseness, which is characterized in that including such as Lower step:

(1) training dataset is established；

(3) step-length L of the global behavior video for the coarseness that training data is concentrated to set is chosen_StrideIt is segmented, wherein Every segment length is set as L_Seg, N is formd after segmentation_SegA video-frequency band matrix, segments N_Seg=1+ (N_F- L_Seg)/L_Stride, N_FFor the frame number of deep video；Behavior video is deep video；

(4) N for obtaining step (3)_SegWhat is constructed in a video-frequency band matrix parallel feeding step (2) has N_SegA parallel depth It spends in the deep neural network model of convolutional neural networks and is trained；

(5) the global behavior video for choosing coarseness to be identified carries out step (3) and obtains N_SegA video-frequency band matrix, will obtain N_SegA video-frequency band matrix parallel is sent into trained deep neural network model and carries out Activity recognition.

8. the Activity recognition method of the global behavior information according to claim 7 based on deep learning and coarseness, Be characterized in that: the deep neural network in step (2) has classification layer, at least one using convolutional neural networks as structure block A convolutional layer, at least one pond layer and at least one full articulamentum.

9. the Activity recognition method of the global behavior information according to claim 7 based on deep learning and coarseness, Be characterized in that: the global behavior video for the coarseness that training data is concentrated is by pretreated video, coarseness to be identified Global behavior video be pass through pretreated video.