CN106909938A

CN106909938A - Viewing angle independence Activity recognition method based on deep learning network

Info

Publication number: CN106909938A
Application number: CN201710082263.5A
Authority: CN
Inventors: 王传旭; 胡国锋; 刘继超; 杨建滨; 孙海峰; 崔雪红; 李辉; 刘云
Original assignee: Qingdao University of Science and Technology
Current assignee: Qingdao Shengruida Technology Co ltd
Priority date: 2017-02-16
Filing date: 2017-02-16
Publication date: 2017-06-30
Anticipated expiration: 2037-02-16
Also published as: CN106909938B

Abstract

The present invention proposes a kind of viewing angle independence Activity recognition method based on deep learning network, comprises the following steps：By the video frame images typing under a certain visual angle, low-level image feature extraction and processing are carried out by the way of deep learning；Low-level image feature to obtaining is modeled, and cube model is obtained in chronological order；The cube model at all visual angles is converted into the cylinder feature space mapping of unchanged view angle, after be entered into grader and be trained, obtain video behavior viewing angle independence grader.Technical scheme is analyzed using deep learning network to the human body behavior under various visual angles, improves the robustness of disaggregated model；It is especially suitable for being trained based on big data, being learnt, can have well given play to its advantage.

Description

Viewing angle independence Activity recognition method based on deep learning network

Technical field

Technical field of computer vision of the present invention, particularly relates to a kind of viewing angle independence behavior based on deep learning network Recognition methods.

Background technology

With developing rapidly for information technology, computer vision along with the concepts such as VR, AR and artificial intelligence appearance Best developing period is welcome, has also increasingly been subject to domestic and international as the most important video behavioural analysis of computer vision field The favor of scholar.In a series of field such as video monitoring, man-machine interaction, Medical nursing, video frequency searching, video behavioural analysis is accounted for According to very big proportion.Such as now popular pilotless automobile project, video behavioural analysis is very challenging. Due to the complexity and multifarious feature of human action, along with multiple visual angles human body from blocking, multiple dimensioned and visual angle The influence of the factors such as rotation, translation so that the difficulty of video Activity recognition is very big.How real life is accurately recognized Human body behavior under middle multiple angles, and human body behavior is analyzed, it always is very important research topic, and society Requirement to behavioural analysis also more and more higher.

Traditional research method is comprising following several：

Based on space-time characteristic point：Video frame images to extracting extract space-time characteristic point therein, then space-time characteristic Point modeling, analysis, are finally classified.

Based on human skeleton：Human skeleton information is extracted by algorithm or depth camera, is then believed by skeleton Breath is described, models, and then video behavior is classified.

Behavior analysis method based on space-time characteristic point and framework information, takes under traditional single-view or under single player mode Significantly achievement was obtained, but is directed to now as the bigger area of pedestrian's flows such as street, airport, station or human body hide The appearance of a series of complex problem such as gear, illumination variation, view transformation, simple both analysis methods of use are in real life Middle effect does not often reach the requirement of people, and the robustness of algorithm is also very poor sometimes.

The content of the invention

In order to solve the defect of above prior art presence, the present invention propose a kind of visual angle based on deep learning network without Sexual behaviour recognition methods is closed, the human body behavior under various visual angles is analyzed using deep learning network, lifting disaggregated model Robustness；Especially deep learning network is suitably based on big data and is trained, learns, and can well give play to its advantage.

The technical proposal of the invention is realized in this way：

A kind of viewing angle independence Activity recognition method based on deep learning network, is divided using training sample set The training process of class device and the identification process using grader identification test sample；

The training process is comprised the following steps：

S1) video frame images Image 1 to the Image i under a certain visual angle are input into sequentially in time；

S2) to step S1) input image use CNN (Convolutional Neural Network, convolutional Neural net Network) carry out low-level image feature extraction and pond is carried out to it, the low-level image feature of Chi Huahou is used into STN (Spatial Transform Networks, space switching network) strengthened；

S3) to step S2) reinforcing after characteristic image (Feature Map) carry out pond and be input into RNN (Recurrent Neural Network, recurrent neural net network layers) time modeling is carried out, obtain the cube model of sequential correlation；

S4) repeat step S1) to S3) the spatial cuboids model of same behavior under multiple visual angles is obtained, by each visual angle Spatial cuboids model conversation be unchanged view angle the mapping of cylinder feature space, and as the training of the class behavior Sample is trained in being input to grader；

S5 each step more than) repeating, obtains the viewing angle independence grader of various actions；

The identification process is comprised the following steps：

S6) the video frame images under a certain visual angle of typing, using above-mentioned steps S1) to S3) low-level image feature is carried out to it carry Take and model, obtain the spatial cuboids model under the visual angle；

S7) by step S6) the spatial cuboids model conversation that obtains is a cylinder feature space mapping for unchanged view angle, And be entered into grader and be identified obtaining video behavior classification.

In above-mentioned technical proposal, step S2) preferably operation is accumulated using three-layer coil to extract low-level image feature；Step S2) and step Rapid S3) dimensionality reduction operation is preferably carried out to characteristic image using maximum pond method.

In above-mentioned technical proposal, step S3) what is obtained is the spatial cuboids model under some visual angle of same behavior, Operating procedure S1 repeatedly) to S3) obtain the spatial cuboids model of same behavior under multiple visual angles.

In technical scheme, it is preferred to use LSTM networks (Long-Short Term Memory, abbreviation LSTM) Time modeling is carried out, because the back-propagating process of deep learning network uses stochastic gradient descent method, using in LSTM Special door operation, the gradient disappearance problem of each layer can be prevented.

In above-mentioned technical proposal, step S4) specifically include：

S41) repeat step S1) to S3), obtain the spatial cuboids model at each visual angle of same behavior, and by its It is incorporated into x, y, z are in the cylindrical space of reference axis, cylindrical space represents the track description of motion feature under each visual angle；

S42) to step S41) model that obtains uses formula：

Polar coordinate transform is carried out, isogonal cylinder space mapping is obtained.

In above-mentioned technical proposal, also include：S0 data set) is built, present invention preferably employs IXMAS data sets.

Compared with prior art, technical scheme has following difference：

1st, feature extraction is carried out to low-level image feature using the method for CNN, obtains the feature of the overall situation rather than conventional method institute The key point for obtaining.

2nd, characteristic strengthening is carried out to the global characteristics for obtaining using STN methods, is directly carried out rather than the feature to obtaining Modeling.

3rd, time modeling is carried out to the global characteristics after being operated by reinforcing and dimensionality reduction using LSTM networks, adds weight The temporal information wanted, makes it have temporal associativity.

4th, coordinate transform is carried out to the spatial cuboids model at each visual angle of same behavior using polar coordinate transform, obtains angle The constant cylinder space mapping of degree, then training and Classification and Identification are completed by CNN.

The advantage of the invention is that：What is drawn using the method for CNN is global advanced features, by the characteristic strengthening of STN, There is good robustness to the video in real life, then using RNN network setup time information, polar coordinates are eventually passed Conversion is merged to the different characteristic in various visual angles, the isogonal descriptor for obtaining is trained and is divided using CNN Class, and operation is extracted without using traditional skeleton and key point, the feature that global characteristics are obtained is more comprehensively；RNN networks are obtained Inter frame temporal information so that behavior description ground is more complete, and applicability is stronger.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, may be used also Other accompanying drawings are obtained with according to these accompanying drawings.

Fig. 1 is the schematic flow sheet of training process of the present invention；

Fig. 2 is the schematic flow sheet of identification process of the present invention；

Fig. 3 is general Human bodys' response schematic flow sheet；

Fig. 4 is extraction and the modeling procedure figure of simplified low-level image feature；

Fig. 5 is the process chart of general CNN；

Fig. 6 is general RNN simplified structure diagrams；

Fig. 7 is LSTM block diagrams；

Fig. 8 is the flow chart that integrated classification is carried out to each visual angle；

Model schematics of the Fig. 9 for the Motion History Volume in Fig. 8 after polar coordinate transform.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

As shown in Figures 1 and 2, the viewing angle independence Activity recognition method based on deep learning network of the invention, including The training process and the identification process using grader identification test sample of grader are obtained using training sample set；

The training process is as shown in figure 1, comprise the following steps：

S2) to step S1) input image low-level image feature extraction is carried out using CNN and pond is carried out to it, by Chi Huahou Low-level image feature strengthened using STN；

S3) to step S2) characteristic image after reinforcing carries out pond and is input into RNN carrying out time modeling, obtain sequential and close The cube model of connection；

S5 each step more than) repeating, obtains the viewing angle independence grader of various actions.

The identification process is as shown in Fig. 2 comprise the following steps：

In above-mentioned technical proposal, step S4) specifically include：

S42) to step S41) model that obtains uses formula：

In above-mentioned technical proposal, also include：S0 data set) is built.

Present invention preferably employs IXMAS data sets, data set includes five different visual angles, everyone 14 actions of 12 people, Each action is in triplicate.Using 11 personal accomplishment training dataset therein, remaining 1 people is used as test data set.

Specifically, for example to recognize " running " this behavior, five kinds of running videos of lower 12 people in visual angle are gathered first, its In 11 running videos of people as training dataset, remaining 1 people as checking data set.First by an a certain personal visual angle Under running video frame images according to above-mentioned steps S1) to S3) operated, what is finally given is that " running " under the visual angle is regarded The cube model of the sequential correlation of frequency behavior, i.e., the spatial cuboids model of " running " behavior under the visual angle；Then repeat Step S1) to S3) the spatial cuboids model of " running " behavior under other four kinds of visual angles is obtained successively；By more than under five kinds of visual angles The spatial cuboids model conversation of " running " behavior is a cylinder feature space mapping for unchanged view angle, and as this The training sample of " running " this classification behavior of people, is input into classifier training；Trained by the training sample of multiple different people Afterwards, the viewing angle independence grader of " running " behavior is obtained.Similarly, the viewing angle independence classification of various video behaviors can be built Device.

When being identified, above-mentioned steps S6 is performed) and S7), some first by a people in test sample is regarded Video frame images under angle are according to above-mentioned steps S1) to S3) operated, obtain the spatial cuboids mould of the visual angle lower behavior Type, then be converted into cylinder feature space by polar coordinate transform and map, is inputted in grader and identifies behavior classification. The identification process at other visual angles is same with this.

In order to more fully understand and illustrating technical scheme, below by way of having for being related to above-mentioned technical proposal Pass technology explain in detail and analyze.

Method of the present invention model includes two Main Stages, and one is low-level image feature to be extracted, is modeled；Second is to each Visual angle is merged, is classified, and main innovation work is as follows.

The general flow of Human bodys' response as shown in figure 3, in the figure feature extraction and character representation stage be that behavior is known Other emphasis, the result in this stage will final influence identification accuracy, and algorithm robustness, present invention employs depth The method for spending study carries out feature extraction.

Simplified low-level image feature is illustrated in figure 4 to extract and modeling procedure figure.

In technical scheme, the deep learning framework of use is Caffe, the video under a certain visual angle in Fig. 4 Frame Image 1 to Image i are to be input in network sequentially in time.Feature is carried out using CNN to input picture first to carry Take, feature is strengthened using STN then, make it that there is certain robustness to translation, dimensional variation, angle change, so Pondization operation is carried out to characteristic image (Feature Map) afterwards, maximum pond method is used here, then will be by pond The characteristic image of operation carries out time modeling in being input to RNN layers, finally obtains the characteristic image with inter frame temporal relevance Sequence (Feature Maps Sequences).

Technical scheme accumulates operation to extract low-level image feature using three-layer coil, then by maximum pond method pair Feature carries out dimensionality reduction operation.Intensified operation, STN networks are carried out during the later characteristic image of pondization is input into STN layers to feature Function be that the feature that enables to has and has robustness to translation, rotation and dimensional variation.Then the spy for STN being exported Levying image carries out maximum pond, and dimension-reduction treatment is carried out again, is then input to make it insert temporal information in RNN networks, finally In chronological order, the Feature Maps that will be obtained are combined into spatial cuboids.The RNN networks used in the present invention are LSTM nets Network, because the back-propagating process of deep learning network uses stochastic gradient descent method, is grasped using the special door in LSTM Make, the gradient disappearance problem of each layer can be prevented.

In above-mentioned technical proposal, CNN is the efficient identification side for growing up and drawing attention developed in recent years Method.The sixties in 20th century, Hubel and Wiesel is when in studying cat cortex for the neuron of local sensitivity and set direction It was found that its unique network structure can be effectively reduced the complexity of Feedback Neural Network, CNN is then proposed.Now, CNN One of study hotspot of numerous scientific domains is had become, particularly in pattern classification field, because the network is avoided to figure The complicated early stage pretreatment of picture, can directly input original image, thus obtain more being widely applied.

Usually, the basic structure of CNN includes two-layer, and one is characterized extract layer, the input of each neuron with it is previous The local acceptance region of layer is connected, and extracts the local feature.After the local feature is extracted, it is and between further feature Position relationship is also decided therewith；The second is Feature Mapping layer, each computation layer of network is made up of multiple Feature Mappings, often Individual Feature Mapping is a plane, and the weights of all neurons are equal in plane.

It is exactly, using Feature Mapping layer, to extract the global low-level image feature in video frame images in technical scheme, Deeper treatment then is carried out to low-level image feature.

The vague generalization handling process of CNN is as shown in Figure 5.

The technical scheme layer to be used is exactly that we neglect in the Feature Map obtained after convolution Pondization slightly below and full articulamentum.What CNN was obtained is the characteristic information of single image, and to be processed is video information, because This needs to introduce temporal information, so simple use CNN can not reach the requirement for the treatment of video behavior.

In above-mentioned technical proposal, RNN or to be called Recognition with Recurrent Neural Network be in feedforward neural network (Feed-forward Neural Networks, abbreviation FNNs) on the basis of develop.Different from traditional FNNs, RNN introduces directed circulation, Can process those input between forward-backward correlation problem.RNN includes input block (Input units), and input set is labeled as {x₀, x₁..., x_t-1, x_t, x_t+1..., and the output collection of output unit (Output units) is then marked as { o₀, o₁..., o_t-1, o_t, o_t+1....Also comprising implicit unit (Hidden units), we output it collection labeled as { s to RNN₀, s₁..., s_t-1, s_t, S_t+1..., these implicit units complete work main.

It is illustrated in figure 6 general RNN and simplifies structure, in Fig. 6, the information flow for having an one-way flow is from input block Implicit unit is reached, the at the same time another information flow of one-way flow reaches output unit from implicit unit.In some feelings Under condition, RNN can break the limitation of the latter, and guidance information returns to implicit unit from output unit, and these are referred to as " Back Projections ", and the input of hidden layer also includes the state of a upper hidden layer, i.e. and the node in hidden layer can connect certainly Can also interconnect.Therefore, the connection of temporal information is achieved that in hidden layer, it is not necessary to which extra again consideration temporal information is asked Topic.This is also a big advantages of the RNN when video behavioural characteristic is processed.Therefore, the general treatment with timing information, in depth All it is to give RNN to process in study.

A model for new process time information is developed again on the basis of RNN：Section time memory (Long- long Short Term Memory, abbreviation LSTM).Because the stochastic gradient descent method that the back-propagating of deep learning network is used, because This, RNN occurs the problem that a kind of gradient disappears, that is, following time node under the node perceived power of prior time Drop.So it is exactly Cell that LSTM introduces a core element.The substantially block diagram of LSTM is as shown in Figure 7.

Fig. 8 show the flow chart that integrated classification is carried out to each visual angle.

Method according to Fig. 4 obtains the spatial cuboids model of same action under multiple visual angles, then by each visual angle To with x, y, z are in the cylindrical space of reference axis, cylindrical space is moved spatial cuboids model integration under representing each visual angle The track description of feature, then carries out polar coordinate transform using mathematical method, and it is transformed into r, θ, in the space of z coordinate axle, Formula is as follows：

Then isogonal cylinder space mapping (Invariant Cylinder Space Map) is obtained, is finally incited somebody to action To cylinder space mapping be input in grader, obtain behavior classification, the mode used here as CNN is classified, and is different from SVM classifier, because CNN is most initially for classifying what is used.Motion History Volume (motion histories in Fig. 8 Post) and model after polar coordinate transform it is as shown in Figure 9.

Technical scheme is more special than the space-time of conventional method using the bottom-up information that the method for deep learning is extracted Levy a little and framework information is more senior and bone rod is more preferable.

Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention Within god and principle, any modification, equivalent substitution and improvements made etc. should be included within the scope of the present invention.

Claims

1. a kind of viewing angle independence Activity recognition method based on deep learning network, is classified using training sample set The training process of device and the identification process using grader identification test sample；It is characterized in that：

The training process is comprised the following steps：

S2) to step S1) input image low-level image feature extraction is carried out using CNN and pond is carried out to it, by the bottom of Chi Huahou Layer feature is strengthened using STN；

S3) to step S2) characteristic image after reinforcing carries out pond and is input into RNN carrying out time modeling, obtains sequential correlation Cube model；

S4) repeat step S1) to S3) the spatial cuboids model of same behavior under multiple visual angles is obtained, by the sky at each visual angle Between cube model be converted into the cylinder feature space mapping of unchanged view angle, and as the training sample of the class behavior It is input in grader and is trained；

The identification process is comprised the following steps：

S6) the video frame images under a certain visual angle of typing, using above-mentioned steps S1) to S3) it is carried out low-level image feature extract and Modeling, obtains the spatial cuboids model under the visual angle；

S7) by step S6) the spatial cuboids model conversation that obtains is a cylinder feature space mapping for unchanged view angle, and will It is identified obtaining video behavior classification in being input to grader.

2. the viewing angle independence Activity recognition method based on deep learning network according to claim 1, it is characterised in that：

Step S2) low-level image feature is extracted using the operation of three-layer coil product.

3. the viewing angle independence Activity recognition method based on deep learning network according to claim 2, it is characterised in that：

Step S2) and step S3) dimensionality reduction operation is carried out to characteristic image using maximum pond method.

4. the viewing angle independence Activity recognition method based on deep learning network according to claim 1, it is characterised in that：

Step S3) time modeling is carried out using LSTM networks.

5. the viewing angle independence Activity recognition method based on deep learning network according to claim 1, it is characterised in that Step S4) specifically include：

S41 step S1) is repeated) to S3), the spatial cuboids model at each visual angle of same behavior is obtained, and integrated To with x, y, z are in the cylindrical space of reference axis, cylindrical space represents the track description of motion feature under each visual angle；

S42) to step S41) model that obtains uses formula：

6. the viewing angle independence Activity recognition method based on deep learning network according to claim 1, it is characterised in that Also include：

S0 data set) is built.