CN106845375A

CN106845375A - A kind of action identification method based on hierarchical feature learning

Info

Publication number: CN106845375A
Application number: CN201710010477.1A
Authority: CN
Inventors: 李文辉; 聂为之
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-01-06
Filing date: 2017-01-06
Publication date: 2017-06-13

Abstract

The invention discloses a kind of action identification method based on hierarchical feature learning, comprise the following steps：By the feature clustering of training set region unit, feature is carried out to all region units using bag of words and is characterized again, obtain the feature of high-rise block, all pieces in a video of feature is carried out into average pond, obtain the feature set of video sequence；The feature set of video sequence is modeled using SVMs, obtains model parameter；The action sequence in test set is chosen as cycle tests, by the cluster and bag of words of two-layer, the feature of action sequence is extracted, by feature input model, the action classification number of action sequence is obtained.This method is extracted with more preferable resolution and more rich feature to carry out motion characteristic sign by the action identification method based on hierarchical feature learning so that the study of model is more efficient, improves the discrimination of action recognition；By experimental verification, this method is achieved compared with high-accuracy, meets the various needs in practical application.

Description

Action identification method based on hierarchical feature learning

Technical Field

The invention relates to the field of image processing and pattern recognition, in particular to an action recognition method based on hierarchical feature learning.

Background

The computer vision technology is to realize the identification and understanding of the surrounding environment information by processing and analyzing a two-dimensional image or a three-dimensional video in reality through the simulation of human vision. Under the background that the current image video increasingly becomes a means for people to acquire visual information, the computer vision technology is well developed. As part of the field of computer vision research, human motion analysis and recognition based on visual information is one of the current popular research directions. The human body action recognition refers to the recognition of human body actions in an image sequence or a video through a computer vision technology and a machine learning method. In recent years, human body action recognition is widely applied to aspects such as intelligent monitoring, video retrieval, man-machine interaction, behavior analysis, virtual reality and the like, and good progress is achieved.

In the process of carrying out feature extraction and modeling on a sample, the human body action identification method can be divided into two types: a spatio-temporal ensemble based method and a time series based method. In spatiotemporal ensemble-based research methods, the researcher views the video data as a three-dimensional spatiotemporal cube, and human motion is present in this spatiotemporal data. In the assumption that spatio-temporal data have similarity based on the same motion (as in reference [1]), data reorganization is performed by extracting foreground portions in video data, and then recognition of motion is performed by comparing the similarity of foreground data in each video data. 3-D automatic segmentation of video data is achieved by clustering cubes with similar colors in the video using a hierarchical mean shift algorithm (see reference [2 ]). And then searching a subset which is most matched with the motion model in the segmented data to realize motion recognition. Motion recognition can be performed by basing motion trajectories on human body (see, for example, reference [3 ]). The change of human body motion in the video is regarded as a track line which changes in space and time, and the track line formed by different motions is different to a certain extent, so that the motion can be described by using the track line. The human body motion recognition method with the angle invariance is obtained by storing the time-space curvature value of the human hand in the three-dimensional motion track on the two-dimensional motion track and expressing the track as the characteristic of the motion. The similarity between motion samples is judged according to similarity invariance by extracting motion tracks of important joints (such as a head, a hand, a foot and the like) of a human body in the motion process of the human body, and the similarity is judged according to the similarity invariance. In recent years, the most widely applied space-time based integral method is to use space-time interest points to characterize human body actions. The spatial-temporal interest point features can capture the appearance of the human body and the local significance of the motion. Due to the local characteristics of the interest points, the method has good robustness on complex backgrounds, scale changes, diversity of types of actions and the like in the video. Common spatio-temporal interest points are the STIP feature (e.g. ref 5), which extends the two-dimensional image Harris corner detection method (e.g. ref 6) to Harris3D in 3-D spatio-temporal and by using a joint representation of HOG and HOF as descriptors of interest points. The Cuboids interest point features (such as reference [7]) increase the number of detected interest points by using Gabor filtering in the time domain, and meanwhile, the descriptor is obtained by performing brightness gradient description by using information of the surrounding space 6 times higher than the detection scale of the interest points. The feature points are selected by adopting a dense sampling and track tracking method (such as a reference document [8]), and the features based on the density track are obtained by using a gradient, an optical flow and a motion boundary histogram as descriptors. . There are also many applications like 3-dimensional Scale Invariant Feature (3D-SIFT) (e.g., ref [9]), Speeded Up Robust Feature (SURF) (e.g., ref [10]), Mosift (e.g., ref [11 ]).

In time series based methods, researchers view a video as a sequence of images, each image in the sequence containing human motion features. The type of the action is judged by comparing a certain sequence. Because the human body movement has certain differences with different individuals, such as amplitude and speed, the problem can be better solved based on the dynamic time programming algorithm (for example, reference [12 ]). While Hidden Markov Models (HMMs) are used to identify human actions (e.g., reference [13]), they use each frame of image in the video as a feature vector, then HMM modeling is performed on these feature vectors to find the implicit state transition relationship between sequences, establish a state-based Model, and then identify actions. Furthermore, by applying multiple HMMs, a Coupled Hidden Markov Model (CHMM) is generated (e.g., reference [14]), modeling the interaction between multiple people. Another widely used method in time-based sequences is the Conditional Random Field (CRF) (see reference [15]), which can divide the motion sequence into a plurality of consecutive units and identify human motion according to the transformation rules between adjacent units. To deal with different timing models, CRFs have been expanded by a number of research efforts, such as: hidden state CRF (e.g., ref [16]) dynamic CRF (e.g., ref [17]), semi-Markov random field model (e.g., ref [18]), and the like.

The following challenges are mainly faced in the field of motion recognition:

1. the human body has different action forms. In an action sequence, different persons often behave differently for the same action due to habits, which adds difficulty to the action recognition. Meanwhile, different devices and different action types can lead to the action forms to be diversified in sequence. A detection mode with robustness on human action forms is provided, and the method is important for identifying human actions.

2. The background of the action is complex. In order to be compatible with the real-world situation, the recording environment of many motion sequence samples not only includes a simple and fixed background, but also many of them come from a complex and variable environment, and the complex background is a very big challenge for human motion modeling.

3. The existing human motion recognition features have the defects that most of the features are based on manual design and have universal usability, but the uniqueness of a motion sample cannot be well captured, and how to characterize the motion sample based on the features learned by the sample is very important for motion recognition.

Disclosure of Invention

The invention provides an action recognition method based on hierarchical feature learning, which solves the problems that the same action type has large performance difference, single extracted feature and large model learning difficulty due to the fact that the manually designed feature can not capture the characteristics of a sample according to the difference of an action sample, and is described in detail as follows:

a motion recognition method based on hierarchical feature learning, the motion recognition method comprising the steps of:

clustering the characteristics of the region blocks in the training set, performing characteristic re-characterization on all the region blocks by using a bag-of-words model to obtain the characteristics of high-level blocks, and performing mean pooling on the characteristics of all the blocks in one video to obtain a characteristic set of a video sequence;

modeling a feature set of a video sequence by using a support vector machine to obtain model parameters;

and selecting the action sequence in the test set as a test sequence, extracting the characteristics of the action sequence through two-layer clustering and a bag-of-words model, and inputting the characteristics into the model to obtain the action class number of the action sequence.

The motion recognition method further comprises:

training video sequences and candidate predictive video sequences are selected from each class of the motion video data set.

The training video sequence is divided into space-time blocks with equal size, the covariance characteristics of the blocks are constructed according to the pixel information of the blocks, and the covariance characteristics are used as the initialization characteristics of the blocks to form an action data set.

The hierarchical feature learning specifically comprises:

clustering the blocks in the training set by using a clustering method, and then performing feature re-characterization on all the blocks by using a bag-of-words model to obtain the features of the bottom-layer blocks;

the pooling fuses the bottom layer features of all blocks around the bottom layer block as the center to obtain the feature representation of the region block which is larger than the bottom layer block in space.

The technical scheme provided by the invention has the beneficial effects that: the method extracts the characteristics with better resolution and richness to characterize the action characteristics by the action identification method based on hierarchical characteristic learning, so that the model learning is more efficient, and the identification rate of the action identification is improved; experiments prove that the method achieves higher accuracy and meets various requirements in practical application.

Drawings

Fig. 1 is a flowchart of a motion recognition method based on hierarchical feature learning.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

In order to solve the problems that the difference information between samples cannot be mined by manual features and the features are not rich in action recognition, so that the action recognition efficiency is not high, the embodiment of the invention provides an action recognition method based on hierarchical feature learning, and the action recognition method comprises the following steps:

101: clustering the characteristics of the region blocks in the training set, performing characteristic re-characterization on all the region blocks by using a bag-of-words model to obtain the characteristics of high-level blocks, and performing mean pooling on the characteristics of all the blocks in one video to obtain a characteristic set of a video sequence;

102: modeling a feature set of a video sequence by using a support vector machine to obtain model parameters;

103: and selecting the action sequence in the test set as a test sequence, extracting the characteristics of the action sequence through two-layer clustering and a bag-of-words model, and inputting the characteristics into the model to obtain the action class number of the action sequence.

The action recognition method further comprises the following steps:

Wherein the hierarchical feature learning specifically comprises:

In summary, the embodiment of the present invention provides a motion recognition method based on hierarchical feature learning, in which a video sequence is partitioned, a training set is used to learn feature characterizations of each layer, modeling is performed, in an application process, a motion candidate sequence is input, hierarchical features are extracted from the motion candidate sequence, and a model is established to predict motion categories, so that a better recognition result is obtained, and a recognition rate of motion recognition is improved.

Example 2

The following detailed description of the solution in embodiment 1 is provided with reference to fig. 1, and the following detailed description refers to the following specific calculation principles:

201: selecting a training video sequence and a candidate prediction video sequence from each class of the motion video data set;

the selection of the action sequence in the training set can be realized by manual selection or random selection in the class, and if the action set has a divided training set and a divided test set, the divided sample is used as the training set and the test set.

202: dividing all training video sequences into WxWxT time-space blocks with equal size, constructing covariance characteristics of the blocks according to pixel information of the blocks, using the covariance characteristics as initialization characteristics of the blocks, and forming an action data set

Wherein N is the total number of action sequences in the data set, i is the serial number of the action sequence sample,the content at the first level for the ith action sequence,representing D.T. taking values as real numbers_iDimension space, D being the dimension of the initialized features of the space-time block, T_iThe number of the time-space blocks in the ith action sequence.

Wherein D is¹Is related to the selected initialization feature, which is not limited by the embodiments of the present invention. y is_iA sample label is given, whose value is Y ═ 1, 2., M }, where a value of 1 represents that the sample sequence includes an action class of 1, a value of 2 represents that the sample sequence includes an action class of 2, and M represents dataTotal number of centralized action categories; without loss of generality, embodiments of the present invention feature covariance of spatio-temporal blocks in all motion samples (see reference [21 ]]) And (4) extracting.

203: clustering the blocks in the training set by using a clustering method, and then performing feature re-characterization on all the blocks by using a bag-of-words model to obtain the features of the bottom-layer blocks;

the problem solved in the step is the problem of sample initial feature processing and re-characterization in motion recognition, which is embodied by learning the input initial feature, mapping the initial feature to a new feature space by learning a conversion matrix of the new feature space, and obtaining a feature set of the learned sample by re-characterizing the sample

Common methods include k-means clustering, sparse coding and the like, and the embodiment of the invention does not limit the learning of the feature space and the selection of the re-characterization.

204: fusing the bottom layer characteristics of all blocks around the bottom layer block by taking the bottom layer block as the center through pooling to obtain the characteristic representation of the region block which is spatially larger than the bottom layer block;

the pooling operation is to fuse information around the position of the central block, so that large learning features have locality, contain surrounding spatio-temporal information, and enrich the features. Generally, the pooling operation includes operations of mean pooling, sum pooling, and maximum pooling, which is not limited in the embodiments of the present invention.

205: clustering the characteristics of the regional blocks of the training set, then performing characteristic re-characterization on all regional blocks by using a bag-of-words model to obtain the characteristics of high-level blocks, performing mean pooling on the characteristics of all blocks in a video to obtain a characteristic set of a video sequence

206: modeling a feature set of a video sequence by using a support vector machine to obtain model parameters;

207: and selecting the action sequence in the test set as a test sequence, extracting the characteristics of the action sequence through two-layer clustering and a bag-of-words model, and inputting the characteristics into the model to obtain the action class number of the action sequence.

In summary, the features obtained by the hierarchical feature learning method have better robustness, local information around feature points which is not present in the existing artificial features is retained, and global features are obtained through sublimation layer by layer. The information quantity in the characteristics is improved, the accuracy of action recognition is further improved, and a better result is obtained.

Example 3

The schemes in examples 1 and 2 are further discussed below in conjunction with specific calculation formulas, as described in detail below:

first, without loss of generality, embodiments of the present invention select a covariance matrix as an initialization feature for a block.

The covariance matrix is used as the initialization characteristic of the block, and the expression is as follows:

i (x, y, t) is the value of the pixel at the location of the block (x, y, t),respectively, the pixel values of the current point are respectively expressed for x, y, t first order partial derivatives,the corresponding second derivatives are represented, respectively, and finally, from these 10 kinds of information, a representation F (x, y, t) of the point at the point x, y, t position is generated. Since the information carrier is the smallest block, when there is description information of a single point, the characterization of the block is initialized by using a covariance descriptor, where the covariance formula is as follows:

where n represents the number of pixels in the block, and n is S × S × T, F_i＝F(x_i,y_i,t_i) Is a representation of the midpoint of the block. Finally, integrating all points in the block by means of covariance to generate an initialization descriptor C of the block_I。C_IIs a matrix with dimensions of dimension (F)_i) × dimension (F)_i) Suppose F_iIs a 10-dimensional vector, C_IIs a covariance matrix of 10 × 10.

Secondly, the learning process of the bottom layer features is as follows:

the covariance matrix is a special type of Riemann manifold. A non-euclidean structure of a Symmetric positive definite matrix (SPD) may be applied to the metrics between different covariance matrices. The SPD manifold is embedded into the traditional Euclidean space by utilizing differential homomorphism, and the manifold geometry of a symmetrical positive definite matrix is applied in the process of dictionary learning and coding. In a Riemannian manifold (M, g) space, the tangent space of any point P is T_pM,T_pM is represented as all tangent vectors passing through point P. The correlation formula of the smooth change in the tangent vector space isThe formula g is a formula for any p ∈ T_pM all have positive definiteThe symmetric and bilinear properties have certain robustness to geometric changes. The operators for the conversion of tangent space and manifold space are e-exponential transformations exp, respectively_P(·)∶T_pM → M, mapping the tangent vector △ to a point X in manifold space, and performing logarithmic transformationMapping a point in manifold space to a vector in tangent vector space, exp_P(. o) and log_PThe (-) transform is a pair of inverse transforms. exp_PThe (-) transformation may be such that the length of tangent vector △ is equal to the geodesic distance of X from P.

For converting data in Euclidean space into manifold space, Karcher mean value can be used to solve X instead of arithmetic mean value_iAnd X_jDistance between (reference [19 ]]). The Karcher mean is solved by:

wherein,in the calculation, each mapping to the tangent vector space requires the calculation of Cholesky factorization, and for a covariance matrix of d × d, the complexity of time is O (d × d)³)。

For an SPD matrix of real number d × d, denoted asIt forms a real manifold with a Group structure in mathematics, called Lie Group, so forProperties in the riemann manifold and all related geometrical concepts can be applied. In thatAffine Invariant Riemann Metric (AIRM) above, logarithmic transformation and logarithmic transformation under the Metric (reference [20 ]])：

For a symmetric positive definite matrix X, the results of the above two equations can be obtained by Singular value decomposition (Singular value decomposition). Suppose that the diagonal matrix is defined as diag (λ)₁,λ₂,…,λ_d) Simultaneously satisfies the condition that X singular value is decomposed into X ═ Udiag (lambda)_i)U^TThen the above equation can be rewritten as:

the transformation operator equation with logarithm and e-exponent of the symmetric positive definite matrix X of manifold structure, i.e. the transformation and inverse transformation from manifold space to tangent vector space, can be obtainedMapping into tangent vector space, from manifold space toVector space, so that the calculation method in euclidean space can be applied. Given a symmetric positive definite matrix X, its logarithmic Euclidean vector characterizationIs unique (e.g. reference [21 ]]) Defined as α ═ Vec (log (x)) where Vec (B), B ∈ sym (d) are defined as:

mapping the orthogonal symmetric matrix training data set into vectors, so that the initialization characteristic of each block is h¹＝Vec(B)。

The characteristic learning process of the three-high layer block is as follows:

without loss of generality, a k-means algorithm is selected for clustering, and meanwhile, a vector quantization method is used for feature characterization.

The k-means clustering method assigns n feature points in the feature space to k classes according to the principle that the sum of the intra-class variance is the minimum, as shown in the following formula.

In the above formula, C_iDenotes the center as μ_iThe (ii) th cluster category of (c),features indicating the S-th layer belong to class C_iThe data points of (a). The k-means algorithm comprises the following specific steps:

(1) the cluster center is initialized. Randomly selecting k initial centers in the characteristic space or selecting k initial centers according to a certain rule;

(2) each feature point is classified. Calculating the distance between each feature point and the clustering center, and distributing the feature points to k initial center points according to the shortest distance;

(3) and updating the cluster center point. According to the result of the second step, recalculating by using the feature points to which each central point belongs to obtain a new clustering center;

(4) and (5) repeating the operations (2) and (3) until the convergence condition is met, and outputting a clustering result D.

The vector quantization is to count the feature points in the sample by calculating the distance relationship between the feature points and each word in the dictionary, and for each feature descriptor x, the dictionary D ═ { D ═ D is passed according to the relationship between the feature points and the number of words in the dictionary in the feature point coding₁,d₂,…d_KThe coding method phi obtains a sample characterization phi (x). The calculation formulas of the two methods are as follows:

in the process of vectorization coding, the distance between the feature point and each word in the dictionary is calculated, and the word d with the minimum distance between the feature point and each word in the dictionary is taken_minNewly building a zero vector, and only d in the zero vector_minIs 1, and finally this vector phi (x) is the characterization of the feature point.

In specific implementation, other algorithms can be adopted to solve the problems of space-time block initialization, initialized feature characterization, high-level feature learning and the like.

Example 4

The following experiments were performed to verify the feasibility of the protocols of examples 1 and 2, as described in detail below:

the human body action database used in the experiment was from a database recorded by the royal academy of sciences KTH, sweden, which, once in red, contains 598 video sequences recorded in four different environments, and 6 different actions were performed by 25 volunteers, respectively, each action being repeated for a certain time. The resolution of the video data in the database is 160 × 120, the frame rate is 25fps, and the image of each frame in the video is a gray scale image, wherein the training sample set has 382 samples, and the test set has 216 samples. The recording environment of the motion database and the information and parameter setting of the data acquisition device may refer to documents (e.g., reference [22]), which are not described in detail in the embodiments of the present invention.

Through literature query, the action identification by using the characteristics in the prior art such as Cuboid, HOG3D, Dense HOF and the like is accurate and can reach 90%. By the method for hierarchical feature learning, the action recognition accuracy rate reaches 91.7%. The results are superior to the characteristics, and the feasibility and the effectiveness of the method are proved.

In summary, the embodiment of the present invention provides a motion recognition algorithm for hierarchical feature learning, in which a training video sequence and a candidate prediction video sequence are selected from each class of a motion video data set, all the training video sequences are divided into equal-sized WxWxT time-space blocks, covariance features of the blocks are constructed according to pixel information of the blocks, and the covariance features are used as initialization features of the blocks to form the motion data set; on the basis, feature information is enriched through a hierarchical learning method, and features of a video sequence are obtained; and finally, learning the model parameters by using the classifier, finding the segmentation surface in the feature space, and finally enabling the recognition result to be ideal.

Reference documents:

[1]Bobick AF,Davis J W.The recognition of human movement usingtemporal templates.IEEE Transactions on Pattern Analysis and MachineIntelligence,2001,23(3):257-267.

[2]Ke Y,Sukthankar R,Hebert M.Spatio-temporal shape and flowcorrelation for action recognition.Proceedings of IEEE Conference on ComputerVision and Pattern Recognition,2007.

[3]Rao C,Shah M.View-invariance in action recognition.IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition.2001(2):316-322.

[4]Sheikh Y,Sheikh M,Shah M.Exploring the Space of a HumanAction.IEEE International Conference on Computer Vision.2005.

[5]Laptev I.On space-time interest points.International Journal ofComputer Vision,2005,64(2-3):107-123.

[6]Dollár P,Rabaud V,Cottrell G,et al.Behavior recognition via sparsespatio temporal features.Visual Surveillance and Performance Evaluation ofTracking and Surveillance,2nd Joint IEEE International Workshop on.IEEE,2005:65-72.

[7]Andrews S,Tsochantaridis I,Hofmann T.Support vector machines formultiple-instance learning.Advances in neural information processingsystems.2002:561-568.

[8]Wang H,A,Schmid C,et al.Action recognition by densetrajectories.IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2011:3169-3176.

[9]Scovanner P,Ali S,Shah M.A3-dimensional sift descriptor and itsapplication to action recognition.Proceedings of the 15th internationalconference on Multimedia.ACM,2007:357-360.

[10]Bay H,Tuytelaars T,Van Gool L.Surf:Speeded up robustfeatures.Computer vision–ECCV.Springer Berlin Heidelberg,2006:404-417.

[11]Chen M,Hauptmann A.Mosift:Recognizing human actions insurveillance videos.2009.

[12]Morency L P,Quattoni A,Darrell T.Latent-dynamic discriminativemodels for continuous gesture recognition.IEEE Conference on Computer Visionand Pattern Recognition.IEEE,2007:1-8.

[13]Yamato J,Ohya J,Ishii K.Recognizing human action in time-sequential images using hidden markov model.1992IEEE Computer SocietyConference on Computer Vision and Pattern Recognition.IEEE,1992:379-385.

[14]Brand M,Oliver N,Pentland A.Coupled hidden Markov models forcomplex action recognition.Proceedings.IEEE Computer Society Conference onComputer Vision and Pattern Recognition.1997:994-999.

[15]Wang J,Liu P,She M,etal.Human action categorization usingconditional random field.Robotic Intelligence In Informationally StructuredSpace,2011IEEE Workshop,2011:131-135.

[16]Wang H,Schmid C.Action recognition with improvedtrajectories.IEEE International Conference on Computer Vision.2013:3551-3558.

[17]Liu J,Kuipers B,Savarese S.Recognizing human actions byattributes.IEEE Conference on Computer Vision and Pattern Recognition.2011:3337-3344.

[18]Wang J,Zucker J D.Solving multiple-instance problem:A lazylearning approach.2000.

[19]Klaser A,M,Schmid C.A spatio-temporal descriptor basedon 3d-gradients.19th British Machine Vision Conference.British Machine VisionAssociation,2008:275:1-10.

[20]Pennec X.Intrinsic statistics on Riemannian manifolds:Basic tools for geometric measurements.Journal ofMathematical Imaging and Vision,2006,25(1):127-154.

[21]Faraki M,Palhang M,Sanderson C.Log-Euclidean bag of words forhuman action recognition.IET Computer Vision,2014,9(3):331-339.

[22]C.Schüldt,I.Laptev,and B.Caputo.Recognizing human actions:A localSVM approach.In 17thInternational Conference onPatternRecognition,pages 32–36,2004.

in the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A motion recognition method based on hierarchical feature learning is characterized by comprising the following steps:

2. The method for recognizing the action based on the hierarchical feature learning according to claim 1, further comprising:

3. The method according to claim 2, wherein the step of identifying the action based on the hierarchical feature learning comprises the steps of,

4. The method according to claim 1, wherein the hierarchical feature learning specifically comprises: