CN106845375A - A kind of action identification method based on hierarchical feature learning - Google Patents

A kind of action identification method based on hierarchical feature learning Download PDF

Info

Publication number
CN106845375A
CN106845375A CN201710010477.1A CN201710010477A CN106845375A CN 106845375 A CN106845375 A CN 106845375A CN 201710010477 A CN201710010477 A CN 201710010477A CN 106845375 A CN106845375 A CN 106845375A
Authority
CN
China
Prior art keywords
feature
action
blocks
sequence
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710010477.1A
Other languages
Chinese (zh)
Inventor
李文辉
聂为之
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201710010477.1A priority Critical patent/CN106845375A/en
Publication of CN106845375A publication Critical patent/CN106845375A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of action identification method based on hierarchical feature learning, comprise the following steps:By the feature clustering of training set region unit, feature is carried out to all region units using bag of words and is characterized again, obtain the feature of high-rise block, all pieces in a video of feature is carried out into average pond, obtain the feature set of video sequence;The feature set of video sequence is modeled using SVMs, obtains model parameter;The action sequence in test set is chosen as cycle tests, by the cluster and bag of words of two-layer, the feature of action sequence is extracted, by feature input model, the action classification number of action sequence is obtained.This method is extracted with more preferable resolution and more rich feature to carry out motion characteristic sign by the action identification method based on hierarchical feature learning so that the study of model is more efficient, improves the discrimination of action recognition;By experimental verification, this method is achieved compared with high-accuracy, meets the various needs in practical application.

Description

Action identification method based on hierarchical feature learning
Technical Field
The invention relates to the field of image processing and pattern recognition, in particular to an action recognition method based on hierarchical feature learning.
Background
The computer vision technology is to realize the identification and understanding of the surrounding environment information by processing and analyzing a two-dimensional image or a three-dimensional video in reality through the simulation of human vision. Under the background that the current image video increasingly becomes a means for people to acquire visual information, the computer vision technology is well developed. As part of the field of computer vision research, human motion analysis and recognition based on visual information is one of the current popular research directions. The human body action recognition refers to the recognition of human body actions in an image sequence or a video through a computer vision technology and a machine learning method. In recent years, human body action recognition is widely applied to aspects such as intelligent monitoring, video retrieval, man-machine interaction, behavior analysis, virtual reality and the like, and good progress is achieved.
In the process of carrying out feature extraction and modeling on a sample, the human body action identification method can be divided into two types: a spatio-temporal ensemble based method and a time series based method. In spatiotemporal ensemble-based research methods, the researcher views the video data as a three-dimensional spatiotemporal cube, and human motion is present in this spatiotemporal data. In the assumption that spatio-temporal data have similarity based on the same motion (as in reference [1]), data reorganization is performed by extracting foreground portions in video data, and then recognition of motion is performed by comparing the similarity of foreground data in each video data. 3-D automatic segmentation of video data is achieved by clustering cubes with similar colors in the video using a hierarchical mean shift algorithm (see reference [2 ]). And then searching a subset which is most matched with the motion model in the segmented data to realize motion recognition. Motion recognition can be performed by basing motion trajectories on human body (see, for example, reference [3 ]). The change of human body motion in the video is regarded as a track line which changes in space and time, and the track line formed by different motions is different to a certain extent, so that the motion can be described by using the track line. The human body motion recognition method with the angle invariance is obtained by storing the time-space curvature value of the human hand in the three-dimensional motion track on the two-dimensional motion track and expressing the track as the characteristic of the motion. The similarity between motion samples is judged according to similarity invariance by extracting motion tracks of important joints (such as a head, a hand, a foot and the like) of a human body in the motion process of the human body, and the similarity is judged according to the similarity invariance. In recent years, the most widely applied space-time based integral method is to use space-time interest points to characterize human body actions. The spatial-temporal interest point features can capture the appearance of the human body and the local significance of the motion. Due to the local characteristics of the interest points, the method has good robustness on complex backgrounds, scale changes, diversity of types of actions and the like in the video. Common spatio-temporal interest points are the STIP feature (e.g. ref 5), which extends the two-dimensional image Harris corner detection method (e.g. ref 6) to Harris3D in 3-D spatio-temporal and by using a joint representation of HOG and HOF as descriptors of interest points. The Cuboids interest point features (such as reference [7]) increase the number of detected interest points by using Gabor filtering in the time domain, and meanwhile, the descriptor is obtained by performing brightness gradient description by using information of the surrounding space 6 times higher than the detection scale of the interest points. The feature points are selected by adopting a dense sampling and track tracking method (such as a reference document [8]), and the features based on the density track are obtained by using a gradient, an optical flow and a motion boundary histogram as descriptors. . There are also many applications like 3-dimensional Scale Invariant Feature (3D-SIFT) (e.g., ref [9]), Speeded Up Robust Feature (SURF) (e.g., ref [10]), Mosift (e.g., ref [11 ]).
In time series based methods, researchers view a video as a sequence of images, each image in the sequence containing human motion features. The type of the action is judged by comparing a certain sequence. Because the human body movement has certain differences with different individuals, such as amplitude and speed, the problem can be better solved based on the dynamic time programming algorithm (for example, reference [12 ]). While Hidden Markov Models (HMMs) are used to identify human actions (e.g., reference [13]), they use each frame of image in the video as a feature vector, then HMM modeling is performed on these feature vectors to find the implicit state transition relationship between sequences, establish a state-based Model, and then identify actions. Furthermore, by applying multiple HMMs, a Coupled Hidden Markov Model (CHMM) is generated (e.g., reference [14]), modeling the interaction between multiple people. Another widely used method in time-based sequences is the Conditional Random Field (CRF) (see reference [15]), which can divide the motion sequence into a plurality of consecutive units and identify human motion according to the transformation rules between adjacent units. To deal with different timing models, CRFs have been expanded by a number of research efforts, such as: hidden state CRF (e.g., ref [16]) dynamic CRF (e.g., ref [17]), semi-Markov random field model (e.g., ref [18]), and the like.
The following challenges are mainly faced in the field of motion recognition:
1. the human body has different action forms. In an action sequence, different persons often behave differently for the same action due to habits, which adds difficulty to the action recognition. Meanwhile, different devices and different action types can lead to the action forms to be diversified in sequence. A detection mode with robustness on human action forms is provided, and the method is important for identifying human actions.
2. The background of the action is complex. In order to be compatible with the real-world situation, the recording environment of many motion sequence samples not only includes a simple and fixed background, but also many of them come from a complex and variable environment, and the complex background is a very big challenge for human motion modeling.
3. The existing human motion recognition features have the defects that most of the features are based on manual design and have universal usability, but the uniqueness of a motion sample cannot be well captured, and how to characterize the motion sample based on the features learned by the sample is very important for motion recognition.
Disclosure of Invention
The invention provides an action recognition method based on hierarchical feature learning, which solves the problems that the same action type has large performance difference, single extracted feature and large model learning difficulty due to the fact that the manually designed feature can not capture the characteristics of a sample according to the difference of an action sample, and is described in detail as follows:
a motion recognition method based on hierarchical feature learning, the motion recognition method comprising the steps of:
clustering the characteristics of the region blocks in the training set, performing characteristic re-characterization on all the region blocks by using a bag-of-words model to obtain the characteristics of high-level blocks, and performing mean pooling on the characteristics of all the blocks in one video to obtain a characteristic set of a video sequence;
modeling a feature set of a video sequence by using a support vector machine to obtain model parameters;
and selecting the action sequence in the test set as a test sequence, extracting the characteristics of the action sequence through two-layer clustering and a bag-of-words model, and inputting the characteristics into the model to obtain the action class number of the action sequence.
The motion recognition method further comprises:
training video sequences and candidate predictive video sequences are selected from each class of the motion video data set.
The training video sequence is divided into space-time blocks with equal size, the covariance characteristics of the blocks are constructed according to the pixel information of the blocks, and the covariance characteristics are used as the initialization characteristics of the blocks to form an action data set.
The hierarchical feature learning specifically comprises:
clustering the blocks in the training set by using a clustering method, and then performing feature re-characterization on all the blocks by using a bag-of-words model to obtain the features of the bottom-layer blocks;
the pooling fuses the bottom layer features of all blocks around the bottom layer block as the center to obtain the feature representation of the region block which is larger than the bottom layer block in space.
The technical scheme provided by the invention has the beneficial effects that: the method extracts the characteristics with better resolution and richness to characterize the action characteristics by the action identification method based on hierarchical characteristic learning, so that the model learning is more efficient, and the identification rate of the action identification is improved; experiments prove that the method achieves higher accuracy and meets various requirements in practical application.
Drawings
Fig. 1 is a flowchart of a motion recognition method based on hierarchical feature learning.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Example 1
In order to solve the problems that the difference information between samples cannot be mined by manual features and the features are not rich in action recognition, so that the action recognition efficiency is not high, the embodiment of the invention provides an action recognition method based on hierarchical feature learning, and the action recognition method comprises the following steps:
101: clustering the characteristics of the region blocks in the training set, performing characteristic re-characterization on all the region blocks by using a bag-of-words model to obtain the characteristics of high-level blocks, and performing mean pooling on the characteristics of all the blocks in one video to obtain a characteristic set of a video sequence;
102: modeling a feature set of a video sequence by using a support vector machine to obtain model parameters;
103: and selecting the action sequence in the test set as a test sequence, extracting the characteristics of the action sequence through two-layer clustering and a bag-of-words model, and inputting the characteristics into the model to obtain the action class number of the action sequence.
The action recognition method further comprises the following steps:
training video sequences and candidate predictive video sequences are selected from each class of the motion video data set.
The training video sequence is divided into space-time blocks with equal size, the covariance characteristics of the blocks are constructed according to the pixel information of the blocks, and the covariance characteristics are used as the initialization characteristics of the blocks to form an action data set.
Wherein the hierarchical feature learning specifically comprises:
clustering the blocks in the training set by using a clustering method, and then performing feature re-characterization on all the blocks by using a bag-of-words model to obtain the features of the bottom-layer blocks;
the pooling fuses the bottom layer features of all blocks around the bottom layer block as the center to obtain the feature representation of the region block which is larger than the bottom layer block in space.
In summary, the embodiment of the present invention provides a motion recognition method based on hierarchical feature learning, in which a video sequence is partitioned, a training set is used to learn feature characterizations of each layer, modeling is performed, in an application process, a motion candidate sequence is input, hierarchical features are extracted from the motion candidate sequence, and a model is established to predict motion categories, so that a better recognition result is obtained, and a recognition rate of motion recognition is improved.
Example 2
The following detailed description of the solution in embodiment 1 is provided with reference to fig. 1, and the following detailed description refers to the following specific calculation principles:
201: selecting a training video sequence and a candidate prediction video sequence from each class of the motion video data set;
the selection of the action sequence in the training set can be realized by manual selection or random selection in the class, and if the action set has a divided training set and a divided test set, the divided sample is used as the training set and the test set.
202: dividing all training video sequences into WxWxT time-space blocks with equal size, constructing covariance characteristics of the blocks according to pixel information of the blocks, using the covariance characteristics as initialization characteristics of the blocks, and forming an action data set
Wherein N is the total number of action sequences in the data set, i is the serial number of the action sequence sample,the content at the first level for the ith action sequence,representing D.T. taking values as real numbersiDimension space, D being the dimension of the initialized features of the space-time block, TiThe number of the time-space blocks in the ith action sequence.
Wherein D is1Is related to the selected initialization feature, which is not limited by the embodiments of the present invention. y isiA sample label is given, whose value is Y ═ 1, 2., M }, where a value of 1 represents that the sample sequence includes an action class of 1, a value of 2 represents that the sample sequence includes an action class of 2, and M represents dataTotal number of centralized action categories; without loss of generality, embodiments of the present invention feature covariance of spatio-temporal blocks in all motion samples (see reference [21 ]]) And (4) extracting.
203: clustering the blocks in the training set by using a clustering method, and then performing feature re-characterization on all the blocks by using a bag-of-words model to obtain the features of the bottom-layer blocks;
the problem solved in the step is the problem of sample initial feature processing and re-characterization in motion recognition, which is embodied by learning the input initial feature, mapping the initial feature to a new feature space by learning a conversion matrix of the new feature space, and obtaining a feature set of the learned sample by re-characterizing the sample
Common methods include k-means clustering, sparse coding and the like, and the embodiment of the invention does not limit the learning of the feature space and the selection of the re-characterization.
204: fusing the bottom layer characteristics of all blocks around the bottom layer block by taking the bottom layer block as the center through pooling to obtain the characteristic representation of the region block which is spatially larger than the bottom layer block;
the pooling operation is to fuse information around the position of the central block, so that large learning features have locality, contain surrounding spatio-temporal information, and enrich the features. Generally, the pooling operation includes operations of mean pooling, sum pooling, and maximum pooling, which is not limited in the embodiments of the present invention.
205: clustering the characteristics of the regional blocks of the training set, then performing characteristic re-characterization on all regional blocks by using a bag-of-words model to obtain the characteristics of high-level blocks, performing mean pooling on the characteristics of all blocks in a video to obtain a characteristic set of a video sequence
206: modeling a feature set of a video sequence by using a support vector machine to obtain model parameters;
207: and selecting the action sequence in the test set as a test sequence, extracting the characteristics of the action sequence through two-layer clustering and a bag-of-words model, and inputting the characteristics into the model to obtain the action class number of the action sequence.
In summary, the features obtained by the hierarchical feature learning method have better robustness, local information around feature points which is not present in the existing artificial features is retained, and global features are obtained through sublimation layer by layer. The information quantity in the characteristics is improved, the accuracy of action recognition is further improved, and a better result is obtained.
Example 3
The schemes in examples 1 and 2 are further discussed below in conjunction with specific calculation formulas, as described in detail below:
first, without loss of generality, embodiments of the present invention select a covariance matrix as an initialization feature for a block.
The covariance matrix is used as the initialization characteristic of the block, and the expression is as follows:
i (x, y, t) is the value of the pixel at the location of the block (x, y, t),respectively, the pixel values of the current point are respectively expressed for x, y, t first order partial derivatives,the corresponding second derivatives are represented, respectively, and finally, from these 10 kinds of information, a representation F (x, y, t) of the point at the point x, y, t position is generated. Since the information carrier is the smallest block, when there is description information of a single point, the characterization of the block is initialized by using a covariance descriptor, where the covariance formula is as follows:
where n represents the number of pixels in the block, and n is S × S × T, Fi=F(xi,yi,ti) Is a representation of the midpoint of the block. Finally, integrating all points in the block by means of covariance to generate an initialization descriptor C of the blockI。CIIs a matrix with dimensions of dimension (F)i) × dimension (F)i) Suppose FiIs a 10-dimensional vector, CIIs a covariance matrix of 10 × 10.
Secondly, the learning process of the bottom layer features is as follows:
the covariance matrix is a special type of Riemann manifold. A non-euclidean structure of a Symmetric positive definite matrix (SPD) may be applied to the metrics between different covariance matrices. The SPD manifold is embedded into the traditional Euclidean space by utilizing differential homomorphism, and the manifold geometry of a symmetrical positive definite matrix is applied in the process of dictionary learning and coding. In a Riemannian manifold (M, g) space, the tangent space of any point P is TpM,TpM is represented as all tangent vectors passing through point P. The correlation formula of the smooth change in the tangent vector space isThe formula g is a formula for any p ∈ TpM all have positive definiteThe symmetric and bilinear properties have certain robustness to geometric changes. The operators for the conversion of tangent space and manifold space are e-exponential transformations exp, respectivelyP(·)∶TpM → M, mapping the tangent vector △ to a point X in manifold space, and performing logarithmic transformationMapping a point in manifold space to a vector in tangent vector space, expP(. o) and logPThe (-) transform is a pair of inverse transforms. expPThe (-) transformation may be such that the length of tangent vector △ is equal to the geodesic distance of X from P.
For converting data in Euclidean space into manifold space, Karcher mean value can be used to solve X instead of arithmetic mean valueiAnd XjDistance between (reference [19 ]]). The Karcher mean is solved by:
wherein,in the calculation, each mapping to the tangent vector space requires the calculation of Cholesky factorization, and for a covariance matrix of d × d, the complexity of time is O (d × d)3)。
For an SPD matrix of real number d × d, denoted asIt forms a real manifold with a Group structure in mathematics, called Lie Group, so forProperties in the riemann manifold and all related geometrical concepts can be applied. In thatAffine Invariant Riemann Metric (AIRM) above, logarithmic transformation and logarithmic transformation under the Metric (reference [20 ]]):
For a symmetric positive definite matrix X, the results of the above two equations can be obtained by Singular value decomposition (Singular value decomposition). Suppose that the diagonal matrix is defined as diag (λ)12,…,λd) Simultaneously satisfies the condition that X singular value is decomposed into X ═ Udiag (lambda)i)UTThen the above equation can be rewritten as:
the transformation operator equation with logarithm and e-exponent of the symmetric positive definite matrix X of manifold structure, i.e. the transformation and inverse transformation from manifold space to tangent vector space, can be obtainedMapping into tangent vector space, from manifold space toVector space, so that the calculation method in euclidean space can be applied. Given a symmetric positive definite matrix X, its logarithmic Euclidean vector characterizationIs unique (e.g. reference [21 ]]) Defined as α ═ Vec (log (x)) where Vec (B), B ∈ sym (d) are defined as:
mapping the orthogonal symmetric matrix training data set into vectors, so that the initialization characteristic of each block is h1=Vec(B)。
The characteristic learning process of the three-high layer block is as follows:
without loss of generality, a k-means algorithm is selected for clustering, and meanwhile, a vector quantization method is used for feature characterization.
The k-means clustering method assigns n feature points in the feature space to k classes according to the principle that the sum of the intra-class variance is the minimum, as shown in the following formula.
In the above formula, CiDenotes the center as μiThe (ii) th cluster category of (c),features indicating the S-th layer belong to class CiThe data points of (a). The k-means algorithm comprises the following specific steps:
(1) the cluster center is initialized. Randomly selecting k initial centers in the characteristic space or selecting k initial centers according to a certain rule;
(2) each feature point is classified. Calculating the distance between each feature point and the clustering center, and distributing the feature points to k initial center points according to the shortest distance;
(3) and updating the cluster center point. According to the result of the second step, recalculating by using the feature points to which each central point belongs to obtain a new clustering center;
(4) and (5) repeating the operations (2) and (3) until the convergence condition is met, and outputting a clustering result D.
The vector quantization is to count the feature points in the sample by calculating the distance relationship between the feature points and each word in the dictionary, and for each feature descriptor x, the dictionary D ═ { D ═ D is passed according to the relationship between the feature points and the number of words in the dictionary in the feature point coding1,d2,…dKThe coding method phi obtains a sample characterization phi (x). The calculation formulas of the two methods are as follows:
in the process of vectorization coding, the distance between the feature point and each word in the dictionary is calculated, and the word d with the minimum distance between the feature point and each word in the dictionary is takenminNewly building a zero vector, and only d in the zero vectorminIs 1, and finally this vector phi (x) is the characterization of the feature point.
In specific implementation, other algorithms can be adopted to solve the problems of space-time block initialization, initialized feature characterization, high-level feature learning and the like.
Example 4
The following experiments were performed to verify the feasibility of the protocols of examples 1 and 2, as described in detail below:
the human body action database used in the experiment was from a database recorded by the royal academy of sciences KTH, sweden, which, once in red, contains 598 video sequences recorded in four different environments, and 6 different actions were performed by 25 volunteers, respectively, each action being repeated for a certain time. The resolution of the video data in the database is 160 × 120, the frame rate is 25fps, and the image of each frame in the video is a gray scale image, wherein the training sample set has 382 samples, and the test set has 216 samples. The recording environment of the motion database and the information and parameter setting of the data acquisition device may refer to documents (e.g., reference [22]), which are not described in detail in the embodiments of the present invention.
Through literature query, the action identification by using the characteristics in the prior art such as Cuboid, HOG3D, Dense HOF and the like is accurate and can reach 90%. By the method for hierarchical feature learning, the action recognition accuracy rate reaches 91.7%. The results are superior to the characteristics, and the feasibility and the effectiveness of the method are proved.
In summary, the embodiment of the present invention provides a motion recognition algorithm for hierarchical feature learning, in which a training video sequence and a candidate prediction video sequence are selected from each class of a motion video data set, all the training video sequences are divided into equal-sized WxWxT time-space blocks, covariance features of the blocks are constructed according to pixel information of the blocks, and the covariance features are used as initialization features of the blocks to form the motion data set; on the basis, feature information is enriched through a hierarchical learning method, and features of a video sequence are obtained; and finally, learning the model parameters by using the classifier, finding the segmentation surface in the feature space, and finally enabling the recognition result to be ideal.
Reference documents:
[1]Bobick AF,Davis J W.The recognition of human movement usingtemporal templates.IEEE Transactions on Pattern Analysis and MachineIntelligence,2001,23(3):257-267.
[2]Ke Y,Sukthankar R,Hebert M.Spatio-temporal shape and flowcorrelation for action recognition.Proceedings of IEEE Conference on ComputerVision and Pattern Recognition,2007.
[3]Rao C,Shah M.View-invariance in action recognition.IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition.2001(2):316-322.
[4]Sheikh Y,Sheikh M,Shah M.Exploring the Space of a HumanAction.IEEE International Conference on Computer Vision.2005.
[5]Laptev I.On space-time interest points.International Journal ofComputer Vision,2005,64(2-3):107-123.
[6]Dollár P,Rabaud V,Cottrell G,et al.Behavior recognition via sparsespatio temporal features.Visual Surveillance and Performance Evaluation ofTracking and Surveillance,2nd Joint IEEE International Workshop on.IEEE,2005:65-72.
[7]Andrews S,Tsochantaridis I,Hofmann T.Support vector machines formultiple-instance learning.Advances in neural information processingsystems.2002:561-568.
[8]Wang H,A,Schmid C,et al.Action recognition by densetrajectories.IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2011:3169-3176.
[9]Scovanner P,Ali S,Shah M.A3-dimensional sift descriptor and itsapplication to action recognition.Proceedings of the 15th internationalconference on Multimedia.ACM,2007:357-360.
[10]Bay H,Tuytelaars T,Van Gool L.Surf:Speeded up robustfeatures.Computer vision–ECCV.Springer Berlin Heidelberg,2006:404-417.
[11]Chen M,Hauptmann A.Mosift:Recognizing human actions insurveillance videos.2009.
[12]Morency L P,Quattoni A,Darrell T.Latent-dynamic discriminativemodels for continuous gesture recognition.IEEE Conference on Computer Visionand Pattern Recognition.IEEE,2007:1-8.
[13]Yamato J,Ohya J,Ishii K.Recognizing human action in time-sequential images using hidden markov model.1992IEEE Computer SocietyConference on Computer Vision and Pattern Recognition.IEEE,1992:379-385.
[14]Brand M,Oliver N,Pentland A.Coupled hidden Markov models forcomplex action recognition.Proceedings.IEEE Computer Society Conference onComputer Vision and Pattern Recognition.1997:994-999.
[15]Wang J,Liu P,She M,etal.Human action categorization usingconditional random field.Robotic Intelligence In Informationally StructuredSpace,2011IEEE Workshop,2011:131-135.
[16]Wang H,Schmid C.Action recognition with improvedtrajectories.IEEE International Conference on Computer Vision.2013:3551-3558.
[17]Liu J,Kuipers B,Savarese S.Recognizing human actions byattributes.IEEE Conference on Computer Vision and Pattern Recognition.2011:3337-3344.
[18]Wang J,Zucker J D.Solving multiple-instance problem:A lazylearning approach.2000.
[19]Klaser A,M,Schmid C.A spatio-temporal descriptor basedon 3d-gradients.19th British Machine Vision Conference.British Machine VisionAssociation,2008:275:1-10.
[20]Pennec X.Intrinsic statistics on Riemannian manifolds:Basic tools for geometric measurements.Journal ofMathematical Imaging and Vision,2006,25(1):127-154.
[21]Faraki M,Palhang M,Sanderson C.Log-Euclidean bag of words forhuman action recognition.IET Computer Vision,2014,9(3):331-339.
[22]C.Schüldt,I.Laptev,and B.Caputo.Recognizing human actions:A localSVM approach.In 17thInternational Conference onPatternRecognition,pages 32–36,2004.
in the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (4)

1. A motion recognition method based on hierarchical feature learning is characterized by comprising the following steps:
clustering the characteristics of the region blocks in the training set, performing characteristic re-characterization on all the region blocks by using a bag-of-words model to obtain the characteristics of high-level blocks, and performing mean pooling on the characteristics of all the blocks in one video to obtain a characteristic set of a video sequence;
modeling a feature set of a video sequence by using a support vector machine to obtain model parameters;
and selecting the action sequence in the test set as a test sequence, extracting the characteristics of the action sequence through two-layer clustering and a bag-of-words model, and inputting the characteristics into the model to obtain the action class number of the action sequence.
2. The method for recognizing the action based on the hierarchical feature learning according to claim 1, further comprising:
training video sequences and candidate predictive video sequences are selected from each class of the motion video data set.
3. The method according to claim 2, wherein the step of identifying the action based on the hierarchical feature learning comprises the steps of,
the training video sequence is divided into space-time blocks with equal size, the covariance characteristics of the blocks are constructed according to the pixel information of the blocks, and the covariance characteristics are used as the initialization characteristics of the blocks to form an action data set.
4. The method according to claim 1, wherein the hierarchical feature learning specifically comprises:
clustering the blocks in the training set by using a clustering method, and then performing feature re-characterization on all the blocks by using a bag-of-words model to obtain the features of the bottom-layer blocks;
the pooling fuses the bottom layer features of all blocks around the bottom layer block as the center to obtain the feature representation of the region block which is larger than the bottom layer block in space.
CN201710010477.1A 2017-01-06 2017-01-06 A kind of action identification method based on hierarchical feature learning Pending CN106845375A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710010477.1A CN106845375A (en) 2017-01-06 2017-01-06 A kind of action identification method based on hierarchical feature learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710010477.1A CN106845375A (en) 2017-01-06 2017-01-06 A kind of action identification method based on hierarchical feature learning

Publications (1)

Publication Number Publication Date
CN106845375A true CN106845375A (en) 2017-06-13

Family

ID=59117921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710010477.1A Pending CN106845375A (en) 2017-01-06 2017-01-06 A kind of action identification method based on hierarchical feature learning

Country Status (1)

Country Link
CN (1) CN106845375A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960031A (en) * 2018-03-29 2018-12-07 中国科学院软件研究所 A kind of video actions categorizing system and method based on layering kinetic resolution and coding
CN109711244A (en) * 2018-11-05 2019-05-03 天津大学 A kind of human behavior recognition methods based on covariance descriptor
CN109754359A (en) * 2017-11-01 2019-05-14 腾讯科技(深圳)有限公司 A kind of method and system that the pondization applied to convolutional neural networks is handled
CN109948446A (en) * 2019-02-20 2019-06-28 北京奇艺世纪科技有限公司 A kind of video clip processing method, device and computer readable storage medium
CN110555341A (en) * 2018-05-31 2019-12-10 北京深鉴智能科技有限公司 Pooling method and apparatus, detection method and apparatus, electronic device, storage medium
CN113516030A (en) * 2021-04-28 2021-10-19 上海科技大学 Action sequence verification method and device, storage medium and terminal
CN114550289A (en) * 2022-02-16 2022-05-27 中山职业技术学院 Behavior identification method and system and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021381A (en) * 2014-06-19 2014-09-03 天津大学 Human movement recognition method based on multistage characteristics
CN105138953A (en) * 2015-07-09 2015-12-09 浙江大学 Method for identifying actions in video based on continuous multi-instance learning
CN105956604A (en) * 2016-04-20 2016-09-21 广东顺德中山大学卡内基梅隆大学国际联合研究院 Action identification method based on two layers of space-time neighborhood characteristics

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021381A (en) * 2014-06-19 2014-09-03 天津大学 Human movement recognition method based on multistage characteristics
CN105138953A (en) * 2015-07-09 2015-12-09 浙江大学 Method for identifying actions in video based on continuous multi-instance learning
CN105956604A (en) * 2016-04-20 2016-09-21 广东顺德中山大学卡内基梅隆大学国际联合研究院 Action identification method based on two layers of space-time neighborhood characteristics

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
FARAKI M等: "Log-Euclidean bag of words for human action recognition", 《.IET COMPUTER VISION》 *
李文辉: "基于多示例多特征的人体动作识别", 《信息技术》 *
苏育挺等: "基于层级化特征的人体动作识别", 《信息技术》 *
黄少年等: "基于高层语义词袋的人体行为识别方法", 《电脑与电信》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754359A (en) * 2017-11-01 2019-05-14 腾讯科技(深圳)有限公司 A kind of method and system that the pondization applied to convolutional neural networks is handled
US11537857B2 (en) 2017-11-01 2022-12-27 Tencent Technology (Shenzhen) Company Limited Pooling processing method and system applied to convolutional neural network
US11734554B2 (en) 2017-11-01 2023-08-22 Tencent Technology (Shenzhen) Company Limited Pooling processing method and system applied to convolutional neural network
CN108960031A (en) * 2018-03-29 2018-12-07 中国科学院软件研究所 A kind of video actions categorizing system and method based on layering kinetic resolution and coding
CN110555341A (en) * 2018-05-31 2019-12-10 北京深鉴智能科技有限公司 Pooling method and apparatus, detection method and apparatus, electronic device, storage medium
CN109711244A (en) * 2018-11-05 2019-05-03 天津大学 A kind of human behavior recognition methods based on covariance descriptor
CN109948446A (en) * 2019-02-20 2019-06-28 北京奇艺世纪科技有限公司 A kind of video clip processing method, device and computer readable storage medium
CN109948446B (en) * 2019-02-20 2021-07-16 北京奇艺世纪科技有限公司 Video clip processing method and device and computer readable storage medium
CN113516030A (en) * 2021-04-28 2021-10-19 上海科技大学 Action sequence verification method and device, storage medium and terminal
CN113516030B (en) * 2021-04-28 2024-03-26 上海科技大学 Action sequence verification method and device, storage medium and terminal
CN114550289A (en) * 2022-02-16 2022-05-27 中山职业技术学院 Behavior identification method and system and electronic equipment

Similar Documents

Publication Publication Date Title
Jiang et al. Recognizing human actions by learning and matching shape-motion prototype trees
Torralba et al. Describing visual scenes using transformed dirichlet processes
CN106845375A (en) A kind of action identification method based on hierarchical feature learning
Soomro et al. Action recognition in realistic sports videos
Yi et al. Human action recognition with graph-based multiple-instance learning
CN104616316B (en) Personage's Activity recognition method based on threshold matrix and Fusion Features vision word
Garg et al. Delta descriptors: Change-based place representation for robust visual localization
Gu et al. Multiple stream deep learning model for human action recognition
Rahman et al. Fast action recognition using negative space features
Chen et al. TriViews: A general framework to use 3D depth data effectively for action recognition
Shah et al. A novel biomechanics-based approach for person re-identification by generating dense color sift salience features
Zhang et al. Weakly supervised human fixations prediction
Zheng et al. Fusing shape and spatio-temporal features for depth-based dynamic hand gesture recognition
Afshar et al. Facial expression recognition in the wild using improved dense trajectories and fisher vector encoding
Wang et al. A comprehensive survey of rgb-based and skeleton-based human action recognition
Saqib et al. Intelligent dynamic gesture recognition using CNN empowered by edit distance
Basavaiah et al. Robust Feature Extraction and Classification Based Automated Human Action Recognition System for Multiple Datasets.
Aoun et al. Bag of sub-graphs for video event recognition
Najibi et al. Towards the success rate of one: Real-time unconstrained salient object detection
Liu et al. DISCOV: A framework for discovering objects in video
Zhu et al. Correspondence-free dictionary learning for cross-view action recognition
Zhou et al. Modeling perspective effects in photographic composition
Hashemi et al. View-independent action recognition: A hybrid approach
Said et al. Wavelet networks for facial emotion recognition
Alghyaline et al. Video action classification using symmelets and deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170613