CN103593661B

CN103593661B - A kind of human motion recognition method based on sort method

Info

Publication number: CN103593661B
Application number: CN201310614110.2A
Authority: CN
Inventors: 苏育挺; 刘安安; 董瑞亭
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2013-11-27
Filing date: 2013-11-27
Publication date: 2016-09-28
Anticipated expiration: 2033-11-27
Also published as: CN103593661A

Abstract

The invention discloses a kind of human motion recognition method based on sort method, said method comprising the steps of: from sequence of video images, extract space-time interest points and corresponding positional information；Build dictionary；Respectively training set and test set are processed by word bag model, build training set and the BoW feature of test set respectively；By the BoW features training grader of training set；Test data are input in grader, the differentiation label outputed test data, differentiate that the action label of label and test data does not calculate 6 actions and differentiates correct accuracy rate by contrast.Sort method is applied in action identification method by this method, opens the new direction of action recognition, reduces the complexity of iterative computation simultaneously, improves the accuracy of action recognition.

Description

A kind of human motion recognition method based on sort method

Technical field

The present invention relates to computer vision field, particularly to a kind of human motion recognition method based on sort method.

Background technology

Human action identification is the key areas in computer vision research, is one and has suction in computer vision Gravitation and challenging problem.The visual analysis of human motion is the research field in an emerging forward position, relates to pattern recognition, figure As multi-door subjects such as process, computer vision, artificial intelligences.And intelligent video monitoring, video annotation, virtual reality, people Machine has broad application prospects in the field such as alternately, has become as the study hotspot of computer vision and area of pattern recognition.

Human action identification is the detect and track that the image sequence comprising human motion carries out moving target, and at this On the basis of utilize the dynamic change characterization of human action process vision pattern specific action is modeled and identifies.Human action Identify that existing method is predominantly: method based on template matching and method based on state space.

Template matching is a kind of to be relatively early used in the method in human motion identification, motion image sequence is converted into one or One group of static template, obtains recognition result by carrying out the template of sample to be identified with known template mating^[1].Dynamic In identifying, algorithm based on template matching can be divided into frame to frame matching process and to merge matching process.Main method has: fortune Energy image (MEI) and motion history image (MHI), mean motion shape (MMS) based on profile and based on sport foreground Mean motion energy (AME) etc..But this kind of algorithm lacks the dynamic characteristic considered between adjacent sequential, for noise and fortune The change of dynamic time interval is the most sensitive, and discrimination is poor.

It is that each of athletic performance static posture is defined as a state that the method for state space carries out action recognition Or these states are coupled together by the way of network by the set of a state, the switching between state and state uses Probability describes, and so every kind kinestate can regard an ergodic process of different conditions in the drawings or node as, and it is main The algorithm HMM wanted^[2], dynamic bayesian network^[3], artificial neural network^[4], finite state machine^[5]With confidence net Network^[6]Deng.Although state-space method can overcome the shortcoming of template matching, but generally involves the interative computation of complexity, algorithm Step is complex, it is difficult to be applied in real work.

Summary of the invention

The invention provides a kind of human motion recognition method based on sort method, present invention reduces iterative computation Complexity, improves the accuracy of identification, described below:

A kind of human motion recognition method based on sort method, said method comprising the steps of:

(1) from sequence of video images, extract space-time interest points and corresponding positional information；

(2) dictionary is built；

(3) respectively training set and test set are processed by word bag model, build training set and test set respectively BoW feature；

(4) by the BoW features training grader of training set；

(5) test data are input in grader, the differentiation label outputed test data, by contrast differentiate label and The action label of test data does not calculate 6 actions and differentiates correct accuracy rate.

The operation of the described BoW features training grader by training set particularly as follows:

1) to the BoW feature of video sequence in training set, Euclidean distance is utilized to set up the similarity matrix of n dimension；

2) transition probability matrix P is obtained by the row normalization of similarity matrix；

3) Stationary Distribution π is calculated according to transition probability matrix₁, select state i of maximum probability, and write down this type of maximum Probability；

4) become partially absorbing state by state i, obtain new transition probability matrix and calculate Stationary Distribution π now₂, Then π is calculated₁And π₂In the corresponding difference of each state；

5) difference if there is state is more than the value of threshold value T, is considered as this state similar with state i, selects and state i These states are all become partially absorbing state, and write down the status number m of residual state by similar all states；

6) then double counting Stationary Distribution, difference is respectively less than the value of threshold value T, just directly judges whether m is 0, and m is not equal to 0 explanation does not has state similar with this state, and this state becomes partially absorbing state, and by the status number m-1 of residual state, Then double counting Stationary Distribution；M is equal to the 0 explanation equal classified of all states, and classifier training terminates.

The technical scheme that the present invention provides provides the benefit that: this method passes through word bag model respectively to training set and test Collection processes, and builds training set and the BoW feature of test set respectively；By the BoW features training grader of training set；To survey Examination data are input in grader, the differentiation label outputed test data, and differentiate label and the action of test data by contrast Label does not calculate 6 actions and differentiates correct accuracy rate.Sort method is applied to, in action identification method, open up by this method The new direction of action recognition, reduces the complexity of iterative computation simultaneously, improves the accuracy of action recognition.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of data base used by this method；

Fig. 2 is that space-time interest points is extracted, described flow chart；

Fig. 3 is the flow chart of training grader；

Fig. 4 is the flow chart of Classification and Identification.

Detailed description of the invention

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention Formula is described in further detail.

In order to reduce the complexity of iterative computation, improve the accuracy identified, embodiments provide a kind of based on The human motion recognition method of sort method, described below:

Traditional action method for expressing, its accuracy suffers from following the tracks of and the impact of Attitude estimation precision, at moving object Under the scene that body is more or background is more complicated, the robustness of this category feature faces the challenge^[7].Recently, a kind of new moving is occurred in that Make method for expressing method based on space-time interest points, by calculate point of interest constitute point set histograms of oriented gradients and Light stream rectangular histogram represents action, and this kind of method is relatively more flexible, and action recognition accuracy rate is higher..The present invention based on this thought, Action recognition and sort method are combined thus realizes accurately identifying of human action.

Picture material is expressed as specific " visual word " occurrence number histogrammic word bag (Bag-of-Words, BoW) mould Type, presents powerful advantage in terms of human action classification.But, histogrammic in this statistics " visual word " occurrence number In model, the relation between spatial positional information and " visual word " of " visual word " is almost abandoned completely.Method exists Laptev recognition methods^[8]On the basis of, by calculating the Euclidean distance between two " word bags ", add " visual word " space bit Confidence breath carries out human action identification, can well promote human action recognition accuracy.

Sort method is the another key areas of computer vision research, is very important one in computer vision Link.It can apply to the fields such as information retrieval, plays important work in the speed of service problem reducing computer hardware With.Therefore, the research combined visual analysis and the sort method of human action has great commercial value and realistic meaning.

101: from sequence of video images extract space-time interest points (space-time interest point, stip) with And corresponding positional information；

Seeing Fig. 1, data base used by this method is TJU(University Of Tianjin) data base, wherein comprise 20 people (10 men 10 Female), 22 actions (walking(walking), jogging(jog), running(runs), boxing(boxes forward), two_ Hand_waving(both hands are waved), handclapping(claps hands), P_bend(lateral bending waist), jacks(folding jumps), jump(jumps Far), P_jump(jump in place), side_walking(Eriocheir sinensis step), single_hand_waving(one hand waves), draw_X(draws Fork), draw_tick(draws hooking), draw_circle(draws and justifies), forward_tick(plays forward), side_tick(side plays), Tennis_swing(connects tennis), tennis_serve(hairnet ball), side_boxing(side boxes), bend(bends forward), Sit_down(sits down)), two scenes (bright, dark), everyone each actions do four times, totally 3520 sections of videos.All of video is equal Gathering under uniform background, acquisition frame speed is 20fps, and spatial resolution is 640x480.Propose with reference to list of references Division methods^[9], TJU data base is divided into 3 data sets: training set (train, totally 1056 sections of videos), cross validation collection (validation, totally 1056 sections of videos), test set (test, totally 1408 sections of videos).Wherein, corresponding one of each action is moved Make label, action label can be set to numeral 1 to the 22(embodiment of the present invention concrete setting is not limited, only need to meet The action label of training set action same with in test set is identical).

Point of interest, as the term suggests, it is simply that there is certain feature, and make us, with some, the attribute that comparison is interested Point, and in computer vision, represented by point of interest is the point having brightness acute variation on image or on time dimension.

Space-time interest points^[10]Be by expand to the method for space point of interest to find on space-time field time dimension or The point of large change is had in space dimension.The purpose of its detection is to find the spatio-temporal event occurred in video sequence, as a rule, and its Method is to provide a kind of intensity function, calculates each position intensity level in video sequence and the method that filtered by maximum is looked for Go out point of interest.Extracting of space-time interest points whether accurate has the biggest impact to follow-up action recognition.List of references 10 ］ Harris Corner Detection thought is expanded to time-space domain, obtain the detection method of a kind of space-time interest points and use point of interest structure The point set become represents action.

This method uses the space-time interest points extracting method that Laptev proposes.Seeing Fig. 2, in the method, Laptev is by two Harris(Harris in dimension image) Corner Detection expands in three-dimensional space-time field, with detect sub-Harris3D from The video of KTH data base detects at the pixel (point of interest) that the change of space-time direction is maximum, set up centered by point of interest Space-time cube and extract light stream rectangular histogram and histogram of gradients union feature HoG/HoF describe son, for move into Row characterizes.Finally, all STIP points and their positional information extracted in this video can be obtained, each STIP point Positional information includes place frame number F_fAnd the coordinate (x in this two field picture_f,y_f).The embodiment of the present invention is with the dimension of STIP point For 162(HoG describe front 72 dimension, HoF describe after 90 tie up) as a example by illustrate, when implementing, the embodiment of the present invention is to this not Limit.

102: build dictionary；

STIP point K-means algorithm to samples all in training set^[11]Cluster, it is thus achieved that cluster centre, i.e. dictionary dictionary.K is the number of cluster centre, and K-means algorithm can select K as poly-from all STIP points of input Class center.Therefore, calculating through K-means for the STIP of training set and all can get the matrix of a K*162, every a line represents one Individual cluster centre, K altogether.The matrix of this K*162 is dictionary.

103: by word bag model^[9]Respectively training set and test set are processed, build training set and test set respectively BoW feature.

For training set totally 1056 sections of videos, build BoW feature, all STIP points included in the video by training set And dictionary is input to BoW model, output is the BoW feature of training set.This BoW is characterized as the matrix of 1056*K, the most often A line represents the BoW feature of feature histogram this video i.e. of a video.Action label corresponding for each video is added in square Battle array first row, obtain a 1056*(K+1) matrix, be the BoW feature of the tape label of training set.

Test set totally 1408 sections of videos, (used dictionary is step 102 institute to the same training set of method of extraction BoW feature Build dictionary), the BoW of output is characterized as 1408*(K+1) matrix, wherein first row is the action label of test set.

104: by the BoW features training grader of training set；

The human motion recognition method that this method is taked is by sort method based on GrassHopper^[12]With human body row For visual analysis combine algorithm: first, train grader, obtain disaggregated model；Then, according to disaggregated model to test set Data judge, by contrast differentiate label and test data action label calculate accuracy rate.

1) the BoW feature of video sequence in training set (the BoW feature of each video as will be regarded a point, asked for appointing Distance between two BoW features of anticipating), utilize Euclidean distance to set up the similarity matrix of n dimension；

Wherein Euclidean distance represents the actual distance in n-dimensional space between two points, and computing formula is:

D (A, B)=sqrt [∑ ((a [i]-b [i]) ^2)] (i=1,2 ..., n), wherein A=(a [1], a [2] ..., a [n]) With B=(b [1], b [2] ..., b [n]) represent any two points in n-dimensional space.

Similarity matrix: every a line represents the Euclidean distance between this video sequence and other video sequences, this matrix Diagonal entry be all 0, and matrix is symmetrical with diagonal.

I.e. each row element value sum of transition probability matrix is 1, and it is special that each element in transition probability matrix represents BoW Transition probability between levying.

Stationary Distribution: π₁It is a column vector, by formula π₁=Pπ₁Calculate, and π₁In each element represent different shape respectively The probability of state (i.e. different video), element and be 1.

Wherein, the state of partially absorbing is: there is certain probability under current state and is absorbed, this method definition absorbing probability For the element p on this row diagonal of 0.75(i.e. transition probability matrix_ii=0.75, p_ij(i ≠ j) becomes currently shifting generally the most respectively In rate matrix, corresponding element is multiplied by 0.25).

5) difference if there is state is more than the value of threshold value T, is considered as this state similar with state i, selects and state i Similar all states, these states all become the state that partially absorbs, and (i.e. in transition probability matrix, part absorbing state institute is right In the row answered, the element on diagonal is 0.75, and other elements become corresponding element in current transition probabilities matrix respectively and are multiplied by , and write down the status number m of residual state 0.25)；

6) then double counting Stationary Distribution (from the beginning of the 3rd step), difference is respectively less than the value of threshold value T, the most directly judges m Whether being 0, m is not equal to 0 explanation, and this state becomes partially absorbing state if not having state similar with this state, and will residue The status number m-1 of state, then double counting Stationary Distribution；M is equal to the 0 explanation equal classified of all states, classifier training Terminate.

Below with the process of an example detailed description step 104, described below:

If: similarity matrix is

[\begin{matrix} 0,2,5,6 \\ 2,0,3,1 \\ 5,3,0,4 \\ 6,1,4,0 \end{matrix}]

Then transition probability matrix P is

[\begin{matrix} 0,0.154,0.385,0.461 \\ 0.333,0,0.5,0.167 \\ 0.417,0.25,0,0.333 \\ 0.545,0.091,0.364,0 \end{matrix}]

Stationary Distribution: π₁=Pπ₁, and π₁In element and be 1,

π_{1} = [\begin{matrix} 0.3095 \\ 0.1429 \\ 0.2859 \\ 0.2617 \end{matrix}],

The state of maximum probability is i=1, and This type of maximum of probability is 0.3095.

After becoming partially absorbing state by state i, transition probability matrix is

[\begin{matrix} 0.75,0.0385,0.09625,0.1125 \\ 0.333,0,0.5,0.167 \\ 01417,0.25,0, 0.333 \\ 0.545,0.091,0.364,0 \end{matrix}]

Calculate Stationary Distribution π now₂, the same π of method₁Computational methods.

π_{2} = [\begin{matrix} 0.6419 \\ 0.0741 \\ 0.1482 \\ 0.1357 \end{matrix}]

Calculate π₁And π₂In the corresponding difference (take on the occasion of) of each state be

[\begin{matrix} 0.3324 \\ 0.0688 \\ 0.1377 \\ 0.1250 \end{matrix}]

During actual application, rule of thumb set threshold value T as a column vector, and the element of vector is this type of maximum of probability 1/n, threshold value T is

[\begin{matrix} 0.077375 \\ 0.077375 \\ 0.077375 \\ 0.077375 \end{matrix}]

Relatively difference and threshold value, can obtain state 3, and 4 is similar with state 1, and state 2 is an other class；Many at epidemic situation comparison In the case of, similar state all need to become the state that partially absorbs (method is with state 1), calculate Stationary Distribution, select probability Big state j, and write down this type of maximum of probability, then state j becomes partially absorbing state, calculates steady point now Cloth, compares the difference of two Stationary Distribution, selects the state similar with state j, and all becomes partially absorbing state by these states, Double counting Stationary Distribution, until all of state has divided equally class.

105: test data are input in grader, the differentiation label outputed test data, differentiate label by contrast Do not calculate 6 actions with the action label of test data and differentiate correct accuracy rate.

In sum, training set and test set are processed respectively by the embodiment of the present invention by word bag model, respectively structure Build the BoW feature of training set and test set；By the BoW features training grader of training set；Test data are input to classification In device, the differentiation label outputed test data, differentiate that the action label of label and test data does not calculate 6 and moves by contrast Make to differentiate correct accuracy rate.Present approach reduces the complexity of iterative computation, improve the accuracy of action recognition.

It will be appreciated by those skilled in the art that accompanying drawing is the schematic diagram of a preferred embodiment, the invention described above embodiment Sequence number, just to describing, does not represent the quality of embodiment.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all spirit in the present invention and Within principle, any modification, equivalent substitution and improvement etc. made, should be included within the scope of the present invention.

List of references

[1] open clean. video motion Human bodys' response and Research on classifying method. Xian Electronics Science and Technology University, 2011

[2]RABINER L R.A tutorial on hidden Markov models and selected applications in speech recognition[J].Proc of the IEEE,1989,77(2):257-286.

[3]MURPHY K.Dynamic Bayesian networks:representation,inference and learning[D].Berkeley:University of California,2002.

[4]BUCCOLIERI F,DISTANTE C,LEONE A.Human posture recognition using active contours and radial basis function neural network[C].Proc of Conference on Advanced Video And Signal Based Surveillance.2005.

[5]HONG Peng-yu,TURK M,HUANG T S.Gesture modeling and recognition using finite State machines[C].Proc of IEEE Conference on Face and Gesture Recognition.2000.

[6]INTILLE S,BOBICK A.Representation and visual recognition of complex,mult-agent actions using belief networks,NO.454[R].[S.1.]:MIT,1998

[7] Hu Fei, Luo Limin, Liu Jia, etc. action recognition based on space-time interest points and topic model ［ J ］. Southeast China University Journal: natural science edition, 2011,41 (5): 962-966.

[8]I.Laptev,M.C.Schmid,and B.Rozenfeld.Learning realistic human actions from movies.In CVPR,2008.

[9]Li Feifei,Perona P.A Bayesian Hierarchical Model for Learning Natural Scene Categories[C],Proc.of CVPR’05.San Diego,CA,USA:[s.n.],2005:524- 531.

[10] Laptev I.On space-time interest points ［ J ］ .International Journal Of Computer Vision, 2005,64 (2/3): 107-123.

[11]J.B.MacQueen,Some Methods for classification and Analysis of Multivariate Observations,Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability,Berkeley,University of California Press,1967,1:281-297.

[12]Xiaojin Zhu,Andrew B.Goldberg,Jurgen Van Gael,David Andrzejewski.Improving Diversity in Ranking using Absorbing Random Walks[J], NAACL HLT 2007,97–104.

Claims

1. a human motion recognition method based on sort method, it is characterised in that said method comprising the steps of:

(2) dictionary is built；

(3) being processed training set and test set respectively by word bag model, the BoW building training set and test set respectively is special Levy；

(4) by the BoW features training grader of training set；

(5) test data are input in grader, the differentiation label outputed test data, differentiate label and test by contrast The action label of data calculates 6 actions respectively and differentiates correct accuracy rate；

Wherein, the described BoW features training grader by training set operation particularly as follows:

3) Stationary Distribution π is calculated according to transition probability matrix₁, select state i of maximum probability, and write down this type of maximum of probability；

4) become partially absorbing state by state i, obtain new transition probability matrix and calculate Stationary Distribution π now₂, then Calculate π₁And π₂In the corresponding difference of each state；

5) difference if there is state is more than the value of threshold value T, is considered as this state similar with state i, selects similar with state i All states, these states are all become partially absorbing state, and write down the status number m of residual state；

6) then double counting Stationary Distribution, difference is respectively less than the value of threshold value T, just directly judges whether m is 0, and m is not equal to 0 and says Bright do not have state similar with this state, and this state becomes partially absorbing state, and by the status number m-1 of residual state, so Rear double counting Stationary Distribution；M is equal to the 0 explanation equal classified of all states, and classifier training terminates.