CN106909890A

CN106909890A - A kind of Human bodys' response method based on position cluster feature

Info

Publication number: CN106909890A
Application number: CN201710057722.4A
Authority: CN
Inventors: 孔德慧; 贾文浩; 孙彬; 王少帆
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-01-23
Filing date: 2017-01-23
Publication date: 2017-06-30
Anticipated expiration: 2037-01-23
Also published as: CN106909890B

Abstract

The present invention discloses a kind of Human bodys' response method based on position cluster feature, including：Step 1, in the training stage, the position cluster feature point of each frame of training video is extracted by Attitude estimation first, the local location skew and global position skew of each each characteristic point of frame are calculated afterwards；Then the characteristic point offset information of all training videos is collected, and offset information is clustered using K means clustering algorithms, cluster centre is obtained, that is, forms code book, current training video is then represented with one group of histogram of joint characteristic point according to code book；Step 2, in test phase, to a test video, the code book being made up of the above-mentioned training stage first sets up histogram, and compare test phase histogram by naive Bayesian arest neighbors sorting technique afterwards carries out Activity recognition with the histogrammic difference of training stage.Using technical scheme, with discrimination very high.

Description

A kind of Human bodys' response method based on position cluster feature

Technical field

The invention belongs to computer vision and area of pattern recognition, more particularly to a kind of human body based on position cluster feature Activity recognition method.

Background technology

In recent years, human behavior identification obtained increasing concern, is understood by analyzing interacting for people and object People even infer that its is intended to what does, it appears particularly critical, thus the automatic understanding for carrying out human action with identification to being permitted For many artificial intelligence systems it is critical that, this can be widely applied in many practical applications, such as intelligent video prison In many fields such as control, motion retrieval, man-machine interaction and health care.For example, can intelligently be serviced to build one In the man-machine interactive system of the mankind, the system not only needs to perceive the motion of human body, and is also understood that the semanteme of human action And infer that it is intended to.

Action recognition sorting technique traditional at present is mainly by RGB camera acquisition video sequence to carry out behavior knowledge Not, the video for being obtained in this case is a RGB image sequence according to the tactic 2D of time order and function.Based on RGB The human action identification of information is having made great progress over the past decades, and many methods are suggested in succession, these method bags Include human body key poses, Motion mask, outline and Space Time shape etc..Method based on space-time detection can carry out accurate phase Measured like degree, also the method based on dense motion track is due to enjoying the concern of people with outstanding performance.

Although the above method achieves preferable recognition result in relevant criterion test data set, due to Human action has the flexibility of height, and the attitude of human body, motion, clothing have significant individual difference, camera perspective, phase The motion of machine, the change of illumination condition, the spatio-temporal structure for blocking, blocking the simultaneously interaction comprising people-thing simultaneously and complexity certainly Etc. the combined influence of factor so that human action identification is still extremely challenging.And RGB information is highly susceptible to environmental factor Influence, such as change of illumination, background etc. can all bring different degrees of interference, further for two different behaviors, RGB figures Picture may be closely similar, and this will bring very big difficulty to action recognition classification.

With the development of science and technology, the progress of sensor technology so that the cheap depth transducer of high-resolution becomes possible to, example Such as the Kinect and the Xtion PRO LIVE of HuaShuo Co., Ltd of Microsoft.In the depth map image gathered by depth camera Each pixel record the depth value of scene, completely different with light intensity value represented by pixel in common RGB image.Depth The introducing of sensor can greatly expand computer system and perceive three-dimensional world and extract the ability of Low Level Vision information.Depth The more traditional RGB camera of sensor has unrivaled advantage in terms of human action identification, i.e., it is not by the shadow of illumination condition Ring, with color and texture consistency, and RGBD cameras can not only obtain RGB sequences can also be while obtaining depth sequence Row, while depth information can greatly simplify detection and the segmentation task of target.If from single visual angle, different behaviors may have Similar 2D projections, now depth map can provide extra body-shape information to distinguish different behaviors.So in recent years, largely The research work of researcher is laid particular emphasis on using 3D information research Activity recognitions, and the 3D information pair obtained by RGBD cameras The estimation of human body attitude is significantly improved.

Wherein Lu etc. proposes the effective scheme for recognizing human action：By the part for calculating human synovial 3D positions Position offsets to recognize the action of human body.However, this method does not account for the characteristic of time series, make record joint information Histogram lose the continuous information of sequence；And their method is not accounted in action recognition in code book formation stages Each joint sports independence.

Additionally, the Kinect cameras of Microsoft shoot human body when can not only obtain human body depth map and And 16 joint dot position informations of human body can also be provided simultaneously, the research of Most scholars is all based on Microsoft's The artis information that Kinect is provided carries out human action identification, but Kinect is when human body is shot, preceding 20 frame Left and right can not now provide the joint dot position information of human body, in addition for judging to recognize position of the human body in picture When human action Amplitude Ratio is larger, such as human body from erectility be transitioned into kick when, Kinect is given Artis position have sizable skew, it is not accurate enough, as shown in Figure 1.

The content of the invention

The technical problem to be solved in the present invention is to provide a kind of Human bodys' response method based on position cluster feature, With discrimination very high.

To achieve the above object, the present invention is adopted the following technical scheme that：

A kind of Human bodys' response method based on position cluster feature includes：

Step 1, in the training stage, the position cluster feature point of each frame of training video is extracted by Attitude estimation first, Each characteristic point of each frame is calculated afterwards to be offset relative to the position of the corresponding characteristic point of a certain frame before；Then collect all The characteristic point offset information of training video, and offset information is clustered using K-means clustering algorithms, clustered Center, that is, form code book, then represents current training video with one group of histogram of joint characteristic point according to code book；

Step 2, in test phase, to a test video, the code book being made up of the above-mentioned training stage first sets up straight Fang Tu, compares the Nogata of test phase histogram and training stage by naive Bayesian arest neighbors sorting technique (NBNN) afterwards The difference of figure carries out Activity recognition.

Preferably, the step 1 is comprised the following steps：

Step 1.1, human body attitude feature point extraction, comprise the following steps

Step 1.1.1, firstly the need of exactly position human body limb endpoint location, then centered on acra point, realize people Body region is divided, using geodesic distance as the foundation of classification, using arest neighbors sorting algorithm as the instrument classified, by human body depth Degree pixel is divided into six most of, i.e. head, left arm, right arm, left leg, right leg, trunk, human body classification according to Classified according to following formula,

Ω_i’={ v ∈ V：||v-e^i’||_geod≤min_{J '=0 ..., 5}||v-e^j’||_geod, i '=0 ..., 5

Wherein Ω_i’, six human body blocks of i '=0 ..., 5 presentation classes, their correspondence head, left arm, right arm, left sides Leg, right leg, trunk；V represents some pixel, e in human body^i’Represent the i-th ' individual acra point, i.e. the left hand right hand or Left foot right crus of diaphragm, when i '=0, e^i’Represent the central point of human body.||v-e^i’||_geadRepresent pixel v to acra point e^i’Geodesic distance From

Step 1.1.2, the Divisional characteristic point for having used the region clustering algorithm extraction human body based on K-means, that is, exist Above-mentioned to obtain being clustered in the block of human body acra point position, the representation of the artis according to human body extracts cluster feature point To characterize different human body attitudes.

The calculating of step 1.2, human action sequence signature vector

It is divided into following steps：

Step 1.2.1, calculating position offset：For a video sequence F for n frames, the 3D of m characteristic point of each frame sits Mark f (t) can be estimated to obtain by human body attitude：

F (t)=φ (t)={ θ₁(t),θ₂(t),K,θ_m(t) }, t ∈ { 1,2, K, n }

Wherein θ_i(t)=(x_i(t),y_i(t),z_i(t)), i ∈ { 1,2, K, m }, θ_iT () represents that i-th human body of f (t) is special 3D coordinate informations a little are levied, m represents the quantity of characteristic point.

The global offset for obtaining action sequence by the characteristic point position offset information for calculating current t frames and the first frame is believed Breath：

f_i1=θ_i(t)-θ_i(1)

By calculating current t frames and the (part of the characteristic point position offset information acquisition action sequence of t- Δ t) frames Offset information：

f_i2=θ_i(t)-θ_i(t-Δt)

Wherein, Δ t is a time interval.

Obtain after the offset information of all human body feature points of t frames, the characteristic information of all characteristic points of t frames can lead to Cross global offset information f₁(t) and local offset information f₂T () two parts are represented, as follows：

f₁(t)=[f₁₁(t),f₂₁(t),K,f_m1(t)]

f₂(t)=[f₁₂(t),f₂₂(t),K,f_m2(t)]。

The acquisition of the corresponding action sequence characteristic vector of step 1.2.2, video

Assuming that all human body feature points of each training video are represented with one group of offset information, all videos being collected into Each characteristic point global offset vector R₁Represent, i.e.,WhereinCorresponding is jth The t frames of the ith feature point of individual training video, the local offset vector R of each characteristic point of all videos being collected into₂ Represent, i.e.,WhereinCorresponding is j-th t frame of the ith feature point of training video, if R=R₁YR₂, cluster is carried out to R using K-means algorithms afterwards and forms code book { b_k, k=1,2 ..., K, each code word is just It is the center of each cluster, here using the clustering measure method of Euclidean distance.

If each training video F={ f (t) }, t=1, what 2 ..., n, wherein n were represented is frame number, in each frame f (t) The global offset vector f of each human body feature point i_1i(t) or local offset vector f_2iT () all can be in code book { b_kIn find The most short code word of Euclidean distance, i.e.,

Therefore, in F in each characteristic point i motions i.e. video characteristic point i all position offset f_1i(t) and f_2iT (), the position offset of each characteristic point can further pass through a histogram h_iTo represent, the histogram is a pass In the histogram of each code word frequency, byWithComposition, wherein h_i ¹To represent the global offset amount Nogata of ith feature point Figure,The local offset histogram of ith feature point is represented, i.e.,

Wherein # { } is a scoring function.Last F can just represent with one group of histogram of all characteristic points, i.e. F= {h_i, i=1, wherein 2 ..., m, h_iCorresponding is the histogram of ith feature point.

Preferably, using naive Bayesian arest neighbors sorting technique (Native Bayes Nearest in step 2 Neighbor Classifier NBNN) carry out the classification of motion：The video sequence F=that known one group of characteristic point histogram is represented {h_i, i=1, wherein 2 ..., m, m are the quantity of characteristic point,

It is applied to based on NBNN visual classifications from the initial concept based on NBNN image classifications, that is, Activity recognition, What is calculated is the distance of joint histogram-classification rather than the distance of video-classification or the distance of audio-video, following institute Show：

WhereinRepresent in the ith feature point of c class behaviors with h_iThe histogram of arest neighbors, i.e.,Wherein h '_iC () represents the histogram of ith feature point in behavior class c.

Brief description of the drawings

Fig. 1 is the wrong artis schematic diagram that Microsoft Kinect are given；

Fig. 2 is Human bodys' response method flow schematic diagram of the present invention；

Fig. 3 is the acra feature detection schematic diagram based on geodesic distance；

Fig. 4 is the human region mark schematic diagram based on geodesic distance；

Fig. 5 is that the posture feature based on cluster extracts schematic diagram；

Fig. 6 a are the global offset schematic diagram of present frame；

Fig. 6 b are the local offset schematic diagram of present frame；

Fig. 7 is to be offset to form cluster centre and histogrammic procedure chart according to characteristic point global and local position；

Fig. 8 be different situations under action recognition rate compare figure；

Fig. 9 carries out the result schematic diagram of action recognition classification for the method for the present invention；

Figure 10 is the result schematic diagram for carrying out action recognition classification using the method for Lu et al. based on joint point feature；

Figure 11 is the result schematic diagram for carrying out action recognition classification using the method for the present invention based on joint point feature.

Specific embodiment

Present example provides a kind of Human bodys' response method based on position cluster feature, in order to avoid human synovial Dot position information is not accurate enough, using division of human body position cluster centre as the characteristic point for characterizing human body attitude；In order to using dynamic Make the global property of sequence information, the present invention adds global position skew to make up using only local position in sequence signature vector Put the defect that offset information is identified.Based on this, it is necessary to the key issue for solving includes：The extraction of human body attitude feature；People The calculating of body action sequence characteristic vector；Action recognition is classified.

Range image sequence of present invention when human motion calculates human action classification as input data as defeated Go out；Wherein, to be the side-play amount structural feature vector using the locus of human body attitude feature describe the core link of calculating One behavior sequence (including global offset information and local offset information), and the classification of motion is realized on this basis.

Step 1, in the training stage, the position cluster feature point of each frame of training video is extracted by Attitude estimation first, Each characteristic point of each frame is calculated afterwards to be offset relative to the position of the corresponding characteristic point of a certain frame before；Then collect all The characteristic point offset information of training video, and offset information is clustered using K-means clustering algorithms, clustered Center, that is, code book is formed into, then representing current training with one group of histogram of joint characteristic point according to code book regards Frequently；

Step 2, in test phase, to a test video, the code book being made up of the above-mentioned training stage first sets up straight Fang Tu, compares the Nogata of test phase histogram and training stage by naive Bayesian arest neighbors sorting technique (NBNN) afterwards The difference of figure carries out Activity recognition, as shown in Figure 2.

The step 1 is comprised the following steps：

Step 1.1, human body attitude feature point extraction

In this stage, use Kinect to shoot actual human body sampling depth data, be then converted into a little depth data Cloud.

As shown in Figure 3.Firstly the need of positioning human body acra point (right-hand man, left and right pin and head) position exactly (with human body Geometric center point is that source point carries out acra point location using the Dijkstra's algorithm based on geodesic distance).Then with acra point Centered on, realize that human region is divided.

As shown in figure 4, using geodesic distance as the foundation of classification, using arest neighbors sorting algorithm as the instrument classified, Human depth's pixel is divided into six major parts, i.e. head, left arm, right arm, left leg, right leg, trunk.Human body portion Classified the position following formula of classification foundation (1).

Ω_i’={ v ∈ V：||v-e^i’||_geod≤min_{J '=0......5}||v-e^j’||_geod, i '=0 ..., 5

(1)

Wherein Ω_i’, six human body blocks of i '=0 ..., 5 presentation classes, their correspondence head, left arm, right arm, left sides Leg, right leg, trunk.V represents some pixel, e in human body^i’Represent the i-th ' individual acra point, i.e. the left hand right hand or Left foot right crus of diaphragm, when i '=0, e^i’Represent the central point of human body.||v-e^i’||_geodRepresent pixel v to acra point e^i’Geodesic distance From.Formula (1) is all pixels in the individual position of expression the i-th ' to the i-th ' individual acra point e^i’Geodesic distance be less than other The geodesic distance of acra point.

In order to effectively characterize human body attitude, this method has used the region clustering algorithm based on K-means to extract The Divisional characteristic point of human body, i.e., obtain being clustered in the block of human body acra point position above-mentioned.As shown in Figure 5.In fact, poly- When class point number (i.e. characteristic point quantity) m is very few, the expressiveness of feature shortcoming, a cluster point number cross at most characteristic rule compared with Difference.The present invention extracts m=15 cluster feature point different to characterize according to conventional 16 representations of artis of human body Human body attitude.

The calculating of step 1.2, human action sequence signature vector

It is divided into following steps：

F (t)=φ (t)={ θ₁(t),θ₂(t),K,θ_m(t) }, t ∈ { 1,2, K, n }

(2)

The present invention obtains the overall situation of action sequence by calculating the characteristic point position offset information of current t frames and the first frame Offset information：

f_i1=θ_i(t)-θ_i(1)

f_i2=θ_i(t)-θ_i(t-Δt)

As shown in fig. 6, wherein Δ t is a time interval, it can be with the precision of balanced deflection amount and noise robustness Ability.Δ t values are bigger, then the robustness of noise is just more preferable, but computational accuracy will be reduced, conversely, robustness is then poor, Precision can be higher.Depending on actual conditions of the value according to different action sequence databases.

f₁(t)=[f₁₁(t),f₂₁(t),K,f_m1(t)]

f₂(t)=[f₁₂(t),f₂₂(t),K,f_m2(t)]

The acquisition of the corresponding action sequence characteristic vector of step 1.2.2, video：Assuming that all human bodies of each training video are special Levy and a little represented with one group of offset information.The global offset vector R of each characteristic point of all videos being collected into₁Table Show, i.e.,WhereinCorresponding is j-th t frame of the ith feature point of training video. The local offset vector R of each characteristic point of all videos being collected into₂Represent, i.e.,WhereinCorresponding is j-th t frame of the ith feature point of training video.If R=R₁YR₂.K-means algorithms are used afterwards Cluster is carried out to R and forms code book { b_k, k=1,2 ..., K, each code word are exactly the center of each cluster, are used here The clustering measure method of Euclidean distance.

If each training video F={ f (t) }, t=1,2 ..., n.What n was represented is frame number.It is each in each frame f (t) The global offset vector f of individual human body feature point i_1i(t) or local offset vector f_2iT () all can be in code book { b_kIn find Euclidean The most short code word of distance, i.e.,

Therefore, in F in each characteristic point i motions i.e. video characteristic point i all position offset f_1i(t) and f_2i(t).The position offset of each characteristic point can further pass through a histogram h_iTo represent, the histogram is a pass In the histogram of each code word frequency, byWithComposition, whereinTo represent the global offset amount Nogata of ith feature point Figure,The local offset histogram of ith feature point is represented, i.e.,

Wherein # { } is a scoring function.Last F can just represent with one group of histogram of all characteristic points, i.e. F= {h_i, i=1, wherein 2 ..., m, h_iCorresponding is the histogram of ith feature point, as shown in Figure 7.

Naive Bayesian arest neighbors sorting technique (Native Bayes Nearest Neighbor are used in step 2 Classifier NBNN) carry out the classification of motion：Video sequence F={ the h that known one group of characteristic point histogram is represented_i, i=1, 2 ..., m, wherein m are the quantity of characteristic point, are generally easy to for this group of histogram to combine straight as one Square figure is classified.The independence of characteristics of human body's space of points can thus be lost.The spatial information of human body feature point is being distinguished not With behavior when extra clue can be provided, so to take into full account the independence of human body feature point.

The present invention is applied to based on NBNN visual classifications from the initial concept based on NBNN image classifications, that is, behavior Identification, calculating be the distance of joint histogram-classification rather than the distance of video-classification or the distance of audio-video, such as Shown in lower：

WhereinRepresent in the ith feature point of c class behaviors with h_iThe histogram of arest neighbors, i.e.,Wherein h '_i(C) histogram of ith feature point in behavior class c is represented.

Formula (7) is to represent the test video sequence for being input into, and obtains the histogram of each characteristic point, is then counted The histogrammic difference of the m histogram of characteristic point and each class behavior of training video, the corresponding behavior class c with lowest difference^*, i.e., It is considered as the behavior class corresponding to current video F.

The above method is had been applied to the range image sequence of Kinect2 acquisitions, achieve good experimental result. We select 640 × 480 RGBD images in experiment, and collection environment is interior, and collection illumination is fluorescent lamp, acquires 6 people, Everyone 7 kinds of actions, each action does twice, altogether 84 video sequences, altogether 6343 frame, wherein act including lift respectively Hand, wave, squat down, kicking, bending over, body bilateral is waved, body swing etc..

It is 2 for the ratio of training set and test set for each action selection when being tested:1, chosen at random Choosing, has carried out 50 random experiments altogether, and the average recognition accuracy for obtaining is 98.07%.Same video sequence, in identical reality Under the conditions of testing, that is, same number of times experiment is carried out, training set is identical with the ratio of test set, using the method for Lu et al., obtains Average recognition rate is 95.00%.The artis provided using Microsoft Kinect is acted using the method for the present invention Identification classification, the average recognition rate for obtaining be 96.43%, it is seen that the cluster feature point based on position as action recognition classify according to According to validity.

Give the method for the method of the present invention and Lu et al. and carried based on Kinect with table 1 as shown in Fig. 8,9,10,11 For artis the Different Results comparison schematic diagram of action recognition classification is carried out using this method, it can be seen that it is proposed by the present invention Method has discrimination very high under major part action.

In sum, the human action method for identifying and classifying based on division of human body position cluster feature proposed by the present invention passes through Checking, can obtain highly desirable classification results.

Accuracy of identification and recognition result table under the different situations of table 1

Claims

1. a kind of Human bodys' response method based on position cluster feature, it is characterised in that including：

Step 1, in the training stage, the position cluster feature point of each frame of training video is extracted by Attitude estimation first, afterwards Each characteristic point of each frame is calculated to be offset relative to the position of the corresponding characteristic point of a certain frame before；Then all training are collected The characteristic point offset information of video, and offset information is clustered using K-means clustering algorithms, in being clustered The heart, that is, form code book, then represents current training video with one group of histogram of joint characteristic point according to code book；

Step 2, in test phase, to a test video, the code book being made up of the above-mentioned training stage first sets up histogram, Compare test phase histogram by naive Bayesian arest neighbors sorting technique afterwards to enter with the histogrammic difference of training stage Row Activity recognition.

2. the Human bodys' response method of position cluster feature is based on as claimed in claim 1, it is characterised in that the step 1 comprises the following steps：

Step 1.1.1, firstly the need of exactly position human body limb endpoint location, then centered on acra point, realize human body area Domain divides, using geodesic distance as the foundation of classification, using arest neighbors sorting algorithm as the instrument classified, by human depth's picture Element is divided into six major parts, i.e. head, left arm, right arm, left leg, right leg, trunk, under human body classification foundation Formula is stated to be classified,

Ω_i′={ v ∈ V：||v-e^i′||_gead≤min_{J '=0 ... 5}||v-e^j′||_gead, i '=0 ..., 5

Wherein Ω_i′, six human body blocks of i '=0 ..., 5 presentation classes, their correspondence head, left arm, right arm, left legs Portion, right leg, trunk；V represents some pixel, e in human body^i′Represent the i-th ' individual acra point, i.e. the left hand right hand or a left side Pin right crus of diaphragm, when i '=0, e^i′Represent the central point of human body.||v-e^i′||_geodRepresent pixel v to acra point e^i′Geodesic distance

Step 1.1.2, the Divisional characteristic point for having used the region clustering algorithm extraction human body based on K-means, i.e., above-mentioned Obtain being clustered in the block of human body acra point position, the representation of the artis according to human body, extract cluster feature point with table Levy different human body attitudes.

The calculating of step 1.2, human action sequence signature vector

It is divided into following steps：

Step 1.2.1, calculating position offset：For a video sequence F for n frames, the 3D coordinates f of m characteristic point of each frame T () can be estimated to obtain by human body attitude：

F (t)=φ (t)={ θ₁(t),θ₂(t),K,θ_m(t) }, t ∈ { 1,2, K, n }

Wherein θ_i(t)=(x_i(t),y_i(t),z_i(t)), i ∈ { 1,2, K, m }, θ_iT () represents i-th human body feature point of f (t) 3D coordinate informations, m represents the quantity of characteristic point.

The global offset information of action sequence is obtained by the characteristic point position offset information for calculating current t frames and the first frame：

f_i1=θ_i(t)-θ_i(1)

By calculating current t frames and the (local offset of the characteristic point position offset information acquisition action sequence of t- Δ t) frames Information：

f_i2=θ_i(t)-θ_i(t-Δt)

Wherein, Δ t is a time interval.

Obtain after the offset information of all human body feature points of t frames, the characteristic information of all characteristic points of t frames can be by complete Office offset information f₁(t) and local offset information f₂T () two parts are represented, as follows：

f₁(t)=[f₁₁(t),f₂₁(t),K,f_m1(t)]

f₂(t)=[f₁₂(t),f₂₂(t),K,f_m2(t)]

Assuming that all human body feature points of each training video are represented with one group of offset information, all videos being collected into it is every The global offset vector R of individual characteristic point₁Represent, i.e.,WhereinCorresponding is j-th instruction Practice the t frames of the ith feature point of video, the local offset vector R of each characteristic point of all videos being collected into₂Table Show, i.e.,WhereinCorresponding is j-th t frame of the ith feature point of training video, If R=R₁YR₂, cluster is carried out to R using K-means algorithms afterwards and forms code book { b_i, wherein using the cluster degree of Euclidean distance Amount method, each code word is exactly the K center of cluster, that is, { b_k, k=1,2 ..., K.

If each training video F={ f (t) }, t=1, what 2 ..., n, wherein n were represented is frame number, each in each frame f (t) The global offset vector f of individual human body feature point i_1i(t) or local offset vector f_2iT () all can be in code book { b_kIn find Euclidean The most short code word of distance, i.e.,

[{Δf}_{1 i} (t) {orΔf}_{2 i} (t)] &LeftArrow; \arg \min_{b_{k}} | [{Δf}_{1 i} (t) {orΔf}_{2 i} (t)] - b_{k} |, k &Element; {1, 2, K, K}

Therefore, in F in each characteristic point i motions i.e. video characteristic point i all position offset f_1i(t) and f_2i(t), The position offset of each characteristic point can further pass through a histogram h_iTo represent, the histogram is one on each The histogram of code word frequency, byWithComposition, whereinTo represent the global offset amount histogram of ith feature point,Table Show the local offset histogram of ith feature point, i.e.,

h_{i}^{1} (k) = \frac{# {b_{k} : f_{1 i} (t) = b_{k}}}{n}, (\begin{matrix} i = 1, 2, K, n & k = 1, 2, K, K \end{matrix})

h_{i}^{2} (k) = \frac{# {b_{k} : f_{2 i} (t) = b_{k}}}{n}, (\begin{matrix} i = 1, 2, K, n & k = 1, 2, K, K \end{matrix})

h_{i} (k) = [h_{i}^{1}, h_{i}^{2}]

Wherein # { } is a scoring function.Last F can just represent with one group of histogram of all characteristic points, i.e. F={ h_i, I=1,2 ..., m, wherein h_iCorresponding is the histogram of ith feature point.

3. the Human bodys' response method of position cluster feature is based on as claimed in claim 1, it is characterised in that in step 2 Entered using naive Bayesian arest neighbors sorting technique (Native Bayes Nearest Neighbor Classifier NBNN) The row classification of motion：Video sequence F={ the h that known one group of characteristic point histogram is represented_i, i=1, wherein 2 ..., m, m are features The quantity of point；

It is applied to based on NBNN visual classifications from the initial concept based on NBNN image classifications, that is, Activity recognition, calculate Be the distance of joint histogram-classification rather than the distance of video-classification or the distance of audio-video, it is as follows：

F &RightArrow; c^{*}, w h e r e c^{*} = \arg \min_{c} Σ_{i = 1}^{m} | | h_{i} - {NN}_{i}^{c} (h_{i}) | |^{2}