CN105590100A

CN105590100A - Discrimination supervoxel-based human movement identification method

Info

Publication number: CN105590100A
Application number: CN201510977414.4A
Authority: CN
Inventors: 段立娟; 郭亚楠; 马伟
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2015-12-23
Filing date: 2015-12-23
Publication date: 2016-05-18
Anticipated expiration: 2035-12-23
Also published as: CN105590100B

Abstract

Provided is a discrimination supervoxel-based human movement identification method, comprising: utilizing an unsupervised method to automatically extract a video supervoxel characteristic set among a same category of movement videos, wherein the video supervoxel characteristic set is different from other categories, and can represent the characteristics of the category; then performing feature description on the supervoxels, finally completing identification on an ongoing movement, and thereby more accurately identifying the category of the human movement in videos. Simultaneously referring to a video supervoxel characteristic and an image hog characteristic, the method extracts the discrimination supervoxels in videos through a training and learning iteration process, and identifies movements more accurately. Compared with a traditional method, the method of the invention can automatically extract the effective portion in videos, wherein the effective portion not only comprises the portion with higher discrimination in a human body, but also comprises the portion representing the category of movements in the background.

Description

Human action recognition methods based on the super voxel of identification

Technical field

The present invention relates to feature extraction and machine learning method in Video processing, particularly a kind of based on the super voxel of identificationHuman action recognition methods.

Background technology

In recent years, under the fast development of internet, multimedia technology, video has become the important channel of people's obtaining information,Become the carrier of a large amount of digital informations. Although computer technology has also obtained significant progress in recent years, utilizes computer automaticThe analyzing and processing of carrying out video content is but a great problem of MultiMedia Field all the time. Can basis when human brain is accepted visual informationThe mankind for many years in life subtle study to knowledge or existence general knowledge visual information is carried out to rapid analysis, and computerCan only, by accepting digital information and carrying out numerical calculation and carry out video analysis, lack an intelligentized process, speed slow andThe degree of accuracy is low.

Video human action recognition in Video processing is because it is at aspects such as man-machine interaction, intelligent monitoring, video content analysisRange of application widely, becomes the popular direction of Recent study, has obtained many achievements. But action recognition task also exists veryMany challenges. First, due to the free degree of human action, no matter be in same human action or various human action,The form of expression of action is always differentiated. Even show the video of same action, due to the body posture of different people, peopleBody action speed, the difference of leg speed etc., also has very big-difference. A kind of desirable human motion recognizer should be able to adapt toThe variation of same action, and can distinguish different action classifications. Secondly, the shooting environmental of video or setting are different. ExampleAs, the action camera lens of the people under complicated and mobile background may more be difficult to identification. Recording while arranging, the variation of tone is alsoA common variable. When the video that uses video camera to capture, different visual angles, makes action recognition have more challengeProperty. The problem of distinguishing in order to overcome fuzzy action, this method is sought a kind of method that can apply in real world, make itsIn complicated shooting environmental, identification accurately.

In this article, introduce a kind of by characterizing the method for middle level features. By extracting the super voxel of identification, effectively distinguishDifferent actions, and which part of automatic decision video background can help the identification of moving. First video is carried out tooCut, and pass through the identification piece of the procedure extraction frame of video of a training, get lap with the result of over-segmentation, differentiatedThe super voxel of property. Then super voxel is carried out to track characteristic extraction and description. Finally by BOW belfry video features.

Summary of the invention

The problems referred to above that exist for prior art, the present invention proposes a kind of human action identification side based on the super voxel of identificationMethod, utilizes non-supervisory method automatically to extract to be different from similar action video other classifications, can characterize looking of this class featureFrequently super voxel characteristic set. And then to these super voxels carry out feature description, finally complete the identification of moving, canIdentify more accurately the classification of human action in video.

A human action recognition methods based on the super voxel of identification, comprises the following steps:

Carry out following steps for the video of training:

Step 1, carries out over-segmentation by the video of input, obtains the super voxel of video.

Step 2, carries out key-frame extraction to the video of input.

Step 3, the image that step 2 is obtained carries out the extraction of identification segment.

Step 4, the position of the super voxel that the identification segment that step 3 is obtained and step 1 obtain in video got overlappingOperation.

Step 5, is described the super voxel of video by movement locus feature and the word bag model (bow) of pixel.

Step 6, is used the super voxel of identification as dictionary, obtains video features by the method for bow.

Step 7, obtains disaggregated model with svm grader.

For video to be identified, carry out following steps:

Step 8, the video that input is identified, carries out respectively step 1,2,5,6, obtains the character representation of video to be identified.

Step 9, sends the feature of video to be identified into svm grader, obtains recognition result.

Method of the present invention has the following advantages:

The present invention simultaneously the super voxel feature of reference video (super voxel feature is poor by calculating pixel movement locus and colorDifferent obtaining) with the feature of these two kinds of dimensions of hog feature of image, by a training, the iterative process of study, extractsIn video, there is the super voxel of identification, can identify an action more accurately.

2. the present invention, compared with conventional method, can automatically extract effective part in video, not only comprises having in human bodyThe part of identification, the action to this class that can also extract in background has the part of sign effect (such as the basket of playing basketball in actionBall frame etc.).

Brief description of the drawings

Fig. 1 is the flow chart one of method involved in the present invention;

Fig. 2 is the flowchart 2 of method involved in the present invention.

Detailed description of the invention

Below in conjunction with detailed description of the invention, the present invention is described further.

The flow chart of the method for the invention as shown in Figure 1, comprises the following steps:

Step 1, carries out over-segmentation by the video of input.

One section of video of 1.1 inputs, suppose that the frame of video of input is 3 passage coloured image I, and its wide and height are respectively W, H.

This video is carried out to over-segmentation, obtain the super voxel of video.

Step 2, carries out key-frame extraction to the video of input.

Video is extracted to key frame by the method for getting a frame every 10 frames.

Training image is divided into two groups by 3.1: D, N. Wherein D is the training image of a class action, and N is that in video set, other are movingThe training image of doing. D, N is equally divided into respectively again two parts: D1, D1 and N1, N2.

All images of the 3.2 couples of D1 and N1 proceed as follows:

3.2.1 first image is carried out to segment sampling. The image of N*M carry out twice down-sampled, such width figure just occurs threeLevel. In these three levels, all according to there being the principle of sampling overlappingly, get the blockage (this method is decided to be 60*60) of k*k,These blockages are extracted to traditional HOG feature.

3.2.2 segment D1 being extracted carries out following operation:

Segment is carried out to stochastical sampling, and the segment after sampling is carried out to duplicate removal (if the difference of two is lower than a certain threshold value,Remove this piece). According to the number of residue segment, after 10, obtain wanting the classification number of cluster below. Utilize k-meansMethod is carried out cluster to segment, and removes the classification that only comprises 3 following elements, and the element note of every class is P (i), gives residue everyA classification is distributed a svm grader.

3.2.3 using such element P (i) as positive example, the segment of N1, as negative example, is trained on svm. Again D2 is carriedThe segment of getting is put into each grader as test sample book and is tested, and t the sample that score is the highest joins such original yuanIn element P (i). Next,, D1 and D2 exchange, N1 and N2 exchange, in the operation of carrying out 3.2.3, until iteration repeatedlyObtain final svm model. The iterations of this method is 6 times.

3.2.4 square 3.2.1 being obtained is tested on the svm of this class, if high in the score of some svm upper blockIn a certain threshold value, judge the segment that this piece is identification.

Step 4, the position of the super voxel that the identification segment that step 3 is obtained and step 1 obtain in video got overlappingOperation. Formula is as follows:

f (n) = \{\begin{matrix} 1, & i f & \frac{S (\cup_{j = 1}^{F_{i}} P_{i j}) \cap S ({DS}_{k})}{S ({DS}_{k})} > T \\ 0, & i f & e l s e \end{matrix}

Wherein, Fi is i the key frame of video V, P_ijF_iIn j identification segment. DS_kK in videoIndividual super voxel. Number of pixels in S (.) function representation one panel region. T is the overlapping threshold value that this method arranges.

So far, obtained the super voxel of identification.

Step 5, is described the super voxel of video by movement locus feature and the bow of pixel.

5.1 use tracer tools to follow the trail of pixel, and trace lengths is 15 frames, and obtaining some length is the movement locus of 15 frames.

5.2 pairs of tracks are described

The description of track is divided into four parts, altogether 426 dimensions:

1-30 dimension, totally 30 dimensions, front 30 dimensions represent the direction of motion of a pixel. Formula is as follows:

S^{'} = \frac{({ΔP}_{t}, ..., {ΔP}_{(t + L)} - 1)}{Σ_{t + L - 1}^{j = t} | | {ΔP}_{j} | |}

Wherein Δ P_t＝(P_(t+1)-P_t)＝(x_(t+1)-x_t，y_(t+1)-y_t)，

T represents t frame, and L is 15, xt, the x of this pixel when yt is illustrated in t frame, y axial coordinate.

Feature below obtains by first constructing a stereo block, and the building method of stereo block is as follows:

First for the pixel of every frame of this track, get centered by this location of pixels, the square (N=32) taking N as the length of side,Obtain a square taking N*N as cross section, the stereo block that L is length. This solid is divided into soon to the little stereo block of a*a*b,Wherein a=2, b=3. 12 little stereo block are so just obtained. Respectively these 12 fritters are extracted to conventional HOG, HOF,MHBx, MBHy feature, by these merging features, obtains the feature of 31-425 dimension, as follows:

31-126 dimension, totally 96 dimensions (8*2*2*3), represent HOG feature.

127-234 dimension, totally 108 dimensions (9*2*2*3), represent HOF feature.

235-330 dimension, totally 96 dimensions (8*2*2*3), represent MBHx feature.

331-426 dimension, totally 96 dimensions (8*2*2*3), represent MBHy feature.

So far obtain the motion feature of every track.

5.3 use k-mean algorithm that all tracks are carried out to cluster, and classification number is C1, obtains track dictionary after clustercodebook1。

5.4 use tracks represent super voxel

5.4.1 for the super voxel of the identification obtaining, find the track dropping on it above. Concrete grammar: travel through each track,If more than 7 or 7 pixel of this track all, in this super voxel, judges that this track drops in this super voxel. For lengthDegree is less than the super voxel of 7 frames, thinks that all tracks of passing by it all drop in this super voxel.

5.4.2 for each super voxel, the track dropping in it is done to bow statistics one time taking codebook1 as dictionary, obtainHistogram as the feature of this super voxel.

Step 6, is used the super voxel of identification as codebook, obtains video features by the method for bow.

The super voxel of all identifications is carried out to k-means cluster, and the dictionary after cluster is codebook2. For training video, carryGet its super voxel, it is carried out on codebook2 to bow statistics, the histogram obtaining is as the feature of this video.

Step 7, sends the video features of training video into svm grader and trains, and obtains the disaggregated model of multiclass.

For video to be identified, carry out following steps:

In order to test recognition effect of the present invention, this method is applied on a conventional storehouse of human action identification: YoutubeDataset. This video library comprises 1600 videos, is divided into 11 classes, is respectively: basketball shooting, and cycling, diving,Play golf, ride, football juggles, and plays on a swing, and plays tennis, and trampoline, plays volleyball, and walks a dog. When the broadcasting of each videoLong between 3-20 second.

Claims

1. the human action recognition methods based on the super voxel of identification, is characterized in that: the method is entered for the video of trainingRow following steps,

Step 1, carries out over-segmentation by the video of input, obtains the super voxel of video;

Step 2, carries out key-frame extraction to the video of input;

Step 3, the image that step 2 is obtained carries out the extraction of identification segment;

Step 4, the position of the super voxel that the identification segment that step 3 is obtained and step 1 obtain in video got overlappingOperation;

Step 5, is described the super voxel of video by movement locus feature and the word bag model (bow) of pixel;

Step 6, is used the super voxel of identification as dictionary, obtains video features by the method for bow;

Step 7, obtains disaggregated model with svm grader;

For video to be identified, carry out following steps:

Step 8, the video that input is identified, carries out respectively step 1,2,5,6, obtains the character representation of video to be identified;

2. the human action recognition methods based on the super voxel of identification according to claim 1, is characterized in that: weThe flow process of method comprises the following steps,

Step 1, carries out over-segmentation by the video of input;

One section of video of 1.1 inputs, suppose that the frame of video of input is 3 passage coloured image I, and its wide and height are respectively W, H;

This video is carried out to over-segmentation, obtain the super voxel of video;

Step 2, carries out key-frame extraction to the video of input;

Video is extracted to key frame by the method for getting a frame every 10 frames;

Training image is divided into two groups by 3.1: D, N; Wherein D is the training image of a class action, and N is that in video set, other are movingThe training image of doing; D, N is equally divided into respectively again two parts: D1, D1 and N1, N2;

All images of the 3.2 couples of D1 and N1 proceed as follows:

3.2.1 first image is carried out to segment sampling; The image of N*M carry out twice down-sampled, such width figure just occurs threeLevel; In these three levels, all according to there being the principle of sampling overlappingly, get the blockage (this method is decided to be 60*60) of k*k,These blockages are extracted to traditional HOG feature;

3.2.2 segment D1 being extracted carries out following operation:

Segment is carried out to stochastical sampling, and the segment after sampling is carried out to duplicate removal (if the difference of two is lower than a certain threshold value,Remove this piece); According to the number of residue segment, after 10, obtain wanting the classification number of cluster below; Utilize k-meansMethod is carried out cluster to segment, and removes the classification that only comprises 3 following elements, and the element note of every class is P (i), gives residue everyA classification is distributed a svm grader;

3.2.3 using such element P (i) as positive example, the segment of N1, as negative example, is trained on svm; Again D2 is carriedThe segment of getting is put into each grader as test sample book and is tested, and t the sample that score is the highest joins such original yuanIn element P (i); Next,, D1 and D2 exchange, N1 and N2 exchange, in the operation of carrying out 3.2.3, until iteration repeatedlyObtain final svm model; The iterations of this method is 6 times;

3.2.4 square 3.2.1 being obtained is tested on the svm of this class, if high in the score of some svm upper blockIn a certain threshold value, judge the segment that this piece is identification;

Step 4, the position of the super voxel that the identification segment that step 3 is obtained and step 1 obtain in video got overlappingOperation; Formula is as follows:

f (n) = \{\begin{matrix} 1, & i f \frac{S (\cup_{j = 1}^{F_{i}} P_{i j}) \cap S ({DS}_{k})}{S ({DS}_{k})} > T \\ 0, & i f e l s e \end{matrix}

Wherein, Fi is i the key frame of video V, P_ijF_iIn j identification segment; DS_kK in videoIndividual super voxel; Number of pixels in S (.) function representation one panel region; T is the overlapping threshold value that this method arranges;

So far, obtained the super voxel of identification;

Step 5, is described the super voxel of video by movement locus feature and the bow of pixel;

5.1 use tracer tools to follow the trail of pixel, and trace lengths is 15 frames, and obtaining some length is the movement locus of 15 frames;

5.2 pairs of tracks are described

The description of track is divided into four parts, altogether 426 dimensions:

1-30 dimension, totally 30 dimensions, front 30 dimensions represent the direction of motion of a pixel; Formula is as follows:

S^{'} = \frac{({ΔP}_{t}, ..., {ΔP}_{(t + L)} - 1)}{Σ_{i + L - 1}^{j = t} | | {ΔP}_{j} | |}

Wherein Δ P_t＝(P_(t+1)-P_t)＝(x_(t+1)-x_t，y_(t+1)-y_t)，

First for the pixel of every frame of this track, get centered by this location of pixels, the square (N=32) taking N as the length of side,Obtain a square taking N*N as cross section, the stereo block that L is length; This solid is divided into soon to the little stereo block of a*a*b,Wherein a=2, b=3; 12 little stereo block are so just obtained; Respectively these 12 fritters are extracted to conventional HOG, HOF,MHBx, MBHy feature, by these merging features, obtains the feature of 31-425 dimension, as follows:

31-126 dimension, totally 96 dimensions (8*2*2*3), represent HOG feature;

127-234 dimension, totally 108 dimensions (9*2*2*3), represent HOF feature;

235-330 dimension, totally 96 dimensions (8*2*2*3), represent MBHx feature;

331-426 dimension, totally 96 dimensions (8*2*2*3), represent MBHy feature;

So far obtain the motion feature of every track;

5.3 use k-mean algorithm that all tracks are carried out to cluster, and classification number is C1, obtains track dictionary after clustercodebook1；

5.4 use tracks represent super voxel

5.4.1 for the super voxel of the identification obtaining, find the track dropping on it above; Concrete grammar: travel through each track,If more than 7 or 7 pixel of this track all, in this super voxel, judges that this track drops in this super voxel; For lengthDegree is less than the super voxel of 7 frames, thinks that all tracks of passing by it all drop in this super voxel;

5.4.2 for each super voxel, the track dropping in it is done to bow statistics one time taking codebook1 as dictionary, obtainHistogram as the feature of this super voxel;

Step 6, is used the super voxel of identification as codebook, obtains video features by the method for bow;

The super voxel of all identifications is carried out to k-means cluster, and the dictionary after cluster is codebook2; For training video, carryGet its super voxel, it is carried out on codebook2 to bow statistics, the histogram obtaining is as the feature of this video;

Step 7, sends the video features of training video into svm grader and trains, and obtains the disaggregated model of multiclass;

For video to be identified, carry out following steps: