CN105893967B

CN105893967B - Human behavior classification detection method and system based on time sequence retention space-time characteristics

Info

Publication number: CN105893967B
Application number: CN201610201446.XA
Authority: CN
Inventors: 刘宏; 刘梦源
Original assignee: Shenzhen Gandong Smart Technology Co ltd
Current assignee: Shenzhen Gandong Smart Technology Co ltd
Priority date: 2016-04-01
Filing date: 2016-04-01
Publication date: 2020-04-10
Anticipated expiration: 2036-04-01
Also published as: CN105893967A

Abstract

The invention relates to a human body behavior classification detection method and a human body behavior classification detection system based on time sequence retention space-time characteristics, wherein in the system consisting of a video input end, a time sequence characteristic extraction output end and an off-line training classifier, the method comprises the following steps: 1) detecting a human target in a video sequence; 2) extracting space-time interest points from a time-space domain containing a human body target, and clustering the space-time interest points into K categories by using a mean value clustering method; 3) for the space-time interest point pairs, counting time axis distribution characteristics of the time sequence retention space-time characteristics; 4) carrying out weighted fusion on the time sequence characteristics and the space-time interest point bag characteristics; 5) and training the human behavior template by using a bag-of-words model and a classifier and classifying. According to the invention, the human body behavior model is established by describing the time sequence relation among the same category characteristic points, so that the discrimination among different human body behaviors can be well improved.

Description

Human behavior classification detection method and system based on time sequence retention space-time characteristics

Technical Field

The invention belongs to the field of target identification and intelligent man-machine interaction in machine vision, and particularly relates to a robust human behavior classification detection method based on time sequence retention space-time characteristics.

Background

The human behavior analysis comprises human behavior detection, human behavior classification, abnormal behavior analysis and the like. According to the number of human bodies, the method can be divided into single human body behavior analysis, multi-person behavior analysis and group behavior analysis. The detection, tracking and identification of human bodies all belong to the category of behavior analysis. The behavior analysis here refers to human behavior classification: for a given video sequence containing a certain motion, the video sequence is tagged by the category of the motion. Human behavioral analysis began as early as the thirties of the twentieth century. Earlier successful research has focused primarily on the study of rigid body motion. In the fifty years or so, research on non-rigid bodies has been gradually developed. Especially human motion analysis, it has extensive application prospect in fields such as intelligent video monitoring, network video retrieval, auxiliary medical treatment, sports video analysis. For example, in virtual reality, the gesture of a user in a real physical space is effectively analyzed and understood; in the human-computer interaction, a computer (a robot) can utilize visual information to complete more effective human-computer interaction; in training such as dance and gymnastic exercises, the movements of a practicer are guided and corrected by analyzing the movements of joints.

In a real-world scenario, human behavior classification has a number of difficulties: the human motion performers are often in different age stages and have different appearances, and the motion speed and the time-space variation degree are different from person to person; the large similarity of different motions is inter-class similarity, which is a difficult situation relative to the above-mentioned intra-class diversity; meanwhile, human behavior classification faces a plurality of classic difficult problems of image processing, such as human body occlusion, shadows in outdoor scenes, illumination change, crowd congestion and the like. In the face of the difficulties, how to realize robust human behavior classification is applied to intelligent monitoring in real scenes, and the method has important research significance. We focus on how to describe human behavior in a video sequence. In other words, the process of extracting feature vectors from the video to represent the original video. The feature vector should have the following characteristics: firstly, the vector extraction process should be as efficient as possible to meet the real-time requirement as possible; secondly, the vector dimension should be as low as possible to improve the classification efficiency; finally, the vectors should be representative and robust, with good discrimination of inter-class similarity and tolerance of intra-class diversity.

In view of the above requirements, we classify human behavior description methods into two major categories: global features and local features. The global feature is a top-down process, namely the motion description is obtained by taking human body behaviors as a whole body and extracting. Global feature description is a strong feature that can encode most of the information of motion. However, global features are extremely sensitive to view angle, occlusion, and noise, and the premise of extracting global features is that the motion foreground can be well segmented. This is extremely harsh on the preprocessing process required by human behavior description in complex scenes. In consideration of the defects of the global features, the local features are proposed for the description of human body behaviors in a complex scene. The extraction of local features is a bottom-up process: firstly, detecting spatio-temporal interest points, then extracting local texture blocks around the interest points, and finally combining the descriptions of the blocks to form a final descriptor. Due to the proposal of the bag-of-words model, a framework for classifying human body behaviors by using local features is widely adopted. The local features are different from the global features, the sensitivity to noise and partial shielding is low, and foreground segmentation and tracking processes are not needed in the extraction of the local features, so that the method can be well suitable for human behavior analysis in complex scenes. The main disadvantage of the local features is that the global constraint relationship between points is ignored, so a higher layer of spatial relationship description is required to improve the classification effect of the existing bag-of-words model.

Disclosure of Invention

The invention uses local characteristic points and establishes a human body behavior model by describing the time sequence relation of the characteristic points, and finally realizes the classification of human body behaviors. The extraction and description of local feature points refer to "Evaluation of localization-temporal features for action" (2009), h.wang, m.m.ullah, a.

Laptev and c.schmid; bmvc'09, in proc. The method effectively improves the accuracy and robustness of the traditional method by describing the time sequence relation of the local characteristic points.

The technical scheme adopted by the invention is as follows:

a human behavior classification detection method based on time sequence retention space-time characteristics comprises the following steps:

1) detecting a human target in a video sequence;

2) extracting space-time interest points from a space-time domain containing a human body target;

3) extracting the characteristics of the space-time interest points, and clustering the space-time interest points into a plurality of categories by using a mean value clustering method;

4) counting distribution information on a time axis of the time-space interest points belonging to each category to obtain time axis distribution characteristics;

5) combining time axis distribution characteristics corresponding to different types of space-time interest points to obtain time sequence characteristics of the video;

6) calculating bag-of-word characteristics of the space-time interest points, and fusing the bag-of-word characteristics with the time sequence characteristics to obtain a fusion characteristic histogram corresponding to the video;

7) utilizing a bag-of-words model and converting the histogram features in the model into the fusion feature histogram obtained in the steps 1) -6), and training aiming at different behavior categories to obtain feature description templates corresponding to the different behavior categories;

8) when a video to be detected is input, the feature description (namely the fusion feature histogram) of the video is extracted in the steps 1) -6), and then the nearest neighbor matching is carried out on the feature description template of different behavior categories, and the behavior category corresponding to the video is the one with the highest matching degree.

Furthermore, the human behavior classification in the above method is performed on human behaviors that can be detected in the video.

Further, the spatio-temporal interest points extracted in step 2) refer to points with severe gray level transformation in the spatio-temporal domain.

Further, the features of the spatio-temporal interest points extracted in step 3) are HOG (gradient histogram) and HOF (optical flow histogram) features, or 3DSIFT features, or 3DHOG features.

Furthermore, the time sequence of the spatio-temporal interest points in the above method refers to the relative position relationship of all spatio-temporal interest points with the same category label on the time axis. And 4) optimizing the extracted time axis distribution characteristics, namely, ignoring the video frames which do not contain the space-time interest points, thereby reducing the influence of the static video frames on the space-time interest point distribution and obtaining a more robust time sequence relation.

Further, step 7) averages the histogram features of the plurality of corresponding videos for each behavior category, and takes the averaged histogram feature as the feature corresponding to the behavior category.

The invention also provides a human behavior classification detection system based on time-sequence space-time characteristics, which comprises the following steps: the system comprises a video input end, a time sequence characteristic extraction output end and an offline training classifier;

the video input end comprises a camera device capable of acquiring RGB images;

the time sequence feature extraction output end is used for acquiring the RGB image sequence from the video input end, and extracting and outputting time sequence features corresponding to human behaviors in the video;

the offline training classifier: a) calculating bag-of-word characteristics of an input video, and fusing the bag-of-word characteristics with the time sequence characteristics output by the time sequence characteristic extraction output end to obtain a fusion characteristic histogram corresponding to the video; b) training different classes of behaviors by utilizing a bag-of-words model and a classifier to obtain characteristic description templates corresponding to different classes of behaviors; c) and obtaining the feature description corresponding to the human body behavior in the input test video, carrying out nearest neighbor matching on the feature description corresponding to different behavior categories, wherein the feature description with the highest matching degree is the behavior category corresponding to the test video, and outputting a category label.

The invention realizes a robust human behavior classification method and system based on time-sequence space-time characteristics, namely, the time domain structure characteristics of local space-time interest points are coded by utilizing the occurrence precedence relationship of the local space-time interest points, thereby increasing the discrimination between different behavior categories. The invention belongs to the expansion of a framework for behavior classification by utilizing a bag-of-words model and local feature points. Fig. 4 shows the effect diagram of the present invention, which shows that the human behavior classification effect of the present invention is the best.

Drawings

FIG. 1 is a flow chart of video descriptor (i.e., temporal-spatial feature histogram with order preservation) extraction according to the present invention;

FIG. 2 is a flow chart for combining a time-preserving spatiotemporal feature histogram with conventional bag-of-words features;

FIG. 3 is a sample database portion used by the present invention;

FIG. 4 is a graph comparing the accuracy of the human behavior classification method of the present invention with the conventional bag-of-words features; wherein, the abscissa represents the number of clustering categories K, and the ordinate is the recognition rate.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1 and fig. 2, the time-series remaining spatio-temporal feature histogram feature extraction steps corresponding to the video containing human body behaviors are as follows:

1) and extracting and describing the spatio-temporal interest points. The present invention uses spatio-temporal point-of-interest detectors and descriptors used in the documents C.Schultt, I.Laptev, andB.Caputo, "Recognizing human actions: a local svm approach," in ICPR, pp.32-36,2004. The parameters of the spatio-temporal point of interest detector are kept consistent with those in the above-mentioned document. The spatial-temporal interest point descriptor adopts an HOG (gradient histogram) feature with 90-dimensional dimension and a 72-dimensional multi-HOF (optical flow histogram) feature, and the two features are connected in series to form a 162-dimensional descriptor.

2) Clustering of spatio-temporal interest points. The invention adopts a K-means clustering method and sets different clustering numbers aiming at different databases in an experiment. The experiment uses two databases of UT-Interaction and Rochester, which are respectively proposed by documents M.S. Ryoo, "Human Activity prediction: Early recognition of on going Activity from streaming video," in ICCV, pp.1036-1043,2011 and R.Messing, C.Pal, and H.Kautz, "Activity recognition using the said motion histories of tuckkeypoints," in ICCV, pp.104-111,2009. For the KTH database, the clustering number is set to 900; setting the number of clusters to 2300 for the Rochester database; for the UT-Interaction #1 database, the number of clusters is set to 2100; the number of clusters in the UT-Interaction #2 database was 1900. It should be noted that fig. 4 illustrates the relationship between the number of clusters and the recognition rate, and it can be seen from the figure that when the number of clusters varies in a large range, the method can still obtain a high recognition rate.

3) And (5) extracting video features. The video feature extraction method mainly comprises the following steps:

a) detecting a time-space interest point from an input video;

b) clustering the space-time interest points into K types;

c) counting the number of spatio-temporal interest points on each frame, and removing video frames which do not contain the spatio-temporal interest points;

d) respectively counting a distribution histogram of each type of time-space interest points on a time axis;

e) soft-distributing a distribution histogram corresponding to each type of time-space interest point into L-dimensional vectors; the invention designs a new soft distribution method based on distance weighting, which is shown in a formula (9) hereinafter;

f) counting a K-class time-space interest point number distribution histogram;

g) and weighting and fusing the L-dimensional vector representing the time sequence characteristics and the number distribution histogram to obtain a final video descriptor.

Let S be { S ═ S₁,...,S_k,...,S_KContains all spatio-temporal interest points extracted in one video V. S_kAnd all spatio-temporal interest points with labels of K are contained, wherein K belongs to the range from 1 to the clustering number K.

Representing spatio-temporal points of interest labeled k and occurring at the nth frame,

the parameters α represent the weight of the time sequence feature, β represents the weight of the bag of words feature, L represents the dimension of the time sequence retention feature, and α and the default values of L are set to 0.5 and 3 respectively.

The above algorithm involves the following formula:

wherein,

represents the nth of the kth class (totally containing N)_kOne) of the spatio-temporal interest points,

respectively represent the abscissa, the ordinate, the timestamp and the category of the time-space interest point.

Bow represents a number distribution histogram, and the function η obtains the number distribution histogram B of K-class spatio-temporal interest points through statistics.

Wherein M represents the number of video frames in the video that contain at least one spatio-temporal interest point.

The function delta is used for indicating whether the two variables are equal, if so, the value is 1, otherwise, the value is 0;

and the method is used for recording the number of the k-th class of spatio-temporal interest points contained in the ith frame.

Wherein,

representing the number of all spatio-temporal points of interest, R, contained in the ith frame_iIndicating whether the number of all spatio-temporal interest points contained in the ith frame is greater than 0.

Wherein,

representing a set of class k spatio-temporal points of interest. The time stamp of the spatio-temporal interest point is relative to the video after the interference frame is removed. The interference frame is a video frame which does not contain the spatio-temporal interest point.

Wherein,

representing a probability density function, L representing the number of intervals,

indicates the number of points observed in the ith interval,

representing the number of sampling points, and function I for determining whether variable x falls within interval B_iAnd if so, is 1, otherwise is 0.

Interval B_i(i belongs to 1 to L) is defined as shown in formula (8):

wherein l_iRepresents that it falls within the interval B_iNumber of medium variables, function w (l)_i) Considering the surrounding interval to the current interval B_iThe influence of (c). This effect diminishes as the relative distance between the two compartments increases.

Wherein, index (l)_j) Representing the sought variable l_jIn that

Rank position in (1), Length represents

The length of (a) of (b),

the contribution values of all spatio-temporal interest points of the k-th class to the interval i are recorded.

Substituting the formula (10) into the formula (7) to obtain the L-dimensional time sequence characteristics representing the space-time interest points

Abbreviated as Q. Reducing the dimension of Q to D dimension by PCA method to obtain the time sequence characteristic after dimension reduction

As shown in FIG. 3, the databases KTH, Rochester and UT-Interaction were used for the experiments. Wherein KTH contains 6 human behavioral actions, which are respectively: "ringing", "clipping", "walking", "jumping", "running", "walking", was repeated 4 times by 25 persons for a total of 600 video segments. Rochester contains 10 human behavioral actions, which are respectively: "answer a phone", "chop a banana", "dial a phone", "drag water", "eat abaana", "eat snacks", "look up a phone number in a phone book", "peel abaana", "eat food with silver ware" and "write on a white board", obtained by 5 persons performing 3 times repeatedly, for a total of 150 videos. The UT-Interaction comprises 6 human body behaviors, which are respectively: "hug", "kick", "point", "punch", "push" and "shake-hands", obtained by the performer repeating 10 times in two scenes, respectively, for a total of 120 videos. The first from top to bottom in FIG. 3 are behavior examples of the KTH, Rochester, UT-Interaction #1 and UT-Interaction #2 databases, respectively.

As shown in FIG. 4, the classification results are shown in (a) a KTH database, (b) a Rochester database, (c) a UT database including UT-Interaction #1 and UT-Interaction #2, and (d) a UT-Interaction #2 database. And the offline training classification module adopts leave-one-cross verification and uses a support vector machine as a classifier to compare the matching degree of the test sample and the template obtained by training. The support vector machine employs the chebyshev kernel. FIG. 4 includes a conventional bag-of-words model, timing features proposed by the present invention, and a method for combining the two. The abscissa represents the number K of clusters, and the ordinate represents the recognition rate. It can be seen that when K varies in a large range, the time sequence feature classification accuracy rate provided by the invention is higher than that of a bag-of-words model, and the recognition rate obtained by combining the two features is the highest.

It should be noted that the HOG/HOF feature used in the present invention can be replaced by 3D SIFT (3D Scale Invariantfeature Transform), 3D DHOG (3D History of organized Gradients). The PCA dimension reduction method used can be replaced by an LDA (Linear Discriminont analysis) method.

The above examples are merely illustrative of the present invention and although the preferred embodiments of the present invention and the accompanying drawings have been disclosed for illustrative purposes, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. Therefore, the present invention should not be limited to the disclosure of the preferred embodiments and the accompanying drawings.

Claims

1. A robust human behavior classification detection method based on time sequence retention space-time characteristics is characterized by comprising the following steps:

1) detecting a human target in a video sequence;

4) counting a distribution histogram on a time axis of the time-space interest points belonging to each category to obtain time axis distribution characteristics;

the time axis distribution characteristics comprise complete precedence order among the time-space interest points of the same class labels;

the precedence order considers the relative appearance order between the spatio-temporal interest points, rather than the exact number of frames apart, to increase the robustness of the extracted timing sequence information;

7) utilizing a bag-of-words model and converting the histogram features in the model into the fusion feature histograms obtained in the steps 1) to 6), and training aiming at different behavior categories to obtain feature description templates corresponding to the different behavior categories;

8) when a video to be detected is input, firstly extracting the feature description of the video from the steps 1) to 6), and then carrying out nearest neighbor matching with feature description templates of different behavior categories, wherein the highest matching degree is the behavior category corresponding to the video.

2. The method of claim 1, wherein the spatio-temporal interest points of step 2) are points in the spatio-temporal domain where the gray scale transformation is severe.

3. The method of claim 1, wherein the features of the spatio-temporal points of interest extracted in step 3) are HOG and HOF features, or 3DSIFT features, or 3DHOG features.

4. The method of claim 1, wherein step 4) describes the timeline distribution features with spatiotemporal interest points having the same class labels rather than the entire spatiotemporal interest points.

5. The method as claimed in claim 4, wherein the step 4) optimizes the extracted time-axis distribution characteristics by omitting video frames containing no spatio-temporal interest points to reduce the influence of the still video frames on the spatio-temporal interest point distribution and obtain a more robust timing relationship.

6. The method of claim 1, wherein step 6) uses a weighted fusion method to perform weighted fusion of the bag-of-words model and the reduced-dimension time-series features.

7. The method according to claim 6, wherein step 7) averages the histogram features of the corresponding plurality of videos for each behavior class, and takes the averaged histogram feature as the feature corresponding to the behavior class.

8. A human behavior classification detection system based on time sequence space-time characteristics and adopting the method of claim 1 is characterized by comprising a video input end, a time sequence characteristic extraction output end and an off-line training classifier;

the video input end comprises a camera device capable of acquiring RGB images;

the time sequence feature extraction output end is used for acquiring an RGB image sequence from the video input end, and extracting and outputting time sequence features corresponding to human behaviors in the video;

the offline training classifier calculates the bag-of-word features of the input video, and fuses the bag-of-word features with the time sequence features output by the time sequence feature extraction output end to obtain a fusion feature histogram corresponding to the video; then training different classes of behaviors by utilizing a bag-of-words model and a classifier to obtain characteristic description templates corresponding to different classes of behaviors; and obtaining the feature description corresponding to the human body behavior in the input test video, carrying out nearest neighbor matching on the feature description corresponding to different behavior categories, wherein the feature description with the highest matching degree is the behavior category corresponding to the test video, and outputting a category label.