CN104732208A

CN104732208A - Video human action reorganization method based on sparse subspace clustering

Info

Publication number: CN104732208A
Application number: CN201510114150.XA
Authority: CN
Inventors: 郝宗波; 桑楠; 陆霖霖; 吴杰; 杨眷玉; 万士宁; 赵俊; 朱前芳; 鄢宇烈
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2015-03-16
Filing date: 2015-03-16
Publication date: 2015-06-24
Anticipated expiration: 2035-03-16
Also published as: CN104732208B

Abstract

The invention belongs to computer visual pattern recognition and a video picture processing method. The computer visual pattern recognition and the video picture processing method comprise the steps that establishing a three-dimensional space-time sub-frame cube in a video human action reorganization model, establishing a human action characteristic space, conducting the clustering processing, updating labels, extracting the three-dimensional space-time sub-frame cube in the video human action reorganization model and the human action reorganization from monitoring video, extracting human action characteristic, confirming category of human sub-action in each video and classifying and merging on videos with sub-category labels. According to the computer visual pattern recognition and the video picture processing method, the highest identification accuracy is improved by 16.5% compared with the current international Hollywood2 human action database. Thus, the video human action reorganization method has the advantages that human action characteristic with higher distinguishing ability, adaptability, universality and invariance property can be extracted automatically, the overfitting phenomenon and the gradient diffusion problem in the neural network are lowered, and the accuracy of human action reorganization in a complex environment is improved effectively; the computer visual pattern recognition and the video picture processing method can be applied to the on-site video surveillance and video content retrieval widely.

Description

Based on the video human Activity recognition method of sparse subspace clustering

Technical field

The invention belongs to computer vision pattern-recognition and method of video image processing, particularly a kind ofly adopt sparse subspace (SSC) cluster, segmentation and by the neural network based on degree of depth study more for the number of plies, be split as the video behavior recognition methods of the less more shallow neural network based on degree of depth study of several number of plies.

Background technology

Human bodys' response based on video is the hot issue of computer vision field in recent years, understands problem as typical video, by analyzing the human action feature in sequence of video images, identification decision human body behavior pattern.More specifically, be from sequence of video images, extract the characteristic information that can describe behavior, utilize the technology such as machine learning to understand it, adopt sorter to classify, to reach the object identifying human body behavior.

Along with the development of modern information technologies and the raising of social public security demand, demand is day by day become to the understanding of human body behavior in daily life.Human bodys' response in intelligent video monitoring, Video content retrieval, novel human-machine interaction, virtual reality, Video coding and transmission, playing to control etc. many-sidedly have wide application scenarios, receive much concern.Video human Activity recognition comprises: the Human bodys' response based on space-time method, the Human bodys' response based on serial method and Human bodys' response three class based on degree of depth study.

Wherein: 1. based on the Human bodys' response of space-time method, 3D video is regarded as in time scale, arrange formed solid by 2D image, and carry out space-time expression, comprise again: based on the Human bodys' response of three-dimensional space-time, the Human bodys' response based on three-dimensional space-time local feature and the Human bodys' response based on track; The defects such as it is design manually mostly that these class methods exist human body behavioural characteristic, comparatively large by deviser's experience influence, and the large or adaptivity of calculated amount is poor;

2., based on the Human bodys' response of sequence, be extract proper vector to each two field picture of video, by relevant proper vector composition characteristic sequence, the final human body behavior characterizing this video, carries out discriminator on this basis.Common method is the Human bodys' response based on state model sequence, video is characterized by status switch, be defined as a state to human body static posture, be associated between different states by probability, the behavior that human body links up can regard the migration between the different conditions of these static postures as; Theoretical by this, generating probability model, utilizes similarity to identify, hidden Markov model (Hidden Markov models, HMMs) is the Typical Representative of the method.

3. based on the Human bodys' response method of degree of depth study, then be referred from biological neural theory, be a popular frontier in machine learning, its motivation is to set up and simulates human brain neural network, and the cerebral cortex of namely simulating human brain carries out stratification deciphering to data.In recent years, degree of depth study was used widely in Human bodys' response field.The method direct automatic learning from raw data obtains feature, different from traditional feature extraction, this category feature, without the need to Design intervention manually, has higher adaptivity, versatility and unchangeability (as translation invariance, scale invariability and rotational invariance).3D convolutional neural networks (3D Convolutional Neural Networks, 3D CNNs) be the Typical Representative of the method, traditional convolutional neural networks is expanded to time domain from image 2 dimension space by it, directly automatic learning space-time characteristic from original video sequence, instead of traditional space-time interest points and descriptor, can to simple human body behavior as good discriminations of acquisition such as applauding, wave.Although the method is the most popular at present and effective Human bodys' response method, easily there is the normal Expired Drugs existed in neural network; In addition along with the number of plies of the neural network learnt based on the degree of depth increases, diffusion problem is easily there is in error back propagation when carrying out arameter optimization, affect training process, and at present under comparatively complex scene (as different background, different angle lens and different context environmentals etc.) Human bodys' response in poor effect.

Publication number be CN103955671A, denomination of invention is disclose a kind of Human bodys' response method based on the public vector of Quick in the patent documentation of " the Human bodys' response method based on the public vector algorithm of Quick ", improve classification speed with the public vector algorithm of Quick, and solve the small sample problem in Human bodys' response.First sub-frame processing, gray proces and denoising are carried out to the video sequence of input; Then adopt time differencing method to carry out movement human target detection to the image after framing, extract target prospect; Then target area size is normalized; The method of k-means cluster is adopted to obtain the key frame of behavior sequence again; The public vector of Quick is finally adopted to classify to behavior.Although the method can improve recognition efficiency in ground to a certain degree, solve the small sample problem in Human bodys' response, (i.e. simple background under ecotopia, without obvious noise etc.) Human bodys' response accuracy rate higher, but the method mainly utilizes traditional image processing means, the characteristic limitations extracted is large, is easily affected by the external environment, poor compared with effect in the Human bodys' response under complex scene.

Publication number be CN103810496A, denomination of invention is disclose a kind of 3D Gaussian spatial Human bodys' response method based on image depth information in the patent documentation of " the 3D Gaussian spatial Human bodys' response method based on image depth information ", first extract the skeleton 3D coordinate in depth information and operation is normalized to it, filtering the low joint of Human bodys' response rate and Joint motion; Then build interest for each behavior and close knot cluster, carry out AP cluster based on Gauss's distance collator body action space characteristics, obtain behavioural characteristic word list and data scrubbing is carried out to it; Finally build human body behavior condition random field model of cognition, realize the classification to human body behavior accordingly.Although the method all has stronger anti-interference to the concrete direction of human body, skeletal size, locus, tool generalization ability to a certain degree, can be applicable to the Human bodys' response under more satisfactory environment, but the 3D depth camera needing use cost higher, the algorithm of the method is comparatively complicated in addition, and still undesirable compared with the effect in Human bodys' response under complex scene.

Summary of the invention

The object of the invention is the defect existed for background technology, a kind of video human Activity recognition method based on sparse subspace clustering of research and design, the method can extract the human body behavioural characteristic having more identification, adaptivity, versatility and unchangeability automatically, reduce the Expired Drugs in neural network and diffusion problem, effectively improve (as different background, different angle lens and different context environmentals etc.) Human bodys' response accuracy rate under complex environment to reach, can be widely used in the objects such as live video monitoring and Video content retrieval.

Solution of the present invention is

The present invention is directed to the many factors such as camera lens distance, different context environmentals, different background, under comparatively complex scene, the feature of the same class human body behavior of (as different background, different angle lens and different context environmentals etc.) often possesses the feasibility of segmentation; To input human body behavior video sample complete feature extraction after, be mapped to feature space by sample space after, utilizing sparse subspace clustering (Sparse Subspace Clustering:SSC) to carry out cluster to the feature of same class human body behavior, be subdivided into some sub-line is then upgrade corresponding human body behavior class label and learning training again; Simultaneously by the neural network based on degree of depth study more for the number of plies, be split as the more shallow neural network based on degree of depth study that several number of plies is less, to promote neural network performance, alleviate over-fitting and diffusion problem; When identifying, the recognition result that some sub-line are is reclassified primitive behavior and carry out discrimination statistics.Namely the discrimination of Activity recognition algorithm based on degree of depth study improves further with this by the present invention on original basis, finally reaches the requirement compared with the higher recognition accuracy of human body behavior tool under complex scene, thus realizes its goal of the invention.Thus the inventive method comprises:

A. the model of video human Activity recognition is set up:

A1. three-dimensional space-time subframe cube is set up: the subframe each frame on others' body behavior video of the same class of the human body behavior database for learning being divided into formed objects, then the length of time series of the partial continuous frame of corresponding human body behavior video will be formed as its thickness, to set up three-dimensional space-time subframe cube, and to gained each subframe cube at the identical class label of protoplast's body behavior video;

A2. set up human body behavioural characteristic space: by steps A 1 build each three-dimensional space-time subframe cube together with the class label of its human body behavior video be input to based on degree of depth study neural network, carry out first time and train, to extract the feature for classifying exceeding given behavior classification 50% in human body behavior database, set up the human body behavioural characteristic space after first time training;

A3. clustering processing: to steps A 2 build human body behavioural characteristic space, sparse subspace clustering (SSC) method is utilized to carry out cluster (segmentation) process, so that same class human body behavioural characteristic is subdivided into subclass behavioural characteristic again to the anthropoid behavioural characteristic of each in behavioural characteristic space respectively; The number of behavioural characteristic subclass is determined automatically according to sparse subspace clustering (SSC) method;

A4. the renewal of label: according to the result of Subspace clustering method segmentation sparse in steps A 3, give its subtab to each behavioural characteristic subclass video after cluster segmentation respectively under the class label that protoplast's body behavior video is identical, the sample after label must be upgraded;

A5. video human Activity recognition model is set up: steps A 4 gained is upgraded the sample after label and be input to identical with steps A 2 neural network learnt based on the degree of depth and carry out second time and train, to extract human body behavioural characteristic further, then the behavioural characteristic extracted input sorter is carried out classification process, thus set up the model being used for video human Activity recognition; And neural network parameter after preserving second time training, stand-by;

B. the identification of human body behavior:

B1. from monitoring video, extract three-dimensional space-time subframe cube: adopt the method identical with steps A 1, the three-dimensional space-time subframe cube identical with steps A 1 size and quantity is extracted respectively to the every section of human body behavior video monitored, then goes to step B2;

B2. human body behavioural characteristic is extracted: be input to respectively by the three-dimensional space-time subframe cube of each section of video that step B1 extracts and train through steps A 5 and to preserve in stand-by neural network, with the sub-behavioural characteristic of the human body extracting each section of video;

B3. determine the classification that each video human sub-line is: by each section of sub-behavioural characteristic of video human extracted in step B2, input sorter classification respectively, classification process is carried out successively to each section of monitor video, obtains the video with each subclass label;

B4. belt class label video classification merge: by the video of each for step B3 resulting tape subclass label, according to Hollywood2 human body behavior database institute divide large class classification merging, obtaining behavior classification belonging to each video human behavior, storing in order to calling.

Be Hollywood2 or KTH, HMDB51, UCF101, Sports 1M human body behavior database for the human body behavior database learnt described in steps A 1 and B4.

The neural network learnt based on the degree of depth described in steps A 2 is independence subspace analysis (Independent SubspaceAnalysis:ISA) neural network.

Described in steps A 3, utilize sparse subspace clustering (SSC) method, its step is as follows:

A3-1. be that main sequence arranges and generates a dictionary by the feature of human body behavior video each in A2 step gained behavioural characteristic space with row, recycling sparse coding method determines its sparse coefficient (C);

A3-2. sparse coefficient (C) is normalized;

A3-3. same class human body behavioural characteristic figure is formed: add its transposition after steps A 3-2 gained sparse coefficient is taken absolute value, obtain adjacency matrix; And then composition using each video sample as node, adjacency matrix represents the same class human body behavioural characteristic figure of weight;

A3-4. cluster Subdividing Processing: utilize sparse subspace clustering (SSC) method to be subdivided into each behavioural characteristic subclass to steps A 3-3 gained same class human body behavioural characteristic figure cluster.

Described in steps A 5 and B3, sorter is Softmax sorter.

The present invention is due to after completing feature extraction to the human body behavior video sample of input, namely utilizing the feature of sparse Subspace clustering method to the behavior of same class human body to carry out cluster, be subdivided into some sub-line is then upgrade corresponding human body behavior class label and learning training again; Simultaneously by the neural network based on degree of depth study more for the number of plies, be split as the more shallow neural network based on degree of depth study that several number of plies is less, to promote neural network performance, alleviate over-fitting and diffusion problem; When identifying, the recognition result that some sub-line are is reclassified primitive behavior and carry out discrimination statistics.Thus the total recognition accuracy of ISA neural network to Hollywood2 human body behavior database is brought up to 80.8%, be greatly improved compared with the recognition accuracy 53.3% only utilizing ISA neural network to obtain; And compared with the highest recognition accuracy 64.3% in the known world of current Hollywood2 human body behavior database, the present invention is then than improve 16.5%.Thus, the present invention has the human body behavioural characteristic that automatically can extract and have more identification, adaptivity, versatility and unchangeability, reduce the Expired Drugs in neural network and diffusion problem, effectively improve the accuracy rate of Human bodys' response under complex environment, can be widely used in the features such as live video monitoring and Video content retrieval.

Embodiment

Hardware configuration of the present invention is: dell server, 8 core 2.60Ghz CPU, 128Gb internal memory; Software merit rating is: Windows Server 2003 operating system, and OpenCV increases income computer vision storehouse, Microsoft Visual Studio 2010 development environment, Matlab simulated environment etc.

The specific embodiment of the invention stage comprises training stage and cognitive phase, and its concrete implementation step is as follows:

A. the model of video human Activity recognition is set up:

A1: set up three-dimensional space-time subframe cube: the subframe each frame on others' body behavior video of the same class of the human body behavior database Hollywood2 for learning being divided into formed objects (16 × 16 pixel), then the length of time series of the partial continuous frame (10 frame) of corresponding human body behavior video will be formed as its thickness, to set up three-dimensional space-time subframe cube (16 pixel × 10, pixel × 16 frame), and under the class label that protoplast's body behavior video is identical, its subtab is marked respectively to gained each subframe cube;

The above-mentioned video library for learning is Hollywood2 human body behavior database, comprise in real life common: make a phone call (class label 1), drive (class label 2), have a meal (class label 3), fight (class label 4), get off (class label 5), shake hands (class label 6), embrace (class label 7), kiss (class label 8), run (class label 9), sit down (class label 10), sit-ups (class label 11) and (class label 12) training video totally 823 at 12 interior class behaviors that stands up,

When implementing, video resolution unification to 200 × 160 pixel of this database, the highest recognition accuracy of this database is 64.3%; Test video quantity totally 884;

A2: set up human body behavioural characteristic space: by steps A 1 build each three-dimensional space-time subframe cube together with the class label of its human body behavior video be input to based on degree of depth study neural network ISA (Independent Subspace Analysis), carry out first time and train, reaching the feature for classifying of 53.3% to extract given 12 kinds of behavior average recognition rate in human body behavior database Hollywood2, setting up 3000 row higher-dimension behavioural characteristic spaces after first time training;

A3: clustering processing: 2 Construction Banks are feature space to steps A, sparse subspace clustering (SSC) method is utilized to carry out clustering processing to the anthropoid behavioural characteristic of each in behavioural characteristic space respectively, so that same class human body behavioural characteristic is subdivided into subclass behavioural characteristic again, its SSC clustering method is as follows:

A3-1: be that main sequence arranges and generates a dictionary with row by the feature of human body behavior video each in A2 step gained behavioural characteristic space, recycling sparse coding method determines its sparse coefficient C; The dictionary matrix size of behavior of making a phone call in Hollywood2 is 3000 × 68, C matrix size is 68 × 68; The dictionary matrix size of driving behavior is 3000 × 85, C matrix size is 85 × 85; The dictionary matrix size of behavior of having a meal is 3000 × 39, C matrix size is 39 × 39; The dictionary matrix size of behavior of fighting is 3000 × 54, C matrix size is 54 × 54; The dictionary matrix size of behavior of getting off is 3000 × 48, C matrix size is 48 × 48; The dictionary matrix size of behavior of shaking hands is 3000 × 32, C matrix size is 32 × 32; The dictionary matrix size of embracing behavior is 3000 × 58, C matrix size is 58 × 58; The dictionary matrix size of kiss behavior is 3000 × 99, C matrix size is 99 × 99; The dictionary matrix size of running behavior is 3000 × 122, C matrix size is 122 × 122; The dictionary matrix size of behavior of sitting down is 3000 × 93, C matrix size is 93 × 93; The dictionary matrix size of sit-ups behavior is 3000 × 22, C matrix size is 22 × 22; The dictionary matrix size of behavior of standing up is 3000 × 110, C matrix size is 110 × 110;

A3-2: sparse coefficient C is normalized;

A3-3: determine same class human body behavioural characteristic figure: add its transposition after steps A 3-2 gained sparse coefficient is taken absolute value, obtain adjacency matrix, that is: W=|C|+|C| ^t, W is adjacency matrix; And then composition using each video sample as node, adjacency matrix represents the same class human body behavioural characteristic figure of weight; Behavior of making a phone call in Hollywood2 comprises 68 video samples altogether, i.e. nodes totally 68, in like manner, driving behavior nodes totally 85, the behavior nodes of having a meal totally 39, the behavior nodes of fighting totally 54, the behavior nodes of getting off totally 48, the behavior nodes of shaking hands totally 32, the behavior nodes of embracing totally 58, kiss behavior nodes totally 99, running behavior nodes totally 122, the behavior nodes of sitting down totally 93, sit-ups behavior nodes totally 22, the behavior nodes of standing up totally 110;

A3-4: cluster Subdividing Processing: utilize sparse subspace clustering (SSC) method to be subdivided into each behavioural characteristic subclass to steps A 3-3 gained same class human body behavioural characteristic figure cluster;

A4: the renewal of label: according to the result of SSC cluster segmentation in steps A 3, its subtab is given to each behavioural characteristic subclass video after cluster segmentation respectively under the class label that protoplast's body behavior video is identical, the sample after label must be upgraded, Hollywood2 human body behavior database forms 29 subclasses altogether through steps A 3, label 1 is subdivided into label 1.1 and 1.2, label 2 is subdivided into label 2.1, 2.2 and 2.3, label 4 is subdivided into label 4.1 and 4.2, label 5 is subdivided into label 5.1 and 5.2, label 7 is subdivided into label 7.1 and 7.2, label 8 is subdivided into label 8.1, 8.2, 8.3 and 8.4, label 9 is subdivided into label 9.1, 9.2, 9.3 and 9.4, label 10 is subdivided into label 10.1, 10.2 with 10.3, label 12 is subdivided into label 12.1, 12.2, 12.3 with 12.4, and behavior representated by label 3,6 and 11 is not segmented due to the less therefore present embodiment of number of videos again,

A5: set up video human Activity recognition model: steps A 4 gained is upgraded the sample after label and be input to identical with the steps A 2 neural network ISA learnt based on the degree of depth and carry out second time and train, to extract human body behavioural characteristic further, feature is inputted Softmax sorter and carry out classification process, thus set up the model being used for video human Activity recognition; And neural network parameter after preserving second time training, stand-by;

B. the identification of human body behavior: in order to the effect of accurate validation the inventive method, present embodiment still adopts human body behavior video in Hollywood2 human body behavior database as monitoring video:

B1: extract three-dimensional space-time subframe cube from monitoring video: adopt the method identical with steps A 1, the three-dimensional space-time subframe cube identical with steps A 1 size and quantity is extracted respectively to the every section of human body behavior video monitored, then the three-dimensional space-time subframe cube of every section of human body behavior video extraction is gone to step B2;

B2: human body behavioural characteristic is extracted: the three-dimensional space-time subframe cube of each section of video extracted by step B1 is input in the neural network trained through steps A 5, with the sub-behavioural characteristic of the human body extracting each section of video respectively;

B3: determine the classification that each video human sub-line is: by each section of sub-behavioural characteristic of video human extracted in step B2, is input to the classification of Softmax sorter respectively, carries out classification process successively, obtain the video with each subclass label to each section of monitor video;

B4: the classification of belt class label video merges: according to the video of each subclass label of step B3 resulting tape, be that the label of the video of 1.1 and 1.2 classifies as label 1 by label, label is 2.1, 2.2 with label 2 is merged in the video tab classification of 2.3, label be 4.1 and 4.2 video tab sort out merge into label 4, label be 5.1 and 5.2 video tab sort out merge into label 5, label be 7.1 and 7.2 video tab sort out merge into label 7, label is 8.1, 8.2, 8.3 with label 8 is merged in the video tab classification of 8.4, label is 9.1, 9.2, 9.3 with label 9 is merged in the video tab classification of 9.4, label is 10.1, 10.2 with label 10 is merged in the video tab classification of 10.3, label is 12.1, 12.2, 12.3 with label 12 is merged in the video tab classification of 12.4, and behavior representated by label 3,6 and 11 is not segmented due to less therefore this real formula of number of videos again, then according to Hollywood2 human body behavior database divide large class sort out merge, obtaining behavior classification belonging to each video human behavior, storing in order to calling.

Present embodiment is added up the recognition result exporting final 12 class behaviors, obtain adopting the total recognition accuracy of ISA neural network to Hollywood2 human body behavior database to reach 80.8%, be greatly improved compared with the recognition accuracy 53.3% only utilizing ISA neural network to obtain; And at present the highest recognition accuracy in the known world of Hollywood2 human body behavior database is 64.3%, present embodiment compares with it and also improves 16.5%.

Present embodiment respectively to the statistical conditions of Activity recognition rate all kinds of in Hollywood2 human body behavior database as following table:

Claims

1., based on a video human Activity recognition method for sparse subspace clustering, comprising:

A. the model of video human Activity recognition is set up:

A3. clustering processing: to steps A 2 build human body behavioural characteristic space, sparse Subspace clustering method is utilized to carry out clustering processing to the anthropoid behavioural characteristic of each in behavioural characteristic space respectively, so that same class human body behavioural characteristic is subdivided into subclass behavioural characteristic again; The number of behavioural characteristic subclass is determined automatically according to sparse Subspace clustering method;

B. the identification of human body behavior:

2., by the video human Activity recognition method based on sparse subspace clustering described in claim 1, to it is characterized in that described in steps A 1 and B4 for the human body behavior database learnt being Hollywood2 or KTH, HMDB51, UCF101, Sports1M human body behavior database.

3., by the video human Activity recognition method based on sparse subspace clustering described in claim 1, it is characterized in that the neural network learnt based on the degree of depth described in steps A 2 is independence subspace analysis neural network.

4., by the video human Activity recognition method based on sparse subspace clustering described in claim 1, it is characterized in that utilizing sparse Subspace clustering method described in steps A 3, its step is as follows:

A3-1. be that main sequence arranges and generates a dictionary by the feature of human body behavior video each in A2 step gained behavioural characteristic space with row, recycling sparse coding method determines its sparse coefficient;

A3-2. sparse coefficient is normalized;

A3-4. cluster Subdividing Processing: utilize sparse Subspace clustering method to be subdivided into each behavioural characteristic subclass to steps A 3-3 gained same class human body behavioural characteristic figure cluster.

5., by the video human Activity recognition method based on sparse subspace clustering described in claim 1, it is characterized in that sorter is Softmax sorter described in steps A 5 and B3.