CN107862275A

CN107862275A - Human bodys' response model and its construction method and Human bodys' response method

Info

Publication number: CN107862275A
Application number: CN201711054505.6A
Authority: CN
Inventors: 郝宗波; 林佳月; 王莹; 杨泉; 张舒雨; 王伟国; 孔佑真
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-11-01
Filing date: 2017-11-01
Publication date: 2018-03-30

Abstract

The invention discloses a kind of Human bodys' response model and its construction method and Human bodys' response method, model building method includes pre-processing human body behavior video in Sample Storehouse；Using the characteristic vector of 3D convolutional neural networks extraction human body behavior video；The characteristic vector input coulomb field of force is clustered, penalty values of the characteristic vector in coulomb field of force initial position and final position are calculated using loss function；When penalty values >=given threshold, the error representated by penalty values is subjected to backpropagation, and adjusts classifier parameters, until penalty values ＜ given thresholds；The characteristic vector of extraction is inputted into grader, the error back propagation using the difference of the classification results of grader output and the class label of video sample as grader, and the parameter of grader is adjusted, until error<Given threshold；When penalty values ＜ given thresholds, the current Optimal Parameters of record sort device and corresponding human body behavior video form Human bodys' response model.

Description

Human bodys' response model and its construction method and Human bodys' response method

Technical field

The present invention relates to physics, machine learning and deep learning field, more particularly to a kind of Human bodys' response model And its construction method and Human bodys' response method.

Background technology

Convolutional neural networks (Convolutional Neural Network, CNN) are a kind of feedforward neural networks, it Artificial neuron can respond the unit around in a part of coverage, have outstanding performance for large-scale image procossing.

Due to behavior be one with time and space all relevant physical activity, and traditional 2D convolutional neural networks are right The feature in space is more sensitive, and can not handle the time response in video, is used as the identification of the behavior of timing variations, Target can not be met；

When carrying out feature extraction using convolutional neural networks, with the intensification of its depth, extraction to feature from specific to It is abstract, from simple to complexity, under conditions of sample is few, network depth too depth or noise are too big, very easily produce over-fitting Result；So-called over-fitting is exactly the grader that network training goes out, and only pair input similar to training sample is sensitive, and is that its is defeated Enter similar other unseen test samples, its ability in feature extraction and classification capacity will become very low；In order to prevent Fitting, convolutional neural networks are also introduced into dropout to be improved, but this method will consume more computing resources；

For in most Activity recognition tasks, the cross-distribution of feature is to cause the FAQs of over-fitting, paper " Action Recognition based on Subdivision-Fusion Model " propose improved subdivision fusion mould Type (SFM), for the model in the subdivision stage, the sample characteristics of most classifications are all similar and cross-distributions, and SFM is by such sample One's duty composition is multiple more to be easier to find for the subclass of ga s safety degree, the border of subclass, so as to avoid overfitting. Subsequent fusing stage, more subclass classification results are converted back to original classification identification problem.Subdivision Fusion Model provides The two methods of cluster centre number are determined, but it still suffers from problems with：

The wherein two methods of its determination cluster centre number：One kind is after directly observing high dimensional feature by t-SNE dimensionality reductions Two-dimensional visualization figure, artificial selection should be divided into how many class, another kind be quantity based on each classification sample with it is minimum The ratio of class sample size determines clusters number.Two methods are required for manually observing, it is impossible to are clustered automatically.And its property The personal choice of researcher couple can be depended on., meanwhile, this process must be interrupted by the participation of people.

These problems will all have a strong impact on the stability and automaticity of Activity recognition performance and recognizer.

The content of the invention

For above-mentioned deficiency of the prior art, Human bodys' response model provided by the invention and its construction method and people Body Activity recognition method solves the problems, such as over-fitting easily occur during Human bodys' response and automation is poor.

In order to reach foregoing invention purpose, the technical solution adopted by the present invention is：

First aspect, there is provided a kind of construction method of Human bodys' response model, it includes：

Obtaining includes the Sample Storehouse of some human body behavior videos, and all human body behavior videos in Sample Storehouse is carried out pre- Processing；

The characteristic vector of pretreated human body behavior video is extracted using 3D convolutional neural networks；

By the characteristic vector input coulomb field of force of extraction, all characteristic vectors produce in mutually similar generation gravitation, not species Relative movement is clustered in the presence of repulsion；

When calculating particle current location and the similarity function minimum representated by characteristic vector using loss function feature to Error between the target location of amount；

When the error is more than or equal to given threshold, the error is subjected to backpropagation, and adjust 3D convolutional Neurals Network parameter, until error is less than given threshold；

When the error is less than given threshold, the training of 3D convolutional neural networks is completed, and using grader to feature Vector is trained；

The difference between the classification output result of grader and the label of sample is calculated, is preset when the difference is more than or equal to During value, by the difference backpropagation, and the parameter of grader is updated；

When the difference is less than preset value, sub-line is class label after the current Optimal Parameters of record sort device, cluster And sub-line is human body behavior video corresponding to class label after cluster；

Human bodys' response mould is formed using the 3D convolutional neural networks after the grader and optimization for being provided with Optimal Parameters Type.

Further, the calculation formula of the loss function is：

Wherein,For the penalty values of i-th of sample；W, b be respectively 3D convolutional neural networks weights and Biasing；For the value of i-th of sample jth dimension；When reaching minimum value for similarity function value, i-th of sample jth dimension Desired value；For attenuation term.

Further, the calculation formula of the similarity function is：

Wherein, D (x_i,x_j) it is species x_iWith species x_jSimilarity；m_iFor species i mean vector；m_jFor the equal of species j Value vector；WithFor species i and species j variance within clusters.

Further, the 3D convolutional neural networks include seven layers, and first layer is input layer, and it has three passages, and three Individual passage receives the multiple image of the human body behavior at pretreated upper one second of human body behavior video current time respectively, light stream exists The component of component and light stream on y axis of orientations on x axis of orientations；

The second layer is Three dimensional convolution layer, the image inputted with the convolution kernel that quantity is n, yardstick is cw*ch*cl to first layer Convolution algorithm is carried out with light stream；Third layer is three-dimensional down-sampling layer, and the convolution kernel for being pw*ph*pl with yardstick is to the defeated of the second layer Go out to carry out maximum pond；

4th layer is Three dimensional convolution layer, and convolution fortune is carried out using output of the second layer identical computing mode to third layer Calculate；Layer 5 is NIN layers, is formed using the network by two layers of perceptron convolutional layer, and people is extracted for the output according to the 4th layer The nonlinear characteristic of body behavior；

Layer 6 is pyramid down-sampling layer, is made up of different size of three-dimensional down-sampling layer, for being exported to layer 5 Human body behavior nonlinear characteristic carry out down-sampling processing；Layer 7 is full articulamentum, is consolidated according to the output of layer 6 Determine the characteristic vector of dimension.

Further, the calculation formula that the second layer inputs to first layer image and light stream carry out convolution algorithm is：

Wherein, w is the weight of convolution kernel；U be the image intensity values of three passages of input layer, light stream horizontal component and hang down Straight component；v^xyzFor the output of Three dimensional convolution layer；P, Q is respectively total line number of the two-dimensional matrix of input layer output and total columns；R For the length of human body behavior video；P, q is respectively pth row and q row in the two-dimensional matrix that input layer exports；R is human body behavior The r frames of video；Cw is the width of convolution kernel, and ch is convolution kernel height, and cl is the length of convolution kernel on a timeline；

Further, adopted when carrying out maximum pond using output of the three-dimensional down-sampling layer to the second layer under three-dimensional overlapping maximum The calculation formula of sample is：

Wherein, x is the feature of second layer Three dimensional convolution extraction；Y is the output obtained after sampling, and s, t and r are respectively image Sampling step length in three width, height and length of the video time directions；M, n and l is the feature map of third layer pond layer Element index on x directions, y directions and time shaft；S₁、S₂And S₃It is total line number of second layer output matrix, total columns and total Frame number.

Further, the NIN layers extract the calculation formula of the nonlinear characteristic of human body behavior according to the 4th layer of output For：

Wherein, (i, j) is the feature map of current layer pixel index；x_{I, j}It is input block of the center at (i, j), k_n It is the feature map of current layer index；N is the MPL number of plies.

Further, the nonlinear characteristic for the human body behavior that the pyramid down-sampling layer exports to layer 5 adopt Sample processing further comprises：

The ratio per one-dimensional input and the length of side of output is calculated, is obtained per one-dimensional window size, downwards by rounding up The moving step length obtained per one-dimensional window is rounded, is used afterwards with third layer three-dimensional down-sampling layer identical calculation formula to human body The nonlinear characteristic of behavior carries out down-sampling processing.

Second aspect, there is provided a kind of Human bodys' response model, it is structure side's structure using Human bodys' response model The Human bodys' response model built.

The third aspect, there is provided a kind of method that video human behavior is identified using Human bodys' response model, its It is characterised by, including：

Video image to be identified is obtained, and it is pre-processed；

The characteristic vector of pretreated video image is extracted using 3D convolutional neural networks；

By in the grader of characteristic vector input Human bodys' response model, pass through grader and Human bodys' response Model carries out classification processing to video image, and it is the classification results of the video image of class label to obtain with sub-line after renewal；

Sub-line is the classification results of the video image of class label after obtained band is updated, and divides major class according to sample place Sort out and merge, obtain the affiliated behavior classification of human body behavior in video image.

Compared with Human bodys' response method of the prior art, beneficial effects of the present invention are：

This programme substitutes traditional 2D convolutional neural networks using the 3D convolutional neural networks of deep learning, carries out carrying for feature Take, with Human bodys' response in adaptive video fragment to time and the dual requirementses in space.

The Clustering Effect of subdivision Fusion Model, and the loss function of calculating network are replaced using the coulomb field of force, mould can be strengthened The robustness of type, retain convolutional neural networks feature end to end, reduce the probability of network over-fitting, improve and selected manually in SFM Species number purpose drawback is selected, strengthens the automation characteristic and fluency of program.

Using the coulomb field of force while cluster, the noise spot in training sample can be filtered by interaction force, is made Correct training sample influences to maximize on the parameter of network, and influence of the noise to network parameter minimizes, and network is reached considerable Ability in feature extraction and classification capacity.

After 3D convolutional neural networks are improved by this programme, Three dimensional convolution, three-dimensional down-sampling, NIN, three Vygens are added Word tower structure so that convolutional neural networks can have stronger ability in feature extraction to human body behavior；Assembled for training in specific video Practice, obtain the feature with more classification capacity, strengthen the robustness and accuracy of whole recognizer.

Wherein, 3D convolutional neural networks use the information in Three dimensional convolution extraction time domain, preferably can believe capture movement Breath；Amount of calculation is not only greatly reduced using three-dimensional down-sampling technology, and it is constant to introduce time of the algorithm in time-domain Property, improve the stability of identification and Geng Gao discrimination；The flexibility of system is improved using pyramid structure so that different The video segment of resolution ratio and duration can use the system with without any changes, improve the flexibility and application of system Scope.

Brief description of the drawings

Fig. 1 is the flow chart of construction method one embodiment of Human bodys' response model.

Fig. 2 is Activity recognition 3D convolutional neural networks configuration diagrams provided by the invention.

Fig. 3 is Three dimensional convolution provided by the invention and the comparison diagram of two-dimensional convolution；

Wherein, Fig. 3 a are the schematic diagram of Three dimensional convolution, and Fig. 3 b are the schematic diagram of two-dimensional convolution.

Fig. 4 is linear convolution provided by the invention and MLP convolution schematic diagrames.

Fig. 5 is pyramid structure schematic diagram provided by the invention.

Fig. 6 a~6f are coulomb field of force cluster process schematic diagram provided by the invention.

Fig. 7 a are that 3D convolutional neural networks combination coulombs field of force cluster can in the two dimension of Hollywood2 high-dimensional feature spaces Depending on changing result；Fig. 7 b are the two-dimensional visualization result of the Hollywood2 high-dimensional feature spaces of sparse subspace clustering.

Fig. 8 a~8f are between the 3D convolutional neural networks combination coulomb field of force of this programme and long recursive convolution network clustering Some classification visualizations compare.

Fig. 9 is the flow chart of Human bodys' response method one embodiment.

Embodiment

The embodiment of the present invention is described below, in order to which those skilled in the art understand this hair It is bright, it should be apparent that the invention is not restricted to the scope of embodiment, for those skilled in the art, As long as various change in the spirit and scope of the present invention that appended claim limits and determines, these changes are aobvious and easy See, all are using the innovation and creation of present inventive concept in the row of protection.

With reference to figure 1, Fig. 1 shows the flow chart of construction method one embodiment of Human bodys' response model；Such as Fig. 1 institutes Show, this method 100 includes step 101 to step 109.

In a step 101, obtaining includes the Sample Storehouse of some human body behavior videos, and to all human body rows in Sample Storehouse Pre-processed for video；Pretreatment mainly includes carrying out gray processing, scaling and extraction to the human body behavior video in sample storehouse Streamer.

Wherein, Sample Storehouse is Hollywood2 Human bodys' response databases, and it includes 12 anthropoid behaviors, and 3669 regard Frequency fragment, the duration of about 20.1 hours.

In a step 102, the characteristic vector of pretreated human body behavior video is extracted using 3D convolutional neural networks.Tool It is each layer parameter of random initializtion 3D convolutional neural networks body, and by the network to Hollywood2 Human bodys' responses Sample in data set carries out feature extraction, finally gives the feature of each sample, is entered into feature space.

In one embodiment of the invention, 3D convolutional neural networks include seven layers, specifically may be referred to Fig. 2, but Fig. 2 In the 8th layer reference be grader training；As shown in Fig. 2 first layer is input layer, it has three passages, three passages The multiple image of the human body behavior at pretreated upper one second of human body behavior video current time, light stream are received respectively in x directions The component of component and light stream on y axis of orientations on axle；This programme on the basis of original video input passage by adding two Individual light stream passage can largely strengthen the sensitiveness to behavior act, there is higher discrimination in Activity recognition.

The second layer is Three dimensional convolution layer, the image inputted with the convolution kernel that quantity is n, yardstick is cw*ch*cl to first layer Convolution algorithm is carried out with light stream；This programme adds time domain information using Three dimensional convolution computing, preferably can believe capture movement Breath.

That is, in the task of processing video, the convolutional layer of three-dimensional has been used from the information of multiple continuous frame-grab campaigns, The calculation formula of Three dimensional convolution computing is：

Wherein, w is the weight of convolution kernel；U be the image intensity values of three passages of input layer, light stream horizontal component and hang down Straight component；v^xyzFor the output of Three dimensional convolution layer；P, Q is respectively total line number of the two-dimensional matrix of input layer output and total columns；R For the length of human body behavior video；P, q is respectively pth row and q row in the two-dimensional matrix that input layer exports；R is human body behavior The r frames of video；Cw is the width of convolution kernel, and ch is convolution kernel height, and cl is the length of convolution kernel on a timeline.

Here Three dimensional convolution computing is considered as the cube that the accumulation of multiple frames is got up come convolution with the core of three-dimensional, such as Fig. 3 a show the schematic diagram of Three dimensional convolution, and reference axis indicates three dimensions：The width of image, height and video when Between, the cube of lower section represents the input of convolution, and the cube of top represents the output of convolution.Fig. 3 b are the signals of two-dimensional convolution Figure, its input and output are all the rectangles of two dimension, only include width, the elevation information of image, the information of no time-domain.Using Three dimensional convolution, the feature map of each convolutional layer are the multiple successive frames for being associated with preceding layer, thus capture motion Information.

Third layer is three-dimensional down-sampling layer, and maximum pond is carried out with output of the convolution kernel that yardstick is pw*ph*pl to the second layer Change；The three-dimensional down-sampling technology that this programme uses not only greatly reduces amount of calculation, and introduces algorithm in time-domain Time invariance, improve the stability of identification and Geng Gao discrimination.

Particular content is：As the convolution of three-dimensional, when convolutional neural networks handle video, down-sampling layer also needs It is extended to three-dimensional.Handle the down-sampling layer of the convolutional neural networks of picture, data volume can be caused to strongly reduce, accelerate behind Calculating, while also cause 2D convolutional neural networks there is certain consistency, consistency here is constant in spatial domain Property.

And when video is handled, certain consistency, and the data processing ratio of video are also required in time-domain The picture of single frames is much bigger, therefore it is necessary to which down-sampling is also extended to three-dimensional.Wherein, using three-dimensional down-sampling layer to The calculation formula of three-dimensional overlapping maximum down-sampling is when two layers of output carries out maximum pond：

After being handled using three-dimensional down-sampling layer, feature map data volumes are reduced at double, and amount of calculation can also greatly reduce, Meanwhile network is to the more robust of the change in time-domain.

4th layer is Three dimensional convolution layer, and convolution fortune is carried out using output of the second layer identical computing mode to third layer Calculate；Layer 5 is NIN layers (Network in Network), is formed using the network by two layers of perceptron convolutional layer, for root According to the nonlinear characteristic of the 4th layer of output extraction human body behavior；Use NIN that this method can be allowd to extract more complicated Human body behavior nonlinear characteristic.

The calculation formula of nonlinear characteristic that the NIN layers extract human body behavior according to the 4th layer of output is：

Layer 6 is pyramid down-sampling layer, is made up of different size of three-dimensional down-sampling layer, for being exported to layer 5 Human body behavior nonlinear characteristic carry out down-sampling processing；Obtain the output feature map of different resolution；This programme is adopted Three-dimensional this skill upgrading of pyramid down-sampling layer flexibility of system so that the piece of video of different resolution and duration Section can improve the flexibility of 3D convolutional neural networks with the 3D convolutional neural networks without any changes that can use this programme And application.

Particular content is：The three-dimensional pyramid down-sampling layer of the application is made up of the feature map of a variety of resolution ratio, Conventional down-sampling layer is all to have identical size with the feature map of identical sampling scale and input, so obtaining Feature map have identical dimension.And pyramid down-sampling layer is fixed dimension using a variety of sampling scales feature map。

The sample of Activity recognition is all some video segments, may there is different resolution ratio, it is also possible to has different videos Length.These othernesses so that traditional convolutional neural networks have no idea to handle, because traditional convolutional neural networks are each Feature map are fixed sizes.Cause traditional convolutional neural networks can not handle different resolution and length video Reason does not lie in convolutional layer and down-sampling layer, but at full articulamentum (layer 7 in Fig. 2), because the framework of full articulamentum is Fixed, have no idea to change, the size that result in the feature map of this layer input also must be fixed.And in convolution Layer, the feature map of input size do not interfere with the structure of network, and the feature map simply exported size can be with The change of the feature map sizes of this layer input and change, because convolution kernel is simply sliding on input feature map It is dynamic.In down-sampling layer, size simply is reduced feature map are inputted in a manner, does not also interfere with the knot of network Structure.From this, the processing of appropriate feature map sizes must be just carried out before full articulamentum so that various sizes of Input feature map obtain the output of identical size, and this processing can use different windows size and different step-lengths Overlapping down-sampling is realized.

As shown in figure 5, pyramid is made up of the feature map of some different resolutions, existing relatively large one The resolution ratio of point, also there is small resolution ratio, and also there is the resolution ratio of some transition sizes centre.(16*256-d, 4* in such as Fig. 4 256-d, 256-d) want to obtain the output feature map of different resolution, it is necessary to carry out more from the feature map of input The overlapping down-sampling of kind window size and step-length, each unit of these obtained feature map are spliced into a vector again, Such as Fig. 5 L-Fix layers, then, then full articulamentum is accessed.Example in Figure 5 is 3 grades of pyramids, fixed resolution ratio Respectively 4 × 4,2 × 2,1 × 1, and last layer has 256 feature map.

The output unit total number of pyramidal layer is all fixed, so it is complete to export fixed feature map connections Articulamentum, moreover, the introducing of pyramid model, can form the feature map of a variety of resolution ratio, it also avoid different scale Influenceed caused by video input.

In this programme, that use is all three-dimensional feature map, and three-dimensional overlapping maximum down-sampling uses and third layer Three-dimensional down-sampling layer identical calculation formula carries out down-sampling processing to the nonlinear characteristic of human body behavior.

Window size is obtained by calculating input and the ratio E of the length of side of output, then by rounding up, rounds obtain downwards Obtain step-length.Assuming that input feature map size for a × a × t, it is necessary to be down sampled to size n × n × τ, then in the time Window size is used on axle now：

The time span of slip is：

Wherein, in formulaOperated to round up,For downward floor operation.

Layer 7 is full articulamentum, is fixed the characteristic vector of dimension according to the output of layer 6, there is provided to grader The characteristic of division of (softmax graders) as identification human body behavior.

In step 103, the characteristic vector of extraction is inputted into the coulomb field of force, all characteristic vectors are drawn in mutually similar generation Power, do not relatively move and clustered in the presence of species generation repulsion.

Step 103 concrete implementation mode is：The characteristic vector that extraction obtains is input in coulomb field of force, regards coulomb as An electric charge in the field of force, the intermolecular forces of the electric charge of identical type are gravitation, can be attracted each other, different types of feature to The intermolecular forces of amount are repulsion, can be mutually exclusive, and under force, each characteristic vector is relatively moved, and judges phase Whether reach minimum like function, to determine the stable state S ultimately formed_E, step 104 is carried out if reaching, is otherwise continued It is mobile；Detailed process is as follows：

After the characteristic vector extracted via the 3D convolutional neural networks of random initializtion enters the coulomb field of force for the first time, random point Cloth is in the coulomb field of force, as shown in Figure 6 a, because 3D convolutional neural networks, now without training, the feature of extraction does not possess can Distinction, so its position of each sample is random distribution；In this schematic diagram, there is the sample of three types, respectively with not similar shape Shape represents.

Each characteristic vector is regarded as the electric charge in the coulomb field of force, producing gravitation according to class label, between mutually similar makes them Apart from upper close to each other, repulsion is produced between inhomogeneity, making them, effect schematic diagram is as schemed apart from upper mutually exclusive remote Shown in 6b；

As shown in Fig. 6 c to Fig. 6 e, according to the effect of class Coulomb force via more wheel iteration, eventually arrive at such as Fig. 6 f stabilization State, that is, the result clustered.

In one embodiment of the invention, the calculation formula of the similarity function is：

Wherein, D (x_i,x_j) for the similarity of two species；m_iFor species i mean vector, namely species i central point,C_iIt is the set of all samples of species i, N_iIt is the total number of species i samples, x_nRefer to sample x characteristic vector； m_jFor the central point of species j mean vector, i.e. species j,C_jIt is the set of all samples of species j, N_jIt is The total number of species j samples；WithFor species i and species j variance within clusters,

At step 104, particle current location and the similarity function representated by characteristic vector are calculated using loss function Error when minimum between the target location of characteristic vector.

During implementation, the preferred grader of this programme selects softmax graders.Wherein, the calculation formula of loss function is：

In step 105, when the error is more than or equal to given threshold, the error is subjected to backpropagation, and adjust Whole 3D convolutional neural networks parameter, until error is less than given threshold.

In step 106, when the error is less than given threshold, the training of 3D convolutional neural networks is completed, and is used Grader is trained to characteristic vector.

In step 107, the difference between the classification output result of grader and the label of sample is calculated, when the difference During more than or equal to preset value, by the difference backpropagation, and the parameter of grader is updated.

In step 108, when the difference is less than preset value, son after the current Optimal Parameters of record sort device, cluster Sub-line is human body behavior video corresponding to class label after behavior class label and cluster.

In step 109, people is formed using the 3D convolutional neural networks after the grader and optimization for being provided with Optimal Parameters Body Activity recognition model.

As shown in figure 9, Human bodys' response method it include：Video image to be identified is obtained, and it is located in advance Reason；Wherein, pretreatment mainly includes carrying out the human body behavior video in sample storehouse gray processing, scaling and extraction streamer.

The characteristic vector of pretreated video image is extracted using 3D convolutional neural networks；3D convolutional Neurals net therein Network can use existing 3D convolutional neural networks, can also use 3D volumes built in Human bodys' response model building method Product neutral net.

The Human bodys' response model building method provided below using this programme is respectively to the data set of mesoscale (Hollywood2 and UCF YouTube action datasets) and the data set (UCF101) of large scale carry out human body behavior Identification model is built, and carries out the identification of human body behavior using the Human bodys' response model of structure afterwards.

Hollywood2 data sets：This programme uses the Hollywood2 data sets of standard, and Hollywood2 includes 12 classes 3669 video segments of behavior.Clustered using the coulomb field of force, 12 different behaviors divide into 25 subclasses by this programme.Subdivision Details it is as shown in table 1：

The behavior classification and cluster result of the Hollywood2 data sets of table 1

Because camera position changes greatly, the diversity of the appearance of identification object and posture, and visual angle and background are answered Polygamy, illumination condition etc. influence so that become very challenging to the Activity recognition of Hollywood2 behavioral data collection.Its In, Fig. 7 (a) is the two-dimensional visualization method of the Hollywood2 high-dimensional feature spaces of this programme；Fig. 7 (b) is sparse subspace The two-dimensional visualization of the Hollywood2 high-dimensional feature spaces of cluster.As can be seen that this programme Clustering Effect be substantially better than it is sparse Subspace clustering.By cluster result, as shown in Figure 7a, coulomb field of force Clustering Model is used to filter noise sample first, and Overfitting is avoided, experimental result improves accuracy.

YouTube Action dataset：YouTube action datas collection includes 11 action classifications.The detailed letter of subdivision Breath is as shown in table 2, and recognition accuracy has reached 87.2%.

The YouTube Action of subdivision and the discarding of the behavior class of table 2 sample

Show that the accuracy of this programme reaches by the experimental result and state-of-the-art report result of this programme 81.3%, with other methods (compared with long recursive convolution network and sparse subspace clustering), the method for this programme exists Hollywood2 database achieves more competitive result.The property of identification of this programme on Hollywood2 data sets It can report as shown in table 3：

Comparison of the present networks of table 3 on Hollywood2 databases with other advanced methods

According to table 3, using the method for this programme, performance has substantially improved to be carried from the 82.5% of sparse subspace clustering Rise to 87.2%.

Experiment on UCF101 data sets

UCF101dataset：UCF101 is behavior video identification data set in the actual life of one, is moved comprising 101 classes Make, everyone does 4-7 groups, shares 13320 videos, a length of 7.2 seconds during average each video.

Clustered using the coulomb field of force and 101 class behaviors of UFC101 data sets have been divided into 235 subclasses, the method for this programme Some category distributions visualization compared with the method for long recursive convolution network, as shown in figure 8, wherein Fig. 8 (a), Fig. 8 (c) It is the UCF101 high-dimensional feature space two-dimensional visualizations of long recursive convolution network method with Fig. 8 (e)；Fig. 8 (b), Fig. 8 (d) and Fig. 8 (f) be this programme UCF101 high-dimensional feature space two-dimensional visualizations, in addition, the coordinate range of this programme is much smaller than long-term recurrence The coordinate range of convolutional network method, the aspect ratio length that this programme is can be seen that by the sample in the range of smaller coordinate are passed The feature of convolutional network method is returned more to concentrate.

Experimental result：The method of this programme is as shown in table 4 in the performance report of UCF101 data sets：

Comparison of the present networks of table 4 on UCF101 databases with other advanced methods

Compared with the experimental result of this programme is aobvious with the report result of other advanced methods, the method for indicating this programme reaches Most advanced horizontal recognition performance is arrived.

In summary, this programme is when building Human bodys' response model, using the 3D convolutional Neural nets in deep learning Network substitutes traditional 2D convolutional neural networks with NIN (Network in Network) network integration, carries out the extraction of feature, with suitable Human bodys' response is answered in video segment to time and the dual requirementses in space,；Novel coulomb field of force cluster mode improves Segment the calculating of loss and the classification of feature that Fusion Model carries out network；In addition, coulomb field of force clustering method can filter out Noise sample, improve accuracy.

Claims

1. the construction method of Human bodys' response model, it is characterised in that including：

Obtaining includes the Sample Storehouse of some human body behavior videos, and all human body behavior videos in Sample Storehouse are located in advance Reason；

By the characteristic vector input coulomb field of force of extraction, all characteristic vectors produce repulsion in mutually similar generation gravitation, not species In the presence of relatively move and clustered；

Particle current location representated by characteristic vector and characteristic vector during similarity function minimum are calculated using loss function Error between target location；

When the error is more than or equal to given threshold, the error is subjected to backpropagation, and adjust 3D convolutional neural networks Parameter, until error is less than given threshold；

When the error is less than given threshold, the training of 3D convolutional neural networks is completed, and using grader to characteristic vector It is trained；

The difference between the classification output result of grader and the label of sample is calculated, when the difference is more than or equal to preset value When, by the difference backpropagation, and update the parameter of grader；

When the difference is less than preset value, sub-line is class label and gathered after the current Optimal Parameters of record sort device, cluster Sub-line is human body behavior video corresponding to class label after class；

Human bodys' response model is formed using the 3D convolutional neural networks after the grader and optimization for being provided with Optimal Parameters.

2. the construction method of Human bodys' response model according to claim 1, it is characterised in that the loss function Calculation formula is：

Wherein,For the penalty values of i-th of sample；W, b are respectively weights and the biasing of 3D convolutional neural networks；For the value of i-th of sample jth dimension；When reaching minimum value for similarity function value, the mesh of i-th of sample jth dimension Scale value；For attenuation term.

3. the construction method of Human bodys' response model according to claim 1, it is characterised in that the similarity function Calculation formula be：

Wherein, D (x_i,x_j) it is species x_iWith species x_jSimilarity；m_iFor species i mean vector；m_jFor species j average to Amount；WithFor species i and species j variance within clusters.

4. according to the construction method of any described Human bodys' response models of claim 1-3, it is characterised in that described 3D volumes Product neutral net includes seven layers, and first layer is input layer, and it has three passages, and three passages receive pretreated people respectively Component and light stream of the multiple image, light stream of the human body behavior at upper one second of behavior video current time of body on x axis of orientations are in y side To the component on axle；

The second layer is Three dimensional convolution layer, the image and light inputted with the convolution kernel that quantity is n, yardstick is cw*ch*cl to first layer Stream carries out convolution algorithm；Third layer is three-dimensional down-sampling layer, is entered with output of the convolution kernel that yardstick is pw*ph*pl to the second layer The maximum pond of row；

4th layer is Three dimensional convolution layer, and convolution algorithm is carried out using output of the second layer identical computing mode to third layer；The Five layers are NIN layers, are formed using the network by two layers of perceptron convolutional layer, and human body behavior is extracted for the output according to the 4th layer Nonlinear characteristic；

Layer 6 is pyramid down-sampling layer, is made up of different size of three-dimensional down-sampling layer, for the people exported to layer 5 The nonlinear characteristic of body behavior carries out down-sampling processing；Layer 7 is full articulamentum, and dimension is fixed according to the output of layer 6 The characteristic vector of degree.

5. the construction method of Human bodys' response model according to claim 4, it is characterised in that the second layer is to One layer input image and light stream carry out convolution algorithm calculation formula be：

Wherein, w is the weight of convolution kernel；U is the image intensity values of three passages of input layer, the horizontal component of light stream and vertical point Amount；v^xyzFor the output of Three dimensional convolution layer；P, Q is respectively total line number of the two-dimensional matrix of input layer output and total columns；R is people The length of body behavior video；P, q is respectively pth row and q row in the two-dimensional matrix that input layer exports；R is human body behavior video R frames；Cw is the width of convolution kernel, and ch is convolution kernel height, and cl is the length of convolution kernel on a timeline.

6. the construction method of Human bodys' response model according to claim 4, it is characterised in that use three-dimensional down-sampling The calculation formula of three-dimensional overlapping maximum down-sampling is when output of the layer to the second layer carries out maximum pond：

Wherein, x is the feature of second layer Three dimensional convolution extraction；Y is the output obtained after sampling, and s, t and r are respectively image in width Degree, the sampling step length in three directions of height and length of the video time；M, n and l is the feature map of third layer pond layer in x Element index on direction, y directions and time shaft；S₁、S₂And S₃It is total line number, total columns and the total frame of second layer output matrix Number.

7. the construction method of Human bodys' response model according to claim 4, it is characterised in that the NIN layers according to The calculation formula of nonlinear characteristic of 4th layer of output extraction human body behavior is：

Wherein, (i, j) is the feature map of current layer pixel index；x_{I, j}It is input block of the center at (i, j), k_nIt is to work as The feature map of front layer index；N is the MPL number of plies.

8. the construction method of Human bodys' response model according to claim 4, it is characterised in that adopted under the pyramid The nonlinear characteristic for the human body behavior that sample layer exports to layer 5 carries out down-sampling processing and further comprised：

The ratio per one-dimensional input and the length of side of output is calculated, is obtained per one-dimensional window size by rounding up, rounded downwards The moving step length per one-dimensional window is obtained, is used afterwards with third layer three-dimensional down-sampling layer identical calculation formula to human body behavior Nonlinear characteristic carry out down-sampling processing.

9. a kind of Human bodys' response model, it is characterised in that including any described method structures of use claim 1-8 Into Human bodys' response model.

10. a kind of method that video human behavior is identified Human bodys' response model using described in claim 9, its It is characterised by, including：

Video image to be identified is obtained, and it is pre-processed；

By in the grader of characteristic vector input Human bodys' response model, pass through grader and Human bodys' response model Classification processing is carried out to video image, it is the classification results of the video image of class label to obtain with sub-line after renewal；

Sub-line is the classification results of the video image of class label after obtained band is updated, and is sorted out according to sample place point major class Merge, obtain the affiliated behavior classification of human body behavior in video image.