CN107862275A - Human bodys' response model and its construction method and Human bodys' response method - Google Patents
Human bodys' response model and its construction method and Human bodys' response method Download PDFInfo
- Publication number
- CN107862275A CN107862275A CN201711054505.6A CN201711054505A CN107862275A CN 107862275 A CN107862275 A CN 107862275A CN 201711054505 A CN201711054505 A CN 201711054505A CN 107862275 A CN107862275 A CN 107862275A
- Authority
- CN
- China
- Prior art keywords
- layer
- dimensional
- human
- human body
- body behavior
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of Human bodys' response model and its construction method and Human bodys' response method, model building method includes pre-processing human body behavior video in Sample Storehouse;Using the characteristic vector of 3D convolutional neural networks extraction human body behavior video;The characteristic vector input coulomb field of force is clustered, penalty values of the characteristic vector in coulomb field of force initial position and final position are calculated using loss function;When penalty values >=given threshold, the error representated by penalty values is subjected to backpropagation, and adjusts classifier parameters, until penalty values < given thresholds;The characteristic vector of extraction is inputted into grader, the error back propagation using the difference of the classification results of grader output and the class label of video sample as grader, and the parameter of grader is adjusted, until error<Given threshold;When penalty values < given thresholds, the current Optimal Parameters of record sort device and corresponding human body behavior video form Human bodys' response model.
Description
Technical field
The present invention relates to physics, machine learning and deep learning field, more particularly to a kind of Human bodys' response model
And its construction method and Human bodys' response method.
Background technology
Convolutional neural networks (Convolutional Neural Network, CNN) are a kind of feedforward neural networks, it
Artificial neuron can respond the unit around in a part of coverage, have outstanding performance for large-scale image procossing.
Due to behavior be one with time and space all relevant physical activity, and traditional 2D convolutional neural networks are right
The feature in space is more sensitive, and can not handle the time response in video, is used as the identification of the behavior of timing variations,
Target can not be met;
When carrying out feature extraction using convolutional neural networks, with the intensification of its depth, extraction to feature from specific to
It is abstract, from simple to complexity, under conditions of sample is few, network depth too depth or noise are too big, very easily produce over-fitting
Result;So-called over-fitting is exactly the grader that network training goes out, and only pair input similar to training sample is sensitive, and is that its is defeated
Enter similar other unseen test samples, its ability in feature extraction and classification capacity will become very low;In order to prevent
Fitting, convolutional neural networks are also introduced into dropout to be improved, but this method will consume more computing resources;
For in most Activity recognition tasks, the cross-distribution of feature is to cause the FAQs of over-fitting, paper
" Action Recognition based on Subdivision-Fusion Model " propose improved subdivision fusion mould
Type (SFM), for the model in the subdivision stage, the sample characteristics of most classifications are all similar and cross-distributions, and SFM is by such sample
One's duty composition is multiple more to be easier to find for the subclass of ga s safety degree, the border of subclass, so as to avoid overfitting.
Subsequent fusing stage, more subclass classification results are converted back to original classification identification problem.Subdivision Fusion Model provides
The two methods of cluster centre number are determined, but it still suffers from problems with:
The wherein two methods of its determination cluster centre number:One kind is after directly observing high dimensional feature by t-SNE dimensionality reductions
Two-dimensional visualization figure, artificial selection should be divided into how many class, another kind be quantity based on each classification sample with it is minimum
The ratio of class sample size determines clusters number.Two methods are required for manually observing, it is impossible to are clustered automatically.And its property
The personal choice of researcher couple can be depended on., meanwhile, this process must be interrupted by the participation of people.
These problems will all have a strong impact on the stability and automaticity of Activity recognition performance and recognizer.
The content of the invention
For above-mentioned deficiency of the prior art, Human bodys' response model provided by the invention and its construction method and people
Body Activity recognition method solves the problems, such as over-fitting easily occur during Human bodys' response and automation is poor.
In order to reach foregoing invention purpose, the technical solution adopted by the present invention is:
First aspect, there is provided a kind of construction method of Human bodys' response model, it includes:
Obtaining includes the Sample Storehouse of some human body behavior videos, and all human body behavior videos in Sample Storehouse is carried out pre-
Processing;
The characteristic vector of pretreated human body behavior video is extracted using 3D convolutional neural networks;
By the characteristic vector input coulomb field of force of extraction, all characteristic vectors produce in mutually similar generation gravitation, not species
Relative movement is clustered in the presence of repulsion;
When calculating particle current location and the similarity function minimum representated by characteristic vector using loss function feature to
Error between the target location of amount;
When the error is more than or equal to given threshold, the error is subjected to backpropagation, and adjust 3D convolutional Neurals
Network parameter, until error is less than given threshold;
When the error is less than given threshold, the training of 3D convolutional neural networks is completed, and using grader to feature
Vector is trained;
The difference between the classification output result of grader and the label of sample is calculated, is preset when the difference is more than or equal to
During value, by the difference backpropagation, and the parameter of grader is updated;
When the difference is less than preset value, sub-line is class label after the current Optimal Parameters of record sort device, cluster
And sub-line is human body behavior video corresponding to class label after cluster;
Human bodys' response mould is formed using the 3D convolutional neural networks after the grader and optimization for being provided with Optimal Parameters
Type.
Further, the calculation formula of the loss function is:
Wherein,For the penalty values of i-th of sample;W, b be respectively 3D convolutional neural networks weights and
Biasing;For the value of i-th of sample jth dimension;When reaching minimum value for similarity function value, i-th of sample jth dimension
Desired value;For attenuation term.
Further, the calculation formula of the similarity function is:
Wherein, D (xi,xj) it is species xiWith species xjSimilarity;miFor species i mean vector;mjFor the equal of species j
Value vector;WithFor species i and species j variance within clusters.
Further, the 3D convolutional neural networks include seven layers, and first layer is input layer, and it has three passages, and three
Individual passage receives the multiple image of the human body behavior at pretreated upper one second of human body behavior video current time respectively, light stream exists
The component of component and light stream on y axis of orientations on x axis of orientations;
The second layer is Three dimensional convolution layer, the image inputted with the convolution kernel that quantity is n, yardstick is cw*ch*cl to first layer
Convolution algorithm is carried out with light stream;Third layer is three-dimensional down-sampling layer, and the convolution kernel for being pw*ph*pl with yardstick is to the defeated of the second layer
Go out to carry out maximum pond;
4th layer is Three dimensional convolution layer, and convolution fortune is carried out using output of the second layer identical computing mode to third layer
Calculate;Layer 5 is NIN layers, is formed using the network by two layers of perceptron convolutional layer, and people is extracted for the output according to the 4th layer
The nonlinear characteristic of body behavior;
Layer 6 is pyramid down-sampling layer, is made up of different size of three-dimensional down-sampling layer, for being exported to layer 5
Human body behavior nonlinear characteristic carry out down-sampling processing;Layer 7 is full articulamentum, is consolidated according to the output of layer 6
Determine the characteristic vector of dimension.
Further, the calculation formula that the second layer inputs to first layer image and light stream carry out convolution algorithm is:
Wherein, w is the weight of convolution kernel;U be the image intensity values of three passages of input layer, light stream horizontal component and hang down
Straight component;vxyzFor the output of Three dimensional convolution layer;P, Q is respectively total line number of the two-dimensional matrix of input layer output and total columns;R
For the length of human body behavior video;P, q is respectively pth row and q row in the two-dimensional matrix that input layer exports;R is human body behavior
The r frames of video;Cw is the width of convolution kernel, and ch is convolution kernel height, and cl is the length of convolution kernel on a timeline;
Further, adopted when carrying out maximum pond using output of the three-dimensional down-sampling layer to the second layer under three-dimensional overlapping maximum
The calculation formula of sample is:
Wherein, x is the feature of second layer Three dimensional convolution extraction;Y is the output obtained after sampling, and s, t and r are respectively image
Sampling step length in three width, height and length of the video time directions;M, n and l is the feature map of third layer pond layer
Element index on x directions, y directions and time shaft;S1、S2And S3It is total line number of second layer output matrix, total columns and total
Frame number.
Further, the NIN layers extract the calculation formula of the nonlinear characteristic of human body behavior according to the 4th layer of output
For:
Wherein, (i, j) is the feature map of current layer pixel index;xI, jIt is input block of the center at (i, j), kn
It is the feature map of current layer index;N is the MPL number of plies.
Further, the nonlinear characteristic for the human body behavior that the pyramid down-sampling layer exports to layer 5 adopt
Sample processing further comprises:
The ratio per one-dimensional input and the length of side of output is calculated, is obtained per one-dimensional window size, downwards by rounding up
The moving step length obtained per one-dimensional window is rounded, is used afterwards with third layer three-dimensional down-sampling layer identical calculation formula to human body
The nonlinear characteristic of behavior carries out down-sampling processing.
Second aspect, there is provided a kind of Human bodys' response model, it is structure side's structure using Human bodys' response model
The Human bodys' response model built.
The third aspect, there is provided a kind of method that video human behavior is identified using Human bodys' response model, its
It is characterised by, including:
Video image to be identified is obtained, and it is pre-processed;
The characteristic vector of pretreated video image is extracted using 3D convolutional neural networks;
By in the grader of characteristic vector input Human bodys' response model, pass through grader and Human bodys' response
Model carries out classification processing to video image, and it is the classification results of the video image of class label to obtain with sub-line after renewal;
Sub-line is the classification results of the video image of class label after obtained band is updated, and divides major class according to sample place
Sort out and merge, obtain the affiliated behavior classification of human body behavior in video image.
Compared with Human bodys' response method of the prior art, beneficial effects of the present invention are:
This programme substitutes traditional 2D convolutional neural networks using the 3D convolutional neural networks of deep learning, carries out carrying for feature
Take, with Human bodys' response in adaptive video fragment to time and the dual requirementses in space.
The Clustering Effect of subdivision Fusion Model, and the loss function of calculating network are replaced using the coulomb field of force, mould can be strengthened
The robustness of type, retain convolutional neural networks feature end to end, reduce the probability of network over-fitting, improve and selected manually in SFM
Species number purpose drawback is selected, strengthens the automation characteristic and fluency of program.
Using the coulomb field of force while cluster, the noise spot in training sample can be filtered by interaction force, is made
Correct training sample influences to maximize on the parameter of network, and influence of the noise to network parameter minimizes, and network is reached considerable
Ability in feature extraction and classification capacity.
After 3D convolutional neural networks are improved by this programme, Three dimensional convolution, three-dimensional down-sampling, NIN, three Vygens are added
Word tower structure so that convolutional neural networks can have stronger ability in feature extraction to human body behavior;Assembled for training in specific video
Practice, obtain the feature with more classification capacity, strengthen the robustness and accuracy of whole recognizer.
Wherein, 3D convolutional neural networks use the information in Three dimensional convolution extraction time domain, preferably can believe capture movement
Breath;Amount of calculation is not only greatly reduced using three-dimensional down-sampling technology, and it is constant to introduce time of the algorithm in time-domain
Property, improve the stability of identification and Geng Gao discrimination;The flexibility of system is improved using pyramid structure so that different
The video segment of resolution ratio and duration can use the system with without any changes, improve the flexibility and application of system
Scope.
Brief description of the drawings
Fig. 1 is the flow chart of construction method one embodiment of Human bodys' response model.
Fig. 2 is Activity recognition 3D convolutional neural networks configuration diagrams provided by the invention.
Fig. 3 is Three dimensional convolution provided by the invention and the comparison diagram of two-dimensional convolution;
Wherein, Fig. 3 a are the schematic diagram of Three dimensional convolution, and Fig. 3 b are the schematic diagram of two-dimensional convolution.
Fig. 4 is linear convolution provided by the invention and MLP convolution schematic diagrames.
Fig. 5 is pyramid structure schematic diagram provided by the invention.
Fig. 6 a~6f are coulomb field of force cluster process schematic diagram provided by the invention.
Fig. 7 a are that 3D convolutional neural networks combination coulombs field of force cluster can in the two dimension of Hollywood2 high-dimensional feature spaces
Depending on changing result;Fig. 7 b are the two-dimensional visualization result of the Hollywood2 high-dimensional feature spaces of sparse subspace clustering.
Fig. 8 a~8f are between the 3D convolutional neural networks combination coulomb field of force of this programme and long recursive convolution network clustering
Some classification visualizations compare.
Fig. 9 is the flow chart of Human bodys' response method one embodiment.
Embodiment
The embodiment of the present invention is described below, in order to which those skilled in the art understand this hair
It is bright, it should be apparent that the invention is not restricted to the scope of embodiment, for those skilled in the art,
As long as various change in the spirit and scope of the present invention that appended claim limits and determines, these changes are aobvious and easy
See, all are using the innovation and creation of present inventive concept in the row of protection.
With reference to figure 1, Fig. 1 shows the flow chart of construction method one embodiment of Human bodys' response model;Such as Fig. 1 institutes
Show, this method 100 includes step 101 to step 109.
In a step 101, obtaining includes the Sample Storehouse of some human body behavior videos, and to all human body rows in Sample Storehouse
Pre-processed for video;Pretreatment mainly includes carrying out gray processing, scaling and extraction to the human body behavior video in sample storehouse
Streamer.
Wherein, Sample Storehouse is Hollywood2 Human bodys' response databases, and it includes 12 anthropoid behaviors, and 3669 regard
Frequency fragment, the duration of about 20.1 hours.
In a step 102, the characteristic vector of pretreated human body behavior video is extracted using 3D convolutional neural networks.Tool
It is each layer parameter of random initializtion 3D convolutional neural networks body, and by the network to Hollywood2 Human bodys' responses
Sample in data set carries out feature extraction, finally gives the feature of each sample, is entered into feature space.
In one embodiment of the invention, 3D convolutional neural networks include seven layers, specifically may be referred to Fig. 2, but Fig. 2
In the 8th layer reference be grader training;As shown in Fig. 2 first layer is input layer, it has three passages, three passages
The multiple image of the human body behavior at pretreated upper one second of human body behavior video current time, light stream are received respectively in x directions
The component of component and light stream on y axis of orientations on axle;This programme on the basis of original video input passage by adding two
Individual light stream passage can largely strengthen the sensitiveness to behavior act, there is higher discrimination in Activity recognition.
The second layer is Three dimensional convolution layer, the image inputted with the convolution kernel that quantity is n, yardstick is cw*ch*cl to first layer
Convolution algorithm is carried out with light stream;This programme adds time domain information using Three dimensional convolution computing, preferably can believe capture movement
Breath.
That is, in the task of processing video, the convolutional layer of three-dimensional has been used from the information of multiple continuous frame-grab campaigns,
The calculation formula of Three dimensional convolution computing is:
Wherein, w is the weight of convolution kernel;U be the image intensity values of three passages of input layer, light stream horizontal component and hang down
Straight component;vxyzFor the output of Three dimensional convolution layer;P, Q is respectively total line number of the two-dimensional matrix of input layer output and total columns;R
For the length of human body behavior video;P, q is respectively pth row and q row in the two-dimensional matrix that input layer exports;R is human body behavior
The r frames of video;Cw is the width of convolution kernel, and ch is convolution kernel height, and cl is the length of convolution kernel on a timeline.
Here Three dimensional convolution computing is considered as the cube that the accumulation of multiple frames is got up come convolution with the core of three-dimensional, such as
Fig. 3 a show the schematic diagram of Three dimensional convolution, and reference axis indicates three dimensions:The width of image, height and video when
Between, the cube of lower section represents the input of convolution, and the cube of top represents the output of convolution.Fig. 3 b are the signals of two-dimensional convolution
Figure, its input and output are all the rectangles of two dimension, only include width, the elevation information of image, the information of no time-domain.Using
Three dimensional convolution, the feature map of each convolutional layer are the multiple successive frames for being associated with preceding layer, thus capture motion
Information.
Third layer is three-dimensional down-sampling layer, and maximum pond is carried out with output of the convolution kernel that yardstick is pw*ph*pl to the second layer
Change;The three-dimensional down-sampling technology that this programme uses not only greatly reduces amount of calculation, and introduces algorithm in time-domain
Time invariance, improve the stability of identification and Geng Gao discrimination.
Particular content is:As the convolution of three-dimensional, when convolutional neural networks handle video, down-sampling layer also needs
It is extended to three-dimensional.Handle the down-sampling layer of the convolutional neural networks of picture, data volume can be caused to strongly reduce, accelerate behind
Calculating, while also cause 2D convolutional neural networks there is certain consistency, consistency here is constant in spatial domain
Property.
And when video is handled, certain consistency, and the data processing ratio of video are also required in time-domain
The picture of single frames is much bigger, therefore it is necessary to which down-sampling is also extended to three-dimensional.Wherein, using three-dimensional down-sampling layer to
The calculation formula of three-dimensional overlapping maximum down-sampling is when two layers of output carries out maximum pond:
Wherein, x is the feature of second layer Three dimensional convolution extraction;Y is the output obtained after sampling, and s, t and r are respectively image
Sampling step length in three width, height and length of the video time directions;M, n and l is the feature map of third layer pond layer
Element index on x directions, y directions and time shaft;S1、S2And S3It is total line number of second layer output matrix, total columns and total
Frame number.
After being handled using three-dimensional down-sampling layer, feature map data volumes are reduced at double, and amount of calculation can also greatly reduce,
Meanwhile network is to the more robust of the change in time-domain.
4th layer is Three dimensional convolution layer, and convolution fortune is carried out using output of the second layer identical computing mode to third layer
Calculate;Layer 5 is NIN layers (Network in Network), is formed using the network by two layers of perceptron convolutional layer, for root
According to the nonlinear characteristic of the 4th layer of output extraction human body behavior;Use NIN that this method can be allowd to extract more complicated
Human body behavior nonlinear characteristic.
The calculation formula of nonlinear characteristic that the NIN layers extract human body behavior according to the 4th layer of output is:
Wherein, (i, j) is the feature map of current layer pixel index;xI, jIt is input block of the center at (i, j), kn
It is the feature map of current layer index;N is the MPL number of plies.
Layer 6 is pyramid down-sampling layer, is made up of different size of three-dimensional down-sampling layer, for being exported to layer 5
Human body behavior nonlinear characteristic carry out down-sampling processing;Obtain the output feature map of different resolution;This programme is adopted
Three-dimensional this skill upgrading of pyramid down-sampling layer flexibility of system so that the piece of video of different resolution and duration
Section can improve the flexibility of 3D convolutional neural networks with the 3D convolutional neural networks without any changes that can use this programme
And application.
Particular content is:The three-dimensional pyramid down-sampling layer of the application is made up of the feature map of a variety of resolution ratio,
Conventional down-sampling layer is all to have identical size with the feature map of identical sampling scale and input, so obtaining
Feature map have identical dimension.And pyramid down-sampling layer is fixed dimension using a variety of sampling scales
feature map。
The sample of Activity recognition is all some video segments, may there is different resolution ratio, it is also possible to has different videos
Length.These othernesses so that traditional convolutional neural networks have no idea to handle, because traditional convolutional neural networks are each
Feature map are fixed sizes.Cause traditional convolutional neural networks can not handle different resolution and length video
Reason does not lie in convolutional layer and down-sampling layer, but at full articulamentum (layer 7 in Fig. 2), because the framework of full articulamentum is
Fixed, have no idea to change, the size that result in the feature map of this layer input also must be fixed.And in convolution
Layer, the feature map of input size do not interfere with the structure of network, and the feature map simply exported size can be with
The change of the feature map sizes of this layer input and change, because convolution kernel is simply sliding on input feature map
It is dynamic.In down-sampling layer, size simply is reduced feature map are inputted in a manner, does not also interfere with the knot of network
Structure.From this, the processing of appropriate feature map sizes must be just carried out before full articulamentum so that various sizes of
Input feature map obtain the output of identical size, and this processing can use different windows size and different step-lengths
Overlapping down-sampling is realized.
As shown in figure 5, pyramid is made up of the feature map of some different resolutions, existing relatively large one
The resolution ratio of point, also there is small resolution ratio, and also there is the resolution ratio of some transition sizes centre.(16*256-d, 4* in such as Fig. 4
256-d, 256-d) want to obtain the output feature map of different resolution, it is necessary to carry out more from the feature map of input
The overlapping down-sampling of kind window size and step-length, each unit of these obtained feature map are spliced into a vector again,
Such as Fig. 5 L-Fix layers, then, then full articulamentum is accessed.Example in Figure 5 is 3 grades of pyramids, fixed resolution ratio
Respectively 4 × 4,2 × 2,1 × 1, and last layer has 256 feature map.
The output unit total number of pyramidal layer is all fixed, so it is complete to export fixed feature map connections
Articulamentum, moreover, the introducing of pyramid model, can form the feature map of a variety of resolution ratio, it also avoid different scale
Influenceed caused by video input.
In this programme, that use is all three-dimensional feature map, and three-dimensional overlapping maximum down-sampling uses and third layer
Three-dimensional down-sampling layer identical calculation formula carries out down-sampling processing to the nonlinear characteristic of human body behavior.
Window size is obtained by calculating input and the ratio E of the length of side of output, then by rounding up, rounds obtain downwards
Obtain step-length.Assuming that input feature map size for a × a × t, it is necessary to be down sampled to size n × n × τ, then in the time
Window size is used on axle now:
The time span of slip is:
Wherein, in formulaOperated to round up,For downward floor operation.
Layer 7 is full articulamentum, is fixed the characteristic vector of dimension according to the output of layer 6, there is provided to grader
The characteristic of division of (softmax graders) as identification human body behavior.
In step 103, the characteristic vector of extraction is inputted into the coulomb field of force, all characteristic vectors are drawn in mutually similar generation
Power, do not relatively move and clustered in the presence of species generation repulsion.
Step 103 concrete implementation mode is:The characteristic vector that extraction obtains is input in coulomb field of force, regards coulomb as
An electric charge in the field of force, the intermolecular forces of the electric charge of identical type are gravitation, can be attracted each other, different types of feature to
The intermolecular forces of amount are repulsion, can be mutually exclusive, and under force, each characteristic vector is relatively moved, and judges phase
Whether reach minimum like function, to determine the stable state S ultimately formedE, step 104 is carried out if reaching, is otherwise continued
It is mobile;Detailed process is as follows:
After the characteristic vector extracted via the 3D convolutional neural networks of random initializtion enters the coulomb field of force for the first time, random point
Cloth is in the coulomb field of force, as shown in Figure 6 a, because 3D convolutional neural networks, now without training, the feature of extraction does not possess can
Distinction, so its position of each sample is random distribution;In this schematic diagram, there is the sample of three types, respectively with not similar shape
Shape represents.
Each characteristic vector is regarded as the electric charge in the coulomb field of force, producing gravitation according to class label, between mutually similar makes them
Apart from upper close to each other, repulsion is produced between inhomogeneity, making them, effect schematic diagram is as schemed apart from upper mutually exclusive remote
Shown in 6b;
As shown in Fig. 6 c to Fig. 6 e, according to the effect of class Coulomb force via more wheel iteration, eventually arrive at such as Fig. 6 f stabilization
State, that is, the result clustered.
In one embodiment of the invention, the calculation formula of the similarity function is:
Wherein, D (xi,xj) for the similarity of two species;miFor species i mean vector, namely species i central point,CiIt is the set of all samples of species i, NiIt is the total number of species i samples, xnRefer to sample x characteristic vector;
mjFor the central point of species j mean vector, i.e. species j,CjIt is the set of all samples of species j, NjIt is
The total number of species j samples;WithFor species i and species j variance within clusters,
At step 104, particle current location and the similarity function representated by characteristic vector are calculated using loss function
Error when minimum between the target location of characteristic vector.
During implementation, the preferred grader of this programme selects softmax graders.Wherein, the calculation formula of loss function is:
Wherein,For the penalty values of i-th of sample;W, b be respectively 3D convolutional neural networks weights and
Biasing;For the value of i-th of sample jth dimension;When reaching minimum value for similarity function value, i-th of sample jth dimension
Desired value;For attenuation term.
In step 105, when the error is more than or equal to given threshold, the error is subjected to backpropagation, and adjust
Whole 3D convolutional neural networks parameter, until error is less than given threshold.
In step 106, when the error is less than given threshold, the training of 3D convolutional neural networks is completed, and is used
Grader is trained to characteristic vector.
In step 107, the difference between the classification output result of grader and the label of sample is calculated, when the difference
During more than or equal to preset value, by the difference backpropagation, and the parameter of grader is updated.
In step 108, when the difference is less than preset value, son after the current Optimal Parameters of record sort device, cluster
Sub-line is human body behavior video corresponding to class label after behavior class label and cluster.
In step 109, people is formed using the 3D convolutional neural networks after the grader and optimization for being provided with Optimal Parameters
Body Activity recognition model.
As shown in figure 9, Human bodys' response method it include:Video image to be identified is obtained, and it is located in advance
Reason;Wherein, pretreatment mainly includes carrying out the human body behavior video in sample storehouse gray processing, scaling and extraction streamer.
The characteristic vector of pretreated video image is extracted using 3D convolutional neural networks;3D convolutional Neurals net therein
Network can use existing 3D convolutional neural networks, can also use 3D volumes built in Human bodys' response model building method
Product neutral net.
By in the grader of characteristic vector input Human bodys' response model, pass through grader and Human bodys' response
Model carries out classification processing to video image, and it is the classification results of the video image of class label to obtain with sub-line after renewal;
Sub-line is the classification results of the video image of class label after obtained band is updated, and divides major class according to sample place
Sort out and merge, obtain the affiliated behavior classification of human body behavior in video image.
The Human bodys' response model building method provided below using this programme is respectively to the data set of mesoscale
(Hollywood2 and UCF YouTube action datasets) and the data set (UCF101) of large scale carry out human body behavior
Identification model is built, and carries out the identification of human body behavior using the Human bodys' response model of structure afterwards.
Hollywood2 data sets:This programme uses the Hollywood2 data sets of standard, and Hollywood2 includes 12 classes
3669 video segments of behavior.Clustered using the coulomb field of force, 12 different behaviors divide into 25 subclasses by this programme.Subdivision
Details it is as shown in table 1:
The behavior classification and cluster result of the Hollywood2 data sets of table 1
Because camera position changes greatly, the diversity of the appearance of identification object and posture, and visual angle and background are answered
Polygamy, illumination condition etc. influence so that become very challenging to the Activity recognition of Hollywood2 behavioral data collection.Its
In, Fig. 7 (a) is the two-dimensional visualization method of the Hollywood2 high-dimensional feature spaces of this programme;Fig. 7 (b) is sparse subspace
The two-dimensional visualization of the Hollywood2 high-dimensional feature spaces of cluster.As can be seen that this programme Clustering Effect be substantially better than it is sparse
Subspace clustering.By cluster result, as shown in Figure 7a, coulomb field of force Clustering Model is used to filter noise sample first, and
Overfitting is avoided, experimental result improves accuracy.
YouTube Action dataset:YouTube action datas collection includes 11 action classifications.The detailed letter of subdivision
Breath is as shown in table 2, and recognition accuracy has reached 87.2%.
The YouTube Action of subdivision and the discarding of the behavior class of table 2 sample
Show that the accuracy of this programme reaches by the experimental result and state-of-the-art report result of this programme
81.3%, with other methods (compared with long recursive convolution network and sparse subspace clustering), the method for this programme exists
Hollywood2 database achieves more competitive result.The property of identification of this programme on Hollywood2 data sets
It can report as shown in table 3:
Comparison of the present networks of table 3 on Hollywood2 databases with other advanced methods
According to table 3, using the method for this programme, performance has substantially improved to be carried from the 82.5% of sparse subspace clustering
Rise to 87.2%.
Experiment on UCF101 data sets
UCF101dataset:UCF101 is behavior video identification data set in the actual life of one, is moved comprising 101 classes
Make, everyone does 4-7 groups, shares 13320 videos, a length of 7.2 seconds during average each video.
Clustered using the coulomb field of force and 101 class behaviors of UFC101 data sets have been divided into 235 subclasses, the method for this programme
Some category distributions visualization compared with the method for long recursive convolution network, as shown in figure 8, wherein Fig. 8 (a), Fig. 8 (c)
It is the UCF101 high-dimensional feature space two-dimensional visualizations of long recursive convolution network method with Fig. 8 (e);Fig. 8 (b), Fig. 8 (d) and Fig. 8
(f) be this programme UCF101 high-dimensional feature space two-dimensional visualizations, in addition, the coordinate range of this programme is much smaller than long-term recurrence
The coordinate range of convolutional network method, the aspect ratio length that this programme is can be seen that by the sample in the range of smaller coordinate are passed
The feature of convolutional network method is returned more to concentrate.
Experimental result:The method of this programme is as shown in table 4 in the performance report of UCF101 data sets:
Comparison of the present networks of table 4 on UCF101 databases with other advanced methods
Compared with the experimental result of this programme is aobvious with the report result of other advanced methods, the method for indicating this programme reaches
Most advanced horizontal recognition performance is arrived.
In summary, this programme is when building Human bodys' response model, using the 3D convolutional Neural nets in deep learning
Network substitutes traditional 2D convolutional neural networks with NIN (Network in Network) network integration, carries out the extraction of feature, with suitable
Human bodys' response is answered in video segment to time and the dual requirementses in space,;Novel coulomb field of force cluster mode improves
Segment the calculating of loss and the classification of feature that Fusion Model carries out network;In addition, coulomb field of force clustering method can filter out
Noise sample, improve accuracy.
Claims (10)
1. the construction method of Human bodys' response model, it is characterised in that including:
Obtaining includes the Sample Storehouse of some human body behavior videos, and all human body behavior videos in Sample Storehouse are located in advance
Reason;
The characteristic vector of pretreated human body behavior video is extracted using 3D convolutional neural networks;
By the characteristic vector input coulomb field of force of extraction, all characteristic vectors produce repulsion in mutually similar generation gravitation, not species
In the presence of relatively move and clustered;
Particle current location representated by characteristic vector and characteristic vector during similarity function minimum are calculated using loss function
Error between target location;
When the error is more than or equal to given threshold, the error is subjected to backpropagation, and adjust 3D convolutional neural networks
Parameter, until error is less than given threshold;
When the error is less than given threshold, the training of 3D convolutional neural networks is completed, and using grader to characteristic vector
It is trained;
The difference between the classification output result of grader and the label of sample is calculated, when the difference is more than or equal to preset value
When, by the difference backpropagation, and update the parameter of grader;
When the difference is less than preset value, sub-line is class label and gathered after the current Optimal Parameters of record sort device, cluster
Sub-line is human body behavior video corresponding to class label after class;
Human bodys' response model is formed using the 3D convolutional neural networks after the grader and optimization for being provided with Optimal Parameters.
2. the construction method of Human bodys' response model according to claim 1, it is characterised in that the loss function
Calculation formula is:
Wherein,For the penalty values of i-th of sample;W, b are respectively weights and the biasing of 3D convolutional neural networks;For the value of i-th of sample jth dimension;When reaching minimum value for similarity function value, the mesh of i-th of sample jth dimension
Scale value;For attenuation term.
3. the construction method of Human bodys' response model according to claim 1, it is characterised in that the similarity function
Calculation formula be:
Wherein, D (xi,xj) it is species xiWith species xjSimilarity;miFor species i mean vector;mjFor species j average to
Amount;WithFor species i and species j variance within clusters.
4. according to the construction method of any described Human bodys' response models of claim 1-3, it is characterised in that described 3D volumes
Product neutral net includes seven layers, and first layer is input layer, and it has three passages, and three passages receive pretreated people respectively
Component and light stream of the multiple image, light stream of the human body behavior at upper one second of behavior video current time of body on x axis of orientations are in y side
To the component on axle;
The second layer is Three dimensional convolution layer, the image and light inputted with the convolution kernel that quantity is n, yardstick is cw*ch*cl to first layer
Stream carries out convolution algorithm;Third layer is three-dimensional down-sampling layer, is entered with output of the convolution kernel that yardstick is pw*ph*pl to the second layer
The maximum pond of row;
4th layer is Three dimensional convolution layer, and convolution algorithm is carried out using output of the second layer identical computing mode to third layer;The
Five layers are NIN layers, are formed using the network by two layers of perceptron convolutional layer, and human body behavior is extracted for the output according to the 4th layer
Nonlinear characteristic;
Layer 6 is pyramid down-sampling layer, is made up of different size of three-dimensional down-sampling layer, for the people exported to layer 5
The nonlinear characteristic of body behavior carries out down-sampling processing;Layer 7 is full articulamentum, and dimension is fixed according to the output of layer 6
The characteristic vector of degree.
5. the construction method of Human bodys' response model according to claim 4, it is characterised in that the second layer is to
One layer input image and light stream carry out convolution algorithm calculation formula be:
Wherein, w is the weight of convolution kernel;U is the image intensity values of three passages of input layer, the horizontal component of light stream and vertical point
Amount;vxyzFor the output of Three dimensional convolution layer;P, Q is respectively total line number of the two-dimensional matrix of input layer output and total columns;R is people
The length of body behavior video;P, q is respectively pth row and q row in the two-dimensional matrix that input layer exports;R is human body behavior video
R frames;Cw is the width of convolution kernel, and ch is convolution kernel height, and cl is the length of convolution kernel on a timeline.
6. the construction method of Human bodys' response model according to claim 4, it is characterised in that use three-dimensional down-sampling
The calculation formula of three-dimensional overlapping maximum down-sampling is when output of the layer to the second layer carries out maximum pond:
Wherein, x is the feature of second layer Three dimensional convolution extraction;Y is the output obtained after sampling, and s, t and r are respectively image in width
Degree, the sampling step length in three directions of height and length of the video time;M, n and l is the feature map of third layer pond layer in x
Element index on direction, y directions and time shaft;S1、S2And S3It is total line number, total columns and the total frame of second layer output matrix
Number.
7. the construction method of Human bodys' response model according to claim 4, it is characterised in that the NIN layers according to
The calculation formula of nonlinear characteristic of 4th layer of output extraction human body behavior is:
Wherein, (i, j) is the feature map of current layer pixel index;xI, jIt is input block of the center at (i, j), knIt is to work as
The feature map of front layer index;N is the MPL number of plies.
8. the construction method of Human bodys' response model according to claim 4, it is characterised in that adopted under the pyramid
The nonlinear characteristic for the human body behavior that sample layer exports to layer 5 carries out down-sampling processing and further comprised:
The ratio per one-dimensional input and the length of side of output is calculated, is obtained per one-dimensional window size by rounding up, rounded downwards
The moving step length per one-dimensional window is obtained, is used afterwards with third layer three-dimensional down-sampling layer identical calculation formula to human body behavior
Nonlinear characteristic carry out down-sampling processing.
9. a kind of Human bodys' response model, it is characterised in that including any described method structures of use claim 1-8
Into Human bodys' response model.
10. a kind of method that video human behavior is identified Human bodys' response model using described in claim 9, its
It is characterised by, including:
Video image to be identified is obtained, and it is pre-processed;
The characteristic vector of pretreated video image is extracted using 3D convolutional neural networks;
By in the grader of characteristic vector input Human bodys' response model, pass through grader and Human bodys' response model
Classification processing is carried out to video image, it is the classification results of the video image of class label to obtain with sub-line after renewal;
Sub-line is the classification results of the video image of class label after obtained band is updated, and is sorted out according to sample place point major class
Merge, obtain the affiliated behavior classification of human body behavior in video image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711054505.6A CN107862275A (en) | 2017-11-01 | 2017-11-01 | Human bodys' response model and its construction method and Human bodys' response method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711054505.6A CN107862275A (en) | 2017-11-01 | 2017-11-01 | Human bodys' response model and its construction method and Human bodys' response method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107862275A true CN107862275A (en) | 2018-03-30 |
Family
ID=61697421
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711054505.6A Pending CN107862275A (en) | 2017-11-01 | 2017-11-01 | Human bodys' response model and its construction method and Human bodys' response method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107862275A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108734095A (en) * | 2018-04-10 | 2018-11-02 | 南京航空航天大学 | A kind of motion detection method based on 3D convolutional neural networks |
CN108830185A (en) * | 2018-05-28 | 2018-11-16 | 四川瞳知科技有限公司 | Activity recognition and localization method based on multitask combination learning |
CN108830157A (en) * | 2018-05-15 | 2018-11-16 | 华北电力大学(保定) | Human bodys' response method based on attention mechanism and 3D convolutional neural networks |
CN108860150A (en) * | 2018-07-03 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | Automobile brake method, apparatus, equipment and computer readable storage medium |
CN108921204A (en) * | 2018-06-14 | 2018-11-30 | 平安科技(深圳)有限公司 | Electronic device, picture sample set creation method and computer readable storage medium |
CN108960059A (en) * | 2018-06-01 | 2018-12-07 | 众安信息技术服务有限公司 | A kind of video actions recognition methods and device |
CN109255284A (en) * | 2018-07-10 | 2019-01-22 | 西安理工大学 | A kind of Activity recognition method of the 3D convolutional neural networks based on motion profile |
CN109376683A (en) * | 2018-11-09 | 2019-02-22 | 中国科学院计算技术研究所 | A kind of video classification methods and system based on dense graph |
CN109615674A (en) * | 2018-11-28 | 2019-04-12 | 浙江大学 | The double tracer PET method for reconstructing of dynamic based on losses by mixture function 3D CNN |
CN109816714A (en) * | 2019-01-15 | 2019-05-28 | 西北大学 | A kind of point cloud object type recognition methods based on Three dimensional convolution neural network |
CN110059604A (en) * | 2019-04-10 | 2019-07-26 | 清华大学 | The network training method and device that uniform depth face characteristic extracts |
CN110059761A (en) * | 2019-04-25 | 2019-07-26 | 成都睿沿科技有限公司 | A kind of human body behavior prediction method and device |
CN110726898A (en) * | 2018-07-16 | 2020-01-24 | 北京映翰通网络技术股份有限公司 | Power distribution network fault type identification method |
CN111178344A (en) * | 2020-04-15 | 2020-05-19 | 中国人民解放军国防科技大学 | Multi-scale time sequence behavior identification method |
CN112464835A (en) * | 2020-12-03 | 2021-03-09 | 北京工商大学 | Video human behavior identification method based on time sequence enhancement module |
WO2021232172A1 (en) * | 2020-05-18 | 2021-11-25 | 陈永聪 | Interpretable multilayer information screening network |
CN116027941A (en) * | 2022-06-30 | 2023-04-28 | 荣耀终端有限公司 | Service recommendation method and electronic equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011159258A1 (en) * | 2010-06-16 | 2011-12-22 | Agency For Science, Technology And Research | Method and system for classifying a user's action |
CN104067308A (en) * | 2012-01-25 | 2014-09-24 | 英特尔公司 | Object selection in an image |
CN106407903A (en) * | 2016-08-31 | 2017-02-15 | 四川瞳知科技有限公司 | Multiple dimensioned convolution neural network-based real time human body abnormal behavior identification method |
-
2017
- 2017-11-01 CN CN201711054505.6A patent/CN107862275A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011159258A1 (en) * | 2010-06-16 | 2011-12-22 | Agency For Science, Technology And Research | Method and system for classifying a user's action |
CN104067308A (en) * | 2012-01-25 | 2014-09-24 | 英特尔公司 | Object selection in an image |
CN106407903A (en) * | 2016-08-31 | 2017-02-15 | 四川瞳知科技有限公司 | Multiple dimensioned convolution neural network-based real time human body abnormal behavior identification method |
Non-Patent Citations (4)
Title |
---|
M.J.HUDAK: "RCE Classifers:theory and practice", 《CYBERNETICS AND SYSTEMS:AN INTERNATIONAL JOURNAL》 * |
万士宁: "基于卷积神经网络的人脸识别研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑2017年》 * |
吴杰: "基于卷积神经网络的行为识别研究", 《中国优秀硕士学位论文全文数据库信息科技辑2016年》 * |
赵小川: "《MATLAB图像处理能力提高与应用案例》", 31 January 2014, 北京航空航天大学出版社 * |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108734095A (en) * | 2018-04-10 | 2018-11-02 | 南京航空航天大学 | A kind of motion detection method based on 3D convolutional neural networks |
CN108734095B (en) * | 2018-04-10 | 2022-05-20 | 南京航空航天大学 | Motion detection method based on 3D convolutional neural network |
CN108830157A (en) * | 2018-05-15 | 2018-11-16 | 华北电力大学(保定) | Human bodys' response method based on attention mechanism and 3D convolutional neural networks |
CN108830185B (en) * | 2018-05-28 | 2020-11-10 | 四川瞳知科技有限公司 | Behavior identification and positioning method based on multi-task joint learning |
CN108830185A (en) * | 2018-05-28 | 2018-11-16 | 四川瞳知科技有限公司 | Activity recognition and localization method based on multitask combination learning |
CN108960059A (en) * | 2018-06-01 | 2018-12-07 | 众安信息技术服务有限公司 | A kind of video actions recognition methods and device |
CN108921204A (en) * | 2018-06-14 | 2018-11-30 | 平安科技(深圳)有限公司 | Electronic device, picture sample set creation method and computer readable storage medium |
CN108921204B (en) * | 2018-06-14 | 2023-12-26 | 平安科技(深圳)有限公司 | Electronic device, picture sample set generation method, and computer-readable storage medium |
CN108860150A (en) * | 2018-07-03 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | Automobile brake method, apparatus, equipment and computer readable storage medium |
CN108860150B (en) * | 2018-07-03 | 2021-05-04 | 百度在线网络技术(北京)有限公司 | Automobile braking method, device, equipment and computer readable storage medium |
CN109255284A (en) * | 2018-07-10 | 2019-01-22 | 西安理工大学 | A kind of Activity recognition method of the 3D convolutional neural networks based on motion profile |
CN110726898B (en) * | 2018-07-16 | 2022-02-22 | 北京映翰通网络技术股份有限公司 | Power distribution network fault type identification method |
CN110726898A (en) * | 2018-07-16 | 2020-01-24 | 北京映翰通网络技术股份有限公司 | Power distribution network fault type identification method |
CN109376683A (en) * | 2018-11-09 | 2019-02-22 | 中国科学院计算技术研究所 | A kind of video classification methods and system based on dense graph |
CN109615674B (en) * | 2018-11-28 | 2020-09-18 | 浙江大学 | Dynamic double-tracing PET reconstruction method based on mixed loss function 3D CNN |
CN109615674A (en) * | 2018-11-28 | 2019-04-12 | 浙江大学 | The double tracer PET method for reconstructing of dynamic based on losses by mixture function 3D CNN |
CN109816714A (en) * | 2019-01-15 | 2019-05-28 | 西北大学 | A kind of point cloud object type recognition methods based on Three dimensional convolution neural network |
CN110059604A (en) * | 2019-04-10 | 2019-07-26 | 清华大学 | The network training method and device that uniform depth face characteristic extracts |
CN110059604B (en) * | 2019-04-10 | 2021-04-27 | 清华大学 | Network training method and device for deeply and uniformly extracting human face features |
CN110059761A (en) * | 2019-04-25 | 2019-07-26 | 成都睿沿科技有限公司 | A kind of human body behavior prediction method and device |
CN111178344A (en) * | 2020-04-15 | 2020-05-19 | 中国人民解放军国防科技大学 | Multi-scale time sequence behavior identification method |
WO2021232172A1 (en) * | 2020-05-18 | 2021-11-25 | 陈永聪 | Interpretable multilayer information screening network |
CN112464835A (en) * | 2020-12-03 | 2021-03-09 | 北京工商大学 | Video human behavior identification method based on time sequence enhancement module |
CN116027941A (en) * | 2022-06-30 | 2023-04-28 | 荣耀终端有限公司 | Service recommendation method and electronic equipment |
CN116027941B (en) * | 2022-06-30 | 2023-10-20 | 荣耀终端有限公司 | Service recommendation method and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107862275A (en) | Human bodys' response model and its construction method and Human bodys' response method | |
CN110188725A (en) | The scene Recognition system and model generating method of high-resolution remote sensing image | |
CN103927531B (en) | It is a kind of based on local binary and the face identification method of particle group optimizing BP neural network | |
CN110598598A (en) | Double-current convolution neural network human behavior identification method based on finite sample set | |
CN108133188A (en) | A kind of Activity recognition method based on motion history image and convolutional neural networks | |
CN108009509A (en) | Vehicle target detection method | |
GB2578341A (en) | Method and apparatus for automatically recognizing electrical imaging well logging facies | |
CN112883839B (en) | Remote sensing image interpretation method based on adaptive sample set construction and deep learning | |
CN110322445B (en) | Semantic segmentation method based on maximum prediction and inter-label correlation loss function | |
CN110084285A (en) | Fish fine grit classification method based on deep learning | |
CN106709482A (en) | Method for identifying genetic relationship of figures based on self-encoder | |
CN108268860A (en) | A kind of gas gathering and transportation station equipment image classification method based on convolutional neural networks | |
CN110287985B (en) | Depth neural network image identification method based on variable topology structure with variation particle swarm optimization | |
CN111833322B (en) | Garbage multi-target detection method based on improved YOLOv3 | |
CN110163213A (en) | Remote sensing image segmentation method based on disparity map and multiple dimensioned depth network model | |
CN104809469A (en) | Indoor scene image classification method facing service robot | |
US11695898B2 (en) | Video processing using a spectral decomposition layer | |
CN110852369B (en) | Hyperspectral image classification method combining 3D/2D convolutional network and adaptive spectrum unmixing | |
CN111368660A (en) | Single-stage semi-supervised image human body target detection method | |
CN106980830A (en) | One kind is based on depth convolutional network from affiliation recognition methods and device | |
CN106980831A (en) | Based on self-encoding encoder from affiliation recognition methods | |
CN100416599C (en) | Not supervised classification process of artificial immunity in remote sensing images | |
CN109447014A (en) | A kind of online behavioral value method of video based on binary channels convolutional neural networks | |
CN110096976A (en) | Human behavior micro-Doppler classification method based on sparse migration network | |
CN111160389A (en) | Lithology identification method based on fusion of VGG |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |