CN108805083A - The video behavior detection method of single phase - Google Patents
The video behavior detection method of single phase Download PDFInfo
- Publication number
- CN108805083A CN108805083A CN201810607804.6A CN201810607804A CN108805083A CN 108805083 A CN108805083 A CN 108805083A CN 201810607804 A CN201810607804 A CN 201810607804A CN 108805083 A CN108805083 A CN 108805083A
- Authority
- CN
- China
- Prior art keywords
- behavior
- video
- training
- multiple dimensioned
- segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Abstract
The invention discloses the video behavior detection methods of single phase a kind of comprising:In the training stage, multiple dimensioned behavior segment Recurrent networks are built based on convolutional neural networks;Using training video and frame level real behavior label as input, multiple dimensioned behavior segment Recurrent networks are trained using the end-to-end optimization method of multi-task learning, obtain trained multiple dimensioned behavior segment Recurrent networks model;In service stage, when new video inputs, the input frame sequence that there is equal length with training video is generated by time dimension sliding window, using trained multiple dimensioned behavior segment Recurrent networks model, the behavior classification of prediction input frame sequence and corresponding time location;It reuses non-maxima suppression to handle prediction result, generates final behavioral value result.This method can improve detection performance and detection efficiency.
Description
Technical field
The present invention relates to video behavior detection technique field more particularly to the video behavior detection methods of single phase a kind of.
Background technology
In recent years, video capture equipment (such as:Smart mobile phone, digital camera, monitoring camera head etc.) it is rapidly universal, it makes one
Can easily shoot video, modern communications equipment makes the acquisition of video and propagates more and more convenient, and video has become existing
Important information carrier in generation society.At the continuous growth of Computerized intelligent demand and mode identification technology, image
The fast development of reason technology and artificial intelligence technology, using computer vision technique to video content carry out analysis have it is huge
Actual demand and very high commercial value.And mankind's activity is often the information agent in video, to the human behavior in video
It is detected and is of great significance for video understanding.Video human behavior Detection task be in undivided long video,
It detects the classification for each human behavior example for including in video while orienting the time that each behavior example occurs.Due to
Most monitor videos and Internet video are undivided long video, and detection is done in long video and is more in line with practical need
It asks.
With the development of depth learning technology, video behavior detection field achieves some achievements in research.However, video line
Be detection field still in development the starting stage, current video behavior detection method is often not mature enough, generally existing model
It is excessively complicated, calculate the problems such as cost is excessively high, behavior setting accuracy is low.To meet the needs of practical application, it is badly in need of proposing new
Video behavior detection framework and method.
Few for the research of video behavior Detection task at present, the method for proposition usually follows multistage detection block
Frame:In first stage, the candidate time window of high recall rate is generated in video using nomination technology (proposal), or
There is the behavioural characteristic of discrimination using additional Feature Extraction Technology;In the next stage, to these candidate time windows
Or behavioural characteristic is classified to obtain the prediction of behavior classification.Patent《A kind of motion detection model based on convolutional neural networks》
In used a kind of two stage method, interest window is generated in video frame and light stream figure using Faster RCNN networks first
Mouth nominates and extracts behavioural characteristic, is then classified to behavioural characteristic using independent SVM classifier.In patent《A kind of base
In the video actions detection method of convolutional neural networks》In, the first stage is using intensive multi-scale sliding window mouth to not shearing
Video is split, and each window is identified using the convolutional neural networks with space-time pyramidal layer, then under
The recognition result of each window is screened and is integrated in one stage to obtain final video detection segment.Paper
《Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs》Middle proposition
A kind of behavioral value method based on segmented three-dimensional convolutional neural networks, is first based on using a Three dimensional convolution neural network
Sliding window generates the nomination of behavior example, is then classified to nomination using another Three dimensional convolution neural network.Paper
《Cascaded Boundary Regression for TemporalAction Detection》It uses a kind of two stage
Behavioral value frame carries out the time boundary of behavior to return the time boundary for operating and further improving sliding window and nominating.By
Text《Single Shot Temporal Action Detection》In propose a kind of single channel behavior for behavioral value
Grader, they carry out external appearance characteristic and motion feature using independent double-current neural network (two-stream ConvNets)
Extraction.
However, feature extraction, sliding window nomination and behavior classification are regarded as independent processing by above-mentioned multistage method
Stage, each stage are unable to joint training, are unfavorable for the coordinated and combined optimization of behavioral value model;Meanwhile in difference
Exist in stage and largely compute repeatedly, affects the computational efficiency of algorithm.
Invention content
The object of the present invention is to provide the video behavior detection methods of single phase a kind of, can improve detection performance and detection
Efficiency.
The purpose of the present invention is what is be achieved through the following technical solutions:
A kind of video behavior detection method of single phase, including:
In the training stage, multiple dimensioned behavior segment Recurrent networks are built based on convolutional neural networks;By training video and frame
Grade real behavior label trains multiple dimensioned behavior segment to return net as input, using the end-to-end optimization method of multi-task learning
Network obtains trained multiple dimensioned behavior segment Recurrent networks model;
In service stage, when new video inputs, generated by time dimension sliding window has equal length with training video
Input frame sequence, use trained multiple dimensioned behavior segment Recurrent networks model, prediction input frame sequence behavior classification
With corresponding time location;It reuses non-maxima suppression to handle prediction result, generates final behavioral value result.
As seen from the above technical solution provided by the invention, first, constructed multiple dimensioned behavior segment returns net
Network completely eliminates the nomination stage of the sequential in traditional behavioral value method and additional feature extraction phases, in single convolution god
Through completing all calculating of the behavior example detection in not trimming long video in network, can combine end to end on the whole
Training and optimization, to reach higher detection performance;Secondly, network structure is simplified, so that the overwhelming majority is calculated can be parallel
It realizes, the efficiency of behavioral value greatly improved.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment
Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this
For the those of ordinary skill in field, without creative efforts, other are can also be obtained according to these attached drawings
Attached drawing.
Fig. 1 is a kind of flow chart of the video behavior detection method of single phase provided in an embodiment of the present invention;
Fig. 2 is behavioral value process schematic in video provided in an embodiment of the present invention;
Fig. 3 is multiple dimensioned behavior segment Recurrent networks general structure schematic diagram provided in an embodiment of the present invention;
Fig. 4 is that 14 data sets of THUMOS ' provided in an embodiment of the present invention export result schematic diagram.
Specific implementation mode
With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete
Ground describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this
The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, belongs to protection scope of the present invention.
In order to solve the problems such as existing video behavior detection method is complicated, accuracy of detection is low, processing speed is slow, this hair
Bright embodiment provides a kind of video behavior detection method of single phase;First, in order to improve computational efficiency, the method for the present invention is by institute
There is calculating to be encapsulated into a network, the consummatory behavior Detection task in the convolutional neural networks of a single phase.Next, in order to
Behavioral value precision is improved, the method for the present invention is returned using multiple dimensioned position on multiple dimensioned network characterization figure and neatly detected
The human behavior of various time spans, and the time of the act boundary of output video frame rank and behavior classification.Finally, in order to make net
Network various pieces energy combined optimization, the method for the present invention handle input video in single network, keep whole network end-to-end
Training.
As shown in Figure 1, providing a kind of flow chart of the video behavior detection method of single phase for the embodiment of the present invention, lead
Including:
1, in the training stage, multiple dimensioned behavior segment Recurrent networks are built based on convolutional neural networks;By training video and
Frame level real behavior label trains multiple dimensioned behavior segment to return as input, using the end-to-end optimization method of multi-task learning
Network obtains trained multiple dimensioned behavior segment Recurrent networks model.
In the embodiment of the present invention, using the convolutional neural networks consummatory behavior Detection task of a single phase, and will be different
The network characterization figure of scale and the anchoring behavior example of different time length connect, and enable the network to neatly detect various
The human behavior of time span.It is broadly divided into following several parts:
1) convolutional neural networks are based on and build multiple dimensioned behavior segment Recurrent networks.
In the embodiment of the present invention, constructed multiple dimensioned behavior segment Recurrent networks include:The extensive module in basis, behavior are real
Example anchoring system and behavior prediction module;Wherein:
A, the extensive module in basis, including the N being arranged alternately1(such as:N1=5) layer Three dimensional convolution layer (3D
Convolution layer) and N2(such as:N2=5) layer three-dimensional maximum value pond layer (3D max-pooling layer), is used for
It is extensive to the video sequence progress feature of input, and expand receptive field.
B, the behavior example anchoring system, using N3(such as:N3=4) stride on layer time dimension is s1(such as:s1=
2), the stride on Spatial Dimension (stride) is s2(such as:s2=1) Three dimensional convolution network, for each to be three-dimensional for this module
The anchoring behavior example of each cell association different time length of the anchor feature figure of convolutional layer output.
In the embodiment of the present invention, in the behavior example anchoring system, when defining a basis for each anchor feature figure
Between scale sk, k ∈ [1, N3];skIt is distributed in specification in codomain [0,1];One group of scale ratio is defined for each anchor feature figureDkIt is the number of scale ratio;The size of each anchor feature figure is expressed as h × w × t, the corresponding table of h, w, t
Show the height, width, length of anchor feature figure, then each size of anchor feature figure is that the cell of h × w × 3 is associated with DkA anchoring
The time span of behavior example, behavior example is ld=sk·rd, d ∈ [1, Dk], center is cell center.
C, the behavior prediction module uses D for each cell to anchor feature figurek(m+2) a size is h
The convolution kernel of × w × 3 carries out convolution, the corresponding D of output respective cellskA anchoring behavior example is pre- to m behavior classification
It measures and is divided to and two time location offsets.
2) the end-to-end optimization method of multi-task learning
In the embodiment of the present invention, using training video and frame level real behavior label as input, combined training object function,
The parameter in multiple dimensioned behavior segment Recurrent networks is trained by gradient descent method, training process is as follows:
A, (such as using fixed frame per second:10 frames/second) video frame is extracted from training video, training frames sequence of pictures is obtained,
And every frame picture is adjusted to unified resolution ratio;The subsequence of sequence sliding window equal length is as input frame sequence again, each
The length of sliding window be GPU video memorys allow maximum frame number (such as:192 frames).
B, the correspondence between the real behavior example in anchoring behavior example and label is established using positive sample matching strategy
Relationship.
In the embodiment of the present invention, in each training sample, each anchoring behavior example and the true row of each label are calculated
The degree of overlapping (Intersection-over-Union, IoU) for being example on time dimension, if degree of overlapping is more than fixed threshold
Value is (such as:0.5) behavior example, will be accordingly anchored as positive sample, otherwise be used as negative sample;Wherein, a label actual example
Multiple anchoring behavior examples can be matched.
C, the object function trained as multiple dimensioned behavior segment Recurrent networks using multitask loss function, by random
Gradient descent method is trained, and the multiple dimensioned behavior segment Recurrent networks model that grey iterative generation is final is passed through.
In the embodiment of the present invention, the multitask loss function refers to the object function L of network traininglossMiddle joint row
Loss is returned for Classification Loss and time of the act position, is expressed as:
In above formula, L2(Θ) loses for L2 normalizations, and Θ is in all multiple dimensioned behavior segment Recurrent networks learnt
Parameter, α and β are that loss tradeoff parameter is respectively intended to the loss of control time position offset and normalization loss, NclsAnd NposPoint
Not Wei total training sample and positive sample quantity,It is the pre- direction finding to the behavior classification of i-th of anchor station
Amount, wherein subscript j indicate j-th of behavior classification, total m behavior classification.It indicates true to label i-th of anchor station
The prediction score of classification, subscript g indicate that g-th of behavior classification is true tag classification.tiTime location for anchoring example is inclined
Shifting amount,For the coordinate transform of label actual position and anchor station;
LclsFor behavior Classification Loss, it is set as multi-class soft maximum loss:
LlocIt returns and loses for time of the act position, be set as the smooth L1 losses of time location offset.
2, in service stage, when new video inputs, generated by time dimension sliding window has identical length with training video
The input frame sequence of degree uses trained multiple dimensioned behavior segment Recurrent networks model, the behavior class of prediction input frame sequence
Other and corresponding time location;It reuses non-maxima suppression to handle prediction result, generates final behavioral value knot
Fruit.
Said program of the embodiment of the present invention, constructed multiple dimensioned behavior segment Recurrent networks completely eliminate traditional behavior
Sequential nomination stage in detection method and additional feature extraction phases, complete in single convolutional neural networks and do not trim
All calculating of video behavior example detection, on the whole can joint training end to end and optimization, to reach higher
Detection performance;Secondly, network simplifies network structure using full convolution by the way of, make the overwhelming majority calculating can Parallel Implementation,
The efficiency of behavioral value greatly improved, particularly, accelerated by GPU parallel computations, reached the method than there is record at present
Faster detection speed.
In order to make it easy to understand, being illustrated with reference to specific example.
This exemplary whole flow process is similar with previous embodiment, i.e.,:In the training stage, multiple dimensioned behavior segment is built first
Recurrent networks;Multiple dimensioned behavior segment Recurrent networks are completed all behavioral values in a single phase and are calculated, it is eliminated
His extra link, meanwhile, the full convolution mode of Web vector graphic, the overwhelming majority, which calculates, to be accelerated parallel, to be greatly improved
Computational efficiency;Then, video frame sequence of pictures is extracted from training video, will be produced with sliding window method in video frame sequence of pictures
Raw frame sequence and frame level real behavior the label input as multiple dimensioned behavior segment Recurrent networks together, by multitask
The end-to-end optimization method practised trains network parameter, generates network model.In service stage, for the video newly inputted, extraction
The sequence with input frame sequence equal length is generated by sliding window mode after video frame, inputs trained multiple dimensioned behavior piece
Behavioral value is carried out in section Recurrent networks;Then it uses non-maxima suppression (NMS) to handle output result, generates final
Behavioral value result.
Specific testing process is as shown in Fig. 2, the sequence R that input is made of the rgb video frame extracted from video3 ×H×W×T, T, H and W are the length of list entries, height, width respectively, and RGB picture port numbers are 3.It is general that list entries passes through basis successively
Change module, behavior example anchoring system and behavior prediction module, exports the class categories and position offset of anchoring behavior example.Root
The time location that final behavior occurs is calculated according to position offset.T, h and w indicates the length of anchor feature figure, height, width respectively.m
Expression behavior classification number.
The video data that this implementation uses is from representative 14 data of human behavior identification data set THUMOS '
Collection.14 data sets of THUMOS ' are by (THUMOS challenge:Action recognition with a large
numberof classes.http://crcv.ucf.edu/THUMOS14/., 2014.) it provides.
Below from building the end-to-end optimization training of multiple dimensioned behavior segment Recurrent networks, multi-task learning, test and comment
Estimate three aspects to be introduced.
1, multiple dimensioned behavior segment Recurrent networks are built.
As shown in figure 3, the schematic diagram of the multiple dimensioned behavior segment Recurrent networks provided for this example.It includes mainly:Base
The extensive module of plinth, behavior example anchoring system and behavior prediction module;Wherein:
The extensive module in basis, by 5 Three dimensional convolution layers (3D convolution) and 5 three-dimensional maximum value pond layer (3D
Max-pooling it) forms.The size of three dimensional convolution kernel is 3 × 3 × 3, and three first layers include that the quantity of convolution kernel is respectively 64,
128,256, equal 512 of remaining two layers of convolution nuclear volume.The kernel of preceding 3 three-dimensional pond layers is dimensioned to 2 × 2 ×
2, the kernel size setting 1 × 2 × 2 of remaining three-dimensional pond layer.It is denoted as in the output characteristic pattern of the 5th maximum value pond layer
F5, height and width are h=w=3, time dimension t=24.For every layer of convolution output, using linear amending unit
Nonlinear Mapping modeling ability is added as activation primitive, for network in ReLU.
This mould each Three dimensional convolution layer in the block is referred to as by behavior example anchoring system here using Three dimensional convolution layer
For an anchoring layer, each layer of output is referred to as an anchor feature figure.In all anchoring layers, the setting of convolution kernel size
It is 3 × 3 × 3, the convolution kernel number of every layer of time dimension is 256.The output anchor feature seal of each layer of anchoring layer be F6, F7,
F8 and F9, size be respectively (256 × 12 × 3 × 3), (256 × 6 × 3 × 3), (256 × 3 × 3 × 3) and (256 × 1 × 3 ×
3).The output F5 of extensive last layer of module in basis, size is (256 × 24 × 3 × 3), used also as anchor feature figure.Every
Expand (pad) dimension, each cell on each anchor feature figure in second dimension of a anchor feature figure from beginning to end
Size be (256 × 3 × 3 × 3).The basic scale of anchor feature figure F5, F6, F7, F8 and F9 set gradually for 0.1,
0.3,0.5,0.7,0.9}.The scale ratio of F5, F6, F7 and F8 are set as { 0.8,1,1.5 }, and F9 scale ratios are set as
{0.7,0.85,1}.Scale ratio number on each anchor feature figure is 3, then each list on each anchor feature figure
First lattice (256 × 3 × 3 × 3) are associated with 3 anchoring behavior examples, and each length for being anchored behavior example distinguishes this anchor feature figure
Corresponding basis scale is multiplied by corresponding 3 scale ratios, and each center for being anchored behavior example is corresponding unit lattice
Center.
Behavior prediction module predicts the position offset of corresponding anchor station and behavior classification using Three dimensional convolution.Often
A cell is associated with 3 anchoring behavior examples and needs to predict 20 behavior classifications and 1 by taking 14 data sets of THUMOS ' as an example
The other score of background classes and front and back 2 time location offsets, so in the cell of each of F5, F6, F7, F8 and F9
On, the convolution kernel for the use of 3 × (20+1+2)=69 sizes being 3 × 3 × 3 carries out convolution, and the output of each convolution is corresponding 3
A anchoring behavior example is in 20 behavior classifications, the other score of 1 background classes and front and back 2 time location offsets.
2, the end-to-end optimization training of multi-task learning.
Since GPU video memorys are limited, long video cannot be accomplished to input a complete video every time, need to video into
Row processing generates suitable input.Therefore in the training stage, first with the frame per second of 10 frames/second from the instruction of 14 data sets of THUMOS '
Practice in video and extract video frame, unified to 96 × 96, i.e. H=W=96 per frame picture size.Then, in sequence of frames of video
Sequence sliding window generates the successive frame of T=192 as input frame sequence.
Generate input frame sequence after, due to input frame sequence in include behavior segment and background segment, so also need into
One step determines specific positive negative sample.Specific method is:To each input frame sequence, multiple dimensioned behavior segment Recurrent networks are calculated
In be each anchored the degree of overlapping of behavior example and corresponding label actual example on time dimension, if degree of overlapping be more than threshold value
0.5, using this anchoring example as positive sample, otherwise it is used as negative sample.Wherein, a label actual example can match multiple
Anchoring behavior example, but an anchoring behavior example is only capable of one label actual example of matching.
After having positive negative sample, the object function using multitask loss function as network training passes through stochastic gradient
Descent method trains multiple dimensioned behavior segment Recurrent networks parameter.Multitask loss function is defined as:
In above formula, L2(Θ) loses for L2 normalizations, and Θ is in all multiple dimensioned behavior segment Recurrent networks learnt
Parameter, α and β are that loss tradeoff parameter is respectively intended to the loss of control time position offset and normalization loss, NclsAnd NposPoint
Not Wei total training sample and positive sample quantity,It is the pre- direction finding to the behavior classification of i-th of anchor station
Amount, wherein subscript j indicate j-th of behavior classification, total m behavior classification.It indicates true to label i-th of anchor station
The prediction score of classification, subscript g indicate that g-th of behavior classification is true tag classification.tiTime location for anchoring example is inclined
Shifting amount,For the coordinate transform of label actual position and anchor station;
LclsFor behavior Classification Loss, it is set as multi-class soft maximum loss:
LlocIt returns and loses for time of the act position, be set as the smooth L1 losses of time location offset.
3, test and evaluation.
It, below will be to the performance of network after training multiple dimensioned behavior segment Recurrent networks on 14 data sets of THUMOS '
It is assessed, the specific method is as follows:On the test video collection of 14 data sets of THUMOS ', for each video, with 10 frames/second
Frame per second extract sequence of frames of video, using 192 frames be step-length carry out sliding window, generate and training video frame sequence equal length test
Sequence of frames of video is sent in trained multiple dimensioned behavior segment Recurrent networks, the behavior classification predicted and corresponding
Time location.Then it uses non-maxima suppression (NMS) to handle output result, generates final behavioral value result.
Finally the real behavior label of behavioral value result and test data set is compared, obtains the assessment result of network.Fig. 4 gives
The behavioral value result schematic diagram on the test video collection of 14 data sets of THUMOS ' is gone out.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can
By software realization, the mode of necessary general hardware platform can also be added to realize by software.Based on this understanding,
The technical solution of above-described embodiment can be expressed in the form of software products, the software product can be stored in one it is non-easily
In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set
Standby (can be personal computer, server or the network equipment etc.) executes the method described in each embodiment of the present invention.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto,
Any one skilled in the art is in the technical scope of present disclosure, the change or replacement that can be readily occurred in,
It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims
Subject to enclosing.
Claims (6)
1. the video behavior detection method of single phase a kind of, which is characterized in that including:
In the training stage, multiple dimensioned behavior segment Recurrent networks are built based on convolutional neural networks;Training video and frame level is true
It is that label is used as input to carry out, and multiple dimensioned behavior segment Recurrent networks are trained using the end-to-end optimization method of multi-task learning,
Obtain trained multiple dimensioned behavior segment Recurrent networks model;
In service stage, when new video inputs, generated by time dimension sliding window has the defeated of equal length with training video
Enter frame sequence, using trained multiple dimensioned behavior segment Recurrent networks model, the behavior classification of prediction input frame sequence and right
The time location answered;It reuses non-maxima suppression to handle prediction result, generates final behavioral value result.
2. the video behavior detection method of single phase according to claim 1 a kind of, which is characterized in that constructed more rulers
Spending behavior segment Recurrent networks includes:The extensive module in basis, behavior example anchoring system and behavior prediction module;Wherein:
The extensive module in basis, including the N being arranged alternately1Layer Three dimensional convolution layer and N2The three-dimensional maximum value pond layer of layer, is used for pair
The video sequence progress feature of input is extensive, and expands receptive field;
The behavior example anchoring system, using N3Stride on layer time dimension is s1, the stride on Spatial Dimension is s2Three
Convolutional network is tieed up, when each cell association for the anchor feature figure for each the Three dimensional convolution layer output of this module is different
Between length anchoring behavior example;
The behavior prediction module uses D for each cell to anchor feature figurek(m+2) a size is h × w × 3
Convolution kernel carry out convolution, the corresponding D of output respective cellskPrediction score of a anchoring behavior example to m behavior classification
With two time location offsets;Wherein, corresponding height, the width for indicating anchor feature figure of h, w.
3. the video behavior detection method of single phase according to claim 2 a kind of, which is characterized in that the behavior example
In anchoring system, a basal latency scale s is defined for each anchor feature figurek, k ∈ [1, N3];It is fixed for each anchor feature figure
Adopted one group of scale ratioDkIt is the number of scale ratio;The size of each anchor feature figure be expressed as h × w ×
T, t indicate the length of anchor feature figure, then each size of anchor feature figure is that the cell of h × w × 3 is associated with DkA anchoring row
Time span for example, behavior example is ld=sk·rd, d ∈ [1, Dk], center is cell center.
4. the video behavior detection method of single phase according to claim 2 a kind of, which is characterized in that
Using training video and frame level real behavior label as input, combined training object function, by gradient descent method to more
Parameter in Scaling behavior segment Recurrent networks is trained, and training process is as follows:
Video frame is extracted from training video using fixed frame per second, obtains training frames sequence of pictures, and will be adjusted to per frame picture
Unified resolution ratio;For the subsequence of sequence sliding window equal length as input frame sequence, the length of each sliding window is GPU video memorys again
The maximum frame number of permission;
The correspondence between the real behavior example in anchoring behavior example and label is established using positive sample matching strategy;
The object function trained as multiple dimensioned behavior segment Recurrent networks using multitask loss function, by under stochastic gradient
Drop method is trained, and the multiple dimensioned behavior segment Recurrent networks model that grey iterative generation is final is passed through.
5. the video behavior detection method of single phase according to claim 4 a kind of, which is characterized in that described to use positive sample
Correspondence between the real behavior example that this matching strategy is established in anchoring behavior example and label includes:
In each training sample, each anchoring behavior example and each label real behavior example are calculated on time dimension
Degree of overlapping will accordingly be anchored behavior example as positive sample, and otherwise be used as negative sample if degree of overlapping is more than fixed threshold;Its
In, a label actual example can match multiple anchoring behavior examples.
6. the video behavior detection method of single phase according to claim 4 a kind of, which is characterized in that the multitask damage
It refers to the object function L of network training to lose functionlossMiddle joint action Classification Loss and time of the act position return loss, table
It is shown as:
In above formula, L2(Θ) loses for L2 normalizations, and Θ is the ginseng in all multiple dimensioned behavior segment Recurrent networks learnt
Number, α and β are that loss tradeoff parameter is respectively intended to the loss of control time position offset and normalization loss, NclsAnd NposRespectively
The quantity of total training sample and positive sample,It is the predicted vector to the behavior classification of i-th of anchor station,
Middle subscript j indicates j-th of behavior classification, total m behavior classification;It indicates i-th of anchor station to the true classification of label
Predict that score, subscript g indicate that g-th of behavior classification is true tag classification;tiTo be anchored the time location offset of example,
For the coordinate transform of label actual position and anchor station;
LclsFor behavior Classification Loss, it is set as multi-class soft maximum loss:
LlocIt returns and loses for time of the act position, be set as the smooth L1 losses of time location offset.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810607804.6A CN108805083B (en) | 2018-06-13 | 2018-06-13 | Single-stage video behavior detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810607804.6A CN108805083B (en) | 2018-06-13 | 2018-06-13 | Single-stage video behavior detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108805083A true CN108805083A (en) | 2018-11-13 |
CN108805083B CN108805083B (en) | 2022-03-01 |
Family
ID=64085637
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810607804.6A Active CN108805083B (en) | 2018-06-13 | 2018-06-13 | Single-stage video behavior detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108805083B (en) |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109697434A (en) * | 2019-01-07 | 2019-04-30 | 腾讯科技(深圳)有限公司 | A kind of Activity recognition method, apparatus and storage medium |
CN109816023A (en) * | 2019-01-29 | 2019-05-28 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating picture tag model |
CN109829398A (en) * | 2019-01-16 | 2019-05-31 | 北京航空航天大学 | A kind of object detection method in video based on Three dimensional convolution network |
CN110059658A (en) * | 2019-04-26 | 2019-07-26 | 北京理工大学 | A kind of satellite-remote-sensing image multidate change detecting method based on Three dimensional convolution neural network |
CN110059584A (en) * | 2019-03-28 | 2019-07-26 | 中山大学 | A kind of event nomination method of the distribution of combination boundary and correction |
CN110084202A (en) * | 2019-04-29 | 2019-08-02 | 东南大学 | A kind of video behavior recognition methods based on efficient Three dimensional convolution |
CN110222592A (en) * | 2019-05-16 | 2019-09-10 | 西安特种设备检验检测院 | A kind of construction method of the timing behavioral value network model generated based on complementary timing behavior motion |
CN110348345A (en) * | 2019-06-28 | 2019-10-18 | 西安交通大学 | A kind of Weakly supervised timing operating position fixing method based on continuity of movement |
CN110610194A (en) * | 2019-08-13 | 2019-12-24 | 清华大学 | Data enhancement method for small data video classification task |
CN110633645A (en) * | 2019-08-19 | 2019-12-31 | 同济大学 | Video behavior detection method based on enhanced three-stream architecture |
CN110659572A (en) * | 2019-08-22 | 2020-01-07 | 南京理工大学 | Video motion detection method based on bidirectional feature pyramid |
CN110796069A (en) * | 2019-10-28 | 2020-02-14 | 广州博衍智能科技有限公司 | Behavior detection method, system, equipment and machine readable medium |
CN111259779A (en) * | 2020-01-13 | 2020-06-09 | 南京大学 | Video motion detection method based on central point trajectory prediction |
CN111259783A (en) * | 2020-01-14 | 2020-06-09 | 深圳市奥拓电子股份有限公司 | Video behavior detection method and system, highlight video playback system and storage medium |
CN111325097A (en) * | 2020-01-22 | 2020-06-23 | 陕西师范大学 | Enhanced single-stage decoupled time sequence action positioning method |
CN111553238A (en) * | 2020-04-23 | 2020-08-18 | 北京大学深圳研究生院 | Regression classification module and method for time axis positioning of actions |
CN111814588A (en) * | 2020-06-18 | 2020-10-23 | 浙江大华技术股份有限公司 | Behavior detection method and related equipment and device |
CN111832479A (en) * | 2020-07-14 | 2020-10-27 | 西安电子科技大学 | Video target detection method based on improved self-adaptive anchor R-CNN |
CN111898461A (en) * | 2020-07-08 | 2020-11-06 | 贵州大学 | Time sequence behavior segment generation method |
CN113033500A (en) * | 2021-05-06 | 2021-06-25 | 成都考拉悠然科技有限公司 | Motion segment detection method, model training method and device |
CN113505266A (en) * | 2021-07-09 | 2021-10-15 | 南京邮电大学 | Two-stage anchor-based dynamic video abstraction method |
CN114339403A (en) * | 2021-12-31 | 2022-04-12 | 西安交通大学 | Video action fragment generation method, system, equipment and readable storage medium |
CN114882403A (en) * | 2022-05-05 | 2022-08-09 | 杭州电子科技大学 | Video space-time action positioning method based on progressive attention hypergraph |
CN116996661A (en) * | 2023-09-27 | 2023-11-03 | 中国科学技术大学 | Three-dimensional video display method, device, equipment and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103164694A (en) * | 2013-02-20 | 2013-06-19 | 上海交通大学 | Method for recognizing human motion |
CN105740773A (en) * | 2016-01-25 | 2016-07-06 | 重庆理工大学 | Deep learning and multi-scale information based behavior identification method |
CN106407903A (en) * | 2016-08-31 | 2017-02-15 | 四川瞳知科技有限公司 | Multiple dimensioned convolution neural network-based real time human body abnormal behavior identification method |
CN107729799A (en) * | 2017-06-13 | 2018-02-23 | 银江股份有限公司 | Crowd's abnormal behaviour vision-based detection and analyzing and alarming system based on depth convolutional neural networks |
CN108133188A (en) * | 2017-12-22 | 2018-06-08 | 武汉理工大学 | A kind of Activity recognition method based on motion history image and convolutional neural networks |
-
2018
- 2018-06-13 CN CN201810607804.6A patent/CN108805083B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103164694A (en) * | 2013-02-20 | 2013-06-19 | 上海交通大学 | Method for recognizing human motion |
CN105740773A (en) * | 2016-01-25 | 2016-07-06 | 重庆理工大学 | Deep learning and multi-scale information based behavior identification method |
CN106407903A (en) * | 2016-08-31 | 2017-02-15 | 四川瞳知科技有限公司 | Multiple dimensioned convolution neural network-based real time human body abnormal behavior identification method |
CN107729799A (en) * | 2017-06-13 | 2018-02-23 | 银江股份有限公司 | Crowd's abnormal behaviour vision-based detection and analyzing and alarming system based on depth convolutional neural networks |
CN108133188A (en) * | 2017-12-22 | 2018-06-08 | 武汉理工大学 | A kind of Activity recognition method based on motion history image and convolutional neural networks |
Non-Patent Citations (2)
Title |
---|
JIYANG GAO ET.AL: "Cascaded Boundary Regression for Temporal Action Detection", 《ARXIV:1705.01180》 * |
TIANWEI LIN ET.AL: "Single Shot Temporal Action Detection", 《PROCEEDINGS OF THE 25TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA》 * |
Cited By (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109697434A (en) * | 2019-01-07 | 2019-04-30 | 腾讯科技(深圳)有限公司 | A kind of Activity recognition method, apparatus and storage medium |
CN109697434B (en) * | 2019-01-07 | 2021-01-08 | 腾讯科技(深圳)有限公司 | Behavior recognition method and device and storage medium |
CN109829398A (en) * | 2019-01-16 | 2019-05-31 | 北京航空航天大学 | A kind of object detection method in video based on Three dimensional convolution network |
CN109829398B (en) * | 2019-01-16 | 2020-03-31 | 北京航空航天大学 | Target detection method in video based on three-dimensional convolution network |
CN109816023A (en) * | 2019-01-29 | 2019-05-28 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating picture tag model |
CN110059584A (en) * | 2019-03-28 | 2019-07-26 | 中山大学 | A kind of event nomination method of the distribution of combination boundary and correction |
CN110059658A (en) * | 2019-04-26 | 2019-07-26 | 北京理工大学 | A kind of satellite-remote-sensing image multidate change detecting method based on Three dimensional convolution neural network |
CN110084202A (en) * | 2019-04-29 | 2019-08-02 | 东南大学 | A kind of video behavior recognition methods based on efficient Three dimensional convolution |
CN110222592A (en) * | 2019-05-16 | 2019-09-10 | 西安特种设备检验检测院 | A kind of construction method of the timing behavioral value network model generated based on complementary timing behavior motion |
CN110222592B (en) * | 2019-05-16 | 2023-01-17 | 西安特种设备检验检测院 | Construction method of time sequence behavior detection network model based on complementary time sequence behavior proposal generation |
CN110348345A (en) * | 2019-06-28 | 2019-10-18 | 西安交通大学 | A kind of Weakly supervised timing operating position fixing method based on continuity of movement |
CN110348345B (en) * | 2019-06-28 | 2021-08-13 | 西安交通大学 | Weak supervision time sequence action positioning method based on action consistency |
CN110610194A (en) * | 2019-08-13 | 2019-12-24 | 清华大学 | Data enhancement method for small data video classification task |
CN110633645A (en) * | 2019-08-19 | 2019-12-31 | 同济大学 | Video behavior detection method based on enhanced three-stream architecture |
CN110659572B (en) * | 2019-08-22 | 2022-08-12 | 南京理工大学 | Video motion detection method based on bidirectional feature pyramid |
CN110659572A (en) * | 2019-08-22 | 2020-01-07 | 南京理工大学 | Video motion detection method based on bidirectional feature pyramid |
CN110796069A (en) * | 2019-10-28 | 2020-02-14 | 广州博衍智能科技有限公司 | Behavior detection method, system, equipment and machine readable medium |
CN111259779A (en) * | 2020-01-13 | 2020-06-09 | 南京大学 | Video motion detection method based on central point trajectory prediction |
CN111259779B (en) * | 2020-01-13 | 2023-08-01 | 南京大学 | Video motion detection method based on center point track prediction |
CN111259783A (en) * | 2020-01-14 | 2020-06-09 | 深圳市奥拓电子股份有限公司 | Video behavior detection method and system, highlight video playback system and storage medium |
CN111325097A (en) * | 2020-01-22 | 2020-06-23 | 陕西师范大学 | Enhanced single-stage decoupled time sequence action positioning method |
CN111553238A (en) * | 2020-04-23 | 2020-08-18 | 北京大学深圳研究生院 | Regression classification module and method for time axis positioning of actions |
CN111814588B (en) * | 2020-06-18 | 2023-08-01 | 浙江大华技术股份有限公司 | Behavior detection method, related equipment and device |
CN111814588A (en) * | 2020-06-18 | 2020-10-23 | 浙江大华技术股份有限公司 | Behavior detection method and related equipment and device |
CN111898461A (en) * | 2020-07-08 | 2020-11-06 | 贵州大学 | Time sequence behavior segment generation method |
CN111832479A (en) * | 2020-07-14 | 2020-10-27 | 西安电子科技大学 | Video target detection method based on improved self-adaptive anchor R-CNN |
CN111832479B (en) * | 2020-07-14 | 2023-08-01 | 西安电子科技大学 | Video target detection method based on improved self-adaptive anchor point R-CNN |
CN113033500A (en) * | 2021-05-06 | 2021-06-25 | 成都考拉悠然科技有限公司 | Motion segment detection method, model training method and device |
CN113033500B (en) * | 2021-05-06 | 2021-12-03 | 成都考拉悠然科技有限公司 | Motion segment detection method, model training method and device |
CN113505266A (en) * | 2021-07-09 | 2021-10-15 | 南京邮电大学 | Two-stage anchor-based dynamic video abstraction method |
CN113505266B (en) * | 2021-07-09 | 2023-09-26 | 南京邮电大学 | Two-stage anchor-based dynamic video abstraction method |
CN114339403A (en) * | 2021-12-31 | 2022-04-12 | 西安交通大学 | Video action fragment generation method, system, equipment and readable storage medium |
CN114882403A (en) * | 2022-05-05 | 2022-08-09 | 杭州电子科技大学 | Video space-time action positioning method based on progressive attention hypergraph |
CN114882403B (en) * | 2022-05-05 | 2022-12-02 | 杭州电子科技大学 | Video space-time action positioning method based on progressive attention hypergraph |
CN116996661A (en) * | 2023-09-27 | 2023-11-03 | 中国科学技术大学 | Three-dimensional video display method, device, equipment and medium |
CN116996661B (en) * | 2023-09-27 | 2024-01-05 | 中国科学技术大学 | Three-dimensional video display method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN108805083B (en) | 2022-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108805083A (en) | The video behavior detection method of single phase | |
CN111611878B (en) | Method for crowd counting and future people flow prediction based on video image | |
Sankaranarayanan et al. | Learning from synthetic data: Addressing domain shift for semantic segmentation | |
Reda et al. | Unsupervised video interpolation using cycle consistency | |
Jiang et al. | Density-aware multi-task learning for crowd counting | |
CN105512289B (en) | Image search method based on deep learning and Hash | |
CN108399380A (en) | A kind of video actions detection method based on Three dimensional convolution and Faster RCNN | |
CN109993095B (en) | Frame level feature aggregation method for video target detection | |
CN110889343B (en) | Crowd density estimation method and device based on attention type deep neural network | |
CN110852267B (en) | Crowd density estimation method and device based on optical flow fusion type deep neural network | |
CN111860693A (en) | Lightweight visual target detection method and system | |
CN109272509A (en) | A kind of object detection method of consecutive image, device, equipment and storage medium | |
CN109697434A (en) | A kind of Activity recognition method, apparatus and storage medium | |
CN110097115B (en) | Video salient object detection method based on attention transfer mechanism | |
CN111027377B (en) | Double-flow neural network time sequence action positioning method | |
Hua et al. | Depth estimation with convolutional conditional random field network | |
CN106815563B (en) | Human body apparent structure-based crowd quantity prediction method | |
CN110532959B (en) | Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network | |
CN114782737A (en) | Image classification method, device and storage medium based on improved residual error network | |
Desai et al. | Next frame prediction using ConvLSTM | |
CN104077742A (en) | GABOR characteristic based face sketch synthetic method and system | |
Chen et al. | Salbinet360: Saliency prediction on 360 images with local-global bifurcated deep network | |
CN112836755B (en) | Sample image generation method and system based on deep learning | |
CN110163103A (en) | A kind of live pig Activity recognition method and apparatus based on video image | |
Zhong et al. | EST-TSANet: Video-Based Remote Heart Rate Measurement Using Temporal Shift Attention Network and ESTmap |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |