CN108805083A - The video behavior detection method of single phase - Google Patents

The video behavior detection method of single phase Download PDF

Info

Publication number
CN108805083A
CN108805083A CN201810607804.6A CN201810607804A CN108805083A CN 108805083 A CN108805083 A CN 108805083A CN 201810607804 A CN201810607804 A CN 201810607804A CN 108805083 A CN108805083 A CN 108805083A
Authority
CN
China
Prior art keywords
behavior
video
training
multiple dimensioned
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810607804.6A
Other languages
Chinese (zh)
Other versions
CN108805083B (en
Inventor
王子磊
刘志康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201810607804.6A priority Critical patent/CN108805083B/en
Publication of CN108805083A publication Critical patent/CN108805083A/en
Application granted granted Critical
Publication of CN108805083B publication Critical patent/CN108805083B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention discloses the video behavior detection methods of single phase a kind of comprising:In the training stage, multiple dimensioned behavior segment Recurrent networks are built based on convolutional neural networks;Using training video and frame level real behavior label as input, multiple dimensioned behavior segment Recurrent networks are trained using the end-to-end optimization method of multi-task learning, obtain trained multiple dimensioned behavior segment Recurrent networks model;In service stage, when new video inputs, the input frame sequence that there is equal length with training video is generated by time dimension sliding window, using trained multiple dimensioned behavior segment Recurrent networks model, the behavior classification of prediction input frame sequence and corresponding time location;It reuses non-maxima suppression to handle prediction result, generates final behavioral value result.This method can improve detection performance and detection efficiency.

Description

The video behavior detection method of single phase
Technical field
The present invention relates to video behavior detection technique field more particularly to the video behavior detection methods of single phase a kind of.
Background technology
In recent years, video capture equipment (such as:Smart mobile phone, digital camera, monitoring camera head etc.) it is rapidly universal, it makes one Can easily shoot video, modern communications equipment makes the acquisition of video and propagates more and more convenient, and video has become existing Important information carrier in generation society.At the continuous growth of Computerized intelligent demand and mode identification technology, image The fast development of reason technology and artificial intelligence technology, using computer vision technique to video content carry out analysis have it is huge Actual demand and very high commercial value.And mankind's activity is often the information agent in video, to the human behavior in video It is detected and is of great significance for video understanding.Video human behavior Detection task be in undivided long video, It detects the classification for each human behavior example for including in video while orienting the time that each behavior example occurs.Due to Most monitor videos and Internet video are undivided long video, and detection is done in long video and is more in line with practical need It asks.
With the development of depth learning technology, video behavior detection field achieves some achievements in research.However, video line Be detection field still in development the starting stage, current video behavior detection method is often not mature enough, generally existing model It is excessively complicated, calculate the problems such as cost is excessively high, behavior setting accuracy is low.To meet the needs of practical application, it is badly in need of proposing new Video behavior detection framework and method.
Few for the research of video behavior Detection task at present, the method for proposition usually follows multistage detection block Frame:In first stage, the candidate time window of high recall rate is generated in video using nomination technology (proposal), or There is the behavioural characteristic of discrimination using additional Feature Extraction Technology;In the next stage, to these candidate time windows Or behavioural characteristic is classified to obtain the prediction of behavior classification.Patent《A kind of motion detection model based on convolutional neural networks》 In used a kind of two stage method, interest window is generated in video frame and light stream figure using Faster RCNN networks first Mouth nominates and extracts behavioural characteristic, is then classified to behavioural characteristic using independent SVM classifier.In patent《A kind of base In the video actions detection method of convolutional neural networks》In, the first stage is using intensive multi-scale sliding window mouth to not shearing Video is split, and each window is identified using the convolutional neural networks with space-time pyramidal layer, then under The recognition result of each window is screened and is integrated in one stage to obtain final video detection segment.Paper 《Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs》Middle proposition A kind of behavioral value method based on segmented three-dimensional convolutional neural networks, is first based on using a Three dimensional convolution neural network Sliding window generates the nomination of behavior example, is then classified to nomination using another Three dimensional convolution neural network.Paper 《Cascaded Boundary Regression for TemporalAction Detection》It uses a kind of two stage Behavioral value frame carries out the time boundary of behavior to return the time boundary for operating and further improving sliding window and nominating.By Text《Single Shot Temporal Action Detection》In propose a kind of single channel behavior for behavioral value Grader, they carry out external appearance characteristic and motion feature using independent double-current neural network (two-stream ConvNets) Extraction.
However, feature extraction, sliding window nomination and behavior classification are regarded as independent processing by above-mentioned multistage method Stage, each stage are unable to joint training, are unfavorable for the coordinated and combined optimization of behavioral value model;Meanwhile in difference Exist in stage and largely compute repeatedly, affects the computational efficiency of algorithm.
Invention content
The object of the present invention is to provide the video behavior detection methods of single phase a kind of, can improve detection performance and detection Efficiency.
The purpose of the present invention is what is be achieved through the following technical solutions:
A kind of video behavior detection method of single phase, including:
In the training stage, multiple dimensioned behavior segment Recurrent networks are built based on convolutional neural networks;By training video and frame Grade real behavior label trains multiple dimensioned behavior segment to return net as input, using the end-to-end optimization method of multi-task learning Network obtains trained multiple dimensioned behavior segment Recurrent networks model;
In service stage, when new video inputs, generated by time dimension sliding window has equal length with training video Input frame sequence, use trained multiple dimensioned behavior segment Recurrent networks model, prediction input frame sequence behavior classification With corresponding time location;It reuses non-maxima suppression to handle prediction result, generates final behavioral value result.
As seen from the above technical solution provided by the invention, first, constructed multiple dimensioned behavior segment returns net Network completely eliminates the nomination stage of the sequential in traditional behavioral value method and additional feature extraction phases, in single convolution god Through completing all calculating of the behavior example detection in not trimming long video in network, can combine end to end on the whole Training and optimization, to reach higher detection performance;Secondly, network structure is simplified, so that the overwhelming majority is calculated can be parallel It realizes, the efficiency of behavioral value greatly improved.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.
Fig. 1 is a kind of flow chart of the video behavior detection method of single phase provided in an embodiment of the present invention;
Fig. 2 is behavioral value process schematic in video provided in an embodiment of the present invention;
Fig. 3 is multiple dimensioned behavior segment Recurrent networks general structure schematic diagram provided in an embodiment of the present invention;
Fig. 4 is that 14 data sets of THUMOS ' provided in an embodiment of the present invention export result schematic diagram.
Specific implementation mode
With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, belongs to protection scope of the present invention.
In order to solve the problems such as existing video behavior detection method is complicated, accuracy of detection is low, processing speed is slow, this hair Bright embodiment provides a kind of video behavior detection method of single phase;First, in order to improve computational efficiency, the method for the present invention is by institute There is calculating to be encapsulated into a network, the consummatory behavior Detection task in the convolutional neural networks of a single phase.Next, in order to Behavioral value precision is improved, the method for the present invention is returned using multiple dimensioned position on multiple dimensioned network characterization figure and neatly detected The human behavior of various time spans, and the time of the act boundary of output video frame rank and behavior classification.Finally, in order to make net Network various pieces energy combined optimization, the method for the present invention handle input video in single network, keep whole network end-to-end Training.
As shown in Figure 1, providing a kind of flow chart of the video behavior detection method of single phase for the embodiment of the present invention, lead Including:
1, in the training stage, multiple dimensioned behavior segment Recurrent networks are built based on convolutional neural networks;By training video and Frame level real behavior label trains multiple dimensioned behavior segment to return as input, using the end-to-end optimization method of multi-task learning Network obtains trained multiple dimensioned behavior segment Recurrent networks model.
In the embodiment of the present invention, using the convolutional neural networks consummatory behavior Detection task of a single phase, and will be different The network characterization figure of scale and the anchoring behavior example of different time length connect, and enable the network to neatly detect various The human behavior of time span.It is broadly divided into following several parts:
1) convolutional neural networks are based on and build multiple dimensioned behavior segment Recurrent networks.
In the embodiment of the present invention, constructed multiple dimensioned behavior segment Recurrent networks include:The extensive module in basis, behavior are real Example anchoring system and behavior prediction module;Wherein:
A, the extensive module in basis, including the N being arranged alternately1(such as:N1=5) layer Three dimensional convolution layer (3D Convolution layer) and N2(such as:N2=5) layer three-dimensional maximum value pond layer (3D max-pooling layer), is used for It is extensive to the video sequence progress feature of input, and expand receptive field.
B, the behavior example anchoring system, using N3(such as:N3=4) stride on layer time dimension is s1(such as:s1= 2), the stride on Spatial Dimension (stride) is s2(such as:s2=1) Three dimensional convolution network, for each to be three-dimensional for this module The anchoring behavior example of each cell association different time length of the anchor feature figure of convolutional layer output.
In the embodiment of the present invention, in the behavior example anchoring system, when defining a basis for each anchor feature figure Between scale sk, k ∈ [1, N3];skIt is distributed in specification in codomain [0,1];One group of scale ratio is defined for each anchor feature figureDkIt is the number of scale ratio;The size of each anchor feature figure is expressed as h × w × t, the corresponding table of h, w, t Show the height, width, length of anchor feature figure, then each size of anchor feature figure is that the cell of h × w × 3 is associated with DkA anchoring The time span of behavior example, behavior example is ld=sk·rd, d ∈ [1, Dk], center is cell center.
C, the behavior prediction module uses D for each cell to anchor feature figurek(m+2) a size is h The convolution kernel of × w × 3 carries out convolution, the corresponding D of output respective cellskA anchoring behavior example is pre- to m behavior classification It measures and is divided to and two time location offsets.
2) the end-to-end optimization method of multi-task learning
In the embodiment of the present invention, using training video and frame level real behavior label as input, combined training object function, The parameter in multiple dimensioned behavior segment Recurrent networks is trained by gradient descent method, training process is as follows:
A, (such as using fixed frame per second:10 frames/second) video frame is extracted from training video, training frames sequence of pictures is obtained, And every frame picture is adjusted to unified resolution ratio;The subsequence of sequence sliding window equal length is as input frame sequence again, each The length of sliding window be GPU video memorys allow maximum frame number (such as:192 frames).
B, the correspondence between the real behavior example in anchoring behavior example and label is established using positive sample matching strategy Relationship.
In the embodiment of the present invention, in each training sample, each anchoring behavior example and the true row of each label are calculated The degree of overlapping (Intersection-over-Union, IoU) for being example on time dimension, if degree of overlapping is more than fixed threshold Value is (such as:0.5) behavior example, will be accordingly anchored as positive sample, otherwise be used as negative sample;Wherein, a label actual example Multiple anchoring behavior examples can be matched.
C, the object function trained as multiple dimensioned behavior segment Recurrent networks using multitask loss function, by random Gradient descent method is trained, and the multiple dimensioned behavior segment Recurrent networks model that grey iterative generation is final is passed through.
In the embodiment of the present invention, the multitask loss function refers to the object function L of network traininglossMiddle joint row Loss is returned for Classification Loss and time of the act position, is expressed as:
In above formula, L2(Θ) loses for L2 normalizations, and Θ is in all multiple dimensioned behavior segment Recurrent networks learnt Parameter, α and β are that loss tradeoff parameter is respectively intended to the loss of control time position offset and normalization loss, NclsAnd NposPoint Not Wei total training sample and positive sample quantity,It is the pre- direction finding to the behavior classification of i-th of anchor station Amount, wherein subscript j indicate j-th of behavior classification, total m behavior classification.It indicates true to label i-th of anchor station The prediction score of classification, subscript g indicate that g-th of behavior classification is true tag classification.tiTime location for anchoring example is inclined Shifting amount,For the coordinate transform of label actual position and anchor station;
LclsFor behavior Classification Loss, it is set as multi-class soft maximum loss:
LlocIt returns and loses for time of the act position, be set as the smooth L1 losses of time location offset.
2, in service stage, when new video inputs, generated by time dimension sliding window has identical length with training video The input frame sequence of degree uses trained multiple dimensioned behavior segment Recurrent networks model, the behavior class of prediction input frame sequence Other and corresponding time location;It reuses non-maxima suppression to handle prediction result, generates final behavioral value knot Fruit.
Said program of the embodiment of the present invention, constructed multiple dimensioned behavior segment Recurrent networks completely eliminate traditional behavior Sequential nomination stage in detection method and additional feature extraction phases, complete in single convolutional neural networks and do not trim All calculating of video behavior example detection, on the whole can joint training end to end and optimization, to reach higher Detection performance;Secondly, network simplifies network structure using full convolution by the way of, make the overwhelming majority calculating can Parallel Implementation, The efficiency of behavioral value greatly improved, particularly, accelerated by GPU parallel computations, reached the method than there is record at present Faster detection speed.
In order to make it easy to understand, being illustrated with reference to specific example.
This exemplary whole flow process is similar with previous embodiment, i.e.,:In the training stage, multiple dimensioned behavior segment is built first Recurrent networks;Multiple dimensioned behavior segment Recurrent networks are completed all behavioral values in a single phase and are calculated, it is eliminated His extra link, meanwhile, the full convolution mode of Web vector graphic, the overwhelming majority, which calculates, to be accelerated parallel, to be greatly improved Computational efficiency;Then, video frame sequence of pictures is extracted from training video, will be produced with sliding window method in video frame sequence of pictures Raw frame sequence and frame level real behavior the label input as multiple dimensioned behavior segment Recurrent networks together, by multitask The end-to-end optimization method practised trains network parameter, generates network model.In service stage, for the video newly inputted, extraction The sequence with input frame sequence equal length is generated by sliding window mode after video frame, inputs trained multiple dimensioned behavior piece Behavioral value is carried out in section Recurrent networks;Then it uses non-maxima suppression (NMS) to handle output result, generates final Behavioral value result.
Specific testing process is as shown in Fig. 2, the sequence R that input is made of the rgb video frame extracted from video3 ×H×W×T, T, H and W are the length of list entries, height, width respectively, and RGB picture port numbers are 3.It is general that list entries passes through basis successively Change module, behavior example anchoring system and behavior prediction module, exports the class categories and position offset of anchoring behavior example.Root The time location that final behavior occurs is calculated according to position offset.T, h and w indicates the length of anchor feature figure, height, width respectively.m Expression behavior classification number.
The video data that this implementation uses is from representative 14 data of human behavior identification data set THUMOS ' Collection.14 data sets of THUMOS ' are by (THUMOS challenge:Action recognition with a large numberof classes.http://crcv.ucf.edu/THUMOS14/., 2014.) it provides.
Below from building the end-to-end optimization training of multiple dimensioned behavior segment Recurrent networks, multi-task learning, test and comment Estimate three aspects to be introduced.
1, multiple dimensioned behavior segment Recurrent networks are built.
As shown in figure 3, the schematic diagram of the multiple dimensioned behavior segment Recurrent networks provided for this example.It includes mainly:Base The extensive module of plinth, behavior example anchoring system and behavior prediction module;Wherein:
The extensive module in basis, by 5 Three dimensional convolution layers (3D convolution) and 5 three-dimensional maximum value pond layer (3D Max-pooling it) forms.The size of three dimensional convolution kernel is 3 × 3 × 3, and three first layers include that the quantity of convolution kernel is respectively 64, 128,256, equal 512 of remaining two layers of convolution nuclear volume.The kernel of preceding 3 three-dimensional pond layers is dimensioned to 2 × 2 × 2, the kernel size setting 1 × 2 × 2 of remaining three-dimensional pond layer.It is denoted as in the output characteristic pattern of the 5th maximum value pond layer F5, height and width are h=w=3, time dimension t=24.For every layer of convolution output, using linear amending unit Nonlinear Mapping modeling ability is added as activation primitive, for network in ReLU.
This mould each Three dimensional convolution layer in the block is referred to as by behavior example anchoring system here using Three dimensional convolution layer For an anchoring layer, each layer of output is referred to as an anchor feature figure.In all anchoring layers, the setting of convolution kernel size It is 3 × 3 × 3, the convolution kernel number of every layer of time dimension is 256.The output anchor feature seal of each layer of anchoring layer be F6, F7, F8 and F9, size be respectively (256 × 12 × 3 × 3), (256 × 6 × 3 × 3), (256 × 3 × 3 × 3) and (256 × 1 × 3 × 3).The output F5 of extensive last layer of module in basis, size is (256 × 24 × 3 × 3), used also as anchor feature figure.Every Expand (pad) dimension, each cell on each anchor feature figure in second dimension of a anchor feature figure from beginning to end Size be (256 × 3 × 3 × 3).The basic scale of anchor feature figure F5, F6, F7, F8 and F9 set gradually for 0.1, 0.3,0.5,0.7,0.9}.The scale ratio of F5, F6, F7 and F8 are set as { 0.8,1,1.5 }, and F9 scale ratios are set as {0.7,0.85,1}.Scale ratio number on each anchor feature figure is 3, then each list on each anchor feature figure First lattice (256 × 3 × 3 × 3) are associated with 3 anchoring behavior examples, and each length for being anchored behavior example distinguishes this anchor feature figure Corresponding basis scale is multiplied by corresponding 3 scale ratios, and each center for being anchored behavior example is corresponding unit lattice Center.
Behavior prediction module predicts the position offset of corresponding anchor station and behavior classification using Three dimensional convolution.Often A cell is associated with 3 anchoring behavior examples and needs to predict 20 behavior classifications and 1 by taking 14 data sets of THUMOS ' as an example The other score of background classes and front and back 2 time location offsets, so in the cell of each of F5, F6, F7, F8 and F9 On, the convolution kernel for the use of 3 × (20+1+2)=69 sizes being 3 × 3 × 3 carries out convolution, and the output of each convolution is corresponding 3 A anchoring behavior example is in 20 behavior classifications, the other score of 1 background classes and front and back 2 time location offsets.
2, the end-to-end optimization training of multi-task learning.
Since GPU video memorys are limited, long video cannot be accomplished to input a complete video every time, need to video into Row processing generates suitable input.Therefore in the training stage, first with the frame per second of 10 frames/second from the instruction of 14 data sets of THUMOS ' Practice in video and extract video frame, unified to 96 × 96, i.e. H=W=96 per frame picture size.Then, in sequence of frames of video Sequence sliding window generates the successive frame of T=192 as input frame sequence.
Generate input frame sequence after, due to input frame sequence in include behavior segment and background segment, so also need into One step determines specific positive negative sample.Specific method is:To each input frame sequence, multiple dimensioned behavior segment Recurrent networks are calculated In be each anchored the degree of overlapping of behavior example and corresponding label actual example on time dimension, if degree of overlapping be more than threshold value 0.5, using this anchoring example as positive sample, otherwise it is used as negative sample.Wherein, a label actual example can match multiple Anchoring behavior example, but an anchoring behavior example is only capable of one label actual example of matching.
After having positive negative sample, the object function using multitask loss function as network training passes through stochastic gradient Descent method trains multiple dimensioned behavior segment Recurrent networks parameter.Multitask loss function is defined as:
In above formula, L2(Θ) loses for L2 normalizations, and Θ is in all multiple dimensioned behavior segment Recurrent networks learnt Parameter, α and β are that loss tradeoff parameter is respectively intended to the loss of control time position offset and normalization loss, NclsAnd NposPoint Not Wei total training sample and positive sample quantity,It is the pre- direction finding to the behavior classification of i-th of anchor station Amount, wherein subscript j indicate j-th of behavior classification, total m behavior classification.It indicates true to label i-th of anchor station The prediction score of classification, subscript g indicate that g-th of behavior classification is true tag classification.tiTime location for anchoring example is inclined Shifting amount,For the coordinate transform of label actual position and anchor station;
LclsFor behavior Classification Loss, it is set as multi-class soft maximum loss:
LlocIt returns and loses for time of the act position, be set as the smooth L1 losses of time location offset.
3, test and evaluation.
It, below will be to the performance of network after training multiple dimensioned behavior segment Recurrent networks on 14 data sets of THUMOS ' It is assessed, the specific method is as follows:On the test video collection of 14 data sets of THUMOS ', for each video, with 10 frames/second Frame per second extract sequence of frames of video, using 192 frames be step-length carry out sliding window, generate and training video frame sequence equal length test Sequence of frames of video is sent in trained multiple dimensioned behavior segment Recurrent networks, the behavior classification predicted and corresponding Time location.Then it uses non-maxima suppression (NMS) to handle output result, generates final behavioral value result. Finally the real behavior label of behavioral value result and test data set is compared, obtains the assessment result of network.Fig. 4 gives The behavioral value result schematic diagram on the test video collection of 14 data sets of THUMOS ' is gone out.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can By software realization, the mode of necessary general hardware platform can also be added to realize by software.Based on this understanding, The technical solution of above-described embodiment can be expressed in the form of software products, the software product can be stored in one it is non-easily In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes the method described in each embodiment of the present invention.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Any one skilled in the art is in the technical scope of present disclosure, the change or replacement that can be readily occurred in, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Subject to enclosing.

Claims (6)

1. the video behavior detection method of single phase a kind of, which is characterized in that including:
In the training stage, multiple dimensioned behavior segment Recurrent networks are built based on convolutional neural networks;Training video and frame level is true It is that label is used as input to carry out, and multiple dimensioned behavior segment Recurrent networks are trained using the end-to-end optimization method of multi-task learning, Obtain trained multiple dimensioned behavior segment Recurrent networks model;
In service stage, when new video inputs, generated by time dimension sliding window has the defeated of equal length with training video Enter frame sequence, using trained multiple dimensioned behavior segment Recurrent networks model, the behavior classification of prediction input frame sequence and right The time location answered;It reuses non-maxima suppression to handle prediction result, generates final behavioral value result.
2. the video behavior detection method of single phase according to claim 1 a kind of, which is characterized in that constructed more rulers Spending behavior segment Recurrent networks includes:The extensive module in basis, behavior example anchoring system and behavior prediction module;Wherein:
The extensive module in basis, including the N being arranged alternately1Layer Three dimensional convolution layer and N2The three-dimensional maximum value pond layer of layer, is used for pair The video sequence progress feature of input is extensive, and expands receptive field;
The behavior example anchoring system, using N3Stride on layer time dimension is s1, the stride on Spatial Dimension is s2Three Convolutional network is tieed up, when each cell association for the anchor feature figure for each the Three dimensional convolution layer output of this module is different Between length anchoring behavior example;
The behavior prediction module uses D for each cell to anchor feature figurek(m+2) a size is h × w × 3 Convolution kernel carry out convolution, the corresponding D of output respective cellskPrediction score of a anchoring behavior example to m behavior classification With two time location offsets;Wherein, corresponding height, the width for indicating anchor feature figure of h, w.
3. the video behavior detection method of single phase according to claim 2 a kind of, which is characterized in that the behavior example In anchoring system, a basal latency scale s is defined for each anchor feature figurek, k ∈ [1, N3];It is fixed for each anchor feature figure Adopted one group of scale ratioDkIt is the number of scale ratio;The size of each anchor feature figure be expressed as h × w × T, t indicate the length of anchor feature figure, then each size of anchor feature figure is that the cell of h × w × 3 is associated with DkA anchoring row Time span for example, behavior example is ld=sk·rd, d ∈ [1, Dk], center is cell center.
4. the video behavior detection method of single phase according to claim 2 a kind of, which is characterized in that
Using training video and frame level real behavior label as input, combined training object function, by gradient descent method to more Parameter in Scaling behavior segment Recurrent networks is trained, and training process is as follows:
Video frame is extracted from training video using fixed frame per second, obtains training frames sequence of pictures, and will be adjusted to per frame picture Unified resolution ratio;For the subsequence of sequence sliding window equal length as input frame sequence, the length of each sliding window is GPU video memorys again The maximum frame number of permission;
The correspondence between the real behavior example in anchoring behavior example and label is established using positive sample matching strategy;
The object function trained as multiple dimensioned behavior segment Recurrent networks using multitask loss function, by under stochastic gradient Drop method is trained, and the multiple dimensioned behavior segment Recurrent networks model that grey iterative generation is final is passed through.
5. the video behavior detection method of single phase according to claim 4 a kind of, which is characterized in that described to use positive sample Correspondence between the real behavior example that this matching strategy is established in anchoring behavior example and label includes:
In each training sample, each anchoring behavior example and each label real behavior example are calculated on time dimension Degree of overlapping will accordingly be anchored behavior example as positive sample, and otherwise be used as negative sample if degree of overlapping is more than fixed threshold;Its In, a label actual example can match multiple anchoring behavior examples.
6. the video behavior detection method of single phase according to claim 4 a kind of, which is characterized in that the multitask damage It refers to the object function L of network training to lose functionlossMiddle joint action Classification Loss and time of the act position return loss, table It is shown as:
In above formula, L2(Θ) loses for L2 normalizations, and Θ is the ginseng in all multiple dimensioned behavior segment Recurrent networks learnt Number, α and β are that loss tradeoff parameter is respectively intended to the loss of control time position offset and normalization loss, NclsAnd NposRespectively The quantity of total training sample and positive sample,It is the predicted vector to the behavior classification of i-th of anchor station, Middle subscript j indicates j-th of behavior classification, total m behavior classification;It indicates i-th of anchor station to the true classification of label Predict that score, subscript g indicate that g-th of behavior classification is true tag classification;tiTo be anchored the time location offset of example, For the coordinate transform of label actual position and anchor station;
LclsFor behavior Classification Loss, it is set as multi-class soft maximum loss:
LlocIt returns and loses for time of the act position, be set as the smooth L1 losses of time location offset.
CN201810607804.6A 2018-06-13 2018-06-13 Single-stage video behavior detection method Active CN108805083B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810607804.6A CN108805083B (en) 2018-06-13 2018-06-13 Single-stage video behavior detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810607804.6A CN108805083B (en) 2018-06-13 2018-06-13 Single-stage video behavior detection method

Publications (2)

Publication Number Publication Date
CN108805083A true CN108805083A (en) 2018-11-13
CN108805083B CN108805083B (en) 2022-03-01

Family

ID=64085637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810607804.6A Active CN108805083B (en) 2018-06-13 2018-06-13 Single-stage video behavior detection method

Country Status (1)

Country Link
CN (1) CN108805083B (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697434A (en) * 2019-01-07 2019-04-30 腾讯科技(深圳)有限公司 A kind of Activity recognition method, apparatus and storage medium
CN109816023A (en) * 2019-01-29 2019-05-28 北京字节跳动网络技术有限公司 Method and apparatus for generating picture tag model
CN109829398A (en) * 2019-01-16 2019-05-31 北京航空航天大学 A kind of object detection method in video based on Three dimensional convolution network
CN110059658A (en) * 2019-04-26 2019-07-26 北京理工大学 A kind of satellite-remote-sensing image multidate change detecting method based on Three dimensional convolution neural network
CN110059584A (en) * 2019-03-28 2019-07-26 中山大学 A kind of event nomination method of the distribution of combination boundary and correction
CN110084202A (en) * 2019-04-29 2019-08-02 东南大学 A kind of video behavior recognition methods based on efficient Three dimensional convolution
CN110222592A (en) * 2019-05-16 2019-09-10 西安特种设备检验检测院 A kind of construction method of the timing behavioral value network model generated based on complementary timing behavior motion
CN110348345A (en) * 2019-06-28 2019-10-18 西安交通大学 A kind of Weakly supervised timing operating position fixing method based on continuity of movement
CN110610194A (en) * 2019-08-13 2019-12-24 清华大学 Data enhancement method for small data video classification task
CN110633645A (en) * 2019-08-19 2019-12-31 同济大学 Video behavior detection method based on enhanced three-stream architecture
CN110659572A (en) * 2019-08-22 2020-01-07 南京理工大学 Video motion detection method based on bidirectional feature pyramid
CN110796069A (en) * 2019-10-28 2020-02-14 广州博衍智能科技有限公司 Behavior detection method, system, equipment and machine readable medium
CN111259779A (en) * 2020-01-13 2020-06-09 南京大学 Video motion detection method based on central point trajectory prediction
CN111259783A (en) * 2020-01-14 2020-06-09 深圳市奥拓电子股份有限公司 Video behavior detection method and system, highlight video playback system and storage medium
CN111325097A (en) * 2020-01-22 2020-06-23 陕西师范大学 Enhanced single-stage decoupled time sequence action positioning method
CN111553238A (en) * 2020-04-23 2020-08-18 北京大学深圳研究生院 Regression classification module and method for time axis positioning of actions
CN111814588A (en) * 2020-06-18 2020-10-23 浙江大华技术股份有限公司 Behavior detection method and related equipment and device
CN111832479A (en) * 2020-07-14 2020-10-27 西安电子科技大学 Video target detection method based on improved self-adaptive anchor R-CNN
CN111898461A (en) * 2020-07-08 2020-11-06 贵州大学 Time sequence behavior segment generation method
CN113033500A (en) * 2021-05-06 2021-06-25 成都考拉悠然科技有限公司 Motion segment detection method, model training method and device
CN113505266A (en) * 2021-07-09 2021-10-15 南京邮电大学 Two-stage anchor-based dynamic video abstraction method
CN114339403A (en) * 2021-12-31 2022-04-12 西安交通大学 Video action fragment generation method, system, equipment and readable storage medium
CN114882403A (en) * 2022-05-05 2022-08-09 杭州电子科技大学 Video space-time action positioning method based on progressive attention hypergraph
CN116996661A (en) * 2023-09-27 2023-11-03 中国科学技术大学 Three-dimensional video display method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164694A (en) * 2013-02-20 2013-06-19 上海交通大学 Method for recognizing human motion
CN105740773A (en) * 2016-01-25 2016-07-06 重庆理工大学 Deep learning and multi-scale information based behavior identification method
CN106407903A (en) * 2016-08-31 2017-02-15 四川瞳知科技有限公司 Multiple dimensioned convolution neural network-based real time human body abnormal behavior identification method
CN107729799A (en) * 2017-06-13 2018-02-23 银江股份有限公司 Crowd's abnormal behaviour vision-based detection and analyzing and alarming system based on depth convolutional neural networks
CN108133188A (en) * 2017-12-22 2018-06-08 武汉理工大学 A kind of Activity recognition method based on motion history image and convolutional neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164694A (en) * 2013-02-20 2013-06-19 上海交通大学 Method for recognizing human motion
CN105740773A (en) * 2016-01-25 2016-07-06 重庆理工大学 Deep learning and multi-scale information based behavior identification method
CN106407903A (en) * 2016-08-31 2017-02-15 四川瞳知科技有限公司 Multiple dimensioned convolution neural network-based real time human body abnormal behavior identification method
CN107729799A (en) * 2017-06-13 2018-02-23 银江股份有限公司 Crowd's abnormal behaviour vision-based detection and analyzing and alarming system based on depth convolutional neural networks
CN108133188A (en) * 2017-12-22 2018-06-08 武汉理工大学 A kind of Activity recognition method based on motion history image and convolutional neural networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIYANG GAO ET.AL: "Cascaded Boundary Regression for Temporal Action Detection", 《ARXIV:1705.01180》 *
TIANWEI LIN ET.AL: "Single Shot Temporal Action Detection", 《PROCEEDINGS OF THE 25TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA》 *

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697434A (en) * 2019-01-07 2019-04-30 腾讯科技(深圳)有限公司 A kind of Activity recognition method, apparatus and storage medium
CN109697434B (en) * 2019-01-07 2021-01-08 腾讯科技(深圳)有限公司 Behavior recognition method and device and storage medium
CN109829398A (en) * 2019-01-16 2019-05-31 北京航空航天大学 A kind of object detection method in video based on Three dimensional convolution network
CN109829398B (en) * 2019-01-16 2020-03-31 北京航空航天大学 Target detection method in video based on three-dimensional convolution network
CN109816023A (en) * 2019-01-29 2019-05-28 北京字节跳动网络技术有限公司 Method and apparatus for generating picture tag model
CN110059584A (en) * 2019-03-28 2019-07-26 中山大学 A kind of event nomination method of the distribution of combination boundary and correction
CN110059658A (en) * 2019-04-26 2019-07-26 北京理工大学 A kind of satellite-remote-sensing image multidate change detecting method based on Three dimensional convolution neural network
CN110084202A (en) * 2019-04-29 2019-08-02 东南大学 A kind of video behavior recognition methods based on efficient Three dimensional convolution
CN110222592A (en) * 2019-05-16 2019-09-10 西安特种设备检验检测院 A kind of construction method of the timing behavioral value network model generated based on complementary timing behavior motion
CN110222592B (en) * 2019-05-16 2023-01-17 西安特种设备检验检测院 Construction method of time sequence behavior detection network model based on complementary time sequence behavior proposal generation
CN110348345A (en) * 2019-06-28 2019-10-18 西安交通大学 A kind of Weakly supervised timing operating position fixing method based on continuity of movement
CN110348345B (en) * 2019-06-28 2021-08-13 西安交通大学 Weak supervision time sequence action positioning method based on action consistency
CN110610194A (en) * 2019-08-13 2019-12-24 清华大学 Data enhancement method for small data video classification task
CN110633645A (en) * 2019-08-19 2019-12-31 同济大学 Video behavior detection method based on enhanced three-stream architecture
CN110659572B (en) * 2019-08-22 2022-08-12 南京理工大学 Video motion detection method based on bidirectional feature pyramid
CN110659572A (en) * 2019-08-22 2020-01-07 南京理工大学 Video motion detection method based on bidirectional feature pyramid
CN110796069A (en) * 2019-10-28 2020-02-14 广州博衍智能科技有限公司 Behavior detection method, system, equipment and machine readable medium
CN111259779A (en) * 2020-01-13 2020-06-09 南京大学 Video motion detection method based on central point trajectory prediction
CN111259779B (en) * 2020-01-13 2023-08-01 南京大学 Video motion detection method based on center point track prediction
CN111259783A (en) * 2020-01-14 2020-06-09 深圳市奥拓电子股份有限公司 Video behavior detection method and system, highlight video playback system and storage medium
CN111325097A (en) * 2020-01-22 2020-06-23 陕西师范大学 Enhanced single-stage decoupled time sequence action positioning method
CN111553238A (en) * 2020-04-23 2020-08-18 北京大学深圳研究生院 Regression classification module and method for time axis positioning of actions
CN111814588B (en) * 2020-06-18 2023-08-01 浙江大华技术股份有限公司 Behavior detection method, related equipment and device
CN111814588A (en) * 2020-06-18 2020-10-23 浙江大华技术股份有限公司 Behavior detection method and related equipment and device
CN111898461A (en) * 2020-07-08 2020-11-06 贵州大学 Time sequence behavior segment generation method
CN111832479A (en) * 2020-07-14 2020-10-27 西安电子科技大学 Video target detection method based on improved self-adaptive anchor R-CNN
CN111832479B (en) * 2020-07-14 2023-08-01 西安电子科技大学 Video target detection method based on improved self-adaptive anchor point R-CNN
CN113033500A (en) * 2021-05-06 2021-06-25 成都考拉悠然科技有限公司 Motion segment detection method, model training method and device
CN113033500B (en) * 2021-05-06 2021-12-03 成都考拉悠然科技有限公司 Motion segment detection method, model training method and device
CN113505266A (en) * 2021-07-09 2021-10-15 南京邮电大学 Two-stage anchor-based dynamic video abstraction method
CN113505266B (en) * 2021-07-09 2023-09-26 南京邮电大学 Two-stage anchor-based dynamic video abstraction method
CN114339403A (en) * 2021-12-31 2022-04-12 西安交通大学 Video action fragment generation method, system, equipment and readable storage medium
CN114882403A (en) * 2022-05-05 2022-08-09 杭州电子科技大学 Video space-time action positioning method based on progressive attention hypergraph
CN114882403B (en) * 2022-05-05 2022-12-02 杭州电子科技大学 Video space-time action positioning method based on progressive attention hypergraph
CN116996661A (en) * 2023-09-27 2023-11-03 中国科学技术大学 Three-dimensional video display method, device, equipment and medium
CN116996661B (en) * 2023-09-27 2024-01-05 中国科学技术大学 Three-dimensional video display method, device, equipment and medium

Also Published As

Publication number Publication date
CN108805083B (en) 2022-03-01

Similar Documents

Publication Publication Date Title
CN108805083A (en) The video behavior detection method of single phase
CN111611878B (en) Method for crowd counting and future people flow prediction based on video image
Sankaranarayanan et al. Learning from synthetic data: Addressing domain shift for semantic segmentation
Reda et al. Unsupervised video interpolation using cycle consistency
Jiang et al. Density-aware multi-task learning for crowd counting
CN105512289B (en) Image search method based on deep learning and Hash
CN108399380A (en) A kind of video actions detection method based on Three dimensional convolution and Faster RCNN
CN109993095B (en) Frame level feature aggregation method for video target detection
CN110889343B (en) Crowd density estimation method and device based on attention type deep neural network
CN110852267B (en) Crowd density estimation method and device based on optical flow fusion type deep neural network
CN111860693A (en) Lightweight visual target detection method and system
CN109272509A (en) A kind of object detection method of consecutive image, device, equipment and storage medium
CN109697434A (en) A kind of Activity recognition method, apparatus and storage medium
CN110097115B (en) Video salient object detection method based on attention transfer mechanism
CN111027377B (en) Double-flow neural network time sequence action positioning method
Hua et al. Depth estimation with convolutional conditional random field network
CN106815563B (en) Human body apparent structure-based crowd quantity prediction method
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN114782737A (en) Image classification method, device and storage medium based on improved residual error network
Desai et al. Next frame prediction using ConvLSTM
CN104077742A (en) GABOR characteristic based face sketch synthetic method and system
Chen et al. Salbinet360: Saliency prediction on 360 images with local-global bifurcated deep network
CN112836755B (en) Sample image generation method and system based on deep learning
CN110163103A (en) A kind of live pig Activity recognition method and apparatus based on video image
Zhong et al. EST-TSANet: Video-Based Remote Heart Rate Measurement Using Temporal Shift Attention Network and ESTmap

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant