CN108805083A

CN108805083A - The video behavior detection method of single phase

Info

Publication number: CN108805083A
Application number: CN201810607804.6A
Authority: CN
Inventors: 王子磊; 刘志康
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-06-13
Filing date: 2018-06-13
Publication date: 2018-11-13
Anticipated expiration: 2038-06-13
Also published as: CN108805083B

Abstract

The invention discloses the video behavior detection methods of single phase a kind of comprising：In the training stage, multiple dimensioned behavior segment Recurrent networks are built based on convolutional neural networks；Using training video and frame level real behavior label as input, multiple dimensioned behavior segment Recurrent networks are trained using the end-to-end optimization method of multi-task learning, obtain trained multiple dimensioned behavior segment Recurrent networks model；In service stage, when new video inputs, the input frame sequence that there is equal length with training video is generated by time dimension sliding window, using trained multiple dimensioned behavior segment Recurrent networks model, the behavior classification of prediction input frame sequence and corresponding time location；It reuses non-maxima suppression to handle prediction result, generates final behavioral value result.This method can improve detection performance and detection efficiency.

Description

The video behavior detection method of single phase

Technical field

The present invention relates to video behavior detection technique field more particularly to the video behavior detection methods of single phase a kind of.

Background technology

In recent years, video capture equipment (such as：Smart mobile phone, digital camera, monitoring camera head etc.) it is rapidly universal, it makes one Can easily shoot video, modern communications equipment makes the acquisition of video and propagates more and more convenient, and video has become existing Important information carrier in generation society.At the continuous growth of Computerized intelligent demand and mode identification technology, image The fast development of reason technology and artificial intelligence technology, using computer vision technique to video content carry out analysis have it is huge Actual demand and very high commercial value.And mankind's activity is often the information agent in video, to the human behavior in video It is detected and is of great significance for video understanding.Video human behavior Detection task be in undivided long video, It detects the classification for each human behavior example for including in video while orienting the time that each behavior example occurs.Due to Most monitor videos and Internet video are undivided long video, and detection is done in long video and is more in line with practical need It asks.

With the development of depth learning technology, video behavior detection field achieves some achievements in research.However, video line Be detection field still in development the starting stage, current video behavior detection method is often not mature enough, generally existing model It is excessively complicated, calculate the problems such as cost is excessively high, behavior setting accuracy is low.To meet the needs of practical application, it is badly in need of proposing new Video behavior detection framework and method.

Few for the research of video behavior Detection task at present, the method for proposition usually follows multistage detection block Frame：In first stage, the candidate time window of high recall rate is generated in video using nomination technology (proposal), or There is the behavioural characteristic of discrimination using additional Feature Extraction Technology；In the next stage, to these candidate time windows Or behavioural characteristic is classified to obtain the prediction of behavior classification.Patent《A kind of motion detection model based on convolutional neural networks》 In used a kind of two stage method, interest window is generated in video frame and light stream figure using Faster RCNN networks first Mouth nominates and extracts behavioural characteristic, is then classified to behavioural characteristic using independent SVM classifier.In patent《A kind of base In the video actions detection method of convolutional neural networks》In, the first stage is using intensive multi-scale sliding window mouth to not shearing Video is split, and each window is identified using the convolutional neural networks with space-time pyramidal layer, then under The recognition result of each window is screened and is integrated in one stage to obtain final video detection segment.Paper 《Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs》Middle proposition A kind of behavioral value method based on segmented three-dimensional convolutional neural networks, is first based on using a Three dimensional convolution neural network Sliding window generates the nomination of behavior example, is then classified to nomination using another Three dimensional convolution neural network.Paper 《Cascaded Boundary Regression for TemporalAction Detection》It uses a kind of two stage Behavioral value frame carries out the time boundary of behavior to return the time boundary for operating and further improving sliding window and nominating.By Text《Single Shot Temporal Action Detection》In propose a kind of single channel behavior for behavioral value Grader, they carry out external appearance characteristic and motion feature using independent double-current neural network (two-stream ConvNets) Extraction.

However, feature extraction, sliding window nomination and behavior classification are regarded as independent processing by above-mentioned multistage method Stage, each stage are unable to joint training, are unfavorable for the coordinated and combined optimization of behavioral value model；Meanwhile in difference Exist in stage and largely compute repeatedly, affects the computational efficiency of algorithm.

Invention content

The object of the present invention is to provide the video behavior detection methods of single phase a kind of, can improve detection performance and detection Efficiency.

The purpose of the present invention is what is be achieved through the following technical solutions：

A kind of video behavior detection method of single phase, including：

In the training stage, multiple dimensioned behavior segment Recurrent networks are built based on convolutional neural networks；By training video and frame Grade real behavior label trains multiple dimensioned behavior segment to return net as input, using the end-to-end optimization method of multi-task learning Network obtains trained multiple dimensioned behavior segment Recurrent networks model；

In service stage, when new video inputs, generated by time dimension sliding window has equal length with training video Input frame sequence, use trained multiple dimensioned behavior segment Recurrent networks model, prediction input frame sequence behavior classification With corresponding time location；It reuses non-maxima suppression to handle prediction result, generates final behavioral value result.

As seen from the above technical solution provided by the invention, first, constructed multiple dimensioned behavior segment returns net Network completely eliminates the nomination stage of the sequential in traditional behavioral value method and additional feature extraction phases, in single convolution god Through completing all calculating of the behavior example detection in not trimming long video in network, can combine end to end on the whole Training and optimization, to reach higher detection performance；Secondly, network structure is simplified, so that the overwhelming majority is calculated can be parallel It realizes, the efficiency of behavioral value greatly improved.

Description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.

Fig. 1 is a kind of flow chart of the video behavior detection method of single phase provided in an embodiment of the present invention；

Fig. 2 is behavioral value process schematic in video provided in an embodiment of the present invention；

Fig. 3 is multiple dimensioned behavior segment Recurrent networks general structure schematic diagram provided in an embodiment of the present invention；

Fig. 4 is that 14 data sets of THUMOS ' provided in an embodiment of the present invention export result schematic diagram.

Specific implementation mode

With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, belongs to protection scope of the present invention.

In order to solve the problems such as existing video behavior detection method is complicated, accuracy of detection is low, processing speed is slow, this hair Bright embodiment provides a kind of video behavior detection method of single phase；First, in order to improve computational efficiency, the method for the present invention is by institute There is calculating to be encapsulated into a network, the consummatory behavior Detection task in the convolutional neural networks of a single phase.Next, in order to Behavioral value precision is improved, the method for the present invention is returned using multiple dimensioned position on multiple dimensioned network characterization figure and neatly detected The human behavior of various time spans, and the time of the act boundary of output video frame rank and behavior classification.Finally, in order to make net Network various pieces energy combined optimization, the method for the present invention handle input video in single network, keep whole network end-to-end Training.

As shown in Figure 1, providing a kind of flow chart of the video behavior detection method of single phase for the embodiment of the present invention, lead Including：

1, in the training stage, multiple dimensioned behavior segment Recurrent networks are built based on convolutional neural networks；By training video and Frame level real behavior label trains multiple dimensioned behavior segment to return as input, using the end-to-end optimization method of multi-task learning Network obtains trained multiple dimensioned behavior segment Recurrent networks model.

In the embodiment of the present invention, using the convolutional neural networks consummatory behavior Detection task of a single phase, and will be different The network characterization figure of scale and the anchoring behavior example of different time length connect, and enable the network to neatly detect various The human behavior of time span.It is broadly divided into following several parts：

1) convolutional neural networks are based on and build multiple dimensioned behavior segment Recurrent networks.

In the embodiment of the present invention, constructed multiple dimensioned behavior segment Recurrent networks include：The extensive module in basis, behavior are real Example anchoring system and behavior prediction module；Wherein：

A, the extensive module in basis, including the N being arranged alternately₁(such as：N₁=5) layer Three dimensional convolution layer (3D Convolution layer) and N₂(such as：N₂=5) layer three-dimensional maximum value pond layer (3D max-pooling layer), is used for It is extensive to the video sequence progress feature of input, and expand receptive field.

B, the behavior example anchoring system, using N₃(such as：N₃=4) stride on layer time dimension is s₁(such as：s₁= 2), the stride on Spatial Dimension (stride) is s₂(such as：s₂=1) Three dimensional convolution network, for each to be three-dimensional for this module The anchoring behavior example of each cell association different time length of the anchor feature figure of convolutional layer output.

In the embodiment of the present invention, in the behavior example anchoring system, when defining a basis for each anchor feature figure Between scale s_k, k ∈ [1, N₃]；s_kIt is distributed in specification in codomain [0,1]；One group of scale ratio is defined for each anchor feature figureD_kIt is the number of scale ratio；The size of each anchor feature figure is expressed as h × w × t, the corresponding table of h, w, t Show the height, width, length of anchor feature figure, then each size of anchor feature figure is that the cell of h × w × 3 is associated with D_kA anchoring The time span of behavior example, behavior example is l_d=s_k·r_d, d ∈ [1, D_k], center is cell center.

C, the behavior prediction module uses D for each cell to anchor feature figure_k(m+2) a size is h The convolution kernel of × w × 3 carries out convolution, the corresponding D of output respective cells_kA anchoring behavior example is pre- to m behavior classification It measures and is divided to and two time location offsets.

2) the end-to-end optimization method of multi-task learning

In the embodiment of the present invention, using training video and frame level real behavior label as input, combined training object function, The parameter in multiple dimensioned behavior segment Recurrent networks is trained by gradient descent method, training process is as follows：

A, (such as using fixed frame per second：10 frames/second) video frame is extracted from training video, training frames sequence of pictures is obtained, And every frame picture is adjusted to unified resolution ratio；The subsequence of sequence sliding window equal length is as input frame sequence again, each The length of sliding window be GPU video memorys allow maximum frame number (such as：192 frames).

B, the correspondence between the real behavior example in anchoring behavior example and label is established using positive sample matching strategy Relationship.

In the embodiment of the present invention, in each training sample, each anchoring behavior example and the true row of each label are calculated The degree of overlapping (Intersection-over-Union, IoU) for being example on time dimension, if degree of overlapping is more than fixed threshold Value is (such as：0.5) behavior example, will be accordingly anchored as positive sample, otherwise be used as negative sample；Wherein, a label actual example Multiple anchoring behavior examples can be matched.

C, the object function trained as multiple dimensioned behavior segment Recurrent networks using multitask loss function, by random Gradient descent method is trained, and the multiple dimensioned behavior segment Recurrent networks model that grey iterative generation is final is passed through.

In the embodiment of the present invention, the multitask loss function refers to the object function L of network training_lossMiddle joint row Loss is returned for Classification Loss and time of the act position, is expressed as：

In above formula, L₂(Θ) loses for L2 normalizations, and Θ is in all multiple dimensioned behavior segment Recurrent networks learnt Parameter, α and β are that loss tradeoff parameter is respectively intended to the loss of control time position offset and normalization loss, N_clsAnd N_posPoint Not Wei total training sample and positive sample quantity,It is the pre- direction finding to the behavior classification of i-th of anchor station Amount, wherein subscript j indicate j-th of behavior classification, total m behavior classification.It indicates true to label i-th of anchor station The prediction score of classification, subscript g indicate that g-th of behavior classification is true tag classification.t_iTime location for anchoring example is inclined Shifting amount,For the coordinate transform of label actual position and anchor station；

L_clsFor behavior Classification Loss, it is set as multi-class soft maximum loss：

L_locIt returns and loses for time of the act position, be set as the smooth L1 losses of time location offset.

2, in service stage, when new video inputs, generated by time dimension sliding window has identical length with training video The input frame sequence of degree uses trained multiple dimensioned behavior segment Recurrent networks model, the behavior class of prediction input frame sequence Other and corresponding time location；It reuses non-maxima suppression to handle prediction result, generates final behavioral value knot Fruit.

Said program of the embodiment of the present invention, constructed multiple dimensioned behavior segment Recurrent networks completely eliminate traditional behavior Sequential nomination stage in detection method and additional feature extraction phases, complete in single convolutional neural networks and do not trim All calculating of video behavior example detection, on the whole can joint training end to end and optimization, to reach higher Detection performance；Secondly, network simplifies network structure using full convolution by the way of, make the overwhelming majority calculating can Parallel Implementation, The efficiency of behavioral value greatly improved, particularly, accelerated by GPU parallel computations, reached the method than there is record at present Faster detection speed.

In order to make it easy to understand, being illustrated with reference to specific example.

This exemplary whole flow process is similar with previous embodiment, i.e.,：In the training stage, multiple dimensioned behavior segment is built first Recurrent networks；Multiple dimensioned behavior segment Recurrent networks are completed all behavioral values in a single phase and are calculated, it is eliminated His extra link, meanwhile, the full convolution mode of Web vector graphic, the overwhelming majority, which calculates, to be accelerated parallel, to be greatly improved Computational efficiency；Then, video frame sequence of pictures is extracted from training video, will be produced with sliding window method in video frame sequence of pictures Raw frame sequence and frame level real behavior the label input as multiple dimensioned behavior segment Recurrent networks together, by multitask The end-to-end optimization method practised trains network parameter, generates network model.In service stage, for the video newly inputted, extraction The sequence with input frame sequence equal length is generated by sliding window mode after video frame, inputs trained multiple dimensioned behavior piece Behavioral value is carried out in section Recurrent networks；Then it uses non-maxima suppression (NMS) to handle output result, generates final Behavioral value result.

Specific testing process is as shown in Fig. 2, the sequence R that input is made of the rgb video frame extracted from video³ ^×H×W×T, T, H and W are the length of list entries, height, width respectively, and RGB picture port numbers are 3.It is general that list entries passes through basis successively Change module, behavior example anchoring system and behavior prediction module, exports the class categories and position offset of anchoring behavior example.Root The time location that final behavior occurs is calculated according to position offset.T, h and w indicates the length of anchor feature figure, height, width respectively.m Expression behavior classification number.

The video data that this implementation uses is from representative 14 data of human behavior identification data set THUMOS ' Collection.14 data sets of THUMOS ' are by (THUMOS challenge:Action recognition with a large numberof classes.http://crcv.ucf.edu/THUMOS14/., 2014.) it provides.

Below from building the end-to-end optimization training of multiple dimensioned behavior segment Recurrent networks, multi-task learning, test and comment Estimate three aspects to be introduced.

1, multiple dimensioned behavior segment Recurrent networks are built.

As shown in figure 3, the schematic diagram of the multiple dimensioned behavior segment Recurrent networks provided for this example.It includes mainly：Base The extensive module of plinth, behavior example anchoring system and behavior prediction module；Wherein：

The extensive module in basis, by 5 Three dimensional convolution layers (3D convolution) and 5 three-dimensional maximum value pond layer (3D Max-pooling it) forms.The size of three dimensional convolution kernel is 3 × 3 × 3, and three first layers include that the quantity of convolution kernel is respectively 64, 128,256, equal 512 of remaining two layers of convolution nuclear volume.The kernel of preceding 3 three-dimensional pond layers is dimensioned to 2 × 2 × 2, the kernel size setting 1 × 2 × 2 of remaining three-dimensional pond layer.It is denoted as in the output characteristic pattern of the 5th maximum value pond layer F5, height and width are h=w=3, time dimension t=24.For every layer of convolution output, using linear amending unit Nonlinear Mapping modeling ability is added as activation primitive, for network in ReLU.

This mould each Three dimensional convolution layer in the block is referred to as by behavior example anchoring system here using Three dimensional convolution layer For an anchoring layer, each layer of output is referred to as an anchor feature figure.In all anchoring layers, the setting of convolution kernel size It is 3 × 3 × 3, the convolution kernel number of every layer of time dimension is 256.The output anchor feature seal of each layer of anchoring layer be F6, F7, F8 and F9, size be respectively (256 × 12 × 3 × 3), (256 × 6 × 3 × 3), (256 × 3 × 3 × 3) and (256 × 1 × 3 × 3).The output F5 of extensive last layer of module in basis, size is (256 × 24 × 3 × 3), used also as anchor feature figure.Every Expand (pad) dimension, each cell on each anchor feature figure in second dimension of a anchor feature figure from beginning to end Size be (256 × 3 × 3 × 3).The basic scale of anchor feature figure F5, F6, F7, F8 and F9 set gradually for 0.1, 0.3,0.5,0.7,0.9}.The scale ratio of F5, F6, F7 and F8 are set as { 0.8,1,1.5 }, and F9 scale ratios are set as {0.7,0.85,1}.Scale ratio number on each anchor feature figure is 3, then each list on each anchor feature figure First lattice (256 × 3 × 3 × 3) are associated with 3 anchoring behavior examples, and each length for being anchored behavior example distinguishes this anchor feature figure Corresponding basis scale is multiplied by corresponding 3 scale ratios, and each center for being anchored behavior example is corresponding unit lattice Center.

Behavior prediction module predicts the position offset of corresponding anchor station and behavior classification using Three dimensional convolution.Often A cell is associated with 3 anchoring behavior examples and needs to predict 20 behavior classifications and 1 by taking 14 data sets of THUMOS ' as an example The other score of background classes and front and back 2 time location offsets, so in the cell of each of F5, F6, F7, F8 and F9 On, the convolution kernel for the use of 3 × (20+1+2)=69 sizes being 3 × 3 × 3 carries out convolution, and the output of each convolution is corresponding 3 A anchoring behavior example is in 20 behavior classifications, the other score of 1 background classes and front and back 2 time location offsets.

2, the end-to-end optimization training of multi-task learning.

Since GPU video memorys are limited, long video cannot be accomplished to input a complete video every time, need to video into Row processing generates suitable input.Therefore in the training stage, first with the frame per second of 10 frames/second from the instruction of 14 data sets of THUMOS ' Practice in video and extract video frame, unified to 96 × 96, i.e. H=W=96 per frame picture size.Then, in sequence of frames of video Sequence sliding window generates the successive frame of T=192 as input frame sequence.

Generate input frame sequence after, due to input frame sequence in include behavior segment and background segment, so also need into One step determines specific positive negative sample.Specific method is：To each input frame sequence, multiple dimensioned behavior segment Recurrent networks are calculated In be each anchored the degree of overlapping of behavior example and corresponding label actual example on time dimension, if degree of overlapping be more than threshold value 0.5, using this anchoring example as positive sample, otherwise it is used as negative sample.Wherein, a label actual example can match multiple Anchoring behavior example, but an anchoring behavior example is only capable of one label actual example of matching.

After having positive negative sample, the object function using multitask loss function as network training passes through stochastic gradient Descent method trains multiple dimensioned behavior segment Recurrent networks parameter.Multitask loss function is defined as：

3, test and evaluation.

It, below will be to the performance of network after training multiple dimensioned behavior segment Recurrent networks on 14 data sets of THUMOS ' It is assessed, the specific method is as follows：On the test video collection of 14 data sets of THUMOS ', for each video, with 10 frames/second Frame per second extract sequence of frames of video, using 192 frames be step-length carry out sliding window, generate and training video frame sequence equal length test Sequence of frames of video is sent in trained multiple dimensioned behavior segment Recurrent networks, the behavior classification predicted and corresponding Time location.Then it uses non-maxima suppression (NMS) to handle output result, generates final behavioral value result. Finally the real behavior label of behavioral value result and test data set is compared, obtains the assessment result of network.Fig. 4 gives The behavioral value result schematic diagram on the test video collection of 14 data sets of THUMOS ' is gone out.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can By software realization, the mode of necessary general hardware platform can also be added to realize by software.Based on this understanding, The technical solution of above-described embodiment can be expressed in the form of software products, the software product can be stored in one it is non-easily In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes the method described in each embodiment of the present invention.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Any one skilled in the art is in the technical scope of present disclosure, the change or replacement that can be readily occurred in, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Subject to enclosing.

Claims

1. the video behavior detection method of single phase a kind of, which is characterized in that including：

In the training stage, multiple dimensioned behavior segment Recurrent networks are built based on convolutional neural networks；Training video and frame level is true It is that label is used as input to carry out, and multiple dimensioned behavior segment Recurrent networks are trained using the end-to-end optimization method of multi-task learning, Obtain trained multiple dimensioned behavior segment Recurrent networks model；

In service stage, when new video inputs, generated by time dimension sliding window has the defeated of equal length with training video Enter frame sequence, using trained multiple dimensioned behavior segment Recurrent networks model, the behavior classification of prediction input frame sequence and right The time location answered；It reuses non-maxima suppression to handle prediction result, generates final behavioral value result.

2. the video behavior detection method of single phase according to claim 1 a kind of, which is characterized in that constructed more rulers Spending behavior segment Recurrent networks includes：The extensive module in basis, behavior example anchoring system and behavior prediction module；Wherein：

The extensive module in basis, including the N being arranged alternately₁Layer Three dimensional convolution layer and N₂The three-dimensional maximum value pond layer of layer, is used for pair The video sequence progress feature of input is extensive, and expands receptive field；

The behavior example anchoring system, using N₃Stride on layer time dimension is s₁, the stride on Spatial Dimension is s₂Three Convolutional network is tieed up, when each cell association for the anchor feature figure for each the Three dimensional convolution layer output of this module is different Between length anchoring behavior example；

The behavior prediction module uses D for each cell to anchor feature figure_k(m+2) a size is h × w × 3 Convolution kernel carry out convolution, the corresponding D of output respective cells_kPrediction score of a anchoring behavior example to m behavior classification With two time location offsets；Wherein, corresponding height, the width for indicating anchor feature figure of h, w.

3. the video behavior detection method of single phase according to claim 2 a kind of, which is characterized in that the behavior example In anchoring system, a basal latency scale s is defined for each anchor feature figure_k, k ∈ [1, N₃]；It is fixed for each anchor feature figure Adopted one group of scale ratioD_kIt is the number of scale ratio；The size of each anchor feature figure be expressed as h × w × T, t indicate the length of anchor feature figure, then each size of anchor feature figure is that the cell of h × w × 3 is associated with D_kA anchoring row Time span for example, behavior example is l_d=s_k·r_d, d ∈ [1, D_k], center is cell center.

4. the video behavior detection method of single phase according to claim 2 a kind of, which is characterized in that

Using training video and frame level real behavior label as input, combined training object function, by gradient descent method to more Parameter in Scaling behavior segment Recurrent networks is trained, and training process is as follows：

Video frame is extracted from training video using fixed frame per second, obtains training frames sequence of pictures, and will be adjusted to per frame picture Unified resolution ratio；For the subsequence of sequence sliding window equal length as input frame sequence, the length of each sliding window is GPU video memorys again The maximum frame number of permission；

The correspondence between the real behavior example in anchoring behavior example and label is established using positive sample matching strategy；

The object function trained as multiple dimensioned behavior segment Recurrent networks using multitask loss function, by under stochastic gradient Drop method is trained, and the multiple dimensioned behavior segment Recurrent networks model that grey iterative generation is final is passed through.

5. the video behavior detection method of single phase according to claim 4 a kind of, which is characterized in that described to use positive sample Correspondence between the real behavior example that this matching strategy is established in anchoring behavior example and label includes：

In each training sample, each anchoring behavior example and each label real behavior example are calculated on time dimension Degree of overlapping will accordingly be anchored behavior example as positive sample, and otherwise be used as negative sample if degree of overlapping is more than fixed threshold；Its In, a label actual example can match multiple anchoring behavior examples.

6. the video behavior detection method of single phase according to claim 4 a kind of, which is characterized in that the multitask damage It refers to the object function L of network training to lose function_lossMiddle joint action Classification Loss and time of the act position return loss, table It is shown as：

In above formula, L₂(Θ) loses for L2 normalizations, and Θ is the ginseng in all multiple dimensioned behavior segment Recurrent networks learnt Number, α and β are that loss tradeoff parameter is respectively intended to the loss of control time position offset and normalization loss, N_clsAnd N_posRespectively The quantity of total training sample and positive sample,It is the predicted vector to the behavior classification of i-th of anchor station, Middle subscript j indicates j-th of behavior classification, total m behavior classification；It indicates i-th of anchor station to the true classification of label Predict that score, subscript g indicate that g-th of behavior classification is true tag classification；t_iTo be anchored the time location offset of example, For the coordinate transform of label actual position and anchor station；