CN109389035A

CN109389035A - Low latency video actions detection method based on multiple features and frame confidence score

Info

Publication number: CN109389035A
Application number: CN201810998778.4A
Authority: CN
Inventors: 宋砚; 李泽超; 孙莉
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2018-08-30
Filing date: 2018-08-30
Publication date: 2019-02-26

Abstract

The low latency video actions detection method based on multiple features and frame confidence score that the present invention provides a kind of, comprising: step 1, data prediction is carried out to data set, obtains RGB picture and light stream pictures；Step 2, Three dimensional convolution-deconvolution CDC neural network model is constructed；Step 3, RGB picture obtained in step 1 and light stream picture training set are separately input to be trained in the network model of step 2, obtain trained model；Step 4, the test set of RGB picture and light stream picture is respectively put into trained two models of step 3, generated after the output of two models and merged, obtain the confidence score of each frame, generation acts segment；Step 5, the movement segment obtained using step 4 is chosen the frame number of different weight percentage respectively in timing and made comparisons with true value, and low latency motion detection result is obtained.

Description

Low latency video actions detection method based on multiple features and frame confidence score

Technical field

The present invention relates to the video human motion detection technologies in computer vision technique, especially a kind of to be based on multiple features With the low latency video actions detection method of frame confidence score.

Background technique

With the reach of science and the raising of computer technology level, people have higher to the acquisition analysis of information Secondary requirement, increasingly, it is desired that computer can recognize the world, i.e. computer vision as people by vision.Human body Action recognition has become highly developed as one research hotspot of computer vision field, various investigative technique methods.Movement Detection is developed by action recognition, and the purpose is to the positions of the location action in one section of long video without editing, simultaneously It needs to provide correct label to the movement in video.

There is researcher to propose the concept of low latency detection (Low-latency Detection).Delay is originally interactive A key index in formula experiencing system refers to that user is making movement to obtaining the time difference between system feedback.It will This concept expands to identification field, it can be understood as generates from observation data to obtaining the time difference correct recognition result. We can be regarded as, and early stage, real-time, continuous and online recognition one kind is extensive.For simple, low latency movement inspection Survey refers to, for the long video of non-editing, is identified and is positioned every to the movement content having been observed that in playing process The beginning and end of one movement.The difficult point of low latency recognition detection mostlys come from two aspects: 1) the incomplete sight of data The property surveyed, i.e., will identify the type of the behavior in the case where only observing part behavioral data and need to position each Movement starts over；2) to the timeliness requirement of algorithm, that is, require algorithm that can detect as soon as possible while video acquisition And identify the type of behavior.The two difficult points cause many traditional algorithms not can be used directly in such problem.

The automatic detection of mankind's activity has much potential applications in video, supervises as video understands with detection, automatic video frequency Control and human-computer interaction etc..It further says again, many applications is required to detect activity as soon as possible.The human body of low latency Motion analysis increasingly highlights its importance in the man-machine interactive system of multiplicity.For man-machine interactive system, system The minimum of reaction delay is a very important Consideration.Excessively high delay not only seriously reduces the use of interactive system Family experience, while but also certain specific interactive systems, such as the electronic game that gesture control or enhancing perceive, forfeiture are inhaled Gravitation is to be difficult to popularize.Particularly, low latency detection is very important in terms of manufacture machine people, such as disposes a machine Before people helps a patient to stand up, need first to detect that this patient is intended to that movement done.Or it can be with for one The robot of emotion communication is carried out with the mankind, it allows for the emotional state that the mankind are accurately and rapidly found from facial expression, It appropriate in time can respond in this way.In addition, low latency detection can also make system give a forecast in advance.For example, if Early warning can be just provided when hazardous act not yet occurs, then being possible to prevent the generation of some hazard events.To sum up institute It states, the research of the human body low latency motion detection based on video just becomes a critically important research direction, has great Commercial value and realistic meaning.

Summary of the invention

The purpose of the present invention is to provide a kind of low latency video actions detection side based on multiple features and frame confidence score Method can provide complete data observation, and calculate in real time.

Realize the technical solution of the object of the invention are as follows: a kind of low latency video actions based on multiple features and frame confidence score Detection method, comprising the following steps:

Step 1, data prediction is carried out to data set, obtains RGB picture and light stream pictures；

Step 2, Three dimensional convolution-deconvolution CDC neural network model is constructed；

Step 3, RGB picture obtained in step 1 and light stream picture training set are separately input to the network model of step 2 In be trained, obtain trained model；

Step 4, the test set of RGB picture and light stream picture is respectively put into trained two models of step 3, is generated It after the output of two models and merges, obtains the confidence score of each frame, generation acts segment；

Step 5, the movement segment obtained using step 4, chosen respectively in timing the frame number of different weight percentage and with it is true Value is made comparisons, and low latency motion detection result is obtained.

Compared with prior art, it has the advantage that and needs first extraction movement segment to put with traditional complete motion detection Enter network class difference, the present invention only needs chronologically to input frame sequence in a network, can obtain the action classification of every frame, be A kind of motion detection method based on frame.Meanwhile invention introduces the loss functions of a Rank loss, can constrain mould The monotone nondecreasing that type exports a correct label detects score, so as to detect the beginning of movement as soon as possible, realizes low latency Motion detection.Also, present invention uses two kinds of data training networks, and one is RGB picture, have sufficiently used space characteristics, The other is light stream picture, has sufficiently used temporal characteristics, finally space-time training data is combined, is extracted action message, is mentioned The confidence score of high frame classification, to improve the precision of motion detection.

The invention will be further described with reference to the accompanying drawings of the specification.

Detailed description of the invention

Fig. 1 is the basic framework schematic diagram of video low latency motion detection technology.

Fig. 2 is light stream figure.

Fig. 3 is CDC network structure.

Fig. 4 is frame confidence score figure.

Specific embodiment

The present invention is described in more detail with reference to the accompanying drawing:

The present invention proposes a kind of low latency video actions detection method based on multiple features and frame confidence score, and it is more to include building Layer Three dimensional convolution network extracts RGB and light stream picture, extracts the processes such as frame confidence score, low latency detection, to the length of non-editing Video carries out a series of calculating, and the generation of movement can be detected in video display process and judges its classification.Video is low to be prolonged The basic framework of slow motion detection technology is as shown in Figure 1, the present invention is carried out according to this basic framework.

Step 1, the long video of non-editing, including training set and test set, with the picture format of png, according to 25FPS's Frame per second is read.

Step 2, as shown in Fig. 2, the continuous RGB picture read in never editing long video is obtained using TVL1 optical flow algorithm Take light stream picture.Every two frames RGB picture generates two single channel light stream pictures on one group of direction x, y by algorithm.Light stream is calculated The specific method is as follows for method:

Assuming that the gray value of a point m (x, y) is I (x, y, t) on image in moment t, after dt, which is moved To new position m'(x+dx, y+dy), which is I (x+dx, y+dy, t+dt), it is assumed that is arrived after point movement in image It is equal to the gray value of movement front position up to the gray value of position, then has:

I (x, y, t)=I (x+dx, y+dy, t+dt)

Taylor's formula expansion will be carried out on the right of equation, it may be assumed that

Wherein ε represents the infinite event of second order, due to dt → 0, ignores ε, available

If u, v is respectively velocity vector of the light stream in X-axis and Y direction, and is had It enablesA then available light stream Basic Constraint Equation:

I_xu+I_yv+I_t=0

In order to solve above formula unique solution u and v, it is necessary to add other constraint condition.TVL1 algorithm is according to flatness vacation If --- the movement of each pixel is distributed with the gym suit of its field point from flatness, be joined smooth item and is established light stream mould Type is as follows:

E is the energy function of optical flow estimation, and λ is the weight constant of data item,WithIt is two-dimensional gradient, passes through minimum Change energy function E solution and obtains u and v.

Step 3, CDC network is constructed, CDC network structure is as shown in Figure 3.CDC network is using C3D network structure First part of the conv1a-conv5b as CDC, wherein the pond of layer 5 is changed to 1 × 2 × 2.Then by the three-dimensional of C3D Full articulamentum after convolutional network is changed to CDC filter.CDC6 layers by the output data (512, L/8,4,4) after convolution in space Upper down-sampling, up-sampling becomes (4096, L/4,1,1) in time, and CDC7 layers up-sample CDC6 layers of output in time As (4096, L/2,1,1), then CDC8 layers upper one layer of output is continued to be up-sampled in time as (K+1, L, 1,1), Finally by the classification results of softmax layers of generation L frame.

Step 4, there are two the loss functions of whole network, one is the Classification Loss function based on cross entropy, another It is the loss function based on Rank loss.Whole loss function calculates as follows:

Wherein,It is Classification Loss function,It is Rank loss function, λ_rIt is a constant, is set as 6 here.Classification damage Losing function is calculated with cross entropy, as follows:

Wherein, y_tIt is the true tag of t frame in training sequence,It is that t frame belongs to correct classification y_tDetection score, As the softmax of network model is exported.

The present invention also proposed a Rank Loss function on the basis of being based on Classification Loss function.As shown in figure 4, During low latency detects video, it is seen that the frame number of a movement is more, belongs to the detection score of some correct classification Then can be higher, confidence level is bigger；Conversely, the detection score that this movement belongs to some error category can be lower, confidence level is smaller. Therefore, when a movement occurs, its detection score can be a monotone nondecreasing curve.According to this characteristic, if in t Interior there is no movements to change, and the detection score of t frame is not less than the detection score of former frame certainly.Therefore, one is constructed A Rank Loss function.If t frame is not acted and changed, loss function calculates as follows:

If changed in the movement of t frame, i.e., t frame and t-1 frame are not belonging to same category, and loss function calculates such as Under:

Step 5, the input of CDC network first tier is 32 frame images in video, is sliced using every 32 frame of video as one It inputs in network, (1~32), (33~64) ... frame is not overlapped as input, slice window.Use RGB picture and light Flow graph piece is put into the CDC network built respectively as training set in the manner described above and starts to train, and uses stochastic gradient Decline (SGD) optimization object function, initial learning rate is set as 1e-6, and batch size is set as 4, after 25 epoch of iteration most Two training patterns are obtained eventually.

Step 6, classified respectively to RGB and light stream test set picture using two training patterns in step 5, extracted The output of layer network second from the bottom takes maximum confidence score as this to get belonging to the confidence score of every one kind to every frame The motion detection score of class finally merges the output score of RGB picture and light stream picture as final frame confidence score.

Step 7, the classification of every frame is obtained according to the frame confidence score in step 6, in one section of successive video frames, if phase Adjacent two frames belong to same category, and just successively merging these frames becomes as small fragment.

Step 8, it if the small fragment in step 7 is close two-by-two in time series, is i.e. differed between two small fragments As soon as frame number less than 20 frames, they is continued to be merged into a large fragment, become final movement segment.

Step 9, the movement segment obtained using step 8, chooses the frame number of different weight percentage respectively in timing, and such as one The movement segment of 50 frames therefrom takes preceding 3/10 frame number, that is, preceding 15 frame of the segment is taken to carry out low latency motion detection.By this Preceding 15 frame and preceding 3/10 frame number of true movement segment do intersection, the degree of overlapping of the two are obtained, then according to different IOU thresholds Value calculates mean accuracy (AP), is averaged out classification finally to obtain mean accuracy mean value (mAP).Low latency motion detection effect Fruit is by mAP (mean accuracy mean value) come what is evaluated, if mean accuracy mean value is high, this low latency detection effect is just It is good.(that is result that Map is equivalent to the detection of this low latency).

Claims

1. a kind of low latency video actions detection method based on multiple features and frame confidence score, which is characterized in that including following Step:

Step 3, by RGB picture obtained in step 1 and light stream picture training set be separately input in the network model of step 2 into Row training, obtains trained model；

Step 4, the test set of RGB picture and light stream picture is respectively put into trained two models of step 3, generates two It after the output of model and merges, obtains the confidence score of each frame, generation acts segment；

Step 5, the movement segment obtained using step 4 is chosen the frame number of different weight percentage respectively in timing and made with true value Compare, obtains low latency motion detection result.

2. the method according to claim 1, wherein the step 1 specifically includes:

Step 1.1. is the long video of non-editing, including training set and test set, with the picture format of png, according to the frame of 25FPS Rate is read, as RGB pictures；

Never the continuous RGB picture read in editing long video is obtained light stream picture using TVL1 optical flow algorithm by step 1.2..

3. according to the method described in claim 2, it is characterized in that, optical flow algorithm in step 1.2 are as follows:

Step 1.2.1, it is assumed that in moment t, the gray value of a point m (x, y) is I (x, y, t) on image, after dt, is somebody's turn to do Point moves to new position m'(x+dx, y+dy), which is I (x+dx, y+dy, t+dt)；

Assuming that the gray value of in-position is equal to the gray value of movement front position after point movement in image, i.e.,

I (x, y, t)=I (x+dx, y+dy, t+dt) (1)

Step 1.2.2 will carry out Taylor's formula expansion, i.e., on the right of formula (1)

Wherein, ε represents the infinite event of second order, due to dt → 0, ignores ε and obtains

Step 1.2.3 if u, v are respectively velocity vector of the light stream in X-axis and Y direction, and hasIt enablesThen obtain light stream Basic Constraint Equation:

I_xu+I_yv+I_t=0 (4)

Step 1.2.4, (u, v) forms light stream pictures.

4. the method according to claim 1, wherein the step 2 specifically includes the following steps:

Step 2.1, first part of the CDC network using the conv1a-conv5b of C3D network structure as CDC, wherein by the 5th The pond of layer is changed to 1 × 2 × 2；Full articulamentum after the Three dimensional convolution network of C3D is changed to CDC filter；CDC6 layers by convolution Output data (512, L/8,4,4) afterwards spatially down-sampling, in time up-sampling become (4096, L/4,1,1)；CDC7 Layer up-samples CDC6 layers of output in time becomes (4096, L/2,1,1)；CDC8 layers by upper one layer of output continue when Between upper up-sampling become (K+1, L, 1,1)；Finally by the classification results of softmax layers of generation L frame；

Step 2.2, whole loss function L is designed^t

Wherein,It is Classification Loss function,It is Rank loss function, λ_rIt is a constant；

Classification Loss function

Wherein, y_tIt is the true tag of t frame in training sequence,It is that t frame belongs to correct classification y_tDetection score；

Rank loss function is divided into what two kinds of situations calculated, changes if t frame does not act, and loss function calculates such as Formula (7)

If changed in the movement of t frame, i.e., t frame and t-1 frame are not belonging to same category, and loss function calculates in this way (8):

5. according to the method described in claim 4, it is characterized in that, the step 3 specifically includes the following steps:

The input of CDC network first tier is 32 frame images in video, is inputted in network using every 32 frame of video as a slice, (1~32), (33~64) ... frame is not overlapped as input, slice window；The RGB picture and light obtained using step 1 Flow graph piece is put into the CDC network built respectively as training set in the manner described above and starts to train, and obtains two training Model.

6. according to the method described in claim 5, it is characterized in that, the step 4 specifically includes the following steps:

Step 4.1, classified respectively to RGB and light stream test set picture using two training patterns in step 3, extraction is fallen The output of several second layer networks takes maximum confidence score as such to get belonging to the confidence score of every one kind to every frame Motion detection score, the output score of RGB picture and light stream picture is finally done into average value processing as final frame confidence score；

Step 4.2, the classification of every frame is obtained according to the frame confidence score in step 4.1, in one section of successive video frames, if phase Adjacent two frames belong to same category, and just successively merging these frames becomes as small fragment；

Step 4.3, if as soon as the frame number differed between two small fragments in step 4.2 continues them less than 20 frames It is merged into a large fragment, becomes final movement segment.

7. according to the method described in claim 6, it is characterized in that, in step 4.2 each frame can all provide the frame belong to it is each The prediction score of classification, that highest one kind of prediction score are the classification of the frame.

8. according to the method described in claim 6, it is characterized in that, the step 5 specifically includes the following steps:

Step 5.1, the movement segment obtained using step 4, chooses the frame number of different weight percentage respectively in timing, low to carry out Delay voltage detection；

Step 5.2, the frame extracted in step 5.1 is done into intersection with the same number of frames of true movement segment and obtains the overlapping of the two Degree, then calculates mean accuracy according to different IOU threshold values, is averaged out classification finally to obtain mean accuracy mean value, obtains To low latency testing result.