CN110210430A

CN110210430A - A kind of Activity recognition method and device

Info

Publication number: CN110210430A
Application number: CN201910491344.XA
Authority: CN
Inventors: 张俊三; 王晓敏; 王雷全; 吴春雷; 李克文
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2019-09-06

Abstract

The present invention provides a kind of Activity recognition method and device, this method, comprising: presets at least two behavior classifications；Video to be identified is divided at least two video clips；It for each video clip, executes: extracting the key frame of current video segment, stacks light stream and successive frame still image；Activity recognition is carried out respectively according to key frame, stacking light stream and successive frame still image, determines the first scoring, the second scoring and third scoring of each behavior classification of current video segment respectively；According to the first of each behavior classification of each video clip the scoring, the second scoring and third scoring, the spatial flow scoring, time flow scoring and 3D scoring of each behavior classification of video to be identified are determined respectively；According to the scoring of the spatial flow of each behavior classification of video to be identified, time flow scoring and 3D scoring, the final scoring of each behavior classification of video to be identified is generated.The present invention provides a kind of Activity recognition method and devices, can be improved the accuracy of Activity recognition.

Description

A kind of Activity recognition method and device

Technical field

The present invention relates to field of computer technology, in particular to a kind of Activity recognition method and device.

Background technique

The Activity recognition of video refers to automatically analyzes the behavior that identification human body executes from one section of video.Simplest behavior Identification is also referred to as behavior classification, and the human body behavior in unknown video can be categorized into several behavior classifications predetermined by it In.

In the prior art, the still image that several frames are extracted from video to be identified is carried out from these still images Activity recognition generates final for the recognition result identified to equipment.

As can be seen from the above description, the prior art only considers the appearance in still image when carrying out Activity recognition Information, recognition result inaccuracy.

Summary of the invention

The embodiment of the invention provides a kind of Activity recognition method and devices, can be improved the accuracy of Activity recognition.

On the one hand, the embodiment of the invention provides a kind of Activity recognition methods, comprising:

Preset at least two behavior classifications；

Video to be identified is divided at least two video clips；

For each video clip, execute: key frame, stacking light stream and the successive frame for extracting current video segment are quiet State image；Activity recognition is carried out to the current video segment according to the key frame, determines the every of the current video segment First scoring of a behavior classification carries out Activity recognition to the current video segment according to the stacking light stream, determines Second scoring of each of the current video segment behavior classification, according to the successive frame still image to described current Video clip carries out Activity recognition, determines the third scoring of each of described current video segment behavior classification；

According to the first of each of each video clip behavior classification the scoring, the video to be identified is determined The spatial flow of each behavior classification scores；It is commented according to the second of each of each video clip behavior classification Point, determine the time flow scoring of each of described video to be identified behavior classification；According to the every of each video clip The third of a behavior classification scores, and determines the 3D scoring of each of described video to be identified behavior classification；

According to the scoring of the spatial flow of each of the video to be identified behavior classification, each behavior classification Time flow scoring and the 3D of each behavior classification score, generate each of described video to be identified row For the final scoring of classification.

Optionally,

This method further comprises:

Preset the weight of the weight of the spatial flow scoring, the weight of time flow scoring and 3D scoring；

It is described according to the spatial flow of each of the video to be identified behavior classification scoring, each behavior The time flow scoring of classification and the 3D of each behavior classification score, and generate each institute of the video to be identified State the final scoring of behavior classification, comprising:

It for each behavior classification, executes: according to the scoring of the spatial flow of current behavior classification, the current line Weight, the time flow for the time flow scoring of classification, the 3D scoring of the current behavior classification, spatial flow scoring are commented The weight of the weight divided and 3D scoring determines the current behavior classification of the video to be identified most using formula four Final review point, wherein the formula four are as follows:

O=aS+bT+cM；

Wherein, O is the final scoring of the current behavior classification of the video to be identified, and S is the current behavior class Other spatial flow scoring, T are that the time flow of the current behavior classification scores, and M is that the 3D of the current behavior classification is commented Point, a is the weight of spatial flow scoring, and b is the weight of time flow scoring, and c is the weight of 3D scoring.

Optionally,

By the spatial flow of each of the video to be identified behavior classification scoring, each behavior classification The time flow scoring and the 3D scoring of each behavior classification are input in the Linear SVM classifier of training completion, The final scoring of each of described video to be identified behavior classification is determined using the Linear SVM classifier；

Wherein, the kernel function of the Linear SVM classifier are as follows:

k(a,a_i)=((xx_i)+1)^d, d is preset constant, and d is positive integer.

Optionally,

It is described that Activity recognition is carried out to the current video segment according to the key frame, determine the current video segment Each of the behavior classification first scoring, comprising:

The key frame of the current video segment is input in the space flow model of the 2D convolution of training completion, benefit Activity recognition is carried out with the key frame of the space flow model of the 2D convolution to the current video segment, is worked as described in determination First scoring of each of the preceding video clip behavior classification.

Optionally,

It is described that Activity recognition is carried out to the current video segment according to the stacking light stream, determine the current video piece Second scoring of each of the section behavior classification, comprising:

The stacking light stream of the current video segment is input in the time flow model of the 2D convolution of training completion, Activity recognition is carried out to the stacking light stream of the current video segment using the time flow model of the 2D convolution, determines institute State the second scoring of each of current video segment behavior classification.

Optionally,

It is described that Activity recognition is carried out to the current video segment according to the successive frame still image, it determines described current The third of each of video clip behavior classification scores, comprising:

The successive frame still image of the current video segment is input in the 3D convolution model of training completion, benefit Activity recognition is carried out with the successive frame still image of the 3D convolution model to the current video segment, is worked as described in determination The third of each of the preceding video clip behavior classification scores.

Optionally,

It is described to score according to the first of each of each video clip behavior classification, determine the view to be identified The spatial flow of each of frequency behavior classification scores, comprising:

For each behavior classification, executes: being commented according to the first of the current behavior classification of each video clip Point, determine that the spatial flow of the current behavior classification of the video to be identified scores using formula one, wherein the formula one Are as follows:

Wherein, Vid^αFor the spatial flow scoring of the current behavior classification of the video to be identified, K is described at least two The sum of a video clip,For the first scoring of the current behavior classification of k-th of video clip.

Optionally,

It is described to score according to the second of each of each video clip behavior classification, determine the view to be identified The time flow of each of frequency behavior classification scores, comprising:

For each behavior classification, executes: being commented according to the second of the current behavior classification of each video clip Point, determine that the time flow of the current behavior classification of the video to be identified scores using formula two, wherein the formula two Are as follows:

Wherein, Vid^βFor the time flow scoring of the current behavior classification of the video to be identified, K is described at least two The sum of a video clip,For the second scoring of the current behavior classification of k-th of video clip.

Optionally,

It is described to be scored according to the third of each of each video clip behavior classification, determine the view to be identified The 3D of each of frequency behavior classification scores, comprising:

For each behavior classification, executes: being commented according to the third of the current behavior classification of each video clip Point, determine that the 3D of the current behavior classification of the video to be identified scores using formula three, wherein the formula three are as follows:

Wherein, Vid^γFor the 3D scoring of the current behavior classification of the video to be identified, K is at least two view The sum of frequency segment,For the third scoring of the current behavior classification of k-th of video clip.

On the other hand, the embodiment of the invention provides a kind of Activity recognition devices, comprising:

First setting unit, at least two behavior classifications to be arranged；

Cutting unit, for video to be identified to be divided at least two video clips；

Fragment processing unit, for be directed to each video clip, execute: extract current video segment key frame, Stack light stream and successive frame still image；Activity recognition is carried out to the current video segment according to the key frame, determines institute The first scoring for stating each of current video segment behavior classification, according to the stacking light stream to the current video segment Activity recognition is carried out, the second scoring of each of described current video segment behavior classification is determined, according to the successive frame Still image carries out Activity recognition to the current video segment, determines each of described current video segment behavior classification Third scoring；

Segment composition unit, for scoring according to the first of each of each video clip behavior classification, really The spatial flow scoring of each of the fixed video to be identified behavior classification；According to described in each of each described video clip Second scoring of behavior classification determines the time flow scoring of each of described video to be identified behavior classification；According to each The third of each of the video clip behavior classification scores, and determines each of described video to be identified behavior classification 3D scoring；

Final integrated unit, for being commented according to the spatial flow of each of the video to be identified behavior classification Point, the 3D scoring of the scoring of the time flow of each behavior classification and each behavior classification, generate it is described to Identify the final scoring of each of the video behavior classification.

Optionally,

The device further comprises:

Second setting unit, for the weight of spatial flow scoring, the weight of time flow scoring and described to be arranged The weight of 3D scoring；

The final integrated unit executes: according to current behavior classification for being directed to each behavior classification Spatial flow scoring, the time flow scoring of the current behavior classification, the 3D scoring of the current behavior classification, the spatial flow are commented The weight of the weight, time flow scoring divided and the weight of 3D scoring, utilize formula four to determine the video to be identified The current behavior classification final scoring, wherein the formula four are as follows:

O=aS+bT+cM；

Optionally,

The final integrated unit, for commenting the spatial flow of each of the video to be identified behavior classification Divide, the scoring of the time flow of each behavior classification and the 3D scoring of each behavior classification are input to and have trained At Linear SVM classifier in, determine each of described video to be identified behavior classification using the Linear SVM classifier Final scoring；

Wherein, the kernel function of the Linear SVM classifier are as follows:

k(a,a_i)=((xx_i)+1)^d, d is preset constant, and d is positive integer.

Optionally,

The fragment processing unit, execute it is described according to the key frame to the current video segment carry out behavior knowledge Not, when determining the first scoring of each of described current video segment behavior classification, it is specifically used for:

Optionally,

The fragment processing unit, execute it is described according to the stackings light stream to the current video segment progress behavior Identification is specifically used for when determining the second scoring of each of described current video segment behavior classification:

Optionally,

The fragment processing unit, execute it is described according to the successive frame still image to the current video segment into Row Activity recognition is specifically used for when determining the third scoring of each of described current video segment behavior classification:

Optionally,

The segment composition unit, it is described according to the of each of each video clip behavior classification executing One scoring is specifically used for when determining the spatial flow scoring of each of described video to be identified behavior classification:

Optionally,

The segment composition unit, it is described according to the of each of each video clip behavior classification executing Two scorings are specifically used for when determining the time flow scoring of each of described video to be identified behavior classification:

Optionally,

The segment composition unit, it is described according to the of each of each video clip behavior classification executing Three scorings are specifically used for when determining the 3D scoring of each of described video to be identified behavior classification:

In embodiments of the present invention, video to be identified is divided at least two video clips, extracts each video clip and closes Key frame stacks light stream and successive frame still image, right in terms of key frame, stacking light stream and successive frame still image three respectively Each video clip carries out Activity recognition and is then fused to the final recognition result of video to be identified, in identification process, Activity recognition is carried out in terms of three by video segmentation to be identified, and to each video clip, it can be from time, space and space-time Three angles carry out Activity recognition to video to be identified, can based on more angles, more information come it is comprehensive identify to It identifies video, will finally be fused to final recognition result, greatly improve the accuracy of Activity recognition.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.

Fig. 1 is a kind of flow chart for Activity recognition method that one embodiment of the invention provides；

Fig. 2 is the flow chart for another Activity recognition method that one embodiment of the invention provides；

Fig. 3 is a kind of schematic diagram for Activity recognition device that one embodiment of the invention provides.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments, based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

As shown in Figure 1, this method may comprise steps of the embodiment of the invention provides a kind of Activity recognition method:

Step 101: presetting at least two behavior classifications；

Step 102: video to be identified is divided at least two video clips；

Step 103: be directed to each video clip, execute: extract current video segment key frame, stack light stream and Successive frame still image；Activity recognition is carried out to the current video segment according to the key frame, determines the current video First scoring of each of segment behavior classification carries out behavior knowledge to the current video segment according to the stacking light stream Not, the second scoring for determining each of described current video segment behavior classification, according to the successive frame still image pair The current video segment carries out Activity recognition, determines that the third of each of described current video segment behavior classification is commented Point；

Step 104: according to the first of each of each video clip behavior classification the scoring, determining described wait know The spatial flow of each of the other video behavior classification scores；According to each of each video clip behavior classification Second scoring determines the time flow scoring of each of described video to be identified behavior classification；According to each piece of video The third scoring of each of the section behavior classification determines the 3D scoring of each of described video to be identified behavior classification；

Step 105: being scored according to the spatial flow of each of the video to be identified behavior classification, is each described The time flow scoring of behavior classification and the 3D of each behavior classification score, and generate the every of the video to be identified The final scoring of a behavior classification.

In embodiments of the present invention, the scoring of behavior classification refers to that the behavior in video belongs to the score of behavior classification, Score is higher, illustrates that a possibility that current video segment belongs to behavior classification is higher.

Specifically, the first scoring refers under the mode for carrying out Activity recognition based on key frame, in current video segment Behavior belongs to the scoring of some behavior classification；Second, which scores, refers under based on the mode for stacking light stream progress Activity recognition, when Behavior in preceding video clip belongs to the scoring of some behavior classification；Third scoring refers to be carried out based on successive frame still image Under the mode of Activity recognition, the behavior in current video segment belongs to the scoring of some behavior classification.

Spatial flow scoring refers to that under the mode for carrying out Activity recognition based on key frame, the behavior in video to be identified belongs to The scoring of some behavior classification；Time flow, which scores, to be referred under based on the mode for stacking light stream progress Activity recognition, view to be identified Behavior in frequency belongs to the scoring of some behavior classification；3D scoring, which refers to, is carrying out Activity recognition based on successive frame still image Under mode, the behavior in video to be identified belongs to the scoring of some behavior classification.

Final scoring, which refers to, will be carried out the mode of Activity recognition based on key frame, carries out Activity recognition based on stacking light stream After mode and the mode for carrying out Activity recognition based on successive frame still image are merged, the obtained behavior in video to be identified Belong to the scoring of some behavior classification.

In embodiments of the present invention, it can be realized by way of sparse sampling and described video to be identified is divided at least two A video clip and the key frame for extracting current video segment stack light stream and successive frame still image.

In embodiments of the present invention, described that video to be identified is divided at least two video clips, comprising: by view to be identified Frequency is averagely divided at least two video clips.Wherein, the duration of each video clip is equal.

In embodiments of the present invention, the key frame for extracting current video segment includes: to obtain from current video segment at random A frame image is taken, using the frame image as key frame.Here keyword can be individual still image.

In an embodiment of the present invention, this method further comprises:

O=aS+bT+cM；

In embodiments of the present invention, weight, the weight of time flow scoring and the 3D scoring scored by installation space stream Spatial flow scoring, time flow scoring and 3D scoring are merged, generate final scoring by weight.For example, a can be 0.3, b, which can be 0.3, c, can be 0.4 etc..Spatial flow scoring, time flow scoring and 3D is merged by the implementation to comment Point, it is more accurate to enable to finally score.

In an embodiment of the present invention, the space according to each of the video to be identified behavior classification The 3D of stream scoring, the time flow scoring of each behavior classification and each behavior classification scores, and generates institute State the final scoring of each of video to be identified behavior classification, comprising:

Wherein, the kernel function of the Linear SVM classifier are as follows:

k(a,a_i)=((xx_i)+1)^d, d is preset constant, and d is positive integer.

In embodiments of the present invention, the SVM classifier completed using training is commented spatial flow scoring, time flow scoring and 3D Divide fusion, and above-mentioned kernel function is set, it is more accurate to enable to finally score.Wherein, d can be 1,2,3 etc., Specifically, d can be 9.

In addition, when being trained to Linear SVM classifier,

SVM classifier is by predicting that score and label are input in Linear SVM classifier and carry out for training set videl stage Training.

It is in an embodiment of the present invention, described that Activity recognition is carried out to the current video segment according to the key frame, Determine the first scoring of each of described current video segment behavior classification, comprising:

In embodiments of the present invention, using the space flow model of the 2D convolution in double-stream digestion come to key frame Reason realizes the Activity recognition to current video segment.It can be from the angle in space to current by the space flow model of 2D convolution Video clip carries out Activity recognition.

In an embodiment of the present invention, described that behavior knowledge is carried out to the current video segment according to the stacking light stream Not, the second scoring of each of described current video segment behavior classification is determined, comprising:

In embodiments of the present invention, using the time flow model of the 2D convolution in double-stream digestion come to stack light stream at Reason realizes the Activity recognition to current video segment.It can be from the angle of time to current by the time flow model of 2D convolution Video clip carries out Activity recognition.

In an embodiment of the present invention, described to be gone according to the successive frame still image to the current video segment For identification, the third scoring of each of described current video segment behavior classification is determined, comprising:

In embodiments of the present invention, successive frame still image is handled using 3D convolution model, is realized to current The Activity recognition of video clip.Activity recognition can be carried out to current video segment from the angle of space-time by 3D convolution model.

In addition, the time flow model and 3D convolution model of the space flow model of 2D convolution, 2D convolution can be by with lower sections Formula is trained:

J1: the building space flow model of 2D convolution, the time flow model of 2D convolution and 3D convolution model these three networks Model, the training set data that will acquire are sent into each network model, by a series of convolution, Chi Hua, nonlinear activation function, Normalization, full articulamentum, softmax function operation, output video actions classification score, complete the forward-propagating process of network；

Above-mentioned formula be softmax function, wherein Xi be network in i-th of neuron of the last layer output, i ∈ [1, N], N is total classification number of behavior classification.

J2: it calculates network model and finally exports and intersect entropy loss between layer data and actual value, according to backpropagation tune The parameter for saving each layer in each network model completes the back-propagation process of network；

The concrete form of cross entropy loss function is as follows:

L=- ∑ z_ilny_i；

Wherein, z_iFor true classification results.

J3: the forward-propagating and back-propagation process of continuous iteration first two steps, until network convergence.

In an embodiment of the present invention, first according to each of each video clip behavior classification is commented Point, determine the spatial flow scoring of each of described video to be identified behavior classification, comprising:

In an embodiment of the present invention, second according to each of each video clip behavior classification is commented Point, determine the time flow scoring of each of described video to be identified behavior classification, comprising:

In an embodiment of the present invention, the third according to each of each video clip behavior classification is commented Point, determine the 3D scoring of each of described video to be identified behavior classification, comprising:

As shown in Fig. 2, this method may comprise steps of the embodiment of the invention provides a kind of Activity recognition method:

Step 201: presetting at least two behavior classifications.

Specifically, which may include: race, jumps, walks, climbs, plays basketball, plays tennis, plays volleyball, kicks Football etc..

Step 202: video to be identified is divided at least two video clips.

Specifically, video to be identified can be averagely divided at least two video clips.

Step 203: being directed to each video clip, execute: extracting the key frame of current video segment, stack light stream and continuous Frame still image；The key frame of current video segment is input in the space flow model of the 2D convolution of training completion, utilizes 2D The space flow model of convolution carries out Activity recognition to the key frame of current video segment, determines each behavior of current video segment First scoring of classification；The stacking light stream of current video segment is input in the time flow model of the 2D convolution of training completion, Activity recognition is carried out to the stacking light stream of current video segment using the time flow model of 2D convolution, determines current video segment Second scoring of each behavior classification；The successive frame still image of current video segment is input to the 3D convolution mould of training completion In type, Activity recognition is carried out using successive frame still image of the 3D convolution model to current video segment, determines current video piece The third scoring of each behavior classification of section.

For example, it for video clip 1, executes:

It extracts the key frame of video clip 1, stack light stream and successive frame still image；

The key frame of video clip 1 is input in the space flow model of the 2D convolution of training completion, utilizes 2D convolution Space flow model carries out Activity recognition to the key frame of current video segment, determines each behavior classification of current video segment First scoring；Such as: the first scoring of behavior classification " race ", the first scoring of behavior classification " jump ", behavior classification " are played basketball " First scoring etc..

The stacking light stream of video clip 1 is input in the time flow model of the 2D convolution of training completion, utilizes 2D convolution Time flow model Activity recognition is carried out to the stacking light stream of video clip 1, determine the of each behavior classification of video clip 1 Two scorings；Such as: the second scoring of behavior classification " race ", the second scoring of behavior classification " jump ", behavior classification " playing basketball " Second scoring etc..

The successive frame still image of video clip 1 is input in the 3D convolution model of training completion, utilizes 3D convolution mould Type carries out Activity recognition to the successive frame still image of video clip 1, determines that the third of each behavior classification of video clip 1 is commented Point；Such as: the third scoring of behavior classification " race ", the third scoring of behavior classification " jump ", the third of behavior classification " playing basketball " Scoring etc..

In addition, can be extracted in the way of 5 frame per second when extracting stacking light stream.

The size of key frame can be 1 × 3 × L × W (such as RGB channel number is 3), and stacking light stream size can be 5 × 2 (the five light stream figures directly extracted from continuous six static images indicate x wherein a light stream figure is two channels to × L × W The pixel variation in direction and the pixel in the direction y change), successive frame still image size can be (continuous for 16 × 3 × L × W 16 frame RGB pictures), L and W are the length and width of input picture.

Step 204: according to the first of each behavior classification of each video clip the scoring, determining each of video to be identified The spatial flow of behavior classification scores；According to the second of each behavior classification of each video clip the scoring, video to be identified is determined Each behavior classification time flow scoring；It is scored, is determined wait know according to the third of each behavior classification of each video clip The 3D of each behavior classification of other video scores.

For example, there are three video clips altogether for video to be identified, are video clip 1, video clip 2 and piece of video respectively Section 3, for behavior classification " race ", the first scoring of the behavior classification " race " of video clip 1 is P1, the row of video clip 2 The first scoring for classification " race " is P2, and the first scoring of the behavior classification " race " of video clip 3 is P3, then, according to P1, P2 And P3, determine the spatial flow scoring of the behavior classification " race " of video to be identified.

Step 205: according to the scoring of the spatial flow of each behavior classification of video to be identified, the time flow of each behavior classification The 3D of scoring and each behavior classification scores, and generates the final scoring of each behavior classification of video to be identified.

For example, for for behavior classification " race ", the spatial flow scoring S1 of the behavior classification " race " of video to be identified, The 3D scoring M1 of the behavior classification " race " of the time flow scoring T1 and video to be identified of the behavior classification " race " of video to be identified, that , according to S1, T1 and M1, generate the final scoring of the behavior classification " race " of video to be identified.

The embodiment of the present invention makes full use of time in video, space, space time information, simultaneously by merging multiple models In view of the influence of long movement, merges multiple fragment stage Activity recognition results and obtain videl stage Activity recognition as a result, obtaining more Accurate recognition result.

The embodiment of the present invention has merged 2D convolution and 3D convolution model, and wherein 2D convolution refers to double-stream digestion, respectively from The angle in time and space analyzes video content, is adequately utilized and acts change information and appearance information, and 3D Convolution model then analyzes video content from the angle of space-time, and the space time information between multiple frames is adequately utilized.It is logical Fusion 2D convolution sum 3D convolution model is crossed, more fully information is obtained.Meanwhile video is divided into multiple by the embodiment of the present invention Section obtains the Activity recognition result of videl stage by merging the prediction result of multiple segments.Therefore, the embodiment of the present invention is not only The information of multiple dimensions is utilized, has also widened the visual field of time dimension, obtains the prediction result of videl stage, meanwhile, the present invention Embodiment uses sparse sampling, reduces training parameter on the basis of obtaining global information.The embodiment of the present invention can obtain more Complete wider array of information, to reach more accurate recognition result.

As shown in figure 3, the embodiment of the invention provides a kind of Activity recognition devices, comprising:

First setting unit 301, at least two behavior classifications to be arranged；

Cutting unit 302, for video to be identified to be divided at least two video clips；

Fragment processing unit 303 executes for being directed to each video clip: extracting the key of current video segment Frame stacks light stream and successive frame still image；Activity recognition is carried out to the current video segment according to the key frame, is determined First scoring of each of the current video segment behavior classification, according to the stacking light stream to the current video piece Duan Jinhang Activity recognition determines the second scoring of each of described current video segment behavior classification, according to described continuous Frame still image carries out Activity recognition to the current video segment, determines each of described current video segment behavior class Other third scoring；

Segment composition unit 304, for scoring according to the first of each of each video clip behavior classification, Determine the spatial flow scoring of each of described video to be identified behavior classification；According to each institute of each video clip The second scoring for stating behavior classification determines the time flow scoring of each of described video to be identified behavior classification；According to every The third of each of a video clip behavior classification scores, and determines each of described video to be identified behavior class Other 3D scoring；

Final integrated unit 305, for the spatial flow according to each of the video to be identified behavior classification The 3D of scoring, the time flow scoring of each behavior classification and each behavior classification scores, described in generation The final scoring of each of the video to be identified behavior classification.

In an embodiment of the present invention, which further comprises:

O=aS+bT+cM；

In an embodiment of the present invention, the final integrated unit, for by each of the video to be identified row For the institute of spatial flow scoring, the time flow scoring and each behavior classification of each behavior classification of classification It states 3D scoring to be input in the Linear SVM classifier of training completion, determines the view to be identified using the Linear SVM classifier The final scoring of each of frequency behavior classification；

Wherein, the kernel function of the Linear SVM classifier are as follows:

k(a,a_i)=((xx_i)+1)^d, d is preset constant, and d is positive integer.

In an embodiment of the present invention, the fragment processing unit described is worked as according to the key frame to described executing Preceding video clip carries out Activity recognition, when determining the first scoring of each of described current video segment behavior classification, tool Body is used for:

In an embodiment of the present invention, the fragment processing unit, execute it is described according to the stacking light stream to described Current video segment carries out Activity recognition, when determining the second scoring of each of described current video segment behavior classification, It is specifically used for:

In an embodiment of the present invention, the fragment processing unit, it is described according to the successive frame still image executing Activity recognition is carried out to the current video segment, determines that the third of each of described current video segment behavior classification is commented Timesharing is specifically used for:

In an embodiment of the present invention, the segment composition unit, it is described according to each video clip executing First scoring of each behavior classification determines the spatial flow scoring of each of described video to be identified behavior classification When, it is specifically used for:

In an embodiment of the present invention, the segment composition unit, it is described according to each video clip executing Second scoring of each behavior classification determines the time flow scoring of each of described video to be identified behavior classification When, it is specifically used for:

In an embodiment of the present invention, the segment composition unit, it is described according to each video clip executing The third of each behavior classification scores, when determining the 3D scoring of each of described video to be identified behavior classification, tool Body is used for:

The contents such as the information exchange between each unit, implementation procedure in above-mentioned apparatus, due to implementing with the method for the present invention Example is based on same design, and for details, please refer to the description in the embodiment of the method for the present invention, and details are not described herein again.

The embodiment of the invention provides a kind of readable mediums, including execute instruction, when the processor of storage control executes Described when executing instruction, the storage control executes any one Activity recognition method provided in an embodiment of the present invention.

The embodiment of the invention provides a kind of storage controls, comprising: processor, memory and bus；

The memory is executed instruction for storing, and the processor is connect with the memory by the bus, when When the storage control is run, the processor executes the described of memory storage and executes instruction, so that the storage Controller executes any one Activity recognition method provided in an embodiment of the present invention.

The each embodiment of the present invention at least has the following beneficial effects:

1, video to be identified in embodiments of the present invention, is divided at least two video clips, extracts each video clip Key frame stacks light stream and successive frame still image, respectively in terms of key frame, stacking light stream and successive frame still image three Activity recognition is carried out to each video clip and then the final recognition result of video to be identified is fused to, in identification process In, Activity recognition is carried out in terms of three by video segmentation to be identified, and to each video clip, can from the time, space and Three angles of space-time carry out Activity recognition to video to be identified, can comprehensively be known based on more angles, more information Video not to be identified will finally be fused to final recognition result, greatly improve the accuracy of Activity recognition.

2, the embodiment of the present invention is by merging multiple models, makes full use of time in video, space, space time information, together When in view of long movement influence, merge multiple fragment stage Activity recognition results and obtain videl stage Activity recognition as a result, obtaining more Add accurate recognition result.

3, the embodiment of the present invention has merged 2D convolution and 3D convolution model, and wherein 2D convolution refers to double-stream digestion, respectively Video content is analyzed from the angle in time and space, movement change information and appearance information is adequately utilized, and 3D convolution model then analyzes video content from the angle of space-time, and the space time information between multiple frames is adequately utilized. By merging 2D convolution sum 3D convolution model, more fully information is obtained.Meanwhile the embodiment of the present invention video is divided into it is multiple Segment obtains the Activity recognition result of videl stage by merging the prediction result of multiple segments.Therefore, the embodiment of the present invention is not Merely with the information of multiple dimensions, the visual field of time dimension has also been widened, has obtained the prediction result of videl stage, the present invention is implemented Example can obtain more complete wider array of information, to reach more accurate recognition result.

It should be noted that, in this document, such as first and second etc relational terms are used merely to an entity Or operation is distinguished with another entity or operation, is existed without necessarily requiring or implying between these entities or operation Any actual relationship or order.Moreover, the terms "include", "comprise" or its any other variant be intended to it is non- It is exclusive to include, so that the process, method, article or equipment for including a series of elements not only includes those elements, It but also including other elements that are not explicitly listed, or further include solid by this process, method, article or equipment Some elements.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including There is also other identical factors in the process, method, article or equipment of the element.

Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can store in computer-readable storage medium, the program When being executed, step including the steps of the foregoing method embodiments is executed；And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light In the various media that can store program code such as disk.

Finally, it should be noted that the foregoing is merely presently preferred embodiments of the present invention, it is merely to illustrate skill of the invention Art scheme, is not intended to limit the scope of the present invention.Any modification for being made all within the spirits and principles of the present invention, Equivalent replacement, improvement etc., are included within the scope of protection of the present invention.

Claims

1. a kind of Activity recognition method, which is characterized in that preset at least two behavior classifications, comprising:

Video to be identified is divided at least two video clips；

It for each video clip, executes: extracting the key frame of current video segment, stacks light stream and successive frame static map Picture；Activity recognition is carried out to the current video segment according to the key frame, determines each institute of the current video segment State behavior classification first scoring, according to the stackings light stream to the current video segment carry out Activity recognition, determination described in Second scoring of each of current video segment behavior classification, according to the successive frame still image to the current video Segment carries out Activity recognition, determines the third scoring of each of described current video segment behavior classification；

According to the first of each of each video clip behavior classification the scoring, each of described video to be identified is determined The spatial flow of the behavior classification scores；According to the second of each of each video clip behavior classification the scoring, really The time flow scoring of each of the fixed video to be identified behavior classification；According to described in each of each described video clip The third of behavior classification scores, and determines the 3D scoring of each of described video to be identified behavior classification；

According to the scoring of the spatial flow of each of the video to be identified behavior classification, the institute of each behavior classification The 3D for stating time flow scoring and each behavior classification scores, and generates each of the video to be identified behavior class Other final scoring.

2. the method according to claim 1, wherein

Further comprise:

It is described according to the spatial flow of each of the video to be identified behavior classification scoring, each behavior classification Time flow scoring and the 3D of each behavior classification score, generate each of described video to be identified row For the final scoring of classification, comprising:

It for each behavior classification, executes: according to the scoring of the spatial flow of current behavior classification, the current behavior class Other time flow scoring, the 3D scoring of the current behavior classification, the weight of spatial flow scoring, the time flow score The weight of weight and 3D scoring, the most final review of the current behavior classification of the video to be identified is determined using formula four Point, wherein the formula four are as follows:

O=aS+bT+cM；

Wherein, O is the final scoring of the current behavior classification of the video to be identified, and S is the current behavior classification The spatial flow scoring, T are that the time flow of the current behavior classification scores, and M is that the 3D of the current behavior classification scores, a For the weight of spatial flow scoring, b is the weight of time flow scoring, and c is the weight of 3D scoring.

3. the method according to claim 1, wherein

It will be described in the spatial flow scoring of each of the video to be identified behavior classification, each behavior classification Time flow scoring and the 3D scoring of each behavior classification are input in the Linear SVM classifier of training completion, are utilized The Linear SVM classifier determines the final scoring of each of described video to be identified behavior classification；

Wherein, the kernel function of the Linear SVM classifier are as follows:

k(a,a_i)=((xx_i)+1)^d, d is preset constant, and d is positive integer.

4. method according to claim 1 to 3, which is characterized in that

It is described that Activity recognition is carried out to the current video segment according to the key frame, determine the every of the current video segment First scoring of a behavior classification, comprising:

The key frame of the current video segment is input in the space flow model of the 2D convolution of training completion, utilizes institute The space flow model for stating 2D convolution carries out Activity recognition to the key frame of the current video segment, and determination is described to work as forward sight First scoring of each of the frequency segment behavior classification；

And/or

It is described that Activity recognition is carried out to the current video segment according to the stacking light stream, determine the current video segment Second scoring of each behavior classification, comprising:

The stacking light stream of the current video segment is input in the time flow model of the 2D convolution of training completion, is utilized The time flow model of the 2D convolution carries out Activity recognition to the stacking light stream of the current video segment, works as described in determination Second scoring of each of the preceding video clip behavior classification；

And/or

It is described that Activity recognition is carried out to the current video segment according to the successive frame still image, determine the current video The third of each of segment behavior classification scores, comprising:

The successive frame still image of the current video segment is input in the 3D convolution model of training completion, utilizes institute It states 3D convolution model and Activity recognition is carried out to the successive frame still image of the current video segment, determination is described to work as forward sight The third of each of the frequency segment behavior classification scores.

5. method according to claim 1 to 3, which is characterized in that

It is described to score according to the first of each of each video clip behavior classification, determine the video to be identified The spatial flow of each behavior classification scores, comprising:

It for each behavior classification, executes: according to the first of the current behavior classification of each video clip the scoring, benefit The spatial flow scoring of the current behavior classification of the video to be identified is determined with formula one, wherein the formula one are as follows:

Wherein, Vid^αFor the spatial flow scoring of the current behavior classification of the video to be identified, K is at least two view The sum of frequency segment,For the first scoring of the current behavior classification of k-th of video clip；

And/or

It is described to score according to the second of each of each video clip behavior classification, determine the video to be identified The time flow of each behavior classification scores, comprising:

It for each behavior classification, executes: according to the second of the current behavior classification of each video clip the scoring, benefit The time flow scoring of the current behavior classification of the video to be identified is determined with formula two, wherein the formula two are as follows:

Wherein, Vid^βFor the time flow scoring of the current behavior classification of the video to be identified, K is at least two view The sum of frequency segment,For the second scoring of the current behavior classification of k-th of video clip；

And/or

It is described to be scored according to the third of each of each video clip behavior classification, determine the video to be identified The 3D of each behavior classification scores, comprising:

It for each behavior classification, executes: being scored according to the third of the current behavior classification of each video clip, benefit The 3D scoring of the current behavior classification of the video to be identified is determined with formula three, wherein the formula three are as follows:

Wherein, Vid^γFor the 3D scoring of the current behavior classification of the video to be identified, K is at least two piece of video The sum of section,For the third scoring of the current behavior classification of k-th of video clip.

6. a kind of Activity recognition device characterized by comprising

First setting unit, at least two behavior classifications to be arranged；

Fragment processing unit executes for being directed to each video clip: extracting the key frame of current video segment, stacks Light stream and successive frame still image；Activity recognition is carried out to the current video segment according to the key frame, is worked as described in determination First scoring of each of the preceding video clip behavior classification carries out the current video segment according to the stacking light stream Activity recognition determines the second scoring of each of described current video segment behavior classification, static according to the successive frame Image carries out Activity recognition to the current video segment, determines the of each of described current video segment behavior classification Three scorings；

Segment composition unit, for determining institute according to the first of each of each video clip behavior classification the scoring State the spatial flow scoring of each of video to be identified behavior classification；According to the behavior of each of each video clip Second scoring of classification determines the time flow scoring of each of described video to be identified behavior classification；According to each described The third of each of video clip behavior classification scores, and determines the 3D of each of described video to be identified behavior classification Scoring；

Final integrated unit, for being scored according to the spatial flow of each of the video to be identified behavior classification, often The time flow scoring of a behavior classification and the 3D of each behavior classification score, and generate the view to be identified The final scoring of each of frequency behavior classification.

7. device according to claim 6, which is characterized in that further comprise:

Second setting unit, weight, the weight of time flow scoring and the 3D for the spatial flow scoring to be arranged are commented The weight divided；

The final integrated unit executes: for being directed to each behavior classification according to the space of current behavior classification Stream scoring, the time flow scoring of the current behavior classification, the 3D scoring of the current behavior classification, the spatial flow score The weight of weight, the weight of time flow scoring and 3D scoring, the institute of the video to be identified is determined using formula four State the final scoring of current behavior classification, wherein the formula four are as follows:

O=aS+bT+cM；

8. device according to claim 6, which is characterized in that

The final integrated unit, for the spatial flow of each of the video to be identified behavior classification to score, The time flow scoring of each behavior classification and the 3D scoring of each behavior classification are input to trained completion Linear SVM classifier in, determine each of described video to be identified behavior classification using the Linear SVM classifier Final scoring；

Wherein, the kernel function of the Linear SVM classifier are as follows:

k(a,a_i)=((xx_i)+1)^d, d is preset constant, and d is positive integer.

9. device according to claim 1 to 3, which is characterized in that

The fragment processing unit, execute it is described according to the key frame to the current video segment carry out Activity recognition, When determining the first scoring of each of described current video segment behavior classification, it is specifically used for:

And/or

The fragment processing unit, execute it is described according to the stackings light stream to the current video segment progress behavior knowledge Not, when determining the second scoring of each of described current video segment behavior classification, it is specifically used for:

And/or

The fragment processing unit described goes to the current video segment according to the successive frame still image executing It is specifically used for when determining the third scoring of each of described current video segment behavior classification for identification:

10. device according to claim 1 to 3, which is characterized in that

The segment composition unit is commented executing first according to each of each video clip behavior classification Point, when determining the spatial flow scoring of each of described video to be identified behavior classification, it is specifically used for:

And/or

The segment composition unit is commented executing second according to each of each video clip behavior classification Point, when determining the time flow scoring of each of described video to be identified behavior classification, it is specifically used for:

And/or

The segment composition unit is commented executing the third according to each of each video clip behavior classification Point, when determining the 3D scoring of each of described video to be identified behavior classification, it is specifically used for: