CN106980826A

CN106980826A - A kind of action identification method based on neutral net

Info

Publication number: CN106980826A
Application number: CN201710156415.1A
Authority: CN
Inventors: 苏育挺; 安阳; 聂为之
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-03-16
Filing date: 2017-03-16
Publication date: 2017-07-25

Abstract

Know method for distinguishing the invention discloses a kind of human action based on neutral net, the described method comprises the following steps：N number of separate 3D convolutional neural networks are trained based on video database, as video feature extraction device；According to video feature extraction device, multi-instance learning grader is trained；Video to be identified is inputted, by the network extraction video features trained, the classification of motion is carried out by grader.Influence present invention, avoiding much noise feature to classification results, it is to avoid negative effect of the fixed sample length to action recognition result.

Description

A kind of action identification method based on neutral net

Technical field

Field, more particularly to a kind of action identification method based on neutral net are recognized the present invention relates to human action.

Background technology

With the development of mobile Internet, it is many that the carrier of information gradually expands to audio, image, video etc. via word The form of kind.In recent years, the video data volume is in explosive increase, and application field more becomes diversity, is related to safety, monitoring and entertains Etc. every field^[1].In face of the data of such magnanimity, traditional artificial treatment can not meet the demand of people.Therefore, it is sharp The identification and understanding to video information are realized with the powerful storage of computer and computing capability, with important learning value and extensively Wealthy application prospect.

In fact, in computer vision field, many decades have been carried out in the research on video, research topic includes dynamic Recognize, abnormality detection and video frequency searching etc..Human action identification is one of them important research direction, and is achieved larger Progress, it is each that achievement in research is related to intelligent monitoring, Medical nursing, video frequency searching, man-machine interaction, behavioural analysis, virtual reality etc. Individual field^[2].Wherein, the most ripe with man-machine interaction, Kinect (body-sensing) camera of such as Microsoft can be achieved to human action Seizure and understanding.However, still suffering from very big difficult point and challenge, such as true nature scene on the research that human action is recognized Under action recognition, colony's action recognition etc..The presence of these problems, makes human body action recognition distance be efficiently applied to real field There is a very long segment distance in scape.

With the development of parallel computation equipment (GPUs (graphics processor), CPU cluster), and large scale training data Occur, convolutional neural networks (convolutional neural networks, CNNs) rise again, and in object identification, from Right Language Processing, Classification of Speech, man-machine interaction, human body is followed the trail of, the direction such as image restoration, denoising and segmentation achieve it is prominent into Really.However, in video identification field, the application of convolutional neural networks is also seldom.

The content of the invention

The invention provides a kind of action identification method based on neutral net, present invention, avoiding much noise feature pair The influence of classification results, it is to avoid negative effect of the fixed sample length to action recognition result, it is described below：

A kind of human action based on neutral net knows method for distinguishing, the described method comprises the following steps：

N number of separate 3D convolutional neural networks are trained based on video database, as video feature extraction device；

According to video feature extraction device, multi-instance learning grader is trained；

Video to be identified is inputted, by the network extraction video features trained, the classification of motion is carried out by grader.

Wherein, it is described that N number of separate 3D convolutional neural networks are trained based on video database, carried as video features The step of taking device be specially：

Each video in video library is divided into the video segment that several frame lengths are Fi, each video segment is as net A network i training sample, trains 3D convolutional neural networks, and N number of independent 3D convolutional neural networks collectively form video features Extractor.

Wherein, described according to video feature extraction device, the step of training multi-instance learning grader is specially：

Each video in database is distinguished into input video feature extractor, characteristic vector is extracted；Then will be each Video regards a bag of multi-instance learning as, and characteristic vector carries out multi-instance learning as the example in bag.

Wherein, described each video by database distinguishes input video feature extractor, extracts characteristic vector Step is specially：

A video M is given, is video segment that Mi frame length is Fi by video M points, is extracted as the input of network The characteristic vector of Mi n dimension, then video M extract (M1+M2+ ...+MN) individual characteristic vector altogether.

Wherein, the input video to be identified, by the network extraction video features trained, action is entered by grader Work is specially the step of classify：

By the network trained, the characteristic vector that P n is tieed up is extracted, whole video is regard as one of multi-instance learning Bag, each characteristic vector carries out the classification of motion by multi-instance learning as an example in bag to it.

The beneficial effect for the technical scheme that the present invention is provided is：

1st, on the basis of C3D (3D convolution) feature, the method that multiple features are produced to same video is introduced, using showing more Example learning method, influence of the reduction much noise feature to classification results；

2nd, the influence in view of length of time series to action recognition result, is respectively adopted the piece of video of different length combination Duan Jinhang feature learnings, it is to avoid negative effect of the fixed sample length to action recognition result.

Brief description of the drawings

Fig. 1 is the flow chart of the action identification method based on neutral net；

Fig. 2 is 3D convolutional neural networks structure charts；

Fig. 3 is that 3D convolutional neural networks train schematic diagram；

Fig. 4 is C3D feature extraction schematic diagrames.

Embodiment

To make the object, technical solutions and advantages of the present invention clearer, further is made to embodiment of the present invention below It is described in detail on ground.

To solve problem above, it is desirable to be able to comprehensively, it is automatic, accurately extract video features, and classified.Study table Bright C3D features have higher accuracy rate in visual classification field, and multi-instance learning can eliminate much noise feature to classification As a result influence.

Embodiment 1

The embodiment of the present invention proposes a kind of action identification method based on neutral net, referring to Fig. 1, the action recognition side Method comprises the following steps：

101：N number of separate 3D convolutional neural networks are trained based on video database, as video feature extraction device；

102：According to video feature extraction device, multi-instance learning grader is trained；

103：Video to be identified is inputted, by the network extraction video features trained, passes through grader progress action point Class.

Wherein, N number of separate 3D convolutional neural networks are trained based on video database in step 101, as regarding The step of frequency feature extractor is specially：

Wherein, it is specially according to video feature extraction device, the step of training multi-instance learning grader in step 102：

Wherein, above-mentioned each video by database distinguishes input video feature extractor, extracts characteristic vector Step is specially：

Wherein, the input video to be identified in step 103, by the network extraction video features trained, passes through classification Device carry out the classification of motion the step of be specially：

In summary, the embodiment of the present invention avoids much noise feature to classification by above-mentioned steps 101- steps 103 As a result influence, it is to avoid negative effect of the fixed sample length to action recognition result, substantially increases human action identification Robustness and accuracy.

Embodiment 2

The scheme in embodiment 1 is further introduced with reference to specific example, Fig. 2-Fig. 4, it is as detailed below Description：

201：Video database is set up, and N number of separate 3D convolutional neural networks are trained based on video database, is used Make video feature extraction device, i.e. C3D features；

Wherein, the study of C3D features is carried out on 3D ConvNets (3D convolutional neural networks), its network structure Figure is as shown in Fig. 2 all convolution filter sizes are all 3*3*3, and space-time step-length is 1.It is all except Pool1 (1*2*2) Pond layer size is all 2*2*2, and step-length is 1.Finally, the output of 4096 dimensions is respectively obtained in full articulamentum fc6 and fc7.

Wherein, video feature extraction device needs the N number of separate 3D ConvNets of training, the training of each network Cheng Xiangtong, referring to Fig. 3, by taking network i (i=1,2,3 ..., N) as an example, detailed process is：Each video in database is divided into Several frame lengths are Fi video segment, and each video segment trains 3D as a network i training sample ConvNets.Change frame length Fi, repeat above procedure, N number of different 3D ConvNets can be obtained, human action is collectively formed The video feature extraction device of identifying system.

202：According to video feature extraction device, multi-instance learning grader is trained；

Wherein, the characteristic vector of each video in database is extracted using the N number of 3D ConvNets trained, each The characteristic extraction procedure of network is identical, and referring to Fig. 4, by taking network i (i=1,2,3 ..., N) as an example, detailed process is：By video M It is divided into the video segment that Mi frame length is Fi, as network i input, extracts Mi characteristic vector.Therefore, video M passes through Feature extractor (N number of 3D ConvNets) extracts (M1+M2+ ...+MN) individual characteristic vector altogether.

Finally, the video of each in video library is regarded as to a bag of multi-instance learning, carried by video feature extraction device The characteristic vector got regards an example in bag as, carries out multi-instance learning, trains sorter model.

203：Video to be identified is inputted, by the network extraction video features trained, passes through grader progress action point Class.

When carrying out action recognition, a video K to be identified is inputted, it is passed through into N number of 3D for training respectively first ConvNets, extracts (K1+K2+ ...+KN) individual characteristic vector, regard video K as a bag of multi-instance learning, characteristic vector As the example in bag, by the grader trained in step (2), classification results are obtained.

In summary, the embodiment of the present invention avoids much noise feature to classification by above-mentioned steps 2-01- steps 203 As a result influence, it is to avoid negative effect of the fixed sample length to action recognition result, substantially increases human action identification Robustness and accuracy.

Bibliography：

[1]Turaga P,Chellappa R,Subrahmanian V S,et al.Machine recognition of human activities:A survey[J].IEEE Transactions on Circuits and Systems for Video Technology,2008,18(11):1473-1488.

[2]Aggarwal J K,Ryoo M S.Human activity analysis:A review[J].ACM Computing Surveys(CSUR),2011,43(3):16.

[3]Laptev I.On space-time interest points[J].International Journal of Computer Vision,2005,64(2-3):107-123.

[4]Ji S,Xu W,Yang M,et al.3D convolutional neural networks for human action recognition[J].IEEE transactions on pattern analysis and machine intelligence,2013,35(1):221-231.

[5]Tran D,Bourdev L,Fergus R,et al.C3D:generic features for video analysis[J].CoRR,abs/1412.0767,2014,2:7.

It will be appreciated by those skilled in the art that accompanying drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention Sequence number is for illustration only, and the quality of embodiment is not represented.

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc. should be included in the scope of the protection.

Claims

1. a kind of human action based on neutral net knows method for distinguishing, it is characterised in that the described method comprises the following steps：

2. a kind of human action based on neutral net according to claim 1 knows method for distinguishing, it is characterised in that described N number of separate 3D convolutional neural networks are trained based on video database, are specially the step of as video feature extraction device：

Each video in video library is divided into the video segment that several frame lengths are Fi, each video segment is as network i A training sample, train 3D convolutional neural networks, N number of independent 3D convolutional neural networks collectively form video feature extraction Device.

3. a kind of human action based on neutral net according to claim 1 knows method for distinguishing, it is characterised in that described According to video feature extraction device, the step of training multi-instance learning grader is specially：

Each video in database is distinguished into input video feature extractor, characteristic vector is extracted；Then by each video Regard a bag of multi-instance learning as, characteristic vector carries out multi-instance learning as the example in bag.

4. a kind of human action based on neutral net according to claim 3 knows method for distinguishing, it is characterised in that described Each video in database is distinguished into input video feature extractor, the step of extracting characteristic vector is specially：

A video M is given, is video segment that Mi frame length is Fi by video M points, Mi n is extracted as the input of network The characteristic vector of dimension, then video M extract (M1+M2+ ...+MN) individual characteristic vector altogether.

5. a kind of human action based on neutral net according to claim 1 knows method for distinguishing, it is characterised in that described Video to be identified is inputted, by the network extraction video features trained, the step of carrying out the classification of motion by grader is specific For：

By the network trained, the characteristic vector of P n dimension is extracted, is wrapped whole video as one of multi-instance learning, often Individual characteristic vector carries out the classification of motion by multi-instance learning as an example in bag to it.