CN107808150A

CN107808150A - The recognition methods of human body video actions, device, storage medium and processor

Info

Publication number: CN107808150A
Application number: CN201711154691.0A
Authority: CN
Inventors: 周文明; 王志鹏
Original assignee: Zhuhai Xi Yue Information Technology Co Ltd
Current assignee: Zhuhai Xi Yue Information Technology Co Ltd
Priority date: 2017-11-20
Filing date: 2017-11-20
Publication date: 2018-03-16

Abstract

The invention discloses a kind of recognition methods of human body video actions, device, storage medium and processor.Wherein, this method includes：First convolution neural network model is created according to default full tunnel three dimensional convolution kernel；Neural network model is accumulated according to the deliberate action identification data collection training first volume, obtains the second convolution neural network model；At least part full tunnel Three dimensional convolution layer in first convolution neural network model is replaced with into single channel Three dimensional convolution unit, obtains the 3rd convolutional neural networks model；3rd convolutional neural networks model is trained according to deliberate action identification data collection and the second convolution neural network model, obtains target convolution neural network model；Video to be identified is inputted to target convolution neural network model, obtains target identification result.The present invention solves the technical problem that human action identification method computational accuracy is relatively low, computational efficiency is poor present in prior art.

Description

The recognition methods of human body video actions, device, storage medium and processor

Technical field

The present invention relates to field of video processing, in particular to a kind of human body video actions recognition methods, device, deposits Storage media and processor.

Background technology

With the informationization of society, web development tide, various videos emerge in multitude, as monitoring system security protection video, From shoot the video, network media video etc..The motion analysis identification technology of intelligence is for extensive video frequency searching, man-machine interaction, peace The application such as anti-monitoring and early warning, visual classification plays an important roll.

Conventional action identification is carried out by technologies such as optical flow method, intensive trajectory analysises, engineer and selected characteristic, is calculated Complexity, and performance bottleneck be present.Along with breakthrough development of the deep learning in image classification field, deep learning correlation technique Gradually infiltrate into video analysis action recognition field.But current human action identification method exist computational accuracy it is relatively low, meter Calculate the poor technical problem of efficiency.

For it is above-mentioned the problem of, not yet propose effective solution at present.

The content of the invention

The embodiments of the invention provide a kind of recognition methods of human body video actions, device, storage medium and processor, so that Solves the technical problem that human action identification method computational accuracy is relatively low, computational efficiency is poor present in prior art less.

One side according to embodiments of the present invention, there is provided a kind of human body video actions recognition methods, this method include： First convolution neural network model is created according to default full tunnel three dimensional convolution kernel；According in deliberate action identification data collection training The first convolution neural network model is stated, obtains the second convolution neural network model, wherein, above-mentioned second convolution neural network model To reach the above-mentioned first convolution neural network model of convergence state；By at least portion in above-mentioned first convolution neural network model Divide full tunnel Three dimensional convolution layer to replace with single channel Three dimensional convolution unit, obtain the 3rd convolutional neural networks model；According to above-mentioned Deliberate action identification data collection and above-mentioned second convolution neural network model train above-mentioned 3rd convolutional neural networks model, obtain Target convolution neural network model, wherein, above-mentioned target convolution neural network model is reach convergence state above-mentioned volume three Product neural network model；Input video to be identified to above-mentioned target volume and accumulate neural network model, obtain target identification result.

Further, before according to the above-mentioned first convolution neural network model of deliberate action identification data collection training, on Stating method also includes：Obtain the video data in target video；By above-mentioned partitioning video data into multiple video clip sections, its In, each above-mentioned video clip section only includes single action classification；Pre-set categories label is added to above-mentioned video clip section, obtained Above-mentioned deliberate action identification data collection.

Further, above-mentioned at least part full tunnel Three dimensional convolution layer by above-mentioned first convolution neural network model replaces Being changed to single channel Three dimensional convolution unit includes：It is three-dimensional that above-mentioned at least part full tunnel Three dimensional convolution layer is replaced with into above-mentioned single channel Convolutional layer；In the rear position level addition batch standardization layer of above-mentioned single channel Three dimensional convolution layer, non-linear layer, residual error branch, superposition Unit and 1x1 packet convolutional layers, obtain above-mentioned single channel Three dimensional convolution unit.

Further, above-mentioned input video to be identified to above-mentioned target volume accumulates neural network model, obtains target identification knot Fruit includes：Above-mentioned video to be identified is split, obtains multiple the second video sequences with same preset length；Will be multiple Above-mentioned second video sequence inputs to above-mentioned target volume and accumulates neutral net, obtains corresponding to above-mentioned multiple above-mentioned second video sequences Preliminary recognition result；Above-mentioned preliminary recognition result is handled according to preset data processing mode, obtains above-mentioned target identification As a result, wherein, above-mentioned preset data processing mode includes at least one following：Obtain the extreme value of above-mentioned preliminary recognition result, obtain Take the average value of above-mentioned preliminary recognition result and summation is weighted to above-mentioned preliminary recognition result.

Another aspect according to embodiments of the present invention, a kind of human body video actions identification device is additionally provided, the device bag Include：Creating unit, for creating the first convolution neural network model according to default full tunnel three dimensional convolution kernel；First training is single Member, for according to the above-mentioned first convolution neural network model of deliberate action identification data collection training, obtaining the second convolution nerve net Network model, wherein, above-mentioned second convolution neural network model is the above-mentioned first convolution neural network model for reaching convergence state； Replacement unit, at least part full tunnel Three dimensional convolution layer in above-mentioned first convolution neural network model to be replaced with into single-pass Road Three dimensional convolution unit, obtains the 3rd convolutional neural networks model；Second training unit, for being identified according to above-mentioned deliberate action Data set and above-mentioned second convolution neural network model train above-mentioned 3rd convolutional neural networks model, obtain target convolutional Neural Network model, wherein, above-mentioned target convolution neural network model is the above-mentioned 3rd convolutional neural networks mould for reaching convergence state Type；Processing unit, neural network model is accumulated for inputting video to be identified to above-mentioned target volume, obtains target identification result.

Further, said apparatus also includes：Acquiring unit, for obtaining the video data in target video；Segmentation is single Member, for by above-mentioned partitioning video data into multiple video clip sections, wherein, each above-mentioned video clip section is only comprising single dynamic Make classification；Adding device, for adding pre-set categories label to above-mentioned video clip section, obtain above-mentioned deliberate action identification data Collection.

Further, above-mentioned replacement unit includes：Subelement is replaced, for by least part full tunnel Three dimensional convolution Layer replaces with the single channel Three dimensional convolution layer；Subelement is added, for the rear position level in the single channel Three dimensional convolution layer Addition batch standardization layer, non-linear layer, residual error branch, superpositing unit and 1x1 packet convolutional layers, obtain the three-dimensional volume of the single channel Product unit.

Further, above-mentioned processing unit includes：Split subelement, for splitting to above-mentioned video to be identified, obtain To multiple the second video sequences with same preset length；Subelement is inputted, for multiple above-mentioned second video sequences are defeated Enter to above-mentioned target volume and accumulate neutral net, obtain preliminary recognition result corresponding to above-mentioned multiple above-mentioned second video sequences；Processing Subelement, for being handled according to preset data processing mode above-mentioned preliminary recognition result, obtain above-mentioned target identification knot Fruit, wherein, above-mentioned preset data processing mode includes at least one following：Obtain the extreme value of above-mentioned preliminary recognition result, obtain The average value of above-mentioned preliminary recognition result and summation is weighted to above-mentioned preliminary recognition result.

Another aspect according to embodiments of the present invention, provides a kind of storage medium again, and above-mentioned storage medium includes storage Program, wherein, equipment where above-mentioned storage medium is controlled when said procedure is run performs above-mentioned human body video actions and known Other method.

Another aspect according to embodiments of the present invention, a kind of processor is provided again, above-mentioned processor is used for operation program, Wherein, above-mentioned human body video actions recognition methods is performed when said procedure is run.

In embodiments of the present invention, the first convolution neural network model is created using according to default full tunnel three dimensional convolution kernel Mode；Neural network model is accumulated according to the deliberate action identification data collection training first volume, obtains the second convolutional neural networks mould Type, wherein, the second convolution neural network model is the first convolution neural network model for reaching convergence state；By by the first volume At least part full tunnel Three dimensional convolution layer in product neural network model replaces with single channel Three dimensional convolution unit, obtains volume three Product neural network model；3rd convolutional Neural net is trained according to deliberate action identification data collection and the second convolution neural network model Network model, target convolution neural network model is obtained, wherein, target convolution neural network model is reach convergence state the 3 Convolutional neural networks model；Reach input video to be identified to target convolution neural network model, obtain target identification result Purpose, it is achieved thereby that lifting human action accuracy of identification, improve human action identification efficiency technical problem.

Brief description of the drawings

Accompanying drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings：

Fig. 1 is a kind of schematic flow sheet of optional human body video actions recognition methods according to embodiments of the present invention；

Fig. 2 is the schematic flow sheet of the optional human body video actions recognition methods of another kind according to embodiments of the present invention；

Fig. 3 is the schematic flow sheet of another optional human body video actions recognition methods according to embodiments of the present invention；

Fig. 4 is the schematic flow sheet of another optional human body video actions recognition methods according to embodiments of the present invention；

Fig. 5 is a kind of structural representation of optional human body video actions identification device according to embodiments of the present invention；

Fig. 6 is a kind of structural representation of optional first convolution neural network model according to embodiments of the present invention；

Embodiment

In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects Enclose.

It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or Order beyond those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product Or the intrinsic other steps of equipment or unit.

Embodiment 1

According to embodiments of the present invention, there is provided a kind of embodiment of human body video actions recognition methods, it is necessary to explanation, It can be performed the step of the flow of accompanying drawing illustrates in the computer system of such as one group computer executable instructions, and And although showing logical order in flow charts, in some cases, can be with different from order execution institute herein The step of showing or describing.

Fig. 1 is a kind of schematic flow sheet of optional human body video actions recognition methods according to embodiments of the present invention, such as Shown in Fig. 1, this method comprises the following steps：

Step S102, the first convolution neural network model is created according to default full tunnel three dimensional convolution kernel；

Step S104, neural network model is accumulated according to the deliberate action identification data collection training first volume, obtains the second convolution Neural network model, wherein, the second convolution neural network model is the first convolution neural network model for reaching convergence state；

Step S106, at least part full tunnel Three dimensional convolution layer in the first convolution neural network model is replaced with into single-pass Road Three dimensional convolution unit, obtains the 3rd convolutional neural networks model；

Step S108, the 3rd convolutional Neural is trained according to deliberate action identification data collection and the second convolution neural network model Network model, target convolution neural network model is obtained, wherein, target convolution neural network model is reach convergence state Three convolutional neural networks models；

Step S110, video to be identified is inputted to target convolution neural network model, obtains target identification result.

Alternatively, the first convolutional neural networks in step S102 include：Input layer, Three dimensional convolution layer, three-dimensional pond layer, Non-linear layer, full articulamentum, output layer.Wherein, input layer size is [H, W, 3, F], and wherein H and W are respectively input video Height and the width, F are the number of image frames included in video.Wherein, three-dimensional pond layer uses maximum pond function.

Alternatively, in step S104, the Video segmentation that can concentrate deliberate action identification data is that length is not weighing for F Folded video sequence, input into the first convolution neural network model, be trained using gradient descent method, object function is friendship Pitch entropy error.

Alternatively, in step S106, by least part full tunnel Three dimensional convolution layer in the first convolution neural network model Single channel Three dimensional convolution unit is replaced with to include：The full tunnel connection mode of three dimensional convolution kernel in Three dimensional convolution layer is replaced with Single channel connection mode, obtains single channel Three dimensional convolution unit, and calculation formula is：Input feature vector figure is X [h, w, c, f], output Characteristic pattern is Y [h1, w1, c, f1], and convolution kernel is K [k, k, c, d], step-length 1, bias vector b, single channel Three dimensional convolution list Member exports：In the rear position of above-mentioned single channel Three dimensional convolution unit Level addition batch standardization layer, non-linear layer, residual error branch, superpositing unit and 1x1 packet convolutional layers.

Alternatively, in step S108, the Video segmentation that can concentrate deliberate action identification data is that length is not weighing for F Folded video sequence, input into the second convolution neural network model and the 3rd convolutional neural networks model, obtain soft label and Predict output valve.The cross entropy error between prediction output valve and soft label is calculated, and calculates and predicts that output valve and video are true Cross entropy error between class label, summation is weighted, obtains overall error, is trained using gradient descent method.

Alternatively, the human body video actions recognition methods in the embodiment of the present application is based on single channel Three dimensional convolution unit, structure Action recognition convolutional neural networks are made, temporal information and spatial information in input video can be utilized simultaneously, compared to tradition Two-dimensional convolution neutral net, more suitable for handle video data, enhancing action identification precision.

Alternatively, the single channel Three dimensional convolution unit in the human body video actions recognition methods in the embodiment of the present application includes Single channel Three dimensional convolution layer, batch standardization layer, non-linear layer, residual error branch, superpositing unit, 1x1 packet convolutional layers.Wherein, adopt With single channel Three dimensional convolution, compared to initial three-dimensional convolution, amount of calculation and parameter amount are reduced.Using residual error branch and 1x1 points Group convolutional layer, loss of significance caused by parameter reduces effectively is made up, is known so as to solve present in existing action recognition technology The technical problem that other precision is low, computational efficiency is poor.

Alternatively, Fig. 2 is the flow of the optional human body video actions recognition methods of another kind according to embodiments of the present invention Schematic diagram, as shown in Fig. 2 before step S104 is performed, i.e., god is being accumulated according to the deliberate action identification data collection training first volume Before network model, this method can also include：

Step S202, obtain the video data in target video；

Step S204, by partitioning video data into multiple video clip sections, wherein, each video clip section is only comprising single Action classification；

Step S206, pre-set categories label is added to video clip section, obtains deliberate action identification data collection.

Alternatively, Fig. 3 is the flow of another optional human body video actions recognition methods according to embodiments of the present invention Schematic diagram, as shown in figure 3, perform step S106, i.e., it is at least part full tunnel in the first convolution neural network model is three-dimensional Convolutional layer, which replaces with single channel Three dimensional convolution unit, to be included：

Step S302, at least part full tunnel Three dimensional convolution layer is replaced with into single channel Three dimensional convolution layer；

Step S304, in the rear position level addition batch standardization layer, non-linear layer, residual error point of single channel Three dimensional convolution layer Branch, superpositing unit and 1x1 packet convolutional layers, obtain single channel Three dimensional convolution unit.

Alternatively, Fig. 4 is the flow of another optional human body video actions recognition methods according to embodiments of the present invention Schematic diagram, as shown in figure 4, performing step S110, video to be identified is inputted to target convolution neural network model, obtains target knowledge Other result includes：

Step S402, video to be identified is split, obtain multiple the second video sequences with same preset length；

Step S404, multiple second video sequences are inputted to target convolutional neural networks, obtain multiple second video sequences Preliminary recognition result corresponding to row；

Step S406, preliminary recognition result is handled according to preset data processing mode, obtains target identification result, Wherein, preset data processing mode includes at least one following：Obtain the extreme value of preliminary recognition result, obtain preliminary recognition result Average value and summation is weighted to preliminary recognition result.

Embodiment 2

Another aspect according to embodiments of the present invention, a kind of human body video actions identification device is additionally provided, such as Fig. 5 institutes Show, the device includes：

Creating unit 501, for creating the first convolution neural network model according to default full tunnel three dimensional convolution kernel；First Training unit 503, for accumulating neural network model according to the deliberate action identification data collection training first volume, obtain the second convolution god Through network model, wherein, the second convolution neural network model is the first convolution neural network model for reaching convergence state；Replace Unit 505, at least part full tunnel Three dimensional convolution layer in the first convolution neural network model to be replaced with into single channel three Convolution unit is tieed up, obtains the 3rd convolutional neural networks model；Second training unit 507, for according to deliberate action identification data Collection and the second convolution neural network model train the 3rd convolutional neural networks model, obtain target convolution neural network model, its In, target convolution neural network model is the 3rd convolutional neural networks model for reaching convergence state；Processing unit 509, is used for Video to be identified is inputted to target convolution neural network model, obtains target identification result.

Alternatively, Fig. 6 is that a kind of structure of optional first convolution neural network model according to embodiments of the present invention is shown Be intended to, as shown in fig. 6, the first convolution neural network model include input layer, 12 full tunnel Three dimensional convolution layers, five three Wei Chiization layer, two-dimensional convolution layer, full articulamentum, output layer.Specifically, each layer of parameter in the first convolution neural network model Can be：Input layer size is [H, W, 3, F], and wherein H and W are respectively the height and the width of input video, and F is to be included in video Number of image frames.Optionally, the H of input layer is set to 128, W and is set to 171, F to be set to 16.12 full tunnel Three dimensional convolution layer volumes Product core size is 3x3x3, and step-length is [1,1,1], port number is respectively 16,32,64,64,64,128,128,128,256,256, 512、512.The pond size of three-dimensional pond layer is [2,2,1], [2,2,2], [2,2,2], [2,2,2], [2,2,3] respectively, is adopted With maximum pond function.

Alternatively, device can also include：Acquiring unit, for obtaining the video data in target video；Cutting unit, For by partitioning video data into multiple video clip sections, wherein, each video clip section only includes single action classification；Addition Unit, for adding pre-set categories label to video clip section, obtain deliberate action identification data collection.

Alternatively, replacement unit includes：Subelement is replaced, at least part full tunnel Three dimensional convolution layer to be replaced with into list Passage Three dimensional convolution layer；Subelement is added, for the rear position level addition batch standardization layer, non-thread in single channel Three dimensional convolution layer Property layer, residual error branch, superpositing unit and 1x1 packet convolutional layer, obtain single channel Three dimensional convolution unit.

Alternatively, processing unit includes：Split subelement, for splitting video to be identified, obtain multiple having Second video sequence of same preset length；Subelement is inputted, for multiple second video sequences to be inputted to target convolution god Through network, preliminary recognition result corresponding to multiple second video sequences is obtained；Subelement is handled, for being handled according to preset data Mode is handled preliminary recognition result, obtains target identification result, wherein, preset data processing mode include it is following at least One of：The extreme value of preliminary recognition result is obtained, the average value of preliminary recognition result is obtained and preliminary recognition result is added Power summation.

Another aspect according to embodiments of the present invention, provides a kind of storage medium again, and storage medium includes the journey of storage Sequence, wherein, control equipment where storage medium to perform the human body video actions identification in the embodiment of the present application 1 when program is run Method.

Another aspect according to embodiments of the present invention, a kind of processor is provided again, processor is used for operation program, its In, program performs the human body video actions recognition methods in the embodiment of the present application 1 when running.

The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.

In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in some embodiment The part of detailed description, it may refer to the associated description of other embodiment.

In several embodiments provided herein, it should be understood that disclosed technology contents, others can be passed through Mode is realized.Wherein, device embodiment described above is only schematical, such as the division of the unit, Ke Yiwei A kind of division of logic function, can there is an other dividing mode when actually realizing, for example, multiple units or component can combine or Person is desirably integrated into another system, or some features can be ignored, or does not perform.Another, shown or discussed is mutual Between coupling or direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module Connect, can be electrical or other forms.

The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On unit.Some or all of unit therein can be selected to realize the purpose of this embodiment scheme according to the actual needs.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.

If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer Equipment (can be personal computer, server or network equipment etc.) perform each embodiment methods described of the present invention whole or Part steps.And foregoing storage medium includes：USB flash disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can be with store program codes Medium.

Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

A kind of 1. human body video actions recognition methods, it is characterised in that including：

First convolution neural network model is created according to default full tunnel three dimensional convolution kernel；

According to deliberate action identification data collection training the first convolution neural network model, the second convolutional neural networks mould is obtained Type, wherein, the second convolution neural network model is the first convolution neural network model for reaching convergence state；

At least part full tunnel Three dimensional convolution layer in the first convolution neural network model is replaced with into the three-dimensional volume of single channel Product unit, obtain the 3rd convolutional neural networks model；

3rd convolutional Neural is trained according to the deliberate action identification data collection and the second convolution neural network model Network model, target convolution neural network model is obtained, wherein, the target convolution neural network model is to reach convergence state The 3rd convolutional neural networks model；

Video to be identified is inputted to the target convolution neural network model, obtains target identification result.
2. according to the method for claim 1, it is characterised in that according to deliberate action identification data collection training described first Before convolutional neural networks model, methods described also includes：

Obtain the video data in target video；

By the partitioning video data into multiple video clip sections, wherein, each video clip section only includes single action Classification；

Pre-set categories label is added to the video clip section, obtains the deliberate action identification data collection.
3. according to the method for claim 1, it is characterised in that it is described by the first convolution neural network model extremely Small part full tunnel Three dimensional convolution layer, which replaces with single channel Three dimensional convolution unit, to be included：

At least part full tunnel Three dimensional convolution layer is replaced with into the single channel Three dimensional convolution layer；

In rear position level addition batch standardization layer, non-linear layer, residual error branch, the superpositing unit of the single channel Three dimensional convolution layer Convolutional layer is grouped with 1x1, obtains the single channel Three dimensional convolution unit.
4. according to the method for claim 1, it is characterised in that the input video to be identified to the target convolutional Neural Network model, obtaining target identification result includes：

The video to be identified is split, obtains multiple the second video sequences with same preset length；

Multiple second video sequences are inputted to the target convolutional neural networks, obtain the multiple second video Preliminary recognition result corresponding to sequence；

The preliminary recognition result is handled according to preset data processing mode, obtains the target identification result, wherein, The preset data processing mode includes at least one following：Extreme value, the acquisition for obtaining the preliminary recognition result are described preliminary The average value of recognition result and summation is weighted to the preliminary recognition result.
A kind of 5. human body video actions identification device, it is characterised in that including：

Creating unit, for creating the first convolution neural network model according to default full tunnel three dimensional convolution kernel；

First training unit, for according to deliberate action identification data collection training the first convolution neural network model, obtaining Second convolution neural network model, wherein, the second convolution neural network model is the first volume for reaching convergence state Product neural network model；

Replacement unit, at least part full tunnel Three dimensional convolution layer in the first convolution neural network model to be replaced with Single channel Three dimensional convolution unit, obtains the 3rd convolutional neural networks model；

Second training unit, for according to the deliberate action identification data collection and the second convolution neural network model training The 3rd convolutional neural networks model, target convolution neural network model is obtained, wherein, the target convolutional neural networks mould Type is to reach the 3rd convolutional neural networks model of convergence state；

Processing unit, for inputting video to be identified to the target convolution neural network model, obtain target identification result.
6. device according to claim 5, it is characterised in that described device also includes：

Acquiring unit, for obtaining the video data in target video；

Cutting unit, for by the partitioning video data into multiple video clip sections, wherein, each video clip section is only Include single action classification；

Adding device, for adding pre-set categories label to the video clip section, obtain the deliberate action identification data collection.
7. device according to claim 5, it is characterised in that the replacement unit includes：

Subelement is replaced, at least part full tunnel Three dimensional convolution layer to be replaced with into the single channel Three dimensional convolution layer；

Subelement is added, for the rear position level addition batch standardization layer, non-linear layer, residual in the single channel Three dimensional convolution layer Difference branch, superpositing unit and 1x1 packet convolutional layers, obtain the single channel Three dimensional convolution unit.
8. device according to claim 5, it is characterised in that the processing unit includes：

Split subelement, for splitting to the video to be identified, obtain multiple second regarding with same preset length Frequency sequence；

Subelement is inputted, for multiple second video sequences to be inputted to the target convolutional neural networks, is obtained described Preliminary recognition result corresponding to multiple second video sequences；

Subelement is handled, for being handled according to preset data processing mode the preliminary recognition result, obtains the mesh Recognition result is marked, wherein, the preset data processing mode includes at least one following：Obtain the pole of the preliminary recognition result It is worth, obtains the average value of the preliminary recognition result and summation is weighted to the preliminary recognition result.
A kind of 9. storage medium, it is characterised in that the storage medium includes the program of storage, wherein, run in described program When control the storage medium where equipment perform claim require that 1 people's volumetric video into claim 4 described in any one is moved Make recognition methods.
A kind of 10. processor, it is characterised in that the processor is used for operation program, wherein, right of execution when described program is run Profit requires the 1 human body video actions recognition methods into claim 4 described in any one.