CN110321761A

CN110321761A - A kind of Activity recognition method, terminal device and computer readable storage medium

Info

Publication number: CN110321761A
Application number: CN201810272399.7A
Authority: CN
Inventors: 徐洋洋; 王磊; 程俊
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2019-10-11
Anticipated expiration: 2038-03-29
Also published as: CN110321761B

Abstract

The application is suitable for nerual network technique field, provide a kind of Activity recognition method, terminal device and computer readable storage medium, the described method includes: building includes the identification model of at least two sub-networks, and each sub-network in the identification model is trained respectively, after training, video sequence to be identified is identified by each sub-network, obtain initial recognition result corresponding with each sub-network, Activity recognition result will be obtained after the corresponding initial recognition result fusion of each sub-network, the robustness of Activity recognition method can be improved by the application.

Description

A kind of Activity recognition method, terminal device and computer readable storage medium

Technical field

The application belongs to nerual network technique field more particularly to a kind of Activity recognition method, terminal device and computer Readable storage medium storing program for executing.

Background technique

Activity recognition has been widely used for video monitoring, human-computer interaction, robot as an important field of research Learn etc..Also, with the development of inexpensive depth transducer, the three-dimensional coordinate point of skeleton joint also can be accurately recorded, This just provides advantageous help for the development of Activity recognition.

Currently, the Activity recognition based on 3D video sequence, mainly using the algorithm based on recurrent neural network and based on 2 dimensions The algorithm of convolutional neural networks.However, both methods can not accurately be extracted simultaneously from time dimension and Spatial Dimension Feature.Therefore, there is poor robustness in current Activity recognition method.

Summary of the invention

In view of this, the embodiment of the present application provides a kind of Activity recognition method, terminal device and computer-readable storage Medium, to solve the problems, such as current Activity recognition method, there are poor robustness.

The first aspect of the embodiment of the present application provides a kind of Activity recognition method, comprising:

Building include at least two sub-networks identification model, and to each sub-network in the identification model respectively into Row training；

After training, video sequence to be identified is identified by each sub-network, is obtained corresponding with each sub-network Initial recognition result；

Activity recognition result will be obtained after the corresponding initial recognition result fusion of each sub-network.

The second aspect of the embodiment of the present application provides a kind of terminal device, comprising:

Training module is constructed, for constructing the identification model including at least two sub-networks, and in the identification model Each sub-network be trained respectively；

Initial recognition result obtains module, for identifying video to be identified by each sub-network after training Sequence obtains initial recognition result corresponding with each sub-network；

Activity recognition result obtains module, will obtain Activity recognition after the corresponding initial recognition result fusion of each sub-network As a result.

The third aspect of the embodiment of the present application provides a kind of terminal device, including memory, processor and is stored in In the memory and the computer program that can run on the processor, when the processor executes the computer program The step of realizing the method that the embodiment of the present application first aspect provides.

The fourth aspect of the embodiment of the present application provides a kind of computer readable storage medium, the computer-readable storage Media storage has computer program, and the computer program realizes the embodiment of the present application when being executed by one or more processors On the one hand the step of the method provided.

5th aspect of the embodiment of the present application provides a kind of computer program product, and the computer program product includes Computer program, the computer program realize that the embodiment of the present application first aspect provides when being executed by one or more processors The method the step of.

The embodiment of the present application building includes the identification model of at least two sub-networks, and to each of described identification model Sub-network is trained respectively, after training, is identified video sequence to be identified by each sub-network, is obtained and each The corresponding initial recognition result of sub-network will obtain Activity recognition knot after the corresponding initial recognition result fusion of each sub-network Fruit, since the corresponding behavioural characteristic of each sub-network can be extracted by different sub-network network, and finally by each sub-network Test result is merged, and can be improved the robustness of Activity recognition method.

Detailed description of the invention

It in order to more clearly explain the technical solutions in the embodiments of the present application, below will be to embodiment or description of the prior art Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only some of the application Embodiment for those of ordinary skill in the art without any creative labor, can also be according to these Attached drawing obtains other attached drawings.

Fig. 1 is a kind of implementation process schematic diagram for Activity recognition method that one embodiment of the application provides；

Fig. 2 is a kind of structural schematic diagram for base net network that one embodiment of the application provides；

Fig. 3 is a kind of structural schematic diagram for binary-flow network that one embodiment of the application provides；

Fig. 4 is a kind of structural schematic diagram for limbs separated network that one embodiment of the application provides；

Fig. 5 is a kind of structural schematic diagram for attention network that one embodiment of the application provides；

Fig. 6 is the schematic block diagram for the terminal device that one embodiment of the application provides；

Fig. 7 is the schematic block diagram for the terminal device that the another embodiment of the application provides.

Specific embodiment

In being described below, for illustration and not for limitation, the tool of such as particular system structure, technology etc is proposed Body details, so as to provide a thorough understanding of the present application embodiment.However, it will be clear to one skilled in the art that there is no these specific The application also may be implemented in the other embodiments of details.In other situations, it omits to well-known system, device, electricity The detailed description of road and method, so as not to obscure the description of the present application with unnecessary details.

It should be appreciated that ought use in this specification and in the appended claims, term " includes " instruction is described special Sign, entirety, step, operation, the presence of element and/or component, but be not precluded one or more of the other feature, entirety, step, Operation, the presence or addition of element, component and/or its set.

It is also understood that mesh of the term used in this present specification merely for the sake of description specific embodiment And be not intended to limit the application.As present specification and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.

It will be further appreciated that the term "and/or" used in present specification and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.

As used in this specification and in the appended claims, term " if " can be according to context quilt Be construed to " when ... " or " once " or " in response to determination " or " in response to detecting ".Similarly, phrase " if it is determined that " or " if detecting [described condition or event] " can be interpreted to mean according to context " once it is determined that " or " in response to true It is fixed " or " once detecting [described condition or event] " or " in response to detecting [described condition or event] ".

In order to illustrate technical solution described herein, the following is a description of specific embodiments.

Fig. 1 is the implementation process schematic diagram for the Activity recognition method that one embodiment of the application provides, this method as shown in the figure It may comprise steps of:

Step S101, building include the identification model of at least two sub-networks, and to every height in the identification model Network is trained respectively.

In the embodiment of the present application, it may include 4 sub-networks for the model of Activity recognition, also may include than 4 The more sub-networks of sub-network, it is, of course, also possible to only include one or more of sub-networks.The sub-network includes: double Flow network, limbs separated network, attention network, frame difference network.The binary-flow network, limbs separated network, attention network, Frame difference network is all made of base net network.

As shown in Fig. 2, the base net network includes:

Sequentially connected one-dimensional convolutional layer, at least two basic blocks, mean value pond layer and full articulamentum；

It is connected between two adjacent basic blocks by residual error, the residual error is indicated by following formula:

x_i+1=F_layer(x_i)+x_i

Wherein, x_iIt is the input of i-th of basic block, x_i+1It is the output of i-th of basic block, x_i+1It is also that i+1 is basic The input of block.

Include three basic blocks: Block1, Block2, Block3 in Fig. 2, be before Block1 one-dimensional convolutional layer, It is full articulamentum that the subsequent Avg pool of Block3, which is mean value pond layer, the mean value pond subsequent Fc of layer,.

The basic block includes:

At least two convolutional layers are provided with batch normalization layer, nonlinear activation function and dropout between convolutional layer Layer, the batch normalization layer, nonlinear activation function and dropout layers are indicated by following formula:

F_layer(x)=Dropout (ReLU (BN (f (x*w))))

Wherein, w indicates the weight of convolution kernel, and the x indicates the input of convolutional layer, and the f (x*w) indicates a upper convolution The output of layer, also illustrates that the input of batch normalization layer, and the BN (f (x*w)) indicates the output of batch normalization layer, also illustrates that The input of nonlinear activation function；The ReLU (BN (f (x*w))) indicates the output of nonlinear activation function, also illustrates that Dropout layers of input, F_layer(x) output for indicating dropout layers, also illustrates that the input of next convolutional layer, and * indicates convolution Operation.

As shown in Fig. 2, right part is a kind of structure of basic block provided by the embodiments of the present application in figure, wherein Conv1D indicates that convolutional layer, the batch normalization indicate that batch normalizes layer, it is also possible to BN expression, ReLU Indicate nonlinear activation function, be finally dropout layer, the convolutional layer, batch normalize layer, nonlinear activation function, Dropout layers are sequentially connected.

The binary-flow network includes:

Base net network and the base net network on spatial flow is softmax layers corresponding, the base net network in time flow and the base net network It is softmax layers corresponding.

In the embodiment of the present application, object to be identified is video sequence, is all had in each image in video sequence The three-dimensional coordinate of human skeleton, in this way, just will appear two dimensions: time dimension and Spatial Dimension.Time dimension has recorded people The motion information of body, while Spatial Dimension has recorded the interactive information in the important joint of human body.

As shown in figure 3, being the binary-flow network that the application one embodiment provides, joints indicates to close as illustrated in the drawing Node, time indicate time series, wherein spatial stream representation space stream, and temporal stream indicates the time Stream, score fusion indicate score fusion.

The limbs separated network includes:

Five base net networks and, five base softmax layers corresponding with each base net network in five base net networks Network respectively corresponds five parts of human body: trunk, left arm, right arm, left leg, right leg.

As shown in figure 4, being the limbs separated network that one embodiment of the application provides, j1 to j20 illustrates the human body of label In 20 artis three-dimensional coordinate, T indicate video sequence in share T image.

In practical application, some behaviors of people are only related to part of limb, and for waving, only arm has participated in fortune In dynamic, other parts are all static.So human body can be divided into five parts, other can also be carried out in practical application Division mode.Limbs separated network can capture subtle limb motion information, while can also learn to behavior class The big limbs of other contribution degree can be regarded as the attention mechanism based on limbs for this angle.In this network, volume Product core only slides on time dimension.

The attention network includes:

The base net network of attention mechanism is merged, the attention mechanism includes: two full articulamentums, softmax layers.

As shown in figure 5, being the attention network that one embodiment of the application provides, a video comprising behavior is considered as The set of Time Continuous frame, but not all frame has identical importance, and some frames even can cause to mislead to classification Information, meanwhile, the different channels of feature are also different to the contribution margin of behavior classification in network, therefore we need to design a note Meaning power mechanism learns important frame and feature channel.In fig. 5 it can be seen that the volume of base net network is arranged in attention mechanism Behind the basic block (layer or Block) of lamination or base net network, for being arranged in behind basic block, basic block it is defeated Enter to carry out transmission by basic block and obtain corresponding characteristic value, while the input of basic block can also be connected entirely by first Connect layer (FC layer1), activation primitive (activation), second full articulamentum (FC layer2) and softmax layers It carries out transmission and obtains normalized weight, normalized weight is similar to for pointing out which frame or which feature channel are important Weight in mathematical formulae is obtained with two classes transmission result (feature and the corresponding weight of each feature) in this way, will obtain Two classes transmission result be multiplied be added again by way of (by each feature multiplied by the corresponding weight of this feature after carry out again it is tired Add) it calculates and obtains next layer of input.

The frame difference network includes:

Base net network and softmax layers corresponding with the base net network.It, most can area for different classes of behavior The feature of point property is exactly motion information, but original frame sequence can not directly indicate motion information.

Step S102 identifies video sequence to be identified by each sub-network, obtains and every height after training The corresponding initial recognition result of network.

In the embodiment of the present application, each sub-network in identification model is independently trained, and is based on the double fluid Network, limbs separated network, attention network and frame difference network, the loss function used when training are to intersect entropy loss letter Number:

Wherein, y_iIndicate true class label,Prediction label is represented, n indicates class number.

Explanation during being trained below for each sub-network.

Based on the binary-flow network, the convolution kernel on spatial flow is slided in Spatial Dimension, with the base on the spatial flow The corresponding softmax layers of score obtained on spatial flow of network, the convolution kernel in time flow slide on time dimension, with institute State the corresponding softmax layers of score obtained in time flow of base net network in time flow, by the spatial flow score and institute The score stated in time flow is multiplied, to be trained end to end；

Based on the limbs separated network, five Partial Features of human body are respectively fed to corresponding five base net networks, are obtained The corresponding score of five base net networks is multiplied, to be trained by the corresponding score of five base net networks；

Based on the attention network, the attention mechanism is arranged in behind convolutional layer or basic block, the attention Power mechanism are as follows:

y_c1=Activation (W₁x_ic+b₁)

y_c2=W₂y_c1+b₂

x_oc=F (x_ic)

Wherein, the c indicates c-th of channel of attention mechanism input, x_icOne layer is indicated in the attention mechanism Output, the W₁Indicate the weight of first full articulamentum, the b₁Indicate the biasing of first full articulamentum, y_c1Indicate first The output of a full articulamentum also illustrates that the input of second full articulamentum, the W₂Indicate the weight of second full articulamentum, institute State b₂Indicate the biasing of second full articulamentum, y_c2Indicate the output and softmax layers of input of second full articulamentum, W_αIndicate the attention weight learnt, the α_cIndicate the normalized weight of softmax layers of acquisition, x_ocIndicate the attention One layer of output in mechanism, O indicate lower layer of the attention mechanism of input；

Based on the frame difference network, convolution kernel slides in timing, the input of the frame difference network are as follows:

S_m={ M₂,M₃,…M_t,…M_N,

Wherein, the S_mIndicate the input of the frame difference network, the M_t=F_t-F_t-1, the F_t={ J₁,J₂, J_i,… J_t, the J_i=(x_i,y_i,z_i), N indicates shared N frame video sequence；

It is also understood that a three-dimensional body joint point coordinate J=(x, y, z), F can be indicated at t frame_t={ J₁,J₂, J_i,…J_N, meanwhile, one has the video of N frame that can be expressed as S={ F₁,F₂,…F_t,…F_N}.The motion information of skeleton is by such as Lower formula calculates: M_t=F_t-F_t-1, the motion information of such video may be expressed as: Sm={ M₂,M₃,…M_t,…M_N}。

Step S103 will obtain Activity recognition result after the corresponding initial recognition result fusion of each sub-network.

In the embodiment of the present application, it can be merged in the following manner,

Pass through formulaOrThe corresponding initial recognition result fusion of each sub-network is obtained Activity recognition as a result,

Wherein, the y_testIndicate Activity recognition as a result, the y_iIndicate i-th of sub-network, n indicates the identification model It altogether include n sub-network.

It is describedIndicate the fusion that is multiplied, the initial recognition result that each sub-network is obtained is multiplied, describedIt indicates to be added fusion, the initial recognition result that each sub-network is obtained is added.

It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present application constitutes any limit It is fixed.

Fig. 6 is that the schematic block diagram for the terminal device that one embodiment of the application provides only is shown and this Shen for ease of description It please the relevant part of embodiment.

The terminal device 6 can be the software being built in the existing terminal device such as mobile phone, notebook, computer The unit of unit, hardware cell or soft or hard combination, can also be used as independent pendant be integrated into the existing such as mobile phone, In the terminal devices such as notebook, computer, it is also used as independent terminal device and exists.

The terminal device 6 includes:

Training module 61 is constructed, for constructing the identification model including at least two sub-networks, and to the identification model In each sub-network be trained respectively；

Initial recognition result obtains module 62, for identifying view to be identified by each sub-network after training Frequency sequence obtains initial recognition result corresponding with each sub-network；

Activity recognition result obtains module 63, knows behavior is obtained after the corresponding initial recognition result fusion of each sub-network Other result.

Optionally, the sub-network includes: binary-flow network, limbs separated network, attention network, frame difference network.

Optionally, the binary-flow network includes:

Base net network and the base net network on spatial flow is softmax layers corresponding, the base net network in time flow and the base net network It is softmax layers corresponding；

The limbs separated network includes:

Five base net networks and, five base softmax layers corresponding with each base net network in five base net networks Network respectively corresponds five parts of human body: trunk, left arm, right arm, left leg, right leg；

The attention network includes:

The base net network of attention mechanism is merged, the attention mechanism includes: two full articulamentums, softmax layers；

The frame difference network includes:

Base net network and softmax layers corresponding with the base net network.

Optionally, the base net network includes:

x_i+1=F_layer(x_i)+x_i

Optionally, the basic block includes:

F_layer(x)=Dropout (ReLU (BN (f (x*w))))

Wherein, w indicates the weight of convolution kernel, and the x indicates the input of convolutional layer, and the f (x*w) indicates a upper convolution The output of layer, also illustrates that the input of batch normalization layer, and the BN (f (x*w)) indicates the output of batch normalization layer, also illustrates that The input of nonlinear activation function；The ReLU (BN (f (x*w))) indicates the output of nonlinear activation function, also illustrates that Dropout layers of input, F_layer(x) output for indicating dropout layers, also illustrates that the input of next convolutional layer.

Optionally, the building training module 61 includes:

Binary-flow network training unit, for being based on the binary-flow network, the convolution kernel on spatial flow is slided in Spatial Dimension, Convolution kernel in the score that softmax layers corresponding with the base net network on the spatial flow obtains on spatial flow, time flow exists It is slided on time dimension, it, will in the score that softmax layers corresponding with the base net network in the time flow obtains in time flow Score on the spatial flow is multiplied with the score in the time flow, to be trained end to end；

Limbs separated network training unit distinguishes five Partial Features of human body for being based on the limbs separated network Corresponding five base net networks are sent into, the corresponding score of five base net networks is obtained, five base net networks are corresponding Score is multiplied, to be trained；

Attention network training unit, for being based on the attention network, the attention mechanism is arranged in convolutional layer Or behind basic block, the attention mechanism are as follows:

y_c1=Activation (W₁x_ic+b₁)

y_c2=W₂y_c1+b₂

x_oc=F (x_ic)

Frame difference network training unit, for being based on the frame difference network, convolution kernel slides in timing, the frame difference network Input are as follows:

S_m={ M₂,M₃,…M_t,…M_N,

Based on the binary-flow network, limbs separated network, attention network and frame difference network, the loss letter used when training Number is cross entropy loss function:

Optionally, the Activity recognition result obtains module 63 and is also used to:

It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of the terminal device is divided into different functional unit or module, to complete All or part of function described above.Each functional unit in embodiment, module can integrate in one processing unit, It is also possible to each unit to physically exist alone, can also be integrated in one unit with two or more units, above-mentioned collection At unit both can take the form of hardware realization, can also realize in the form of software functional units.In addition, each function Unit, module specific name be also only for convenience of distinguishing each other, the protection scope being not intended to limit this application.Above-mentioned dress The specific work process for setting middle unit, module, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

Fig. 7 is the schematic block diagram for the terminal device that the another embodiment of the application provides.As shown in fig. 7, the end of the embodiment End equipment 7 includes: one or more processors 70, memory 71 and is stored in the memory 71 and can be in the processing The computer program 72 run on device 70.The processor 70 realizes that above-mentioned each behavior is known when executing the computer program 72 Step in other embodiment of the method, such as step S101 to S103 shown in Fig. 1.Alternatively, the processor 70 executes the meter The function of each module/unit in above-mentioned terminal device embodiment, such as module 61 to 63 shown in Fig. 6 are realized when calculation machine program 72 Function.

Illustratively, the computer program 72 can be divided into one or more module/units, it is one or Multiple module/units are stored in the memory 71, and are executed by the processor 70, to complete the application.Described one A or multiple module/units can be the series of computation machine program instruction section that can complete specific function, which is used for Implementation procedure of the computer program 72 in the terminal device 7 is described.For example, the computer program 72 can be divided It is cut into building training module, initial recognition result obtains module, Activity recognition result obtains module.

The building training module, for constructing the identification model including at least two sub-networks, and to the identification mould Each sub-network in type is trained respectively；

The initial recognition result obtains module, for being identified by each sub-network to be identified after training Video sequence obtains initial recognition result corresponding with each sub-network；

The Activity recognition result obtains module, will obtain behavior after the corresponding initial recognition result fusion of each sub-network Recognition result.

Other modules or unit can refer to the description in embodiment shown in fig. 6, and details are not described herein.

The terminal device includes but are not limited to processor 70, memory 71.It will be understood by those skilled in the art that figure 7 be only an example of terminal device 7, does not constitute the restriction to terminal device 7, may include more more or less than illustrating Component, perhaps combine certain components or different components, for example, the terminal device can also include input equipment, it is defeated Equipment, network access equipment, bus etc. out.

The processor 70 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng.

The memory 71 can be the internal storage unit of the terminal device 7, for example, terminal device 7 hard disk or Memory.The memory 71 is also possible to the External memory equipment of the terminal device 7, such as is equipped on the terminal device 7 Plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, Flash card (Flash Card) etc..Further, the memory 71 can also both include the storage inside of the terminal device 4 Unit also includes External memory equipment.The memory 71 is for storing needed for the computer program and the terminal device Other programs and data.The memory 71 can be also used for temporarily storing the data that has exported or will export.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, is not described in detail or remembers in some embodiment The part of load may refer to the associated description of other embodiments.

Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually It is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technician Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed Scope of the present application.

In embodiment provided herein, it should be understood that disclosed terminal device and method can pass through it Its mode is realized.For example, terminal device embodiment described above is only schematical, for example, the module or list Member division, only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or Component can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point is shown The mutual coupling or direct-coupling or communication connection shown or discussed can be through some interfaces, between device or unit Coupling or communication connection are connect, can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated module/unit be realized in the form of SFU software functional unit and as independent product sale or In use, can store in a computer readable storage medium.Based on this understanding, the application realizes above-mentioned implementation All or part of the process in example method, can also instruct relevant hardware to complete, the meter by computer program Calculation machine program can be stored in a computer readable storage medium, the computer program when being executed by processor, it can be achieved that on The step of stating each embodiment of the method.Wherein, the computer program includes computer program code, the computer program generation Code can be source code form, object identification code form, executable file or certain intermediate forms etc..The computer-readable medium It may include: any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic that can carry the computer program code Dish, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that described The content that computer-readable medium includes can carry out increasing appropriate according to the requirement made laws in jurisdiction with patent practice Subtract, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium do not include be electric carrier signal and Telecommunication signal.

Embodiment described above is only to illustrate the technical solution of the application, rather than its limitations；Although referring to aforementioned reality Example is applied the application is described in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features；And these are modified Or replacement, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution should all Comprising within the scope of protection of this application.

Claims

1. a kind of Activity recognition method characterized by comprising

Building includes the identification model of at least two sub-networks, and instructs respectively to each sub-network in the identification model Practice；

After training, video sequence to be identified is identified by each sub-network, is obtained corresponding with each sub-network first Beginning recognition result；

2. Activity recognition method as described in claim 1, which is characterized in that the sub-network includes: binary-flow network, limbs point Off-network network, attention network, frame difference network.

3. Activity recognition method as claimed in claim 2, which is characterized in that the binary-flow network includes:

Base net network and the base net network on spatial flow is softmax layers corresponding, and the base net network in time flow is corresponding with the base net network Softmax layer；

The limbs separated network includes:

Five base net networks and, five base net network softmax layers corresponding with each base net network in five base net networks Respectively correspond five parts of human body: trunk, left arm, right arm, left leg, right leg；

The attention network includes:

The frame difference network includes:

Base net network and softmax layers corresponding with the base net network.

4. Activity recognition method as claimed in claim 3, which is characterized in that the base net network includes:

x_i+1=F_layer(x_i)+x_i

Wherein, x_iIt is the input of i-th of basic block, x_i+1It is the output of i-th of basic block, x_i+1It is also the defeated of i+1 basic block Enter.

5. Activity recognition method as claimed in claim 4, which is characterized in that the basic block includes:

At least two convolutional layers are provided with batch normalization layer, nonlinear activation function and dropout layers, institute between convolutional layer State batch normalization layer, nonlinear activation function and dropout layer pass through following formula expression:

F_layer(x)=Dropout (ReLU (BN (f (x*w))))

Wherein, w indicates the weight of convolution kernel, and the x indicates the input of convolutional layer, and the f (x*w) indicates a upper convolutional layer Output, also illustrates that the input of batch normalization layer, and the BN (f (x*w)) indicates the output of batch normalization layer, also illustrates that non-thread The input of property activation primitive；The ReLU (BN (f (x*w))) indicates the output of nonlinear activation function, also illustrates that dropout layers Input, F_layer(x) output for indicating dropout layers, also illustrates that the input of next convolutional layer.

6. Activity recognition method as claimed in claim 3, which is characterized in that each subnet in the identification model Network is trained respectively

Based on the binary-flow network, the convolution kernel on spatial flow is slided in Spatial Dimension, with the base net network on the spatial flow The corresponding softmax layers score obtained on spatial flow, the convolution kernel in time flow slides on time dimension, when with described Between stream on base net network it is corresponding softmax layer acquisition time flow on score, by the spatial flow score and it is described when Between stream on score be multiplied, to be trained end to end；

Based on the limbs separated network, five Partial Features of human body are respectively fed to corresponding five base net networks, described in acquisition The corresponding score of five base net networks is multiplied, to be trained by the corresponding score of five base net networks；

Based on the attention network, the attention mechanism is arranged in behind convolutional layer or basic block, the attention machine It is made as:

y_c1=Activation (W₁x_ic+b₁)

y_c2=W₂y_c1+b₂

x_oc=F (x_ic)

Wherein, the c indicates c-th of channel of attention mechanism input, x_icIndicate one layer in the attention mechanism of output, The W₁Indicate the weight of first full articulamentum, the b₁Indicate the biasing of first full articulamentum, y_c1Expression first complete The output of articulamentum also illustrates that the input of second full articulamentum, the W₂Indicate the weight of second full articulamentum, the b₂ Indicate the biasing of second full articulamentum, y_c2Indicate the output and softmax layers of input of second full articulamentum, W_αTable The attention weight that dendrography is practised, the α_cIndicate the normalized weight of softmax layers of acquisition, x_ocIndicate the attention mechanism Upper one layer of output, O indicate lower layer of the attention mechanism of input；

S_m={ M₂,M₃,…M_t,…M_N,

Wherein, the S_mIndicate the input of the frame difference network, the M_t=F_t-F_t-1, the F_t={ J₁,J₂,J_i,…J_t, institute State J_i=(x_i,y_i,z_i), N indicates shared N frame video sequence；

Based on the binary-flow network, limbs separated network, attention network and frame difference network, the loss function used when training is equal For cross entropy loss function:

7. such as Activity recognition method as claimed in any one of claims 1 to 6, which is characterized in that described that each sub-network is corresponding Initial recognition result fusion after obtain Activity recognition result include:

Pass through formulaOrThe corresponding initial recognition result fusion acquisition behavior of each sub-network is known Not as a result,

Wherein, the y_testIndicate Activity recognition as a result, the y_iIndicate i-th of sub-network, n indicates that the identification model is wrapped altogether Include n sub-network.

8. a kind of terminal device characterized by comprising

Training module is constructed, for constructing the identification model including at least two sub-networks, and to every in the identification model A sub-network is trained respectively；

Initial recognition result obtains module, for identifying video sequence to be identified by each sub-network after training, Obtain initial recognition result corresponding with each sub-network；

Activity recognition result obtains module, will obtain Activity recognition knot after the corresponding initial recognition result fusion of each sub-network Fruit.

9. a kind of terminal device, including memory, processor and storage are in the memory and can be on the processor The computer program of operation, which is characterized in that the processor realizes such as claim 1 to 7 when executing the computer program The step of any one the method.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence realizes the step such as any one of claim 1 to 7 the method when the computer program is executed by one or more processors Suddenly.