CN107423721A

CN107423721A - Interactive action detection method, device, storage medium and processor

Info

Publication number: CN107423721A
Application number: CN201710670075.4A
Authority: CN
Inventors: 王志鹏; 周文明
Original assignee: Zhuhai Xi Yue Information Technology Co Ltd
Current assignee: Zhuhai Xi Yue Information Technology Co Ltd
Priority date: 2017-08-08
Filing date: 2017-08-08
Publication date: 2017-12-01

Abstract

The invention discloses a kind of interactive action detection method, device, storage medium and processor.Wherein, this method includes：Target Photo is detected according to default multilayer convolutional neural networks, obtains classification corresponding at least one destination object and frame coordinate present in Target Photo；Determine that confidence level highest destination object is target detection object at least one destination object；Classification corresponding to target detection object and frame coordinate were inputted to the default multistage and return convolutional neural networks, and then the position that human synovial position is carried out to target detection object is detected, and obtains the position coordinates at the human synovial position in target detection object；Position coordinates is normalized, and then the position coordinates after completing normalized detected according to default multilayer recurrent neural network, obtains the class label of Target Photo.The present invention solves interactive action accuracy in detection present in prior art and less efficient technical problem.

Description

Interactive action detection method, device, storage medium and processor

Technical field

The present invention relates to field of human-computer interaction, in particular to a kind of interactive action detection method, device, deposits Storage media and processor.

Background technology

Interactive action detects and classification is a basic technology of man-machine interaction, for smart home, safety-protection system It is significant in the scene interacted with mankind such as patient cares with electronic equipment.Such as medical industry, in gesture identification Under help, deaf-mutes patient can nurse not when, demand is communicated to by hospital by a camera and simple gesture, solve The problems such as independent electronics of having determined configuration costliness and patient will not use computer.

It is the method based on two-stream convolutional neural networks to be currently used in human action to know method for distinguishing, when it will contain Between information optical flow field and RBG images input convolutional neural networks simultaneously and go forward side by side row information fusion, the class of whole section of video of final output Distinguishing label.Because the temporal information that optical flow field contains is confined to several frames nearby, the accuracy of result is limited, and because output is The class label of one section of video calculates a large amount of duplicate messages, limits system, it is necessary to enter line slip to time window frame by frame Efficiency and real-time.To sum up, there is the degree of accuracy and less efficient technical problem in interactive action detection of the prior art.

For it is above-mentioned the problem of, not yet propose effective solution at present.

The content of the invention

The embodiments of the invention provide a kind of interactive action detection method, device, storage medium and processor, so that Solve interactive action accuracy in detection and less efficient technical problem present in prior art less.

One side according to embodiments of the present invention, there is provided a kind of interactive action detection method, this method include： Target Photo is detected according to default multilayer convolutional neural networks, obtains at least one mesh present in above-mentioned Target Photo Mark frame coordinate corresponding to classification corresponding to object and above-mentioned at least one destination object；Determine above-mentioned at least one target pair As the middle above-mentioned destination object of confidence level highest is target detection object；By above-mentioned classification corresponding to above-mentioned target detection object and Above-mentioned frame coordinate corresponding to above-mentioned target detection object inputs to the default multistage and returns convolutional neural networks, and then according to upper State the position that default multistage recurrence convolutional neural networks carry out above-mentioned target detection object at human synovial position to detect, obtain The position coordinates at the above-mentioned human synovial position in above-mentioned target detection object；Above-mentioned position coordinates is normalized, And then the above-mentioned position coordinates after completing above-mentioned normalized is detected according to default multilayer recurrent neural network, obtain To the testing result of above-mentioned Target Photo, wherein, including at least the class label of above-mentioned Target Photo in above-mentioned testing result.

Further, in the default multilayer recurrent neural network of basis to the above-mentioned position after completing above-mentioned normalized Before coordinate is detected, the above method also includes：According to default loss function and preset algorithm to above-mentioned default multilayer recurrence Neutral net is trained, wherein, above-mentioned default loss function is classification function, and above-mentioned preset algorithm is based on time scale Back-propagation algorithm.

Further, above-mentioned basis presets multilayer recurrent neural network to the upper rheme after completing above-mentioned normalized Put coordinate to be detected, obtaining the testing result of above-mentioned Target Photo includes：According to above-mentioned default multilayer recurrent neural network pair The above-mentioned position coordinates after above-mentioned normalized is completed to be detected, obtain multiple classifications corresponding to above-mentioned Target Photo with And multiple activation values corresponding to each above-mentioned classification in above-mentioned multiple classifications；Each above-mentioned classification is obtained in preset time window The average value of corresponding above-mentioned multiple activation values；Above-mentioned classification corresponding to maximum average value in multiple above-mentioned average values is determined For the class label of above-mentioned Target Photo, so as to obtain above-mentioned testing result.

Further, before being detected according to default multilayer convolutional neural networks to Target Photo, the above method is also Including：Obtain the human body attitude video image photographed in default camera；Will be any in above-mentioned human body attitude video image One frame picture is defined as above-mentioned Target Photo.

Another aspect according to embodiments of the present invention, a kind of interactive action detection means is additionally provided, the device bag Include：Detection unit, for being detected according to default multilayer convolutional neural networks to Target Photo, obtain in above-mentioned Target Photo Frame coordinate corresponding to classification corresponding to existing at least one destination object and above-mentioned at least one destination object；First is true Order member, for determining that the above-mentioned destination object of confidence level highest is target detection object in above-mentioned at least one destination object； First processing units, for will be above-mentioned corresponding to above-mentioned classification corresponding to above-mentioned target detection object and above-mentioned target detection object Frame coordinate inputs to the default multistage and returns convolutional neural networks, and then returns convolutional Neural net according to the above-mentioned default multistage The position that network carries out human synovial position to above-mentioned target detection object is detected, and obtains the above-mentioned people in above-mentioned target detection object The position coordinates of body joint part；Second processing unit, for above-mentioned position coordinates to be normalized, and then according to pre- If multilayer recurrent neural network detects to the above-mentioned position coordinates after completing above-mentioned normalized, above-mentioned target is obtained The testing result of picture, wherein, including at least the class label of above-mentioned Target Photo in above-mentioned testing result.

Further, said apparatus also includes：Training unit, for the default loss function of basis and preset algorithm to above-mentioned Default multilayer recurrent neural network is trained, wherein, above-mentioned default loss function is classification function, and above-mentioned preset algorithm is base In the back-propagation algorithm of time scale.

Further, above-mentioned second processing unit includes：Detection sub-unit, for according to above-mentioned default multilayer recurrent neural Network detects to the above-mentioned position coordinates after completing above-mentioned normalized, obtains multiple corresponding to above-mentioned Target Photo Multiple activation values corresponding to each above-mentioned classification in classification and above-mentioned multiple classifications；Subelement is obtained, for when default Between the average value of above-mentioned multiple activation values corresponding to each above-mentioned classification is obtained in window；Determination subelement, for will be multiple above-mentioned Above-mentioned classification is defined as the class label of above-mentioned Target Photo corresponding to maximum average value in average value, so as to obtain above-mentioned inspection Survey result.

Further, said apparatus also includes：Acquiring unit, for obtaining the human body attitude photographed in default camera Video image；Second determining unit, for any one frame picture in above-mentioned human body attitude video image to be defined as into above-mentioned mesh Mark on a map piece.

Another aspect according to embodiments of the present invention, additionally provides a kind of storage medium, and above-mentioned storage medium includes storage Program, wherein, equipment where above-mentioned storage medium is controlled when said procedure is run performs above-mentioned interactive action inspection Survey method.

Another aspect according to embodiments of the present invention, additionally provides a kind of processor, and above-mentioned processor is used for operation program, Wherein, above-mentioned interactive action detection method is performed when said procedure is run.

In embodiments of the present invention, Target Photo is detected using according to default multilayer convolutional neural networks, obtained Frame coordinate corresponding to classification corresponding at least one destination object present in Target Photo and at least one destination object Mode, by determining that confidence level highest destination object is target detection object at least one destination object；So as to by mesh Frame coordinate corresponding to classification corresponding to mark detection object and target detection object, which was inputted to the default multistage, returns convolutional Neural Network, and then convolutional neural networks are returned according to the default multistage position at target detection object progress human synovial position is examined Survey, obtain the position coordinates at the human synovial position in target detection object；Reach and position coordinates be normalized, And then the position coordinates after completing normalized is detected according to default multilayer recurrent neural network, obtain target figure The purpose of the testing result of piece, wherein, including at least the class label of Target Photo in testing result.The embodiment of the present invention is realized The accuracy rate of lifting interactive action detection, improve interactive action detection efficiency technique effect, and then solve Interactive action accuracy in detection and less efficient technical problem present in prior art.

Brief description of the drawings

Accompanying drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings：

Fig. 1 is a kind of schematic flow sheet of optional interactive action detection method according to embodiments of the present invention；

Fig. 2 is the schematic flow sheet of the optional interactive action detection method of another kind according to embodiments of the present invention；

Fig. 3 is the schematic flow sheet of another optional interactive action detection method according to embodiments of the present invention；

Fig. 4 is a kind of structural representation of optional interactive action detection means according to embodiments of the present invention.

Embodiment

In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects Enclose.

It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use Data can exchange in the appropriate case, so as to embodiments of the invention described herein can with except illustrating herein or Order beyond those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment Those steps or unit clearly listed, but may include not list clearly or for these processes, method, product Or the intrinsic other steps of equipment or unit.

Embodiment 1

According to embodiments of the present invention, there is provided a kind of embodiment of interactive action detection method, it is necessary to explanation, It can be performed the step of the flow of accompanying drawing illustrates in the computer system of such as one group computer executable instructions, and And although showing logical order in flow charts, in some cases, can be with different from order execution institute herein The step of showing or describing.

Fig. 1 is a kind of schematic flow sheet of optional interactive action detection method according to embodiments of the present invention, such as Shown in Fig. 1, this method comprises the following steps：

Step S102, Target Photo is detected according to default multilayer convolutional neural networks, obtain depositing in Target Photo At least one destination object corresponding to frame coordinate corresponding to classification and at least one destination object；

Step S104, determine that confidence level highest destination object is target detection object at least one destination object；

Step S106, frame coordinate corresponding to classification corresponding to target detection object and target detection object is inputted to pre- If the multistage returns convolutional neural networks, and then returns convolutional neural networks according to the default multistage and target detection object is carried out The position detection at human synovial position, obtains the position coordinates at the human synovial position in target detection object；

Step S108, position coordinates is normalized, and then according to default multilayer recurrent neural network to completing Position coordinates after normalized is detected, and obtains the testing result of Target Photo, wherein, at least wrapped in testing result Include the class label of Target Photo.

Alternatively, a large amount of problems such as object classification in terms of computer vision in recent years of convolutional neural networks technology, knowledge Not, detection etc. embodies good performance, and it is mainly suitable for the analysis of hollow static schema of visual signal.And recurrent neural net Network achieves leading result in machine translation, visual classification problem in recent years, and it is mainly suitable for the dynamic to time series Characteristic is modeled.Therefore, depth convolutional neural networks and recurrent neural network are combined by the embodiment of the present application, and description regards Feel the Dynamic mode of time and space in signal, improve motion detection and the accuracy rate of classification.

Alternatively, before step S102 is performed, classification can be marked to every frame Target Photo, so as to build training sample Collection.For example, interactive action classification can be divided into 6 kinds：Raise one's hand, wave, swing arm, draw circle, both hands intersect and be not belonging to this 5 Other actions of kind action.

Alternatively, during step S102 is performed, multilayer convolutional neural networks can be constructed Target Photo is carried out Target detection, obtain the classification and frame coordinate of multiple targets.The network includes carrying out the feature of picture space feature extraction Network is extracted, proposes the candidate region network (Region Proposal Network) of possible target bezel locations to be detected, And candidate region is classified and frame return classification Recurrent networks.Space characteristics extraction network can select Zeiler＆Fergus networks, VGG-16/19 networks or residual error neutral net.Candidate region network is made up of 3 convolutional layers, Convolution kernel size is respectively 512 × 3 × 3,18 × 1 × 1 and 36 × 1 × 1.The input of wherein second and third convolutional layer is One convolutional layer, output are respectively the frame coordinate and fraction of candidate region.Classify Recurrent networks by region of interest (region of Interest) pond layer, the full articulamentum of two layers 4096 dimensions, and parallel classification and the full articulamentum composition of recurrence.Wherein feel The vector table that the provincial characteristics of arbitrary size is mapped as fixed length by region of interest pond layer reaches, and classifies and returns the output of full articulamentum Respectively for input frame, the offset of all kinds of fractions and frame.Frame coordinate is carried out according to frame offset Fine setting, and after being screened according to fraction, the testing result of the step can be obtained.

Alternatively, during step S106 is performed, multistage recurrence convolutional neural networks can be constructed and carry out human body Joint position is estimated, obtains the coordinate of important joint position.The network is formed by the stacking of multiple identical networks, each sub-network Input be using target's center as average Gaussian Profile, pre-process network output and a upper sub-network output.Wherein, in advance Processing network is made up of four convolutional layers, and convolution kernel size is respectively 128 × 9 × 9, pad=4,128 × 9 × 9, pad=4, 128 × 9 × 9, pad=4,32 × 5 × 5, pad=2, size 9 × 9, the maximum pondization behaviour that step-length is 2 are carried out after three first layers Make.Sub-network increases three-layer coil lamination on the basis of network is pre-processed, and size is respectively 512 × 9 × 9, pad=4,512 × 1 × 1 and 512 × 1 × 1.Number of stages is higher, then the effect of joint estimation amendment is better.

Alternatively, during step S106 is performed, presetting multistage recurrence convolutional neural networks can be by same structure Convolutional neural networks stack, and finely tune output result respectively in the multistage, and obtain head, neck, right and left shoulders, left and right elbow, Left and right wrist, left and right stock, left and right knee, the two-dimensional coordinate in 14 joints of left and right ankle.

Alternatively, during step S108 is performed, the joint coordinates of output can be normalized.Take it The head position h of first frame^{I, 0}For initial point.In view of the Scale invariant on front and each visual angle vertical direction in side, head is taken Apart from sum it is scale factor s to neck, hip joint to double kneesⁱ, then the coordinate after normalizingFor：

Alternatively, the position coordinates after completing normalized is examined according to default multilayer recurrent neural network Before survey, method also includes：Default multilayer recurrent neural network is trained according to default loss function and preset algorithm, its In, it is classification function to preset loss function, and preset algorithm is the back-propagation algorithm based on time scale.Specifically, the loss Function can be softmax classification functions.

It is alternatively possible to build multilayer recurrent neural network and be trained.According to Rye moral criterion selecting video length It is the video near average in 3 times of standard deviations as training set, and takes the time step of a length of recurrent neural network of largest frames (time step).Data enhancing is carried out using the mode that joint coordinates are added with random noise, and using trellis search method to net The hyper parameters such as the ratio of network layers number, every layer of neuron number and dropout optimize.

Specifically, during being trained to default multilayer recurrent neural network, can be lost according to Rye moral criterion Video length is abandoned as the average video outside 3 times of standard deviations, and take the time step of a length of recurrent neural network of largest frames nearby (time step) is deployed.It is neat using full 0 value complement for the video pictures feature of the maximum frame length of deficiency, and by class label 0 is arranged to, representative is not belonging to any classification.Loss function is arranged to softmax functions, the prediction classification to each frame asks loss Value, and take average as total loss, this is also referred to as perplexity.Sample set is pressed into sample number 7：3 random divisions are instruction Practice collection and test set, and joint coordinates are added fromNoise, with increase sample number carry out data enhancing.Training When using BPTT algorithms carry out network weight renewal.Using trellis search method to the network number of plies, every layer of neuron number, with And the hyper parameter such as dropout ratio optimizes, iterations elects 200 as.It is chosen at the best super ginseng of effect on test set Model under number and iterations is used to test.Indexes of Evaluation Effect is F₁Fraction, computational methods are： Wherein precision is classification accuracy, and recall is classification recall rate.

Alternatively, Fig. 2 is the flow of the optional interactive action detection method of another kind according to embodiments of the present invention Schematic diagram, as shown in Fig. 2 performing step S108, i.e., after basis presets multilayer recurrent neural network to completion normalized Position coordinates detected, obtaining the testing result of Target Photo includes：

Step S202, the position coordinates after completing normalized is examined according to default multilayer recurrent neural network Survey, obtain multiple activation values corresponding to multiple classifications corresponding to Target Photo and each classification in multiple classifications；

Step S204, the average value of multiple activation values corresponding to each classification is obtained in preset time window；

Step S206, classification corresponding to the maximum average value in multiple average values is defined as to the classification mark of Target Photo Label, so as to obtain testing result.

Alternatively, Fig. 3 is the flow of another optional interactive action detection method according to embodiments of the present invention Schematic diagram, as shown in figure 3, before step S102 is performed, i.e., Target Photo is entered according to default multilayer convolutional neural networks Before row detection, this method can also include：

Step S302, obtain the human body attitude video image photographed in default camera；

Step S304, any one frame picture in human body attitude video image is defined as Target Photo.

Specifically, the default camera can be USB camera or IP Camera.One in human body attitude video image As include multiframe target image.

Alternatively, the present invention extracts the spatial vision feature of image by convolutional neural networks, and by training recurrence refreshing Dynamic modeling is carried out to time series through network, all kinds of activation values fixed time in window previous to certain moment tire out in test Add, and take the value of maximum to correspond to result of the classification as the moment.The present invention can complete simultaneously carry out human action detection with Identification, and there is preferable real-time and robustness.

Alternatively, the detection of man-machine interaction human action and classification involved in the present invention based on depth Space-time Neural Network Method, the spatial vision feature of image can be extracted using convolutional neural networks, and human body is moved using recurrent neural network State is modeled and predicted, extends the use range of deep learning method, improves the utilization ratio of time scale information, and Simultaneously can to human body behavior at the beginning of carve and detect, expand the use range of the technology.

Embodiment 2

Another aspect according to embodiments of the present invention, a kind of interactive action detection means is additionally provided, such as Fig. 4 institutes Show, the device includes：Detection unit 401, the first determining unit 403, first processing units 405, second processing unit 407.

Wherein, detection unit 401, for being detected according to default multilayer convolutional neural networks to Target Photo, obtain Frame coordinate corresponding to classification corresponding at least one destination object present in Target Photo and at least one destination object； First determining unit 403, for determining that confidence level highest destination object is target detection object at least one destination object； First processing units 405, for frame coordinate corresponding to classification corresponding to target detection object and target detection object to be inputted Convolutional neural networks are returned to the default multistage, and then convolutional neural networks are returned to target detection object according to the default multistage The position detection at human synovial position is carried out, obtains the position coordinates at the human synovial position in target detection object；At second Unit 407 is managed, for position coordinates to be normalized, and then according to default multilayer recurrent neural network to completing normalizing Position coordinates after change processing is detected, and obtains the testing result of Target Photo, wherein, mesh is comprised at least in testing result Mark on a map the class label of piece.

Alternatively, the device can also include：Training unit, for the default loss function of basis and preset algorithm to default Multilayer recurrent neural network is trained, wherein, default loss function is classification function, and preset algorithm is based on time scale Back-propagation algorithm.

Alternatively, second processing unit 407 can include：Detection sub-unit, for according to default multilayer recurrent neural net Network detects to the position coordinates after completing normalized, obtains multiple classifications corresponding to Target Photo and multiple classes Multiple activation values corresponding to each classification in not；Subelement is obtained, it is corresponding for obtaining each classification in preset time window Multiple activation values average value；Determination subelement, for classification corresponding to the maximum average value in multiple average values to be determined For the class label of Target Photo, so as to obtain testing result.

Alternatively, the device can also include：Acquiring unit, for obtaining the human body attitude photographed in default camera Video image；Second determining unit, for any one frame picture in human body attitude video image to be defined as into Target Photo.

Another aspect according to embodiments of the present invention, additionally provides a kind of storage medium, and storage medium includes the journey of storage Sequence, wherein, equipment performs above-mentioned interactive action detection method where controlling storage medium when program is run.

Another aspect according to embodiments of the present invention, additionally provides a kind of processor, and processor is used for operation program, its In, program performs above-mentioned interactive action detection method when running.

The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.

In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in some embodiment The part of detailed description, it may refer to the associated description of other embodiment.

In several embodiments provided herein, it should be understood that disclosed technology contents, others can be passed through Mode is realized.Wherein, device embodiment described above is only schematical, such as the division of the unit, Ke Yiwei A kind of division of logic function, can there is an other dividing mode when actually realizing, for example, multiple units or component can combine or Person is desirably integrated into another system, or some features can be ignored, or does not perform.Another, shown or discussed is mutual Between coupling or direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module Connect, can be electrical or other forms.

The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On unit.Some or all of unit therein can be selected to realize the purpose of this embodiment scheme according to the actual needs.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.

If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer Equipment (can be personal computer, server or network equipment etc.) perform each embodiment methods described of the present invention whole or Part steps.And foregoing storage medium includes：USB flash disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can be with store program codes Medium.

Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

A kind of 1. interactive action detection method, it is characterised in that including：

Target Photo is detected according to default multilayer convolutional neural networks, obtained at least one present in the Target Photo Frame coordinate corresponding to classification corresponding to individual destination object and at least one destination object；

Determine that destination object described in confidence level highest is target detection object at least one destination object；

By the frame coordinate input corresponding to the classification corresponding to the target detection object and the target detection object Convolutional neural networks are returned to the default multistage, and then convolutional neural networks are returned to the target according to the default multistage Detection object carries out the position detection at human synovial position, obtains the human synovial position in the target detection object Position coordinates；

The position coordinates is normalized, and then according to default multilayer recurrent neural network to completing the normalization The position coordinates after processing is detected, and obtains the testing result of the Target Photo, wherein, in the testing result Including at least the class label of the Target Photo.
2. according to the method for claim 1, it is characterised in that in the default multilayer recurrent neural network of basis to described in completion Before the position coordinates after normalized is detected, methods described also includes：According to default loss function and in advance Imputation method is trained to the default multilayer recurrent neural network, wherein, the default loss function is classification function, described Preset algorithm is the back-propagation algorithm based on time scale.
3. according to the method for claim 1, it is characterised in that the basis presets multilayer recurrent neural network to completing institute State the position coordinates after normalized to be detected, obtaining the testing result of the Target Photo includes：

The position coordinates after completing the normalized is examined according to the default multilayer recurrent neural network Survey, obtain multiple sharp corresponding to multiple classifications corresponding to the Target Photo and each classification in the multiple classification Value living；

The average value of the multiple activation value corresponding to each classification is obtained in preset time window；

The classification corresponding to maximum average value in multiple average values is defined as to the class label of the Target Photo, So as to obtain the testing result.
4. according to the method for claim 1, it is characterised in that in the default multilayer convolutional neural networks of basis to Target Photo Before being detected, methods described also includes：

Obtain the human body attitude video image photographed in default camera；

Any one frame picture in the human body attitude video image is defined as the Target Photo.
A kind of 5. interactive action detection means, it is characterised in that including：

Detection unit, for being detected according to default multilayer convolutional neural networks to Target Photo, obtain the Target Photo Present in frame coordinate corresponding to classification corresponding at least one destination object and at least one destination object；

First determining unit, for determining that destination object described in confidence level highest is target at least one destination object Detection object；

First processing units, for by corresponding to the classification corresponding to the target detection object and the target detection object The frame coordinate inputs to the default multistage and returns convolutional neural networks, and then returns convolution god according to the default multistage The position that human synovial position is carried out to the target detection object through network is detected, and obtains the institute in the target detection object State the position coordinates at human synovial position；

Second processing unit, for the position coordinates to be normalized, and then according to default multilayer recurrent neural net Network detects to the position coordinates after completing the normalized, obtains the testing result of the Target Photo, Wherein, the class label of the Target Photo is comprised at least in the testing result.
6. device according to claim 5, it is characterised in that described device also includes：

Training unit, for being instructed according to default loss function and preset algorithm to the default multilayer recurrent neural network Practice, wherein, the default loss function is classification function, and the preset algorithm is the back-propagation algorithm based on time scale.
7. device according to claim 5, it is characterised in that the second processing unit includes：

Detection sub-unit, for according to the default multilayer recurrent neural network to described in completing after the normalized Position coordinates is detected, and obtains multiple classifications corresponding to the Target Photo and each class in the multiple classification Not corresponding multiple activation values；

Subelement is obtained, for obtaining being averaged for the multiple activation value corresponding to each classification in preset time window Value；

Determination subelement, for the classification corresponding to the maximum average value in multiple average values to be defined as into the target The class label of picture, so as to obtain the testing result.
8. device according to claim 5, it is characterised in that described device also includes：

Acquiring unit, for obtaining the human body attitude video image photographed in default camera；

Second determining unit, for any one frame picture in the human body attitude video image to be defined as into the target figure Piece.
A kind of 9. storage medium, it is characterised in that the storage medium includes the program of storage, wherein, run in described program When control the storage medium where equipment perform claim require that 1 man-machine interaction into claim 4 described in any one is moved Make detection method.
A kind of 10. processor, it is characterised in that the processor is used for operation program, wherein, right of execution when described program is run Profit requires the 1 interactive action detection method into claim 4 described in any one.