CN108573197A - Video actions detection method and device - Google Patents

Video actions detection method and device Download PDF

Info

Publication number
CN108573197A
CN108573197A CN201710146933.5A CN201710146933A CN108573197A CN 108573197 A CN108573197 A CN 108573197A CN 201710146933 A CN201710146933 A CN 201710146933A CN 108573197 A CN108573197 A CN 108573197A
Authority
CN
China
Prior art keywords
neural network
frame image
video
information
identification information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710146933.5A
Other languages
Chinese (zh)
Inventor
刘春晖
厉扬豪
胡越予
刘家瑛
郭宗明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University
Priority to CN201710146933.5A priority Critical patent/CN108573197A/en
Publication of CN108573197A publication Critical patent/CN108573197A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides a kind of video actions detection method and device, and wherein method includes:Training video is chosen from training set, obtains the skeleton data of each frame image and the recognition result to being acted in image and prediction result in training video;The skeleton data of each frame image is handled using neural network, obtains the identification information and predictive information of each frame image;According to the identification information and recognition result and the predictive information and prediction result, neural network is optimized;It repeats the above steps, until neural network restrains;After neural network convergence, using neural network to being handled per the skeleton data of frame image in video to be measured, corresponding identification information and predictive information are obtained.Each frame image in video can be identified in video actions detection method and device provided by the invention, be not necessarily to manual extraction video clip, improve detection efficiency and accuracy.

Description

Video actions detection method and device
Technical field
The present invention relates to computer vision technique more particularly to a kind of video actions detection method and device.
Background technology
The target of video actions detection is to give a video sequence, is identified to the action generation in the video sequence Section and corresponding type.The development of Microsoft's Kinect device allows joint skeleton (skeleton) data of people more to hold Easy acquisition, joint skeleton are the more abstract expression of people, have prodigious help to motion detection and forecasting problem.
In the prior art, the work of video actions detection is completed based on action recognition.The task of action recognition be to A fixed short video clip identifies that it acts type.Traditional action identification method is the gradient histogram for extracting video image The traditional characteristics information such as figure are classified.On this basis, and it is proposed that made using the movement locus of front and back frame and light stream Using fischer vector after compressed encoding classify for a new feature combination traditional characteristic.These methods are Regard video as an entirety, one section of video only can recognize that an action type is needed when there are multiple actions in long video Want manual extraction video clip, the above method recycled to be identified for each video clip, inefficiency, and accuracy compared with Difference.
Invention content
The present invention provides a kind of video actions detection method and device, to solve video actions identification in the prior art The technical issues of inefficiency.
The present invention provides a kind of video actions detection method, including:
It chooses training video from training set, obtains in training video the skeleton data of each frame image and in image The recognition result and prediction result of action;
The skeleton data of each frame image is handled using neural network, obtain each frame image identification information and Predictive information;
According to the identification information and recognition result and the predictive information and prediction result, neural network is carried out Optimization;
It repeats the above steps, until neural network restrains;
After neural network convergence, using neural network to being handled per the skeleton data of frame image in video to be measured, Obtain corresponding identification information and predictive information.
Further, the skeleton data of each frame image is handled using neural network, obtains each frame image Identification information and predictive information, including:
The skeleton data of each frame image is separately input to the characteristic extraction part of neural network, obtains corresponding feature Information;
The characteristic information is input to multitask part of neural network, obtains identification information and the prediction of each frame image Information.
Further, according to the identification information and recognition result and the predictive information and prediction result, to nerve Network optimizes, including:
Identification error is calculated according to the identification information and the recognition result;
Prediction error is calculated according to the predictive information and the prediction result;
According to the weighted sum of the identification error and the prediction error, overall error is obtained, and utilize stochastic gradient descent Method reverse transmittance nerve network parameter.
Further, after neural network convergence, using neural network to the skeleton data of every frame image in video to be measured It is handled, obtains corresponding identification information and predictive information, including:
After neural network convergence, the skeleton data of video to be measured is obtained;
The skeleton data of each frame image in video to be measured is input to the characteristic extraction part of neural network, obtains phase The characteristic information answered;
The corresponding characteristic information of the video to be measured is input to multitask part of neural network, obtains each frame image Identification information and predictive information;
According to the identification information and the predictive information, determine in video the corresponding action classification of each frame image and At the beginning of acting end time or next action.
Further, the recognition result acted in each frame image in training video and prediction result are obtained, including:
The training video is played to user, and it is corresponding to receive each frame image that user inputs according to the training video Action classification and action end time or time started;
The recognition result and prediction result are determined according to the action classification and action end time or time started.
The present invention also provides a kind of video actions detection devices, including:
Acquisition module obtains the skeleton number of each frame image in training video for choosing training video from training set According to this and the recognition result to being acted in image and prediction result;
Processing module obtains each frame figure for being handled the skeleton data of each frame image using neural network The identification information and predictive information of picture;
Optimization module is used for according to the identification information and recognition result and the predictive information and prediction result, right Neural network optimizes;
Replicated blocks, if judging result is no, trigger the acquisition module for judging whether log on restrains, If the determination result is YES, then detection trigger module;
The detection module is used for after neural network convergence, using neural network to every frame image in video to be measured Skeleton data is handled, and corresponding identification information and predictive information are obtained.
Further, the processing module is specifically used for:
The skeleton data of each frame image is separately input to the characteristic extraction part of neural network, obtains corresponding feature Information;
The characteristic information is input to multitask part of neural network, obtains identification information and the prediction of each frame image Information.
Further, the optimization module is specifically used for:
Identification error is calculated according to the identification information and the recognition result;
Prediction error is calculated according to the predictive information and the prediction result;
According to the weighted sum of the identification error and the prediction error, overall error is obtained, and utilize stochastic gradient descent Method reverse transmittance nerve network parameter.
Further, the detection module is specifically used for:
After neural network convergence, the skeleton data of video to be measured is obtained;
The skeleton data of each frame image in video to be measured is input to the characteristic extraction part of neural network, obtains phase The characteristic information answered;
The corresponding characteristic information of the video to be measured is input to multitask part of neural network, obtains each frame image Identification information and predictive information;
According to the identification information and the predictive information, determine in video the corresponding action classification of each frame image and At the beginning of acting end time or next action.
Further, the acquisition module is specifically used for:
Training video is chosen from training set, obtains the skeleton data of each frame image in training video;
The training video is played to user, and it is corresponding to receive each frame image that user inputs according to the training video Action classification and action end time or time started;
The recognition result and prediction result are determined according to the action classification and action end time or time started.
Video actions detection method and device provided by the invention obtain instruction by choosing training video from training set Practice the skeleton data of each frame image and the recognition result to being acted in image and prediction result in video, utilizes neural network The skeleton data of each frame image is handled, the identification information and predictive information of each frame image are obtained, according to the knowledge Other information and recognition result and the predictive information and prediction result, optimize neural network, repeat the above steps, Until neural network restrains, after neural network convergence, using neural network to the skeleton data of every frame image in video to be measured It is handled, obtains corresponding identification information and predictive information, each frame image in video can be identified, without manual Video clip is extracted, detection efficiency and accuracy are improved.
Description of the drawings
Fig. 1 is the flow chart for the video actions detection method that the embodiment of the present invention one provides;
Fig. 2 is the structure diagram of video actions detection device provided by Embodiment 2 of the present invention.
Specific implementation mode
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art The every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
The term used in the embodiment of the present application is the purpose only merely for description specific embodiment, is not intended to be limiting The present invention.The "an" of used singulative, " described " and "the" are also intended to including most shapes in the embodiment of the present application Formula, unless context clearly shows that other meanings.
It should be appreciated that term "and/or" used herein is only a kind of incidence relation of description affiliated partner, indicate There may be three kinds of relationships, for example, A and/or B, can indicate:Individualism A, exists simultaneously A and B, individualism B these three Situation.In addition, character "/" herein, it is a kind of relationship of "or" to typically represent forward-backward correlation object.
Depending on context, word as used in this " if ", " if " can be construed to " ... when " or " when ... " or " in response to determination " or " in response to detection ".Similarly, context is depended on, phrase " if it is determined that " or " such as Fruit detect (condition or event of statement) " can be construed to " when determining " or " in response to determination " or " when detection (statement Condition or event) when " or " in response to detection (condition or event of statement) ".
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability Including so that commodity or system including a series of elements include not only those elements, but also include not clear The other element listed, or further include for this commodity or the intrinsic element of system.In the feelings not limited more Under condition, the element that is limited by sentence "including a ...", it is not excluded that including the element commodity or system in also There are other identical elements.
Embodiment one
The embodiment of the present invention one provides a kind of video actions detection method.Fig. 1 is the video that the embodiment of the present invention one provides The flow chart of motion detection method.As shown in Figure 1, the method in the present embodiment, may include:
Step 101 chooses training video from training set, obtain in training video the skeleton data of each frame image and To the recognition result and prediction result acted in image.
Wherein, training set can be the set for including multiple training videos, and the training video can be for training god Video through network.
The present embodiment identifies the corresponding action of each frame image in video to be measured using neural network.Specific method can divide For training process and detection process two parts.Step 101 to step 104 is considered as training process, utilizes the instruction in training set Practice video to optimize neural network;Step 105 is considered as detection process, is to utilize neural network after the completion of training Video to be measured is handled, to analyze the action in video to be measured.
Specifically, the recognition result acted in each frame image in training video and prediction result are obtained, may include:To User plays the training video, and receive the corresponding action classification of each frame image that user inputs according to the training video with And act end time or time started;The knowledge is determined according to the action classification and action end time or time started Other result and prediction result.
The recognition result of one frame image is used to indicate the action classification of the frame image, and the action classification may include but not It is limited to:It sits down, stand up, walk, pour, raise one's hand.The prediction result of one frame image is for indicating the corresponding action of frame image The time that classification terminates, for example, the action classification of a certain frame image is walking, walks this alternatively, at the beginning of next action One action terminates after 10 frames, then the prediction result can be 10 frames.Action classification, action end time or the beginning of image Time can be determined and inputted according to training video by user.
The skeleton data of each frame image may include the information such as the position of each joint skeleton of human body and size in image. The joint skeleton can include but is not limited to:Head, left shoulder, right shoulder, left finesse, right finesse etc..The skeleton number of image According to acquisition belong to the prior art, repeated no more in the present embodiment.
Step 102 is handled the skeleton data of each frame image using neural network, obtains the knowledge of each frame image Other information and predictive information.
In the present embodiment, neural network includes mainly characteristic extraction part and the multitask neural network portion of neural network Point.Wherein, the characteristic extraction part of neural network and multitask part of neural network can respectively include multiple portions again, such as god Characteristic extraction part through network includes shot and long term Memory Neural Networks, and multitask neural network includes full Connection Neural Network Deng.
In the characteristic extraction part of neural network, the mode that model has used three layers of complex network to be superimposed carries out action spy Sign extraction.For each layer, Recognition with Recurrent Neural Network to the temporal aspect of action is modeled and is stored, later using full connection Layer carries out temporary and artis correlation information the processing of feature, introduces arbitrarily lose the mechanism of parameter to mitigate nerve later Over-fitting in e-learning.
In multitask part of neural network, in order to motion detection there are one more accurately as a result, we to training set into Row pretreatment, before action is subdivided into action generation, action, which takes place and acts, soon terminates three parts.Utilize nerve net Network characteristic extraction part output as a result, being classified to the action type of each frame using full Connection Neural Network, to reach Being completed at the same time to identification and detection for task.On the basis of the classification of motion of each frame, the processing of multitask part of neural network The task of action prediction, is broadly divided into prediction action and starts to terminate with prediction action, using selection neural network, using dynamic Parameter in detecting carries out the parameter of full Connection Neural Network the screening of action type level, forecasting problem is converted into The regression problem of countdown occurs for action.
The skeleton data of each frame image is handled using neural network, obtain each frame image identification information and Predictive information may include:The skeleton data of each frame image is separately input to the characteristic extraction part of neural network, is obtained Corresponding characteristic information;The characteristic information is input to multitask part of neural network, obtains the identification letter of each frame image Breath and predictive information.
Wherein, the corresponding action classification of image that the identification information is used to indicate to be obtained according to neural computing, institute At the beginning of end time or the next action of stating action of the predictive information for indicating to obtain according to neural computing.
Specifically, it is assumed that it is respectively V that training, which is concentrated with S training video,0…Vs-1, each training video may include one A or multiple actions randomly select a training video from training set, it is assumed that the training video of selection is Vi, the video is by N Group of picture at being denoted as f respectivelyVi 1…fVi N
First, by the skeleton data f of each frame image in training videoVi 1…fVi NIt is input to the feature extraction unit of neural network Point, corresponding characteristic information is obtained, is as follows:
Skeleton data is input to shot and long term Memory Neural Networks (a kind of Recognition with Recurrent Neural Network) by step 1021, utilizes public affairs Formula (1) obtains accordingly exporting h1…hN
h1,h2,…,hN-1=LSTM (f1,f2,…,fN) (1)
In formula (1), ftFor the corresponding skeleton data of video t frame images, htIt is L for the intermediate parameters of t frame images × 1 vector, LSTM are shot and long term Memory Neural Networks function, and the parameter in the function generates at random when initial, later basis The error of step 103 is modified.
Step 1022, the h that will be calculated1…hNIt is input to full Connection Neural Network, is accordingly tied using formula (2) Fruit.
In formula (2), Wk,jIt is the matrix of a L × L, the matrix for the matrix to be asked in full Connection Neural Network function It is random number when parameter is initial, is modified later according to the error of step 103.L depends on the vector length of framework information.gi Indicate the characteristic information of the i-th frame image, hiIndicate that the intermediate parameters of the i-th frame image obtained according to formula (1), the value of i are From 1 to N, hiAnd giAll it is the vector of L × 1, therefore the value of j is from 1 to L.
Step 1023 obtains gi(j)Afterwards, random loss vector portion numerical value within a preset range, prevents over-fitting.
Step 1021 is repeated to step 1023 three times, finally obtains corresponding characteristic information g1…gN.Wherein gi∈RLx1, it is L is multiplied by 1 real vector, and L values are respectively [100,120,100] in calculating three times, is changed by the height of L values, is realized Processing of the low-dimensional to higher-dimension, again from higher-dimension to low-dimensional so that the robustness of neural network is more preferable.
Then, after the characteristic information for obtaining each frame image, the characteristic information g that will obtain0…gN-1It is separately input to more Business part of neural network, obtains the identification information and predictive information of each frame, is as follows:
Step 1024, by characteristic information g1…gNIt is input to full Connection Neural Network, identification information is obtained using formula (3).
In formula (3), W 'k,jIt is the matrix of a L × L, the matrix parameter for the matrix to be asked of full Connection Neural Network It is random number when initial, is modified later according to the error of step 103, gtIt is a L for the characteristic information of t frame images × 1 vector, L 100.ytThe identification information for indicating t frame images, for the result vector of the judgement of the frame action classification, yt∈ RMx1, it is the vector of M × 1, therefore the value of j is 1 to M in formula (3), M is the number of action classification.It is each in the vector A value expresses the confidence level of corresponding actions.
Assuming that action classification shares 4, respectively it is seated, stands up, walks and raises one's hand, then M=4, yt=(0,0.1,0.2, 0.5), then it represents that the confidence level for this four action classifications that are seated, stand up, walk and raise one's hand is respectively 0,0.1,0.2,0.5, confidence Degree it is higher, illustrate image it is corresponding more may be the action, This move confidence level highest of raising one's hand, be 0.5, illustrate in image Action, which is most likely to be, raises one's hand.
Finally, predictive information is obtained according to skeleton data and identification information, be as follows:
Step 1025 generates prediction matrix using formula (4).
In formula (4), W " is the matrix to be asked of full Connection Neural Network, is the matrix of L × L, is when matrix parameter is initial Random number is modified according to the error of step 103 later.gtFor the characteristic information of t frame images, p ' is the vector of L × 1.
By p 'iData in vector are one group a per M, are converted to a matrix Gk,j.G is the matrix of (L/M) × M.
Step 1025, the final result y obtained using step 1024tAnd formula (5) selects matrix.
In formula (5), p " is the vector of (L/M) × 1, and the value of j is from 1 to L/M.
Step 1025 obtains predictive information using formula (6).
In formula (6), W " ' is the matrix to be asked of full Connection Neural Network, is the matrix of 1 × (L/M), matrix parameter is initial When be random number, modified later according to the error of step 103.pt∈ R are the prediction that frame action occurs, the numerical tabular The time span that present frame terminates to action or occurs to next action is shown.
Step 103, according to the identification information and recognition result and the predictive information and prediction result, to nerve Network optimizes.
In the present embodiment, according to the identification information and recognition result and the predictive information and prediction result, to god It is optimized through network, may include:Identification error is calculated according to the identification information and the recognition result;According to described pre- Measurement information and the prediction result calculate prediction error;According to the weighted sum of the identification error and the prediction error, obtain Overall error, and utilize stochastic gradient descent method reverse transmittance nerve network parameter.
Specifically, it is compared using the actual result in the result and step 101 obtained in step 102, is always missed Difference:
Overall error is formed by identification error and prediction error combination, and λ is the coefficient of artificial selection, can generally take 0.1.
The calculation formula of identification error is:
Wherein, ytFor the identification information of the t frame images obtained in step 102, yt,kIt indicates to move for k-th in identification information Make the corresponding value of classification.ztFor the recognition result of the t frame images obtained in step 101, i.e., correct predicted vector, zt,kTable Show the corresponding value of k-th of action classification in recognition result.K values are 1 to M.It is 1, Qi Tadong correctly to act corresponding numerical value It is 0 to make corresponding numerical value.Assuming that action classification shares 4, respectively it is seated, stands up, walks and raises one's hand, in a certain frame image In, user identifies that action is to raise one's hand, then recognition result zt=(0,0,0,1), in another frame image, user identifies action To stand up, then recognition result zt=(0,1,0,0).
Predict that the calculation formula of error is:
Wherein, ptFor the prediction result obtained in step 102, LtFor the prediction result obtained in step 101, i.e., correctly Predictive information.
On the basis of obtaining overall error using formula (7), joined using stochastic gradient descent method reverse transmittance nerve network Number.
Step 104 repeats the above steps, until neural network restrains.
Specifically, above-mentioned steps 101 can be repeated to step 103, it is follow-up repeat step 101 when, can be with Without processed training video before selection from training set, to preferably be optimized to neural network, until nerve Network convergence.
Step 105, after neural network convergence, using neural network to the skeleton data per frame image in video to be measured into Row processing, obtains corresponding identification information and predictive information.
After neural network convergence, so that it may to be detected to video to be measured using neural network.It specifically, can be first The skeleton data of video to be measured is obtained, and the skeleton data of each frame image in video to be measured is input to the spy of neural network Sign extraction part, obtains corresponding characteristic information.It is then possible to which the corresponding characteristic information of the video to be measured is input to more Business part of neural network, obtains the identification information and predictive information of each frame image, the calculating of identification information and predictive information can With with reference to the formula in step 102.Finally, according to the identification information and the predictive information, each frame figure in video is determined At the beginning of corresponding action classification and action end time or next action:Include in the identification information of one frame image The confidence level of each action classification, the confidence level highest action classification that can be considered in this frame image, predictive information Directly may indicate that also how many frame or how long arrive action terminate or it is next act start.
On the basis of technical solution provided in this embodiment, adjust to the being adapted to property of sequence of each step, for example, The recognition result and prediction result acted in each frame image in acquisition training video in step 101, may be adjusted to step It is executed after 102.
In practical applications, user can determine the action classification of each frame image and be moved according to the training video in training set At the beginning of making end time or next action.It, can be according to the actual act in video using the method in above-mentioned steps The information such as classification, optimize neural network so that the identification information and the close enough reality of predictive information of neural network output Actual value, it is then possible to be handled video to be measured using neural network, the identification information and prediction for obtaining video to be measured are believed Breath.Due to being handled each frame in video, it can identify multiple actions in a video, may be used also To predict the end time of each action.
Video actions detection method provided in this embodiment obtains training and regards by choosing training video from training set The skeleton data of each frame image and the recognition result to being acted in image and prediction result in frequency, using neural network to every The skeleton data of one frame image is handled, and the identification information and predictive information of each frame image are obtained, and is believed according to the identification Breath and recognition result and the predictive information and prediction result, optimize neural network, repeat the above steps, until Neural network restrains, and after neural network convergence, is carried out to the skeleton data per frame image in video to be measured using neural network Processing, obtains corresponding identification information and predictive information, each frame image in video can be identified, and is not necessarily to manual extraction Video clip improves detection efficiency and accuracy.
Embodiment two
Second embodiment of the present invention provides a kind of video actions detection devices.Fig. 2 is video provided by Embodiment 2 of the present invention The structure diagram of action detection device.As shown in Fig. 2, the device in the present embodiment, may include:
Acquisition module 201 obtains the skeleton of each frame image in training video for choosing training video from training set Data and the recognition result to being acted in image and prediction result;
Processing module 202 obtains each frame for being handled the skeleton data of each frame image using neural network The identification information and predictive information of image;
Optimization module 203, for being tied with prediction according to the identification information and recognition result and the predictive information Fruit optimizes neural network;
Replicated blocks 204, if judging result is no, trigger the acquisition mould for judging whether log on restrains Block 201, if the determination result is YES, then detection trigger module 205;
The detection module 205 is used for after neural network convergence, using neural network to every frame image in video to be measured Skeleton data handled, obtain corresponding identification information and predictive information.
Video actions detection device in the present embodiment can be used for executing the video described in embodiment one and move detection side Method, specific implementation principle are referred to embodiment one, and details are not described herein again.
Video actions detection device provided in this embodiment obtains training and regards by choosing training video from training set The skeleton data of each frame image and the recognition result to being acted in image and prediction result in frequency, using neural network to every The skeleton data of one frame image is handled, and the identification information and predictive information of each frame image are obtained, and is believed according to the identification Breath and recognition result and the predictive information and prediction result, optimize neural network, repeat the above steps, until Neural network restrains, and after neural network convergence, is carried out to the skeleton data per frame image in video to be measured using neural network Processing, obtains corresponding identification information and predictive information, each frame image in video can be identified, and is not necessarily to manual extraction Video clip improves detection efficiency and accuracy.
Further, the processing module 202 is specifically used for:
The skeleton data of each frame image is separately input to the characteristic extraction part of neural network, obtains corresponding feature Information;
The characteristic information is input to multitask part of neural network, obtains identification information and the prediction of each frame image Information.
Further, the optimization module 203 is specifically used for:
Identification error is calculated according to the identification information and the recognition result;
Prediction error is calculated according to the predictive information and the prediction result;
According to the weighted sum of the identification error and the prediction error, overall error is obtained, and utilize stochastic gradient descent Method reverse transmittance nerve network parameter.
Further, the detection module 205 is specifically used for:
After neural network convergence, the skeleton data of video to be measured is obtained;
The skeleton data of each frame image in video to be measured is input to the characteristic extraction part of neural network, obtains phase The characteristic information answered;
The corresponding characteristic information of the video to be measured is input to multitask part of neural network, obtains each frame image Identification information and predictive information;
According to the identification information and the predictive information, determine in video the corresponding action classification of each frame image and At the beginning of acting end time or next action.
Further, the acquisition module 201 is specifically used for:
Training video is chosen from training set, obtains the skeleton data of each frame image in training video;
The training video is played to user, and it is corresponding to receive each frame image that user inputs according to the training video Action classification and action end time or time started;
The recognition result and prediction result are determined according to the action classification and action end time or time started.
Finally it should be noted that:The above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Present invention has been described in detail with reference to the aforementioned embodiments for pipe, it will be understood by those of ordinary skill in the art that:Its according to So can with technical scheme described in the above embodiments is modified, either to which part or all technical features into Row equivalent replacement;And these modifications or replacements, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims (10)

1. a kind of video actions detection method, which is characterized in that including:
It chooses training video from training set, obtains in training video the skeleton data of each frame image and to being acted in image Recognition result and prediction result;
The skeleton data of each frame image is handled using neural network, obtains identification information and the prediction of each frame image Information;
According to the identification information and recognition result and the predictive information and prediction result, neural network is optimized;
It repeats the above steps, until neural network restrains;
After neural network convergence, using neural network to being handled per the skeleton data of frame image in video to be measured, obtain Corresponding identification information and predictive information.
2. according to the method described in claim 1, it is characterized in that, using neural network to the skeleton data of each frame image into Row processing, obtains the identification information and predictive information of each frame image, including:
The skeleton data of each frame image is separately input to the characteristic extraction part of neural network, obtains corresponding feature letter Breath;
The characteristic information is input to multitask part of neural network, the identification information and prediction for obtaining each frame image are believed Breath.
3. according to the method described in claim 1, it is characterized in that, according to the identification information and recognition result and described Predictive information and prediction result, optimize neural network, including:
Identification error is calculated according to the identification information and the recognition result;
Prediction error is calculated according to the predictive information and the prediction result;
According to the weighted sum of the identification error and the prediction error, overall error is obtained, and anti-using stochastic gradient descent method To Propagation Neural Network parameter.
4. according to claim 1-3 any one of them methods, which is characterized in that after neural network convergence, utilize nerve net Network handles the skeleton data of every frame image in video to be measured, obtains corresponding identification information and predictive information, including:
After neural network convergence, the skeleton data of video to be measured is obtained;
The skeleton data of each frame image in video to be measured is input to the characteristic extraction part of neural network, is obtained corresponding Characteristic information;
The corresponding characteristic information of the video to be measured is input to multitask part of neural network, obtains the identification of each frame image Information and predictive information;
According to the identification information and the predictive information, the corresponding action classification of each frame image and action in video are determined At the beginning of end time or next action.
5. according to the method described in claim 4, it is characterized in that, obtaining the identification acted in each frame image in training video As a result and prediction result, including:
The training video is played to user, and receives the corresponding action of each frame image that user inputs according to the training video Classification and action end time or time started;
The recognition result and prediction result are determined according to the action classification and action end time or time started.
6. a kind of video actions detection device, which is characterized in that including:
Acquisition module, for choosing training video from training set, obtain the skeleton data of each frame image in training video with And the recognition result to being acted in image and prediction result;
Processing module obtains each frame image for being handled the skeleton data of each frame image using neural network Identification information and predictive information;
Optimization module is used for according to the identification information and recognition result and the predictive information and prediction result, to nerve Network optimizes;
Replicated blocks if judging result is no, trigger the acquisition module, if sentencing for judging whether log on restrains Disconnected result is yes, then detection trigger module;
The detection module is used for after neural network convergence, using neural network to the skeleton of every frame image in video to be measured Data are handled, and corresponding identification information and predictive information are obtained.
7. device according to claim 6, which is characterized in that the processing module is specifically used for:
The skeleton data of each frame image is separately input to the characteristic extraction part of neural network, obtains corresponding feature letter Breath;
The characteristic information is input to multitask part of neural network, the identification information and prediction for obtaining each frame image are believed Breath.
8. device according to claim 6, which is characterized in that the optimization module is specifically used for:
Identification error is calculated according to the identification information and the recognition result;
Prediction error is calculated according to the predictive information and the prediction result;
According to the weighted sum of the identification error and the prediction error, overall error is obtained, and anti-using stochastic gradient descent method To Propagation Neural Network parameter.
9. according to claim 6-8 any one of them devices, which is characterized in that the detection module is specifically used for:
After neural network convergence, the skeleton data of video to be measured is obtained;
The skeleton data of each frame image in video to be measured is input to the characteristic extraction part of neural network, is obtained corresponding Characteristic information;
The corresponding characteristic information of the video to be measured is input to multitask part of neural network, obtains the identification of each frame image Information and predictive information;
According to the identification information and the predictive information, the corresponding action classification of each frame image and action in video are determined At the beginning of end time or next action.
10. device according to claim 9, which is characterized in that the acquisition module is specifically used for:
Training video is chosen from training set, obtains the skeleton data of each frame image in training video;
The training video is played to user, and receives the corresponding action of each frame image that user inputs according to the training video Classification and action end time or time started;
The recognition result and prediction result are determined according to the action classification and action end time or time started.
CN201710146933.5A 2017-03-13 2017-03-13 Video actions detection method and device Pending CN108573197A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710146933.5A CN108573197A (en) 2017-03-13 2017-03-13 Video actions detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710146933.5A CN108573197A (en) 2017-03-13 2017-03-13 Video actions detection method and device

Publications (1)

Publication Number Publication Date
CN108573197A true CN108573197A (en) 2018-09-25

Family

ID=63578612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710146933.5A Pending CN108573197A (en) 2017-03-13 2017-03-13 Video actions detection method and device

Country Status (1)

Country Link
CN (1) CN108573197A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110264668A (en) * 2019-07-10 2019-09-20 四川长虹电器股份有限公司 More tactful old men based on machine vision technique see maintaining method
CN111382764A (en) * 2018-12-29 2020-07-07 北大方正集团有限公司 Neural network model establishing method and device and computer readable storage medium
CN111382306A (en) * 2018-12-28 2020-07-07 杭州海康威视数字技术股份有限公司 Method and device for inquiring video frame
CN112364695A (en) * 2020-10-13 2021-02-12 杭州城市大数据运营有限公司 Behavior prediction method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101082926A (en) * 2007-07-03 2007-12-05 浙江大学 Modeling approachused for trans-media digital city scenic area
CN106022229A (en) * 2016-05-11 2016-10-12 北京航空航天大学 Abnormal behavior identification method in error BP Adaboost network based on video motion information feature extraction and adaptive boost algorithm
CN106471492A (en) * 2014-06-24 2017-03-01 谷歌公司 The action of index resource

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101082926A (en) * 2007-07-03 2007-12-05 浙江大学 Modeling approachused for trans-media digital city scenic area
CN106471492A (en) * 2014-06-24 2017-03-01 谷歌公司 The action of index resource
CN106022229A (en) * 2016-05-11 2016-10-12 北京航空航天大学 Abnormal behavior identification method in error BP Adaboost network based on video motion information feature extraction and adaptive boost algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YANGHAO LI ET AL.: "Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks", 《ARXIV:1604.05633V2 [CS.CV] 26 JUL 2016》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382306A (en) * 2018-12-28 2020-07-07 杭州海康威视数字技术股份有限公司 Method and device for inquiring video frame
CN111382306B (en) * 2018-12-28 2023-12-01 杭州海康威视数字技术股份有限公司 Method and device for inquiring video frame
CN111382764A (en) * 2018-12-29 2020-07-07 北大方正集团有限公司 Neural network model establishing method and device and computer readable storage medium
CN111382764B (en) * 2018-12-29 2024-02-13 新方正控股发展有限责任公司 Neural network model building method and device for face recognition or gesture recognition and computer readable storage medium
CN110264668A (en) * 2019-07-10 2019-09-20 四川长虹电器股份有限公司 More tactful old men based on machine vision technique see maintaining method
CN112364695A (en) * 2020-10-13 2021-02-12 杭州城市大数据运营有限公司 Behavior prediction method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
Zhang et al. Dynamic hand gesture recognition based on short-term sampling neural networks
CN107403154B (en) Gait recognition method based on dynamic vision sensor
CN110287844B (en) Traffic police gesture recognition method based on convolution gesture machine and long-and-short-term memory network
CN108875708A (en) Behavior analysis method, device, equipment, system and storage medium based on video
JP6517681B2 (en) Image pattern learning apparatus, method and program
CN111881705A (en) Data processing, training and recognition method, device and storage medium
CN110084228A (en) A kind of hazardous act automatic identifying method based on double-current convolutional neural networks
Kumar et al. An object detection technique for blind people in real-time using deep neural network
US11417095B2 (en) Image recognition method and apparatus, electronic device, and readable storage medium using an update on body extraction parameter and alignment parameter
CN108629326A (en) The action behavior recognition methods of objective body and device
CN107680119A (en) A kind of track algorithm based on space-time context fusion multiple features and scale filter
CN110070029B (en) Gait recognition method and device
CN108573197A (en) Video actions detection method and device
CN104361316B (en) Dimension emotion recognition method based on multi-scale time sequence modeling
CN108304820A (en) A kind of method for detecting human face, device and terminal device
JP2012178036A (en) Similarity evaluation device and method, and similarity evaluation program and storage medium for the same
CN113642431A (en) Training method and device of target detection model, electronic equipment and storage medium
CN111914676A (en) Human body tumbling detection method and device, electronic equipment and storage medium
CN112149616A (en) Figure interaction behavior recognition method based on dynamic information
CN110705428A (en) Facial age recognition system and method based on impulse neural network
CN114495006A (en) Detection method and device for left-behind object and storage medium
CN112926522A (en) Behavior identification method based on skeleton attitude and space-time diagram convolutional network
CN112906520A (en) Gesture coding-based action recognition method and device
JP2018005638A (en) Image recognition model learning device, image recognition unit, method and program
Putra et al. The performance of Siamese neural network for face recognition using different activation functions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180925