CN110210344A - Video actions recognition methods and device, electronic equipment, storage medium - Google Patents

Video actions recognition methods and device, electronic equipment, storage medium Download PDF

Info

Publication number
CN110210344A
CN110210344A CN201910419763.2A CN201910419763A CN110210344A CN 110210344 A CN110210344 A CN 110210344A CN 201910419763 A CN201910419763 A CN 201910419763A CN 110210344 A CN110210344 A CN 110210344A
Authority
CN
China
Prior art keywords
video
sequence
primitive character
dimension
video images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910419763.2A
Other languages
Chinese (zh)
Inventor
倪烽
易阳
赵世杰
邱日明
李峰
左小祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910419763.2A priority Critical patent/CN110210344A/en
Publication of CN110210344A publication Critical patent/CN110210344A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

Embodiments herein discloses a kind of video actions recognition methods and device, this method comprises: obtaining the primitive character that sequence of video images extracts on space and setting channel in video flowing, the primitive character is that characteristic binding of the sequence of video images on time, space and channel dimension indicates;The separating treatment that the channel dimension and time dimension are carried out to the primitive character obtains the separation characteristic of the sequence of video images;Converge the separation characteristic and the primitive character, obtains the target signature of the sequence of video images;The movement for including in the sequence of video images is identified according to the target signature, obtains action recognition result.The movement for including in video flowing can be accurately identified using this method.

Description

Video actions recognition methods and device, electronic equipment, storage medium
Technical field
This application involves technical field of image processing, in particular to a kind of video actions recognition methods and device, electricity Sub- equipment, computer readable storage medium.
Background technique
In Activity recognition field, to the action recognition important in inhibiting of real-time video, such as in sign language interpreter scene, Sign language interpreter can be greatly reduced for artificial dependence, to liberate manpower using the action recognition scheme of real-time video.
In existing realization, the general action recognition realized by intelligent algorithm to real-time video is usually used Intelligent algorithm is TSN (Temporal Segment Networks, period network) algorithm, and principle will input Video is divided into several short-movie sections by equal intervals, is then obtained using section common recognition function in conjunction with the category score of multiple short-movie sections more The common recognition assumed between a short-movie section about classification predicts that entire video belongs to using anticipation function finally based on this common recognition The probability of every kind of type of action.
Although TSN algorithm can be realized the action recognition to real-time video, but it is right during executing action recognition Time-domain information causes the accuracy to action recognition not high using insufficient.
Summary of the invention
In order to which the accuracy for solving the problems, such as that the prior art carries out action recognition to real-time video is not high, the implementation of the application Example provides a kind of video actions recognition methods and device, electronic equipment, computer readable storage medium, to realize to real-time view The identification of movement included in frequency.
Wherein, technical solution used by the application are as follows:
A kind of video actions recognition methods, comprising: sequence of video images is on space and setting channel in acquisition video flowing The primitive character of extraction, the primitive character are feature connection of the sequence of video images on time, space and channel dimension Closing indicates;The separating treatment that the channel dimension and time dimension are carried out to the primitive character obtains the video image sequence The separation characteristic of column;Converge the separation characteristic and the primitive character, obtains the target signature of the sequence of video images;Root The movement for including in the sequence of video images is identified according to the target signature, obtains action recognition result.
A kind of video actions identification device, comprising: primitive character obtains module, for obtaining video image sequence in video flowing It is listed in space and sets the primitive character extracted on channel, the primitive character is the sequence of video images in time, space It is indicated with the characteristic binding on channel dimension;Separation processing modules, for the primitive character carry out the channel dimension and The separating treatment of time dimension obtains the separation characteristic of the sequence of video images;Feature converges module, for converging described point From feature and the primitive character, the target signature of the sequence of video images is obtained;Action recognition module, for according to Target signature identifies the movement for including in the sequence of video images, obtains action recognition result.
A kind of electronic equipment, including processor and memory are stored with computer-readable instruction on the memory, described Computer-readable instruction realizes video actions recognition methods as described above when being executed by the processor.
A kind of computer readable storage medium, is stored thereon with computer program, and the computer program is held by processor Video actions recognition methods as described above is realized when row.
In the technical solution of the application, video flowing refers to the real-time video transmitted, is getting video flowing After the primitive character of middle sequence of video images, by carrying out the separating treatment of channel dimension and time dimension to primitive character, with And target signature is obtained as converging separation characteristic obtained by separating treatment and primitive character, so that the video of sequence of video images is special Reference breath can obtain Enhanced expressing on the time dimension of target signature.
Since the movement in video flowing is embodied by continuous video image, the movement in video flowing is on time dimension The video feature information of expression is very sensitive, and the application, being capable of basis in the video actions identification executed according to target signature Target signature, to the Enhanced expressing of video feature information, accurately identifies the movement in video flowing on time dimension.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The application can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the application Example, and together with specification it is used to explain the principle of the application.
Fig. 1 is the schematic diagram of implementation environment involved in the application;
Fig. 2 is the structural schematic diagram of video actions identification network model in Fig. 1;
Fig. 3 is a kind of flow chart of video actions recognition methods shown according to an exemplary embodiment;
Fig. 4 is a kind of flow chart of the video actions recognition methods shown according to another exemplary embodiment;
Fig. 5 be in Fig. 4 corresponding embodiment step 230 in the flow chart of one embodiment;
Fig. 6 be in Fig. 3 corresponding embodiment step 130 in the flow chart of one embodiment;
Fig. 7 is that the shown according to an exemplary embodiment first point-by-point convolutional network layer executes convolutional calculation to primitive character Schematic diagram;
Fig. 8 is that depth convolutional network layer shown according to an exemplary embodiment shows intermediate features execution convolutional calculation It is intended to;
Fig. 9 is that the shown according to an exemplary embodiment second point-by-point convolutional network layer executes convolutional calculation to separation characteristic Schematic diagram;
Figure 10 is a kind of flow chart of the video actions recognition methods shown according to another exemplary embodiment;
Figure 11 is the structural schematic diagram of timing enhancing module in one embodiment in Fig. 2;
Figure 12 is a kind of block diagram of video actions identification device shown according to an exemplary embodiment;
Figure 13 is the hardware block diagram of a kind of electronic equipment shown according to an exemplary embodiment.
Through the above attached drawings, it has been shown that the specific embodiment of the application will be hereinafter described in more detail, these attached drawings It is not intended to limit the range of the application design in any manner with verbal description, but is by referring to specific embodiments Those skilled in the art illustrate the concept of the application.
Specific embodiment
Here will the description is performed on the exemplary embodiment in detail, the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with the application.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the application.
Fig. 1 is a kind of schematic diagram of implementation environment involved in the application.Wherein, sequence of video images includes according to the time Several video images that sequence is successively extracted from the video flowing for carrying out video actions identification go out, video figure as shown in Figure 1 As the picture material of video image each in sequence all includes corresponding posture, view is constituted by the posture in each video image The movement that frequency image sequence is included.
Video actions identification network model is the network according to the corresponding building of video actions recognition methods disclosed herein Model, video actions identify that network model is completed to be included to sequence of video images according to the picture material of each video image The identification of movement.
By by sequence of video images input video action recognition network model, so that video actions identify network model The movement for including in sequence of video images is identified, and exports corresponding action recognition result.Illustratively, such as Fig. 1 institute Show, the action recognition result that is exported of video actions identification network model include the highest five kinds of type of action of possibility with And the corresponding probability of every kind of type of action.
Generally comprised in video flowing and extract obtained several sequence of video images, by by each sequence of video images successively In input video action recognition network model, to identify towards entire video flowing execution with this.
In an illustrative application scenarios, sequence of video images in input video action recognition network model is from real-time It extracts and obtains in sign language video, identify the action recognition of network model output as a result, it is possible to realize in video according to video actions The real time translation of sign language is participated in without artificial.
In the application scenarios of another exemplary, the sequence of video images in input video action recognition network model be from It extracts and is obtained to use live streaming user or Video chat in the live streaming platform live video stream that perhaps Video chat tool exports The real-time action at family is identified, the interest of live streaming or Video chat is increased.
Fig. 2 is a kind of structural schematic diagram of video actions identification network model shown according to an exemplary embodiment.? In one exemplary embodiment, as shown in Fig. 2, video actions identify network model by core network (Backbone) layer 101, pond Change network layer 102, timing enhancing module 103 and fully connected network network layers 104 to constitute.
Wherein, backbone network network layers 101 are used to extract convolution feature of the sequence of video images on space and setting channel;Pond Change network layer 102 to be used to carry out Feature Compression to the convolution feature of sequence of video images, obtains the original spy of sequence of video images Sign;Timing enhances module 103 and is used to carry out timing enhancing processing to primitive character, obtains the target signature of sequence of video images; For fully connected network network layers 104 then for executing the identification that sequence of video images includes movement according to target signature, output is corresponding Action recognition result.
In one exemplary embodiment, backbone network network layers 101 are made of sequentially connected several subnet network layers, such as are being schemed Video actions shown by 2 identify in network model, constitute core network 101 by sequentially connected subnet network layers 1-5.It is exemplary , backbone network network layers 101 can be ResNet-50 convolutional neural networks, can also be that MobileNet, ShuffleNet etc. are light The convolutional neural networks of magnitude, this place are not limited.
Fig. 3 is a kind of flow chart of video actions recognition methods shown according to an exemplary embodiment, this method application Network model is identified in Fig. 1 and video actions illustrated in fig. 2.As shown in figure 3, this method at least includes the following steps:
Step 110, the primitive character that image sequence extracts on space and setting channel in video flowing, the original spy are obtained Sign is that characteristic binding of the sequence of video images on time, space and channel dimension indicates.
As previously mentioned, the sequence of video images in video flowing includes sequentially in time from the view for carrying out video actions identification Several video images successively extracted in frequency stream.By the action recognition to sequence of video images in video flowing, will be realized The identification acted in video flowing, therefore after obtaining video flowing, it is no longer necessary to video flowing institute is known by viewing video flowing The video contents such as the event of record and behavior act, this is particularly important for scenes such as security monitorings.
To the video flowing acquired in batches, can also be known in real time based on the video actions that the present exemplary embodiment is realized Not, the movement for including in batch video flowing is quickly identified, and then quickly obtains corresponding action recognition result.
It, all will be right for obtained video flowing under any scene for needing to realize video actions identification to video flowing Sequence of video images in this video flowing extracts primitive character on space and setting channel, is regarded with obtaining primitive character The video actions of frequency image sequence identify.
By extracting the image feature information of video image from each video image, and it is total to by these image feature informations With the video feature information for constituting sequence of video images, to obtain the video features table of sequence of video images in video streaming Show, therefore the video features of sequence of video images in video streaming are indicated to the primitive character of referred to as sequence of video images.
The primitive character of sequence of video images is by carrying out feature to sequence of video images on space and setting channel Obtained by extraction.Space corresponding to sequence of video images is not understood as the image size of each video image in sequence of video images Information, illustratively, the image size information of video image are expressed as the resolution ratio of video image, it is assumed that the resolution of video image Rate is 2048 × 1024, then it represents that a height of 2048 pixel of video image, width are 1024 pixels.
Setting channel is then that the neural net layer extracted by the primitive character of execution sequence of video images is determined, nerve net For executing the neuronal quantity of primitive character extraction in network layers, as neural net layer executes the port number that primitive character extracts Amount.
Neural net layer executes primitive character to sequence of video images and extracts, including to video each in sequence of video images Image information represented by image executes the relevant calculation of neuron, obtains the process of the characteristics of image of each video image, also The process stacked including the characteristics of image to each video image.Wherein image information packet represented by each video image Include image size information, Color Channel information of each image pixel etc..
The characteristics of image for stacking each video image is that each characteristics of image is superimposed together, more to obtain information content Characteristics of image.It include the characteristics of image of each video image in the characteristics of image, then using the characteristics of image as video figure As the primitive character of sequence.
In primitive character, each characteristics of image is sequentially overlapped according to the time series of corresponding video image extraction, with this Form the time dimension of primitive character.And in the superposition of characteristics of image, each characteristics of image does not occur on space and channel Change, therefore it can be concluded that, primitive character is characteristic binding table of the sequence of video images on time, space and channel dimension Show.
Illustratively, the primitive character of sequence of video images can be indicated with the tetradic (H, W, C, T), wherein " H " and " W " indicate primitive character Spatial Dimension, " C " indicate primitive character channel dimension, " T " indicate primitive character when Between dimension.
It should also be noted that the present embodiment is that video actions performed by the primitive character based on sequence of video images are known Not, as long as therefore after getting the primitive character of sequence of video images, i.e., the video actions identification of implementable the present embodiment description Method.
In other words, any neural net layer executes primitive character and extracts after exporting primitive character, passes through step 110 The identification that can initiate that sequence of video images includes movement is executed, so that the video actions identification that the present embodiment is realized It can be compatible in neural network framework present in any neural net layer, have very outstanding versatility.
Step 130, the separating treatment that channel dimension and time dimension are carried out to primitive character, obtains sequence of video images Separation characteristic.
Wherein, feature of the sequence of video images on time, space and channel dimension is obtained by the execution of step 110 Joint indicates, that is, obtains the primitive character of sequence of video images, but since this primitive character is only in sequence of video images The simple stacking of the characteristics of image of each video image, can not indicate the association between each characteristics of image, thus can not be from whole The video feature information of body level reflecting video image sequence influences the accuracy of video actions identification, it is therefore desirable to original Feature carries out the separating treatment in channel and time dimension, to enhance being associated between video feature information and each characteristics of image Property.
The separating treatment for carrying out channel dimension and time dimension to the primitive character of sequence of video images refers to, first passes through pair Primitive character carries out the feature extraction on time dimension, obtains the character representation that time dimension is isolated to primitive character, then Feature extraction on channel dimension is carried out to the character representation for isolating time dimension, to obtain point of sequence of video images From feature.
It follows that the separation characteristic of sequence of video images is by successively carrying out time dimension and channel to primitive character Feature extraction in dimension is obtained, the primitive character of relative video image sequence, video image sequence expressed by separation characteristic The video feature information of column realizes separation on channel dimension and time dimension.
The present embodiment is by successively carrying out the feature extraction on time dimension and channel dimension to primitive character, so that each Characteristics of image corresponding to video image is associated in time dimension with foundation on Spatial Dimension respectively, enables separation characteristic whole The video feature information that sequence of video images is expressed on body, enhances feature representation of the video feature information on time dimension, Be conducive to be promoted the accuracy of video actions identification.
Step 150, converge separation characteristic and primitive character, obtain the target signature of sequence of video images.
As previously mentioned, the relevance between each video image corresponding image feature is established by separation characteristic, from And the video feature information of sequence of video images can be expressed on the whole, and primitive character more remains each video figure It, can be to greatest extent if executing video actions identification jointly according to separation characteristic and primitive character as the characteristics of image of itself Guarantee recognition accuracy.Therefore, it is necessary to which execute separation characteristic and primitive character converges processing.
The separation characteristic and primitive character for converging sequence of video images refer to, separation characteristic are added with primitive character, institute The target signature of obtained feature and as sequence of video images.Separation characteristic can be directly added with primitive character, can be with According to the weighted sum of default weight calculation separation characteristic and primitive character, this place is limited not to this.
Since target signature is to be added via separation characteristic with primitive character, fusion has point in target signature The video feature information expressed respectively from feature and primitive character.In other words, target signature not only expresses each video figure As the image feature information of itself, the linked character information between each video image is also expressed, it being capable of maximum limit on the whole The video feature information of degree ground expression sequence of video images, to realize accurately identifying for video actions.
Step 170, the movement for including in sequence of video images is identified according to target signature, obtains action recognition result.
Wherein, the identification of the video actions according to performed by target signature is that target signature is mapped to sample labeling space, The corresponding process for obtaining target signature and being directed toward the probability of each sample type in sample labeling space, to be acted accordingly Recognition result.In the present embodiment, sample labeling space is pre-set type of action.
In video actions identification network model shown in Fig. 2, the video actions identification carried out to target signature is by complete Connect what network layer 104 executed, fully connected network network layers 104 play " classifier " in entire video actions identification network model and make With action recognition result corresponding to output target signature.Content described in step 110 to step 150 enhances mould by timing Block 103 executes, and enhances module 103 by timing and carries out timing enhancing processing to primitive character, so that the view of sequence of video images Frequency characteristic information obtains enhancing on time dimension to be indicated, to be expressed the video features of sequence of video images on the whole The target signature of information.
In addition, content described in step 110 to step 150 can be used as stand-alone program realization to original in the present embodiment The timing enhancing of beginning feature is handled, which can be applied in arbitrary neural network framework, with original to what is inputted Feature executes timing enhancing processing, and exports the target signature for expressing the video feature information of sequence of video images on the whole.
For video flowing, is acted as included in it and embodied by continuous video image, video flowing is wrapped It is very sensitive to video feature information expressed on time dimension containing movement therefore dynamic in the video executed according to target signature In identifying, it can be indicated according to enhancing of the video feature information expressed by target signature on time dimension, to video flowing The movement for being included is accurately identified.
As a result, through this embodiment provided by method, the movement for including in real-time video can be accurately identified.
Fig. 4 is a kind of flow chart of video actions recognition methods shown according to another exemplary embodiment.Such as Fig. 4 institute Show, before executing step 110, the video actions recognition methods is at least further comprising the steps of:
Step 210, video image is extracted in the video flowing for carrying out video actions identification, forms sequence of video images.
As previously described, video image is extracted in the video flowing for carrying out the dynamic identification of video to refer to, sequentially in time from Several video images are extracted in video flowing, to form sequence of video images.By the video image extraction carried out, can obtain An at least sequence of video images.
In an exemplary embodiment, the video of specified quantity is successively extracted from video flowing according to specified time interval Image constitutes sequence of video images by the video image of specified quantity.Illustratively, it is assumed that include 8 in sequence of video images Video image successively extracts 8 video figures according to specified time interval in the video image extraction carried out from video flowing After forming a sequence of video images, the extraction that next sequence of video images includes 8 video images is continued to execute, until view Frequency stream terminates.
In the embodiment of another exemplary, the video image extracted is the key frame in video flowing, this place is not right This is defined.
Step 230, the feature extraction on Spatial Dimension and channel dimension is carried out to sequence of video images, obtains original spy Sign, the primitive character export the character representation corresponding to Spatial Dimension and time dimension on channel dimension.
Wherein, the feature extraction process on Spatial Dimension and channel dimension is carried out to sequence of video images to refer to, use volume Product neural network carries out convolutional calculation to the image data of video image each in sequence of video images, and sequentially in time to volume Product calculates the process that gained feature is stacked, to obtain the primitive character of sequence of video images.
As previously described, the channel dimension of primitive character is that the convolutional neural networks of execution convolutional calculation are assigned, when Between dimension correspond to sequence of video images in video image quantity, and Spatial Dimension correspond to video image image size believe Breath, it will therefore be appreciated that the primitive character of sequence of video images be exported on channel dimension correspond to Spatial Dimension and when Between dimension character representation.
As shown in figure 5, step 230 at least includes the following steps in an exemplary embodiment:
Step 231, by carrying out the convolution on Spatial Dimension and channel dimension to the video image in sequence of video images It calculates, obtains the convolution feature of video image.
As previously mentioned, carrying out the convolutional calculation on Spatial Dimension and channel dimension to the video image in sequence of video images It is to be executed by backbone network network layers, i.e., is executed by convolutional neural networks provided by backbone network network layers in sequence of video images Convolutional calculation of the video image on space and channel dimension, so that the convolution for obtaining each video image in sequence of video images is special Sign.
It should be appreciated that the character representation that the convolution feature of each video image is exported by trunk network layer, these mark sheets The size and number shown is determined by the self property of the provided convolutional neural networks of backbone network network layers.Such as these character representations Size determine that quantity is then determined by the quantity of convolution kernel by the size of the convolution kernel of execution convolutional calculation, the quantity of convolution kernel Also output channel quantity set in backbone network network layers is determined.
Step 233, Feature Compression is carried out to convolution feature, obtains the convolution compressive features of video image.
Wherein, carrying out Feature Compression to the convolution feature of video image is executed by pond network layer, pond network layer The essence of performed Feature Compression is that the process of convolutional calculation is carried out to convolution feature, is used to extract important in convolution feature Feature, and reduce the parameter amount in the subsequent performed action recognition of video actions identification network model.
Illustratively, Feature Compression can be carried out using convolution feature of the average pond method to video image, with more Ground retains the background information of video image, is conducive to the identification of movement included in subsequent execution sequence of video images.
Step 235, the stacking for carrying out convolution compressive features for sequence of video images on time dimension, generates original spy Sign.
Wherein, since each video image is arranged in sequence of video images according to the time sequencing extracted, Time dimension can refer to putting in order for each video image in sequence of video images.
According to putting in order for video image each in sequence of video images, by the obtained convolution compressive features of step 233 It stacks gradually, that is, produces the primitive character of sequence of video images.The time dimension of primitive character is it can be appreciated that be each volume The stacking order of product compressive features.
It should be noted that pond network layer acts only on the Spatial Dimension of convolution feature, do not change convolution feature Channel dimension, therefore in the primitive character of sequence of video images, channel dimension is still determined by backbone network network layers.
Provided method through this embodiment can obtain in video flowing sequence of video images in time, space and logical Combine the primitive character of expression in road dimension.
Fig. 6 be in Fig. 3 corresponding embodiment step 130 in the flow chart of one embodiment.As shown in fig. 6, step 130 is at least The following steps are included:
Step 131, the feature extraction on time dimension is carried out to primitive character and obtains intermediate features, which is pair Primitive character isolates the character representation of time dimension.
Wherein, the feature extraction on time dimension is carried out to primitive character to refer to, lead on the time dimension of primitive character It crosses the first point-by-point convolutional network layer and convolutional calculation is carried out to primitive character, and stack character representation obtained by convolutional calculation, to obtain Obtain the process of the intermediate features of sequence of video images.
Convolutional neural networks used by first point-by-point convolutional network layer are Pointwise (point-by-point convolutional neural networks), It include point-by-point convolution kernel in point-by-point convolutional neural networks, and each point-by-point convolution kernel passes through on the time dimension of primitive character Each channel of primitive character is operated, simultaneously thereby executing the convolutional calculation to primitive character.
In the first point-by-point convolutional network layer, the size of point-by-point convolution kernel is 1 × 1 × M, wherein " M " indicates that output is original The corresponding number of channels of feature, i.e., set output channel quantity in backbone network network layers.The quantity of point-by-point convolution kernel can be pre- First be arranged, it is identical to can be set the quantity of point-by-point convolution kernel number of channels corresponding with output primitive character, also can be set by The quantity of point convolution kernel number of channels corresponding from output primitive character is different, this place is not limited.
But it should be recognized that being less than the corresponding port number of output primitive character by the quantity that point-by-point convolution kernel is arranged Amount, can be realized dimension-reduction treatment of the primitive character on channel dimension, so as to reduce entire video actions identification network mould The parameter amount of type.
It is easy to understand, Fig. 7 is that the first point-by-point convolutional network layer shown by an exemplary embodiment holds primitive character The schematic diagram of row convolutional calculation.As shown in Figure 7, it is assumed that the primitive character of sequence of video images is expressed as 1 × 1 × 6 × 8, wherein " 1 × 1 " indicates the Spatial Dimension of primitive character, and " 6 " indicate the channel dimension of primitive character, and " 8 " indicate the time of primitive character Dimension includes 3 in the first point-by-point convolutional network layer having a size of 1 × 1 × 6 point-by-point convolution kernels.
In the first point-by-point convolutional network layer as shown in Figure 7, each point-by-point convolution kernel executes the volume to primitive character respectively Product calculates, and the convolutional calculation process that each point-by-point convolution kernel executes respectively is identical.It below will be to wherein one point-by-point convolution kernel Performed convolutional calculation process is described in detail:
On the time dimension of primitive character, point-by-point convolution kernel takes a time point to execute the convolution to primitive character every time Operation, accordingly obtains each characteristic element of the intermediate features on a wherein channel.Due to point-by-point convolution kernel and primitive character It is corresponding on channel dimension, therefore on each time point, point-by-point convolution kernel can be to primitive character at corresponding time point On whole characteristic elements carry out convolutional calculation, i.e., it is each on corresponding time point that point-by-point convolution kernel operates primitive character simultaneously Channel, and obtain the character representation on corresponding time point.
Each point-by-point convolution kernel in the first point-by-point convolutional network layer executes the above convolution meter to primitive character respectively as a result, It calculates, corresponding character representation can be obtained.By being stacked gradually to these character representations, can be obtained in sequence of video images Between feature.
It can be seen from figure 7 that primitive character via the first point-by-point convolutional network layer execute convolutional calculation it is obtained in Between character representation be 1 × 1 × 3 × 8, intermediate features on channel dimension with convolution kernel point-by-point in the first point-by-point convolutional network layer Quantity is consistent.Compared with primitive character, although intermediate features are changed on channel dimension, time dimension is kept not Become, intermediate features are separated from each other between the characteristic element on time dimension, therefore intermediate features can be understood as to original Character separation goes out the character representation of time dimension.
Step 133, on the channel dimension of intermediate features, feature extraction is carried out to intermediate features, obtains video image sequence The separation characteristic of column, the separation characteristic are the character representations for going out channel dimension to intermediate character separation.
Wherein, the process for carrying out feature extraction to intermediate features on channel dimension refers to, passes through depth convolutional network layer Convolutional calculation is carried out to the intermediate features of sequence of video images, then stacks the process of character representation obtained by convolutional calculation.
Convolutional neural networks used by depth convolutional network layer are Depthwise (depth convolutional neural networks), this is A kind of special grouping convolutional neural networks, included in depth convolution kernel group number and intermediate features on channel dimension Number of channels it is identical so that each depth convolution kernel executes depth convolution of the intermediate features on channel dimension respectively.
Fig. 8 is that a kind of depth convolutional network layer according to shown by an exemplary embodiment executes convolution meter to intermediate features The schematic diagram of calculation.As shown in Figure 8, it will again be assumed that the intermediate features of the first point-by-point convolutional network layer output are expressed as 1 × 1 × 3 × 8, The port number as corresponding to intermediate features is 3, includes 3 having a size of 1 × 1 × 3 depth in depth convolutional network layer Spend convolution kernel.
Depth convolutional network layer is respectively that each depth convolution kernel is assigned corresponding channel, such as shown in Fig. 8, and first Depth convolution kernel corresponds to the top layer channel of intermediate features, and the middle layer that second depth convolution kernel corresponds to intermediate features is logical Road, third depth convolution kernel corresponds to the lowest level channel of intermediate features, so that each depth convolution kernel executes centre respectively Convolutional calculation of the feature on institute's assignment channel, and obtain corresponding character representation.
The resulting character representation of convolutional calculation is carried out by stacking depth convolution kernel, point of sequence of video images can be obtained From feature, which is expressed as 1 × 1 × 3 × 8.
As can be seen that although each depth convolution kernel in depth convolutional network executes intermediate features on channel dimension respectively Convolutional calculation, but change intermediate features channel and time dimension, it is only that intermediate features are enterprising in its channel dimension Row separating treatment.The separation characteristic of sequence of video images can be understood as going out channel dimension to intermediate character separation as a result, Character representation.
Since separation characteristic is by executing the feature extraction acquired results on channel dimension, and intermediate features to intermediate features It is by obtained, therefore the primitive character with sequence of video images that executes the feature extraction on time dimension to primitive character It compares, separation characteristic enhances feature representation of the video feature information of sequence of video images on time dimension.
Convolution (Depthwise Separable Convolution) is separated with reference to depth, by common convolution Operation is decomposed into two processes of Depthwise and Pointwise, realizes that channel and Spatial Dimension are divided to corresponding character representation From less more more efficient than the parameter amount of common convolution operation.And the present embodiment passes sequentially through Pointwise and Depthwise two A process realizes that separating treatment of the primitive character of sequence of video images on channel and time dimension obtains separation characteristic, so that In the target signature that primitive character and separation characteristic converge, the video feature information of sequence of video images is in time dimension On obtain Enhanced expressing, the movement so as to be included to sequence of video images accurately identifies.
In addition, the present embodiment passes sequentially through two processes of Pointwise and Depthwise compared with common convolution operation In the acquisition process for realizing the separation characteristic of sequence of video images, required parameter amount is also less and more efficient.
In the embodiment of another exemplary, due to the quantity and output of convolution kernel point-by-point in the first point-by-point convolutional network layer The corresponding number of channels of the primitive character is inconsistent, will lead to the channel dimension of separation characteristic and the channel dimension of primitive character It is not identical, converging for separation characteristic and primitive character can not be carried out.Therefore after obtaining the separation characteristic of sequence of video images, It also needs to carry out convolutional calculation to separation characteristic by the second point-by-point convolutional network layer, the channel dimension of separation characteristic is restored It is extremely identical as primitive character.
Second point-by-point convolutional network layer is roughly the same with the structure and function of the first point-by-point convolutional network layer, be using by Point convolution kernel executes convolutional calculation to separation characteristic on time dimension, but unlike the two, due to the channel of separation characteristic Dimension is different from primitive character, in the second point-by-point convolutional network layer the size and number of point-by-point convolution kernel with the first point-by-point convolution Network layer is different.
It should be noted that the quantity of point-by-point convolution kernel should be with output primitive character pair in the second point-by-point convolutional network layer The number of channels answered is identical, and in the size of point-by-point convolution kernel, " M " value is identical as the port number of separation characteristic.
As shown in figure 9, include 6 in the second point-by-point convolutional network having a size of 1 × 1 × 3 point-by-point convolution kernel, it is each point-by-point Convolution kernel executes convolutional calculation of the separation characteristic on time dimension respectively, and each convolution kernel is calculated gained character representation and is stacked Obtain separation characteristic '.Separation characteristic obtained ' be expressed as 1 × 1 × 3 × 8, the original of channel dimension and sequence of video images Beginning feature is identical, therefore converges operation to separation characteristic ' and primitive character execution.
The present embodiment is enterprising in channel dimension to the separation characteristic of sequence of video images by the second point-by-point convolutional network layer Row liter dimension processing, acquisition channel dimension separation characteristic identical with primitive character ', by executing separation characteristic ' and original spy Converging for sign, obtains the target signature of sequence of video images.
Figure 10 is a kind of flow chart of the video actions recognition methods provided according to another exemplary embodiment.Such as Figure 10 institute Show, after step 110, this method further includes step 310: primitive character is set on time dimension and channel dimension It changes;And after step 150, this method further includes step 330: being carried out on time dimension and channel dimension to target signature Displacement.
Wherein, primitive character is referred in time dimension and the enterprising line replacement of channel dimension, readjusts primitive character Row, column and dimension, so that the time dimension and channel dimension of primitive character are exchanged, to adapt to the first point-by-point convolutional network layer to defeated Enter the substantive requirements of form of feature.
Similarly, target signature is referred in time dimension and the enterprising line replacement of channel dimension, readjusts target signature Row, column and dimension, so that the time dimension and channel dimension of target signature are exchanged, thus by the character representation of sequence of video images Form is restored to identical as primitive character, to meet the substantive requirements of form of next network layer to input feature vector.
Figure 11 is a kind of structural schematic diagram of timing enhancing module according to shown by an exemplary embodiment.By pond net The primitive character that network layers are exported is expressed as (B, T, C), wherein " B " indicates the Spatial Dimension of primitive character, timing enhances module The displacement of time dimension and channel dimension is executed to primitive character by reshape function, the primitive character is via table after displacement It is shown as (B, C, T).By the way that the primitive character after displacement to be input in the first point-by-point convolutional network layer, rolled up point by point with meeting first The substantive requirements of form of the product network layer to input feature vector.
Target signature is being exported to before fully connected network network layers, timing enhancing module also passes through another reshape function The displacement of time dimension and channel dimension in performance objective feature, so that the shape for the target signature being input in fully connected network network layers Formula meets the requirement of full articulamentum.
It should also be noted that, direct-connected access is for indicating video image sequence in the enhancing module of the timing shown in Figure 11 The primitive character of column mutually converges via the target signature that the direct-connected access is exported with the second point-by-point convolutional network layer.
In the embodiment of another exemplary, video actions recognition methods is further comprising the steps of:
Primitive character and target signature are normalized respectively, so that primitive character after normalized via holding The separating treatment of row of channels dimension and time dimension, target signature via after normalized execute sequence of video images in wrapped Identification containing movement.
Wherein, primitive character and target signature are normalized and are referred to, by primitive character and target signature It is pre-processed, the value so that mean value and variance of primitive character and target signature tend towards stability.
Still as shown in figure 11, the normalized carried out to primitive character and target signature is by timing enhancing module Two BN (Batch Normalization, batch standardize) network layer execute respectively.Primitive character is via BN network After layer is normalized, then the separation for executing channel dimension and time dimension is exported into the first point-by-point convolutional network layer. After target signature is normalized via BN network layer, corresponding actions identification is executed in output to fully connected network network layers.
The process that BN network layer executes normalized to primitive character will be described as example using primitive character below:
Firstly the need of the mean value and variance that each characteristic element in primitive character is calculated, according to gained mean value and variance Each characteristic element in primitive character is normalized.It is wherein normalized to characteristic element each in primitive character to be substantially, Each characteristic element is calculated using normalized function, to obtain corresponding calculated result.It additionally needs to calculated result Linear transformation is carried out according to certain rule, obtains final output corresponding to each characteristic element in primitive character.
The present embodiment is by the normalized of primitive character and target signature to sequence of video images, so that video is dynamic Make identification network model and be able to use higher learning rate, to have fast convergence, also increases view to a certain extent The generalization ability of frequency action recognition network model.
Figure 12 is a kind of block diagram of video actions identification device according to shown by an exemplary embodiment.Such as Figure 12 institute Show, which includes that primitive character abstraction module 410, separation processing modules 430, feature converge module 450 and action recognition mould Block 470.
Primitive character obtains module 410 and extracts on space and setting channel for obtaining sequence of video images in video flowing Primitive character, which is that characteristic binding of the sequence of video images on time, space and channel dimension indicates.
Separation processing modules 430 are used to carry out primitive character the separating treatment of channel dimension and time dimension, depending on The separation characteristic of frequency image sequence.
Feature converges module 450 for converging separation characteristic and primitive character, obtains the target signature of sequence of video images.
Action recognition module 470 is used to identify the movement for including in sequence of video images according to target signature, be acted Recognition result.
In the embodiment of another exemplary, which further includes video image extraction module and feature extraction module.
Video image extraction module is used in the video flowing for carrying out video actions identification extract video image, forms video Image sequence.
Feature extraction module is used to carry out sequence of video images the feature extraction on Spatial Dimension and channel dimension, obtains Primitive character, the primitive character are the character representations that output corresponds to Spatial Dimension and time dimension on channel dimension.
In the embodiment of another exemplary, feature extraction module includes convolution feature acquiring unit, convolution Feature Compression Unit and convolution compressive features stackable unit.
Convolution feature acquiring unit is used for by carrying out Spatial Dimension and channel to the video image in sequence of video images Convolutional calculation in dimension obtains the convolution feature of the video image.
Convolution Feature Compression unit is used to carry out Feature Compression to convolution feature, and the convolution compression for obtaining video image is special Sign.
Convolution compressive features stackable unit is used on time dimension carry out the convolution compression for sequence of video images special The stacking of sign generates primitive character.
In the embodiment of another exemplary, separation processing modules 430 include fisrt feature extracting unit and second feature Extracting unit.
Fisrt feature extracting unit is used to carry out primitive character the feature extraction on time dimension, obtains intermediate features, The intermediate features are that the character representation of time dimension is isolated to primitive character.
Second feature extracting unit is used for the channel dimension in intermediate features, carries out feature extraction acquisition to intermediate features The separation characteristic of sequence of video images, the separation characteristic are the character representations for going out channel dimension to intermediate character separation.
In the embodiment of another exemplary, fisrt feature extracting unit includes that point-by-point convolution subelement and intermediate features obtain Take subelement.
Point-by-point convolution subelement is used to roll up primitive character by the first point-by-point convolutional network layer in time dimension Product calculates, and the point-by-point convolution kernel for executing convolutional calculation is included in the first point-by-point convolutional network layer.
Intermediate features obtain subelement and are used to obtain intermediate special as stacking character representation obtained by point-by-point convolution nuclear convolution Sign.
In the embodiment of another exemplary, separation processing modules 430 further include third feature extracting unit, and the third is special It levies extracting unit and is used for the channel corresponding with output primitive character of the quantity of point-by-point convolution kernel in the first point-by-point convolutional network layer When quantity is inconsistent, convolutional calculation is carried out to the separation characteristic of sequence of video images by the second point-by-point convolutional network layer, this Two point-by-point convolutional network layers include that the quantity number of channels corresponding with output primitive character of point-by-point convolution kernel is identical.
In the embodiment of another exemplary, second feature extracting unit includes that depth convolution subelement and separation characteristic obtain Take subelement.
Depth convolution subelement is used to carry out convolutional calculation, the depth convolution to intermediate features by depth convolutional network layer Each depth convolution kernel in network layer for executing depth convolution of the intermediate features on channel dimension respectively.
Separation characteristic obtains subelement and is used to that it is special to obtain separation as stacking character representation obtained by depth convolution nuclear convolution Sign.
In the embodiment of another exemplary, which further includes fisrt feature replacement module and second feature replacement die Block.
Fisrt feature replacement module is set to after primitive character acquisition module 410, is used for primitive character in time dimension Degree and channel dimension are replaced.
Second feature replacement module is set to before action recognition module 470, for target signature in time dimension and Channel dimension is replaced.
In the embodiment of another exemplary, which further includes the first normalized module and the second normalized Module.
First normalized module is set to after primitive character acquisition module 410, for returning to primitive character One change processing, so that primitive character is via the separating treatment for executing channel dimension and time dimension after normalized.
Second normalized module is set to before action recognition module 470, for target signature to be normalized Processing, so that the identification that target signature is acted via execution after normalized.
It should be noted that method provided by device provided by above-described embodiment and above-described embodiment belongs to same structure Think, the concrete mode that wherein modules execute operation is described in detail in embodiment of the method, no longer superfluous herein It states.
In one exemplary embodiment, a kind of electronic equipment, comprising:
Processor;And
Memory, wherein computer-readable instruction is stored on memory, which is executed by processor Video actions recognition methods in Shi Shixian the various embodiments described above.
Figure 13 is the hardware block diagram of a kind of electronic equipment according to shown by an exemplary embodiment.
It should be noted that the electronic equipment is the example for adapting to the application, it must not believe that there is provided right Any restrictions of the use scope of the application.The electronic equipment can not be construed to need to rely on or must have in Figure 13 One or more component in illustrative electronic equipment shown.
The hardware configuration of the electronic equipment can generate biggish difference due to the difference of configuration or performance, such as Figure 13 institute Show, electronic equipment include: power supply 510, interface 530, at least a memory 550 and an at least central processing unit (CPU, Central Processing Units)570。
Wherein, power supply 510 is used to provide operating voltage for each hardware device on electronic equipment.
Interface 530 includes an at least wired or wireless network interface 531, at least a string and translation interface 533, at least one defeated Enter output interface 535 and at least USB interface 537 etc., is used for and external device communication.
The carrier that memory 550 is stored as resource, can be read-only memory, random access memory, disk or CD Deng the resource stored thereon includes operating system 551, application program 553 or data 555 etc., and storage mode can be short Temporary storage permanently stores.Wherein, operating system 551 is for managing and each hardware device in controlling electronic devices and answering It can be Windows with program 553 to realize calculating and processing of the central processing unit 570 to mass data 555 ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, FreeRTOS etc..Application program 553 is based on operation The computer program that at least one particular job is completed on system 551, may include that an at least module (is not shown in Figure 13 Out), each module can separately include the series of computation machine readable instruction to electronic equipment.Data 555, which can be, deposits Video, the picture etc. being stored in disk.
Central processing unit 570 may include the processor of one or more or more, and be set as through bus and memory 550 communications, for the mass data 555 in operation and processing memory 550.
As described in detail above, the electronic equipment for being applicable in the application will read memory by central processing unit 570 The form of the series of computation machine readable instruction stored in 550 completes video actions recognition methods.
In addition, also can equally realize the application by hardware circuit or hardware circuit combination software instruction, therefore, realize The application is not limited to the combination of any specific hardware circuit, software and the two.
In a further exemplary embodiment, a kind of computer readable storage medium, is stored thereon with computer program, the meter The video actions recognition methods in the various embodiments described above is realized when calculation machine program is executed by processor.
Above content, only the preferable examples embodiment of the application, the embodiment for being not intended to limit the application, this Field those of ordinary skill can very easily carry out corresponding flexible or repair according to the central scope and spirit of the application Change, therefore the protection scope of the application should be subject to protection scope required by claims.

Claims (12)

1. a kind of video actions recognition methods characterized by comprising
The primitive character that sequence of video images extracts on space and setting channel in video flowing is obtained, the primitive character is institute Stating characteristic binding of the sequence of video images on time, space and channel dimension indicates;
The separating treatment that the channel dimension and time dimension are carried out to the primitive character obtains the sequence of video images Separation characteristic;
Converge the separation characteristic and the primitive character, obtains the target signature of the sequence of video images;
The movement for including in the sequence of video images is identified according to the target signature, obtains action recognition result.
2. the method according to claim 1, wherein sequence of video images is in space in the acquisition video flowing Before the primitive character that extracts on setting channel, the method also includes:
Video image is extracted in the video flowing for carrying out video actions identification, forms the sequence of video images;
Feature extraction on Spatial Dimension and channel dimension is carried out to the sequence of video images, obtains primitive character, the original Begin to be characterized in character representation of the output corresponding to Spatial Dimension and time dimension on the channel dimension.
3. according to the method described in claim 2, it is characterized in that, described carry out space and channel to the sequence of video images Feature extraction in dimension obtains primitive character, comprising:
By carrying out the convolutional calculation on Spatial Dimension and channel dimension to the video image in the sequence of video images, Obtain the convolution feature of the video image;
Feature Compression is carried out to the convolution feature, obtains the convolution compressive features of the video image;
It is the stacking that the sequence of video images carries out the convolution compressive features on time dimension, generates the original spy Sign.
4. the method according to claim 1, wherein it is described to the primitive character carry out the channel dimension and The separating treatment of time dimension obtains the separation characteristic of the sequence of video images, comprising:
Feature extraction on the time dimension is carried out to the primitive character and obtains intermediate features, the intermediate features are to institute State the character representation that primitive character isolates the time dimension;
On the channel dimension of the intermediate features, feature extraction is carried out to the intermediate features and obtains the sequence of video images Separation characteristic, the separation characteristic is the character representation that the intermediate features are isolated with the channel dimension.
5. according to the method described in claim 4, it is characterized in that, described carry out on the time dimension primitive character Feature extraction, obtain intermediate features, comprising:
On the time dimension, convolutional calculation is carried out to the primitive character by the first point-by-point convolutional network layer, described the Include the point-by-point convolution kernel for executing the convolutional calculation in one point-by-point convolutional network layer;
As stacking character representation obtained by the point-by-point convolution nuclear convolution, the intermediate features are obtained.
6. according to the method described in claim 5, it is characterized in that, the method also includes:
If the quantity of point-by-point convolution kernel channel corresponding with the primitive character is exported in the first point-by-point convolutional network layer Quantity is inconsistent, then carries out convolutional calculation by separation characteristic of the second point-by-point convolutional network layer to the sequence of video images, The second point-by-point convolutional network layer includes the quantity number of channels corresponding with the primitive character is exported of point-by-point convolution kernel It is identical.
7. according to the method described in claim 4, it is characterized in that, described on the channel dimension of the intermediate features, to institute It states intermediate features and carries out the separation characteristic that feature extraction obtains the sequence of video images, comprising:
Convolutional calculation is carried out to the intermediate features by depth convolutional network layer, each depth in the depth convolutional network layer Degree convolution kernel for executing depth convolution of the intermediate features on channel dimension respectively;
As stacking character representation obtained by the depth convolution nuclear convolution, the separation characteristic is obtained.
8. the method according to claim 1, wherein sequence of video images is in space in the acquisition video flowing After the primitive character that extracts on setting channel, the method also includes:
The primitive character is replaced in the time dimension and channel dimension;
And the movement for including in the sequence of video images is identified according to the target signature described, obtain action recognition knot Before fruit, the method also includes:
The target signature is replaced in the time dimension and channel dimension.
9. the method according to claim 1, wherein the method also includes:
The primitive character and the target signature are normalized respectively, so that the primitive character is returned via described The separating treatment of the channel dimension and time dimension is executed after one change processing, the target signature is via the normalized The identification of the movement is executed afterwards.
10. a kind of video actions identification device characterized by comprising
Primitive character obtains module, extracts for obtaining in video flowing sequence of video images on space and setting channel original Feature, the primitive character are that characteristic binding of the sequence of video images on time, space and channel dimension indicates;
Separation processing modules are obtained for carrying out the separating treatment of the channel dimension and time dimension to the primitive character The separation characteristic of the sequence of video images;
Feature converges module, for converging the separation characteristic and the primitive character, obtains the mesh of the sequence of video images Mark feature;
Action recognition module is moved for identifying the movement for including in the sequence of video images according to the target signature Make recognition result.
11. a kind of electronic equipment characterized by comprising
Memory is stored with computer-readable instruction;
Processor reads the computer-readable instruction of memory storage, requires side described in any one of 1-9 with perform claim Method.
12. a kind of computer readable storage medium, which is characterized in that computer-readable instruction is stored thereon with, when the calculating When machine readable instruction is executed by the processor of computer, computer perform claim is made to require method described in any one of 1-9.
CN201910419763.2A 2019-05-20 2019-05-20 Video actions recognition methods and device, electronic equipment, storage medium Pending CN110210344A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910419763.2A CN110210344A (en) 2019-05-20 2019-05-20 Video actions recognition methods and device, electronic equipment, storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910419763.2A CN110210344A (en) 2019-05-20 2019-05-20 Video actions recognition methods and device, electronic equipment, storage medium

Publications (1)

Publication Number Publication Date
CN110210344A true CN110210344A (en) 2019-09-06

Family

ID=67787777

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910419763.2A Pending CN110210344A (en) 2019-05-20 2019-05-20 Video actions recognition methods and device, electronic equipment, storage medium

Country Status (1)

Country Link
CN (1) CN110210344A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241985A (en) * 2020-01-08 2020-06-05 腾讯科技(深圳)有限公司 Video content identification method and device, storage medium and electronic equipment
CN111242068A (en) * 2020-01-17 2020-06-05 科大讯飞(苏州)科技有限公司 Behavior recognition method and device based on video, electronic equipment and storage medium
KR20220153929A (en) * 2021-05-12 2022-11-21 인하대학교 산학협력단 Deep learning method and apparatus based on tsn for real-time hand gesture recognition in video

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241985A (en) * 2020-01-08 2020-06-05 腾讯科技(深圳)有限公司 Video content identification method and device, storage medium and electronic equipment
CN111241985B (en) * 2020-01-08 2022-09-09 腾讯科技(深圳)有限公司 Video content identification method and device, storage medium and electronic equipment
CN111242068A (en) * 2020-01-17 2020-06-05 科大讯飞(苏州)科技有限公司 Behavior recognition method and device based on video, electronic equipment and storage medium
KR20220153929A (en) * 2021-05-12 2022-11-21 인하대학교 산학협력단 Deep learning method and apparatus based on tsn for real-time hand gesture recognition in video
KR102587234B1 (en) * 2021-05-12 2023-10-10 인하대학교 산학협력단 Deep learning method and apparatus based on tsn for real-time hand gesture recognition in video

Similar Documents

Publication Publication Date Title
CN110263681B (en) Facial expression recognition method and device, storage medium and electronic device
CN110032926B (en) Video classification method and device based on deep learning
CN107463949B (en) Video action classification processing method and device
CN106982359B (en) A kind of binocular video monitoring method, system and computer readable storage medium
CN110188795A (en) Image classification method, data processing method and device
CN111783748B (en) Face recognition method and device, electronic equipment and storage medium
CN108009594B (en) A kind of image-recognizing method based on change grouping convolution
CN110210344A (en) Video actions recognition methods and device, electronic equipment, storage medium
CN109214298B (en) Asian female color value scoring model method based on deep convolutional network
CN110659573B (en) Face recognition method and device, electronic equipment and storage medium
CN110222760B (en) Quick image processing method based on winograd algorithm
CN110309847A (en) A kind of model compression method and device
WO2020108336A1 (en) Image processing method and apparatus, device, and storage medium
CN107680077A (en) A kind of non-reference picture quality appraisement method based on multistage Gradient Features
CN111292262B (en) Image processing method, device, electronic equipment and storage medium
CN110222717A (en) Image processing method and device
CN112036260B (en) Expression recognition method and system for multi-scale sub-block aggregation in natural environment
CN110321761A (en) A kind of Activity recognition method, terminal device and computer readable storage medium
CN109685713A (en) Makeup analog control method, device, computer equipment and storage medium
CN107944398A (en) Based on depth characteristic association list diagram image set face identification method, device and medium
CN110059593A (en) A kind of human facial expression recognition method based on feedback convolutional neural networks
CN111046213B (en) Knowledge base construction method based on image recognition
CN108154109A (en) A kind of construction method, device and the intelligent teaching recorded broadcast method of intelligence recorded broadcast model
CN111882516B (en) Image quality evaluation method based on visual saliency and deep neural network
CN114360018B (en) Rendering method and device of three-dimensional facial expression, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination