CN112464807A - Video motion recognition method and device, electronic equipment and storage medium - Google Patents

Video motion recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112464807A
CN112464807A CN202011351589.1A CN202011351589A CN112464807A CN 112464807 A CN112464807 A CN 112464807A CN 202011351589 A CN202011351589 A CN 202011351589A CN 112464807 A CN112464807 A CN 112464807A
Authority
CN
China
Prior art keywords
pixel
value
video
image information
motion recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011351589.1A
Other languages
Chinese (zh)
Inventor
吴臻志
马欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lynxi Technology Co Ltd
Original Assignee
Beijing Lynxi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lynxi Technology Co Ltd filed Critical Beijing Lynxi Technology Co Ltd
Priority to CN202011351589.1A priority Critical patent/CN112464807A/en
Publication of CN112464807A publication Critical patent/CN112464807A/en
Priority to PCT/CN2021/132696 priority patent/WO2022111506A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The application discloses a video action recognition method and device, electronic equipment and a storage medium, and belongs to the technical field of neural networks. The video motion recognition method comprises the following steps: acquiring a target video clip; carrying out differential processing on the image frames in the target video clip to obtain a differential image information sequence, wherein the differential image information sequence comprises at least one frame of differential image information; and inputting the differential image information sequence into a video action recognition network to determine an action recognition result of the target video segment. According to the embodiment of the application, the calculation speed in the video motion recognition process can be improved.

Description

Video motion recognition method and device, electronic equipment and storage medium
Technical Field
The application belongs to the technical field of neural networks, and particularly relates to a video motion recognition method and device, electronic equipment and a storage medium.
Background
The method and the device identify the action in the shot video, and have good application prospect in video monitoring and user interaction.
In the related art, there are drawbacks such as a large amount of calculation and a slow calculation speed when performing motion recognition.
Disclosure of Invention
The embodiment of the application aims to provide a video motion recognition method, a video motion recognition device, electronic equipment and a storage medium, and can solve the problem that a video motion recognition method in the related art is slow in calculation speed.
In order to solve the technical problem, the present application is implemented as follows:
in a first aspect, an embodiment of the present application provides a video motion recognition method, where the method includes:
acquiring a target video clip;
carrying out differential processing on the image frames in the target video clip to obtain a differential image information sequence, wherein the differential image information sequence comprises at least one frame of differential image information;
and inputting the differential image information sequence into a video action recognition network to determine an action recognition result of the target video segment.
In a second aspect, an embodiment of the present application provides a video motion recognition apparatus, where the apparatus includes:
the acquisition module is used for acquiring a target video clip;
the difference module is used for carrying out difference processing on the image frames in the target video clip to obtain a difference image information sequence, and the difference image information sequence comprises at least one frame of difference image information;
and the identification module is used for inputting the difference image information sequence into a video action identification network so as to determine the action identification result of the target video segment.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.
In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.
In the embodiment of the application, a target video clip is obtained; carrying out differential processing on the image frames in the target video clip to obtain a differential image information sequence, wherein the differential image information sequence comprises at least one frame of differential image information; and inputting the differential image information sequence into a video action recognition network to determine an action recognition result of the target video segment. Therefore, the video motion recognition network carries out motion recognition based on the difference image information sequence, the calculated amount of the video motion recognition network is reduced, and the calculating speed in the video motion recognition process can be improved.
Drawings
Fig. 1 is a flowchart of a video motion recognition method according to an embodiment of the present application;
fig. 2 is one of schematic structural diagrams of a video motion recognition network to which a video motion recognition method according to an embodiment of the present application can be applied;
fig. 3 is a second schematic structural diagram of a video motion recognition network to which a video motion recognition method according to an embodiment of the present application can be applied;
fig. 4 is a structural diagram of a video motion recognition apparatus according to an embodiment of the present application;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.
The video motion recognition method, the video motion recognition apparatus, the electronic device, and the readable storage medium provided in the embodiments of the present application are described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.
In the related art, video motion recognition may be performed by:
the first method is as follows: video motion is predicted by a spatio-temporal dual-Stream Network structure (which may also be referred to as a Two Stream Network).
The space-time double-flow network structure comprises two branches, wherein one branch network extracts image information according to an input single-frame image, and image classification is carried out. And the other branch network extracts motion information between frames according to an input continuous 10-frame optical flow (optical flow) motion field, the network structures of the two branches are the same, and the excitation function of the output layer is the same as the softmax function, namely the softmax function is adopted for prediction. And finally, fusing the results of the two branch networks in a mode of direct averaging or Support Vector Machine (SVM).
In this embodiment, it is necessary to calculate and store the optical flow of the video image in advance, and the data amount of the optical flow is large, so that a large storage space is required.
Therefore, the space-time dual-stream network has the following disadvantages:
the first disadvantage is that: and the training process of the space-time double-flow network is repeated.
On one hand, the training process is complex and the training time is long because the training needs to be carried out on the two branches respectively; on the other hand, since the optical flow in a video segment may be displaced in a particular direction, it is necessary to subtract the average value of all optical flow vectors from the optical flow in advance during the training process.
The second disadvantage is that: in the process of predicting video motion, the speed of identifying the result of the video motion is slow due to the large amount of calculation.
In the application process, the video is also required to be converted into the optical flow through the optical flow model, the converted optical flow is calculated by adopting a space-time double-flow network, and the defect of large calculation amount exists when the converted optical flow is calculated by adopting the space-time double-flow network based on the characteristic that the optical flow has large data volume, so that the speed of identifying the video action result is low.
The third disadvantage is that: can only be applied to motion recognition in images or segment video.
Since the spatio-temporal dual-stream network operates only one frame (spatial network) or operates a single stack of frames in a short segment (temporal network), access to temporal context is limited, and thus modeling for a long-range temporal structure cannot be achieved.
The second method comprises the following steps: video motion is predicted by a three-dimensional Convolutional Neural network (3D CNN).
In this manner, a video is divided into a plurality of fixed-length segments, and then motion information of each video segment is extracted separately. When the 3D CNN is applied to video motion recognition, since the 3D CNN has a large parameter amount, training becomes more difficult and requires more training data, and thus, the training process of the 3D CNN is complicated and takes a long time.
The third method comprises the following steps: video motion is predicted by a Convolutional Long Short-Term Memory network (ConvLSTM).
In the method, the characteristics of each frame of image in the video are extracted through a CNN network, and then the time sequence relation among the characteristics of the frame of image is mined by adopting an LSTM network. However, the ConvLSTM method is not much used in video analysis because LSTM has limited long range dependence to capture, and the network model is more difficult and slower to train than spatio-temporal dual-stream networks and 3D convolutional networks.
The method is as follows: time Segment Networks (TSNs).
Like the space-time dual-stream network, the TSN is also composed of a space-stream convolutional network and a time-stream convolutional network. But unlike two-stream, which uses a single frame or a single pile of frames, TSN uses a series of short segments sparsely sampled from the whole video, each of which will give its own preliminary prediction of the behavior class, from which a prediction at video level is derived. Moreover, in the learning process of the TSN, the loss value of the video level prediction needs to be optimized by iteratively updating the model parameters.
The TSN is essentially an improved network for the spatio-temporal dual-stream network, which has the same defect as the spatio-temporal dual-stream network, i.e. the training process is complicated and the calculation process is slow.
In view of the above-mentioned drawbacks in the related art, in the embodiment of the present application, a frame image of a video segment is subjected to difference processing in advance to obtain two-dimensional difference image information, and a motion recognition result in a video is determined by performing feature extraction on the difference image information and performing linear weighting and processing based on the extracted features, so that a model structure can be simplified, a calculation amount in a video motion recognition process is reduced, and a calculation speed in the video motion recognition process can be increased.
Referring to fig. 1, which is a flowchart of a video motion recognition method according to an embodiment of the present disclosure, as shown in fig. 1, the method may include the following steps:
step 101, obtaining a target video clip.
In some optional embodiments, the target video clip may be acquired by a video acquisition device such as a camera. In addition, in the process that the time length of the video acquired by the video acquisition device is long, the video can be divided into a plurality of sections of videos with preset time lengths, for example: 4s (seconds), 5s, etc., where the preset time length is not particularly limited, in this case, the target video segment may include part or all of the plurality of segments of video with the preset time length.
And 102, carrying out differential processing on the image frames in the target video clip to obtain a differential image information sequence, wherein the differential image information sequence comprises at least one frame of differential image information.
The difference image information may be a difference image after the difference processing or image information obtained by processing the difference image, and may be, for example, an image frame obtained by performing binarization processing on the difference image, where the value of each pixel in the binarized image frame is binary data, for example, 1 or 0.
When the pixel values in the differential image information are binary data, compared with pixel difference values with various values, the data complexity of the binary data is lower, so that the calculation complexity of the video motion recognition network can be simplified, the method can be applied to the video motion recognition network constructed by the impulse neural network, and the training speed and the reasoning speed of the video motion recognition network can be improved.
In some alternative embodiments, the differential processing may be understood as: the target video segment comprises a plurality of frames of image frames arranged according to a time sequence, two or more adjacent images are subjected to image data differential processing one by one, and a differential image information sequence of the target video segment is obtained after each image frame in the target video segment is traversed.
For example: assume that a video segment includes an image frame: 1. 2, 3, and 4, performing image data difference processing on two adjacent image frames, wherein the difference processing may be: the pixel value of the image frame 1 is subtracted by the corresponding pixel value of the image frame 2, the pixel value of the image frame 2 is subtracted by the corresponding pixel value of the image frame 3, and the pixel value of the image frame 3 is subtracted by the corresponding pixel value of the image frame 4, so that a plurality of difference images which are arranged in sequence can be obtained by carrying out difference processing on adjacent image frames in a video segment, and the information sequence of the difference images can be determined according to the difference images.
Preferably, L may be equal to 2 to perform a difference processing on two adjacent image frames to find a motion difference between the two adjacent image frames, and of course, in some alternative implementations, L may take any one of 2, 3, 4, and other integers greater than or equal to 2, and is not specifically limited herein.
As an optional implementation manner, the performing differential processing on the image frames in the target video segment to obtain a differential image information sequence includes:
converting the target video clip into image frames arranged according to time sequence;
respectively carrying out gray processing on the image frames, and respectively carrying out difference processing on L adjacent image frames in the image frames after the gray processing to obtain at least one frame of difference image, wherein L is an integer greater than or equal to 2;
respectively generating differential image information corresponding to each frame of differential image so as to determine a differential image information sequence according to at least one frame of differential image information, wherein the differential image information comprises the pixel enhancement information and the pixel weakening information.
For example, assume that the image frame a and the image frame B after the grayscale processing are subjected to difference processing to obtain a difference image. Generating the difference image information corresponding to each frame of the difference image may be understood as generating the difference image information including the pixel enhancement information and the pixel reduction information according to the difference value in the difference image.
For example, the difference image includes a plurality of difference values, and a difference value greater than or equal to a first threshold value may be determined as a pixel enhancement value, and a difference value less than or equal to a second threshold value in the difference image may be determined as a pixel reduction value. Wherein the pixel enhancement value may refer to an enhanced pixel value. The pixel reduction value may refer to a reduced pixel value. The pixel enhancement value and the pixel reduction value may be understood as action edge data.
Wherein the pixel enhancement information can be understood as: and determining an image channel according to the pixel enhancement value in the differential image. Accordingly, the pixel reduction information can be understood as: and determining an image channel according to the pixel attenuation value in the differential image.
In the embodiment, the image frames are subjected to gray processing, so that the color image frames can be converted into gray images, unnecessary color features are analyzed in the feature extraction and analysis process, and the data calculation amount in the video motion recognition process can be reduced. Differential image information is generated through the action edge data, action recognition is carried out based on the differential image information, storage space occupied by the data in the recognition process can be reduced, and recognition speed is improved.
Further, the generating the differential image information corresponding to each frame of the differential image includes:
determining a pixel enhancement value and a pixel reduction value in the plurality of differential values;
generating the pixel enhancement information according to the pixel enhancement value;
and generating the pixel weakening information according to the pixel weakening value.
In this embodiment, the pixel enhancement information and the pixel reduction information are generated according to the difference value, and are input to the video motion recognition network, so that the two-channel two-dimensional data is provided for the video motion recognition network.
In this way, compared with the defects of large calculation amount, low feature extraction precision and the like in the process of extracting optical flow data of a plurality of image frames in advance and performing 3D convolution processing on the optical flow data in the prior art, the embodiment of the application can simply extract the action difference between the differential images based on the pixel enhancement information and the pixel reduction information, and can simplify the complexity of the action of recognizing the differential images.
In some optional implementations, in the sequence of differential image information of the target video segment, the differential value of each piece of differential image information may be analog information or digital information, and the differential value in the differential image information may be divided into a pixel enhancement value and a pixel reduction value, and the pixel enhancement value and the pixel reduction value may be determined according to a value of the analog information or the digital information. For example: in the case where the differential value of the target video segment is an analog information sequence, an analog information value equal to or greater than a first threshold value (e.g., +5) is determined as a pixel enhancement value, and an analog information value equal to or less than a second threshold value (e.g., -5) is determined as a pixel reduction value.
In addition to the analog information sequence, the difference image obtained by the difference processing may be binarized to obtain difference image information having binary data as pixel values.
In some optional embodiments, the determining the sequence of difference image information includes:
and converting the differential value sequence into a digital information sequence, namely binary data, wherein the binary data can be applied to a Spiking Neural Network (SNN) with a simpler model structure.
In some optional embodiments, the difference image includes N difference values, the pixel enhancement information includes N pixel values corresponding to the N difference values, respectively, the pixel reduction information includes N pixel values corresponding to the N difference values, respectively, and N is an integer greater than 1;
wherein the generating the pixel enhancement information according to the pixel enhancement value comprises:
determining a first pixel value corresponding to the pixel enhancement value as 1, and determining a pixel value obtained by dividing the first pixel value from the N pixel values as 0 to obtain the pixel enhancement information;
wherein the generating the pixel reduction information according to the pixel reduction value comprises:
and determining a second pixel value corresponding to the pixel weakening value as 1, and determining a pixel value obtained by dividing the second pixel value from the N pixel values as 0 to obtain the pixel weakening information.
In the embodiment, the difference image is converted into the pixel enhancement information and the pixel weakening information, and under the application scene of the SNN neural network model, the two-channel two-dimensional data can be provided for the SNN neural network model, so that the calculation complexity of the SNN neural network model is simplified.
Of course, in some embodiments, the sequence of difference images (analog information) obtained after the difference processing may be input into the video motion recognition network, and the normalization processing may be performed in a batch normalization layer or the like in the video motion recognition network, which may also achieve the determination of the motion recognition result of the target video segment based on the sequence of difference image information, and is not limited in detail herein.
In some optional embodiments, the performing difference processing on L adjacent image frames in the image frames after the gray processing respectively to obtain at least one frame of difference image includes:
respectively carrying out differential processing on L adjacent image frames in the image frames after the gray processing to obtain at least one frame of differential image, wherein each frame of differential image comprises N differential values, the pixel enhancement information comprises N pixel values respectively corresponding to the N differential values, the pixel attenuation information comprises N pixel values respectively corresponding to the N differential values, and N is an integer greater than 1;
determining that a pixel value corresponding to a first differential value in the pixel emphasis information is equal to 1 and a pixel value corresponding to the first differential value in the pixel reduction information is equal to 0, in a case where the first differential value of the N differential values is greater than or equal to a first threshold;
determining that a pixel value corresponding to a second differential value in the pixel emphasis information is equal to 0 and determining that a pixel value corresponding to the second differential value in the pixel reduction information is equal to 1, in a case where the second differential value of the N differential values is less than or equal to a second threshold value;
determining that a pixel value corresponding to a third differential value in the pixel emphasis information is equal to 0 and determining that a pixel value corresponding to the third differential value in the pixel reduction information is equal to 0, in a case where the third differential value in the differential value sequence is between the first threshold and the second threshold.
In an implementation, the difference value is analog information, and indicates that the pixel is enhanced if the pixel value in the pixel enhancement information is equal to 1, and indicates that the pixel is not enhanced (may be unchanged or weakened) if the pixel value in the pixel enhancement information is equal to 0; if the pixel value in the pixel reduction information is equal to 1, it indicates that the pixel is reduced, and if the pixel value in the pixel reduction information is equal to 0, it indicates that the pixel is not reduced (may be no change or enhanced).
Compared with an analog information number, the difference value converted into the digital signal can simplify the data processing process of the video identification network, and can be applied to the video identification network constructed based on the impulse neural network, so that the operation efficiency of the video identification network is improved.
Further, the pixel enhancement information is transmitted to a video identification network through a pixel enhancement channel, and the pixel reduction information is transmitted to the video identification network through a pixel reduction channel.
Namely, the RGB images of three channels can be converted into the images of two channels through image difference processing, so that the data complexity is simplified. In addition, the inter-frame relation of the images can be found through the difference processing, so that the video motion characteristics can be obtained more easily when the difference images are subjected to feature extraction, and the speed of video motion recognition is improved.
In an implementation, a first identifier may be added to the pixel enhancement information, and a second identifier may be added to the pixel reduction information, so that after the pixel enhancement information and the pixel reduction information are jointly transmitted to a video recognition network, the video recognition network divides the pixel enhancement information and the image reduction information according to the first identifier and the second identifier, which is not limited specifically herein.
Of course, in an application scenario where the calculation power is sufficient, the image frames may not be subjected to the grayscale processing, but a plurality of adjacent image frames may be directly subjected to the difference processing to obtain the difference image information, which is not limited in this respect.
In some optional embodiments, the differential image information may further include all 0 channels, for example, the differential image information includes pixel enhancement information, pixel reduction information, and all 0 channels.
And 103, inputting the difference image information sequence into a video motion recognition network to determine a motion recognition result of the target video segment.
The video motion recognition network can be any trained neural network for motion recognition.
In a possible implementation manner, the video motion recognition network is constructed based on a pulse neural network, and the input video motion recognition network may be a differential image information sequence determined according to the target video segment, and the differential image information sequence includes at least one frame of differential image information. The frame of differential image information may include two image channels, which are pixel enhancement information and pixel reduction information, respectively, each image channel may include a plurality of pixels, and a pixel value may be 0 or 1. Here, the pixel having a pixel value of 1 in the pixel enhancement information may be understood as an enhanced pixel, and the pixel having a pixel value of 0 may be a non-enhanced pixel. The pixel having the pixel value of 1 in the pixel reduction information may be understood as a reduced pixel, and the pixel having the pixel value of 0 may be an unattenuated pixel.
In addition, the extracting the feature value of the difference image information by the video motion recognition network may be that the video motion recognition network extracts the feature value of the difference image information by using a convolution leakage integral distribution model, and the feature value may include a time sequence feature value and a space feature value.
In some optional embodiments, the video action recognition network may extract feature values of a video segment, and after performing weighting processing on the feature values, may obtain a plurality of tag values corresponding to a plurality of preset action tags, respectively, and the action recognition result of determining the target video segment may be a preset action corresponding to a target tag having a largest value among the plurality of tag values.
The larger the value of the tag value is, the closer the action in the video and the preset action corresponding to the tag value are.
In practical application, the action in the video may not be completely matched with a preset action, and therefore, the often obtained video action result may include a plurality of tag values whose values are close to each other or larger than a preset threshold, and at this time, the action identification result for determining the target video segment may also be: and determining that the video action is close to the preset actions corresponding to the values respectively.
For example: fig. 2 is a schematic structural diagram of a video motion recognition network applicable to the video motion recognition method provided in the embodiment of the present application, as shown in the embodiment of the present application. As shown in fig. 2, the video motion recognition network includes: the convolution leakage integral issuing module 10 and the full connection layer module 20, where the extracting the feature value of the difference image information through the video motion recognition network and performing weighting processing on the feature value to determine the motion recognition result of the target video segment includes:
extracting the characteristic value of the differential image information through the convolution leakage integral distribution module 10, and performing weighting processing on the characteristic value through the full connection layer module 20 to determine the action recognition result of the target video clip.
As an alternative implementation, as shown in fig. 2, the convolution leakage integral distribution module includes: a convolution leakage integration and issuance (e.g., ConvLIF or ConvLIAF) layer 11, a Batch Normalization (BN) layer 12, a Rectified Linear Unit (ReLU) layer 13, and a global Pooling (which may also be referred to as: Avg Pooling) layer 14;
the extracting the feature value of the difference image information by the convolution leakage integral distribution module 10 includes:
performing time sequence convolution processing and leakage integral distribution processing on the difference image information through a convolution leakage integral distribution layer 11 to respectively extract a time sequence characteristic value and a space characteristic value of the target video clip, wherein the characteristic value of the difference image information comprises the time sequence characteristic value and the space characteristic value, and the convolution leakage integral distribution layer 11 adopts a pulse neural network model;
performing batch normalization processing on the feature values of the target video segments through a batch normalization layer 12, wherein the feature values of the target video segments comprise the time sequence feature values and the spatial feature values;
performing linear correction processing on the characteristic values subjected to batch standardization processing through a linear rectification layer 13;
the feature values after the linear correction processing are subjected to average pooling processing by the global pooling layer 14.
In implementation, the full connection layer module 20 obtains the feature data after the average pooling process to perform weighted summation process, so as to reassemble the feature values extracted from each convolution leakage integral issuing module 10 into a complete feature map, so as to obtain a tag value corresponding to the feature map, which is used as an identification result of the video action.
In some optional embodiments, the above-mentioned performing, by the convolution and leakage integral distribution layer 11, a time-series convolution process and a leakage integral distribution process on the difference image information to extract a time-series characteristic value and a spatial characteristic value of the target video segment, respectively, may be implemented by the following processes:
the original LIF model is described as a differential equation to show the dynamic behavior of neurons, and the expression of the original LIF model can be as follows:
Figure BDA0002801456430000121
where τ is a time factor for the neuron, VresetIs the reset potential. Xi (t) is the passing weight of the ith neuron of WiAn input signal (pulsed or no signal) connected to the current neuron. When V (t) reaches a certain threshold value VthAt the same time, a pulse signal is sent out, and V (t) is reset to its initial value VresetAnd n represents the total number of neurons. To facilitate derivation and training, we use iterative versions of LIF over discrete time, each iteration process may include the following steps:
1) synapse integration, expressed as:
It=Conv(Xt,W)
wherein, ItDenotes synaptic integration at time t, XtRepresents the activation value of the presynaptic neuron, and W denotes the synaptic weight. Synaptic integration may take the form of full connectivity or convolution, with Conv in the above equation representing convolution.
2) Combining the spatial information and the time information, the expression is as follows:
Figure BDA0002801456430000122
wherein
Figure BDA0002801456430000123
And
Figure BDA0002801456430000124
respectively, the previous and current membrane potentials.
3) Comparing threshold values, and transmitting pulses, wherein the expression is as follows:
Figure BDA0002801456430000125
wherein FtIs a transmitted signal. Ft1 denotes a transmit pulse event, if Ft0 indicates that no impulse event was transmitted.
4) Resetting the membrane potential, expressed as:
Figure BDA0002801456430000131
wherein the content of the first and second substances,
Figure BDA0002801456430000132
membrane potential after reset is indicated.
5) Execution leakage, expressed as:
Figure BDA0002801456430000133
where α and β represent multiplicative and additive attenuation coefficients, respectively.
6) Output Ft
In addition, the batch normalization processing, the linear correction processing, and the global pooling processing are the same as the corresponding processing methods in the prior art, and are not described herein again.
In the present embodiment, sequential convolution processing is performed on consecutive image frames by the ConvLIAF layer; the output characteristic values are normalized through a Batch Normalization layer, the stability of the video motion recognition network is guaranteed, and meanwhile, the probability of overfitting possibly occurring in the process of training the video motion recognition network can be effectively reduced; increasing the nonlinear relation among the layers of the neural network through the RELU layer; by means of the AvgPooling layer, on the one hand, useless parameters can be prevented from increasing the time complexity, and on the other hand, the degree of integration of the eigenvalues is also increased.
In some alternative embodiments, the AvgPooling layer may be selected from an AvgPooling 2D (two-dimensional) layer wrapped with a Time distribution (which may also be referred to as a Time Distributed) layer, or an AvgPooling3D (three-dimensional) layer, which is not specifically limited herein.
In addition, except for the convolution leakage integral distribution layer 11, the batch normalization layer 12, the linear rectification layer 13, the global pooling layer 14, and the full connection layer module may all adopt an Artificial Neural Network (ANN), so that the video motion recognition Network can realize better processing capability of time-space domain hybrid application by adopting a Network structure in which the Artificial Neural Network ANN and the SNN are fused. And the SNN has differentiated advantages in the scene with low precision requirement and high calculation speed requirement. The error rate of the SNN can be close to convergence in a short time, and under the same condition, the time consumption of the traditional CNN method is longer, so that when the SNN is applied to the feature processing of the image frames in the video segments with shorter time, the error rate of the SNN is close to convergence, and the time consumption is shorter; and then, further processing the characteristic value extracted by the SNN by using the ANN with higher accuracy so as to improve the processing capacity of the characteristic value.
It should be noted that, in practical applications, the video motion recognition network may include more or less network layers than the video motion recognition network shown in fig. 2 according to a difference of a model structure or an algorithm of the video motion recognition network, and is not limited in this respect.
Compared with the conventional CNN method, in the embodiment of the present application, the convolution leakage integral issuance layer 11 adopting the SNN mode can be used to identify the action with a limited number of marked videos (i.e., the output result of the video action identification network may include a tag value corresponding to a limited number of action tags, where the tag value is used to indicate the similarity between the action in the video and the corresponding action tag). The method extracts pulse information from original video data and simultaneously reserves time correlation among different frames, and a large amount of dynamic activities in the video are summarized into actions corresponding to the labels in the storage library, so that the video action recognition network adopting ANN + SNN fusion can have high efficiency and accuracy, and the requirements of the video action recognition network on storage capacity and the operation environment of the operation capacity can be reduced due to the reduction of the calculated capacity and the storage capacity in the video action recognition process, thereby improving the applicability of the video action recognition network.
In the training process, in order to learn the motion corresponding to each motion label in the video motion recognition network, the video motion recognition network may be trained through the following processes:
step 1: shooting a preset action.
In this step, a plurality of objects may be photographed, and each object performs the preset action during the photographing process, and the photographing time of each preset action may be set as a preset duration.
For example: during the photographing process, each person makes 10 preset movements (left arm rotation, right arm rotation, left hand bending, etc.), and 20s are photographed for each movement of each person.
Step 2: and (5) segmenting the video.
In this step, the video of each preset action is divided into a plurality of segments to increase the number of samples.
For example: and averagely dividing the 20 s-long video into 4 5 s-long video segments.
Preferably, in consideration of the influence of the shooting delay and the closing delay during shooting, the first 2s and last 2sd of the 20-second video may be discarded, and the remaining 16 s-long video may be divided into 4 video segments of 4 s-long duration.
Therefore, the category errors caused by action change can be effectively cut off, and the effectiveness of the sample is ensured.
And step 3: and (5) classifying the marks.
In this step, video clips of different preset actions are associated with different tags.
For example: the 4 video segments of the ith preset action can be respectively marked as: i × 5, i × 5+1, i × 5+2, and i × 5+3, where i may be any one of 0 to 9.
And 4, step 4: the video is converted into pictures.
In this step, all video clips may be converted into pictures by using a visual and machine learning software library (OpenCV).
Of course, in a specific implementation, other tools may be used to convert the video into the picture frame, and are not limited in this respect.
And 5: and dividing a training set and a testing set.
In specific implementation, the picture frame in step 4 may be used as a sample according to a preset proportion, and divided into a training set and a test set, for example: 80% of the samples were used as training set and 20% as test set. Of course, the samples may be divided into training sets and testing sets according to other proportions, which are not specifically limited herein.
Step 6: and (6) difference processing.
In this step, the picture may be resized (resize) first to reduce and standardize the size of each picture; then converting the picture after the size adjustment processing into a gray picture; finally, the consecutive (i.e. adjacent) pictures are differentially processed, pixel enhancement information and pixel reduction information are obtained according to the result of the differential processing, the pixel enhancement information is transmitted from the pixel enhancement channel, and the pixel reduction information is transmitted from the pixel reduction channel.
Thus, the original three-channel RGB image data can be changed into the current two-channel image data.
In practical application, when a moving object relatively moves between two continuous frames of images, the differential image signals are not all 0, and when no moving object relatively moves between two continuous frames of images, the differential image signals are all 0, so that the relationship between the obtained image frames is realized.
And finally, inputting the differential image signals subjected to the differential processing into a video motion recognition network through a pixel enhancement channel and a pixel weakening channel respectively for training until the accuracy of the trained video motion recognition network meets a preset condition or the samples are completely trained.
As an optional implementation manner, in order to avoid overfitting of the video motion recognition network in the training process, a discarding layer may be added in the video motion recognition network to discard the neural network unit from the network according to a preset probability.
For example: in the embodiment shown in fig. 3, the fully-connected layer module 20 includes: a full link layer 21 and a Dropout layer 22. The Dropout layer 22 temporarily discards the neural network unit from the video motion recognition network according to a certain probability, thereby effectively preventing the video motion recognition network from being over-fitted and improving the training speed of the video motion recognition. And the full connection layer 21 is used for performing a weighted summation process on the features output by the convolution leakage integral issuing module to obtain an action tag value.
As can be seen from the above, in the present embodiment, the training process of the video motion recognition network is trained based on the two-dimensional difference image information, and the ANN layer and the SNN layer do not need to be trained separately, so that the training process of the video motion recognition network is simple and the training time is short.
As an optional implementation manner, as shown in fig. 3, the number of the convolution leakage integral distribution modules 10 is at least two, and the at least two convolution leakage integral distribution modules 10 are sequentially connected to perform multi-stage feature extraction on the difference image information; the input end of the full connection layer module 20 is connected to the output end of the last convolution leakage integral distribution module 10 of the at least two convolution leakage integral distribution modules 10;
and/or
The number of the fully-connected layer modules 20 is at least two, and the at least two fully-connected layer modules 20 are sequentially connected to perform multi-stage linear processing on the characteristic values; the input end of the first fully-connected layer module 20 of the at least two fully-connected layer modules 20 is connected to the output end of the convolution leakage integral distribution module 10.
In the embodiment shown in fig. 3, the number of convolution leakage integral distribution modules 10 is only 3, and the number of fully-connected layer modules 20 is only 2 for illustration, but the image motion recognition network is not limited to include a plurality of convolution leakage integral distribution modules 10 and one fully-connected layer module 20, and include one convolution leakage integral distribution module 10 and a plurality of fully-connected layer modules 20, or the number of convolution leakage integral distribution modules 10 and fully-connected layer modules 20 is other number.
In this embodiment, the video motion recognition network employs multiple sets of convolution leakage integral distribution modules 10 and multiple sets of full connection layer modules 20 in cascade to extract and process deeper features of the video image.
In the embodiment of the application, a target video clip is obtained; carrying out difference processing on the image frames in the target video clip to obtain difference image information; inputting the differential image information into a video action recognition network, extracting a characteristic value of the differential image information through the video action recognition network, and performing weighting processing on the characteristic value to determine an action recognition result of the target video segment. Therefore, the video action recognition network only needs to extract the characteristic value of the two-dimensional difference image information to obtain the difference characteristic between the image frames, and can obtain the action recognition result of the target video clip by weighting the difference characteristic without processing the three-dimensional data of the image frames, so that the calculated amount of the video action recognition network is reduced, and the calculation speed in the video action recognition process can be improved.
It should be noted that, in the video motion recognition method provided in the embodiment of the present application, the execution subject may be a video motion recognition device, or a control module in the video motion recognition device for executing the video motion recognition method. The embodiment of the present application takes the example that the video motion recognition apparatus executes the method for recognizing the video loading motion, and the video motion recognition apparatus provided in the embodiment of the present application is described.
Referring to fig. 4, which is a structural diagram of a video motion recognition apparatus according to an embodiment of the present disclosure, as shown in fig. 4, the video motion recognition apparatus 400 may include:
an obtaining module 401, configured to obtain a target video segment;
a difference module 402, configured to perform difference processing on image frames in the target video segment to obtain a difference image information sequence, where the difference image information sequence includes at least one frame of difference image information;
the identifying module 403 is configured to input the difference image information sequence into a video motion identification network to determine a motion identification result of the target video segment.
Optionally, the video motion recognition network is constructed according to a pulse neural network, and a pixel value in the differential image information is binary data.
Optionally, the video action recognition network includes a convolution leakage integral issuing module and a full connection layer module, and the recognition module 403 is specifically configured to:
and extracting the characteristic value of the differential image information sequence through the convolution leakage integral distribution module, and performing weighting processing on the characteristic value through the full-connection layer module to determine the action recognition result of the target video clip.
Optionally, the identifying module 403 includes:
the conversion unit is used for converting the target video clip into image frames arranged according to time sequence;
the difference processing unit is used for carrying out gray processing on the image frames and respectively carrying out difference processing on L adjacent image frames in the image frames after the gray processing to obtain at least one frame of difference image, wherein L is an integer greater than or equal to 2;
the determining unit is used for respectively generating differential image information corresponding to each frame of differential image so as to determine a differential image information sequence according to at least one frame of differential image information;
wherein the differential image information includes pixel enhancement information and pixel reduction information.
Optionally, the difference image includes a plurality of difference values, where the determining unit includes:
a first determining subunit configured to determine a pixel emphasis value and a pixel subtraction value of the plurality of differential values;
a first generating subunit, configured to generate the pixel enhancement information according to the pixel enhancement value;
a second generating subunit, configured to generate the pixel reduction information according to the pixel reduction value.
Optionally, the difference image includes N difference values, the pixel enhancement information includes N pixel values respectively corresponding to the N difference values, the pixel reduction information includes N pixel values respectively corresponding to the N difference values, and N is an integer greater than 1;
wherein the first generating subunit is specifically configured to:
determining a first pixel value corresponding to the pixel enhancement value as 1, and determining a pixel value obtained by dividing the first pixel value from the N pixel values as 0 to obtain the pixel enhancement information;
wherein the second generating subunit is specifically configured to:
and determining a second pixel value corresponding to the pixel weakening value as 1, and determining a pixel value obtained by dividing the second pixel value from the N pixel values as 0 to obtain the pixel weakening information.
Optionally, the first determining subunit includes:
a first determining sub-unit for determining a differential value greater than or equal to a first threshold as the pixel enhancement value;
a second determining sub-unit for determining a differential value less than or equal to a second threshold value as the pixel reduction value.
Optionally, the convolution leakage integral distribution module includes: a convolution leakage integral distribution layer, a batch standardization layer, a linear rectification layer and a global pooling layer;
an identification module 403, comprising:
a convolution leakage integral distribution unit, configured to perform time sequence convolution processing and leakage integral distribution processing on the difference image information through the convolution leakage integral distribution layer to extract a time sequence characteristic value and a space characteristic value of the target video segment, respectively, where the characteristic value of the difference image information includes the time sequence characteristic value and the space characteristic value, and the convolution leakage integral distribution layer adopts a pulse neural network model;
the batch normalization unit is used for performing batch normalization processing on the characteristic values of the target video segments through the batch normalization layer, wherein the characteristic values of the target video segments comprise the time sequence characteristic values and the space characteristic values;
the linear rectification unit is used for performing linear correction processing on the characteristic values subjected to batch standardization processing through the linear rectification layer;
and the global pooling unit is used for carrying out average pooling on the characteristic values subjected to the linear correction processing through the global pooling layer.
Optionally, the full connection layer module adopts an artificial neural network model.
Optionally, the number of the convolution leakage integral distribution modules is at least two, and the at least two convolution leakage integral distribution modules are sequentially connected to extract the multi-stage features of the differential image information; the input end of the full connection layer module is connected with the output end of the last stage of convolution leakage integral distribution module in the at least two convolution leakage integral distribution modules;
and/or
The number of the fully-connected layer modules is at least two, and the at least two fully-connected layer modules are sequentially connected to perform multistage linear processing on the characteristic values; and the input end of the first-stage full connection layer module in the at least two full connection layer modules is connected with the output end of the convolution leakage integral distribution module.
The video motion recognition device provided by the embodiment of the application has the advantages that the model structure is simple, the data size is small in the process of video motion recognition, and therefore the calculation amount in the video motion recognition process can be reduced, and the calculation efficiency is improved.
The video motion recognition device in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a Network Attached Storage (NAS), a personal computer (personal computer, PC), a Television (TV), a teller machine, a self-service machine, and the like, and the embodiments of the present application are not limited in particular.
The video motion recognition device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.
The video motion recognition device provided in the embodiment of the present application can implement each process implemented by the method embodiment shown in fig. 1, and is not described here again to avoid repetition.
Optionally, as shown in fig. 5, an electronic device 500 is further provided in this embodiment of the present application, and includes a processor 501, a memory 502, and a program or an instruction stored in the memory 502 and executable on the processor 501, where the program or the instruction is executed by the processor 501 to implement each process of the video motion recognition method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic devices and the non-mobile electronic devices described above.
The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the video motion recognition method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.
The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the above video motion recognition method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the description is omitted here.
It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (13)

1. A video motion recognition method, the method comprising:
acquiring a target video clip;
carrying out differential processing on the image frames in the target video clip to obtain a differential image information sequence, wherein the differential image information sequence comprises at least one frame of differential image information;
and inputting the differential image information sequence into a video action recognition network to determine an action recognition result of the target video segment.
2. The video motion recognition method of claim 1, wherein the video motion recognition network is constructed according to a pulse neural network, and the pixel values in the difference image information are binary data.
3. The video motion recognition method of claim 1, wherein the video motion recognition network comprises a convolution leakage integral distribution module and a full connection layer module, and the inputting the differential image information sequence into the video motion recognition network to determine the motion recognition result of the target video segment comprises:
and extracting the characteristic value of the differential image information sequence through the convolution leakage integral distribution module, and performing weighting processing on the characteristic value through the full-connection layer module to determine the action recognition result of the target video clip.
4. The video motion recognition method of claim 1, wherein the differentiating the image frames in the target video segment to obtain a sequence of differential image information comprises:
converting the target video clip into image frames arranged according to time sequence;
performing gray processing on the image frames, and performing difference processing on L adjacent image frames in the image frames after the gray processing respectively to obtain at least one frame of difference image, wherein L is an integer greater than or equal to 2;
respectively generating differential image information corresponding to each frame of differential image so as to determine a differential image information sequence according to at least one frame of differential image information;
wherein the differential image information includes pixel enhancement information and pixel reduction information.
5. The video motion recognition method of claim 4, wherein the difference image includes a plurality of difference values, and wherein the generating difference image information corresponding to each frame of difference image includes:
determining a pixel enhancement value and a pixel reduction value in the plurality of differential values;
generating the pixel enhancement information according to the pixel enhancement value;
and generating the pixel weakening information according to the pixel weakening value.
6. The video motion recognition method according to claim 5, wherein the difference image includes N difference values, the pixel enhancement information includes N pixel values corresponding to the N difference values, respectively, the pixel reduction information includes N pixel values corresponding to the N difference values, respectively, and N is an integer greater than 1;
wherein the generating the pixel enhancement information according to the pixel enhancement value comprises:
determining a first pixel value corresponding to the pixel enhancement value as 1, and determining a pixel value obtained by dividing the first pixel value from the N pixel values as 0 to obtain the pixel enhancement information;
wherein the generating the pixel reduction information according to the pixel reduction value comprises:
and determining a second pixel value corresponding to the pixel weakening value as 1, and determining a pixel value obtained by dividing the second pixel value from the N pixel values as 0 to obtain the pixel weakening information.
7. The video motion recognition method of claim 5, wherein the determining the pixel enhancement value and the pixel reduction value of the plurality of difference values comprises:
determining a differential value greater than or equal to a first threshold value as the pixel enhancement value;
determining a differential value less than or equal to a second threshold value as the pixel reduction value.
8. The video motion recognition method of claim 3, wherein the convolution leakage integral distribution module comprises: a convolution leakage integral distribution layer, a batch standardization layer, a linear rectification layer and a global pooling layer;
the extracting the characteristic value of the differential image information through the convolution leakage integral distribution module comprises:
performing time sequence convolution processing and leakage integral distribution processing on the difference image information through the convolution leakage integral distribution layer to respectively extract a time sequence characteristic value and a space characteristic value of the target video clip, wherein the characteristic value of the difference image information comprises the time sequence characteristic value and the space characteristic value, and the convolution leakage integral distribution layer adopts a pulse neural network model;
performing batch standardization processing on the characteristic values of the target video clips through the batch standardization layer, wherein the characteristic values of the target video clips comprise the time sequence characteristic values and the space characteristic values;
performing linear correction processing on the characteristic values subjected to batch standardization processing through the linear rectification layer;
and carrying out average pooling on the characteristic values subjected to the linear correction processing through the global pooling layer.
9. The video motion recognition method of claim 8, wherein the fully-connected layer module employs an artificial neural network model.
10. The video motion recognition method according to claim 3, wherein the number of the convolution leakage integral distribution modules is at least two, and the at least two convolution leakage integral distribution modules are sequentially connected to perform multi-stage feature extraction on the difference image information; the input end of the full connection layer module is connected with the output end of the last stage of convolution leakage integral distribution module in the at least two convolution leakage integral distribution modules;
and/or
The number of the fully-connected layer modules is at least two, and the at least two fully-connected layer modules are sequentially connected to perform multistage linear processing on the characteristic values; and the input end of the first-stage full connection layer module in the at least two full connection layer modules is connected with the output end of the convolution leakage integral distribution module.
11. A video motion recognition apparatus, the apparatus comprising:
the acquisition module is used for acquiring a target video clip;
the difference module is used for carrying out difference processing on the image frames in the target video clip to obtain a difference image information sequence, and the difference image information sequence comprises at least one frame of difference image information;
and the identification module is used for inputting the difference image information sequence into a video action identification network so as to determine the action identification result of the target video segment.
12. An electronic device comprising a processor, a memory and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the video action recognition method according to any of claims 1-10.
13. A readable storage medium, characterized in that it stores thereon a program or instructions which, when executed by a processor, implement the steps of the video motion recognition method according to any one of claims 1-10.
CN202011351589.1A 2020-11-26 2020-11-26 Video motion recognition method and device, electronic equipment and storage medium Pending CN112464807A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011351589.1A CN112464807A (en) 2020-11-26 2020-11-26 Video motion recognition method and device, electronic equipment and storage medium
PCT/CN2021/132696 WO2022111506A1 (en) 2020-11-26 2021-11-24 Video action recognition method and apparatus, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011351589.1A CN112464807A (en) 2020-11-26 2020-11-26 Video motion recognition method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112464807A true CN112464807A (en) 2021-03-09

Family

ID=74808033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011351589.1A Pending CN112464807A (en) 2020-11-26 2020-11-26 Video motion recognition method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112464807A (en)
WO (1) WO2022111506A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818958A (en) * 2021-03-24 2021-05-18 苏州科达科技股份有限公司 Action recognition method, device and storage medium
CN113052091A (en) * 2021-03-30 2021-06-29 中国北方车辆研究所 Action recognition method based on convolutional neural network
CN113111842A (en) * 2021-04-26 2021-07-13 浙江商汤科技开发有限公司 Action recognition method, device, equipment and computer readable storage medium
CN113239855A (en) * 2021-05-27 2021-08-10 北京字节跳动网络技术有限公司 Video detection method and device, electronic equipment and storage medium
CN113269264A (en) * 2021-06-04 2021-08-17 北京灵汐科技有限公司 Object recognition method, electronic device, and computer-readable medium
CN114466153A (en) * 2022-04-13 2022-05-10 深圳时识科技有限公司 Self-adaptive pulse generation method and device, brain-like chip and electronic equipment
CN114495178A (en) * 2022-04-14 2022-05-13 深圳时识科技有限公司 Pulse sequence randomization method and device, brain-like chip and electronic equipment
WO2022111506A1 (en) * 2020-11-26 2022-06-02 北京灵汐科技有限公司 Video action recognition method and apparatus, electronic device and storage medium
CN115171221A (en) * 2022-09-06 2022-10-11 上海齐感电子信息科技有限公司 Action recognition method and action recognition system
CN115908954A (en) * 2023-03-01 2023-04-04 四川省公路规划勘察设计研究院有限公司 Geological disaster hidden danger identification system and method based on artificial intelligence and electronic equipment
CN116311003A (en) * 2023-05-23 2023-06-23 澳克多普有限公司 Video detection method and system based on dual-channel loading mechanism

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114461468A (en) * 2022-01-21 2022-05-10 电子科技大学 Microprocessor application scene recognition method based on artificial neural network
CN116614666B (en) * 2023-07-17 2023-10-20 微网优联科技(成都)有限公司 AI-based camera feature extraction system and method

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10275646B2 (en) * 2017-08-03 2019-04-30 Gyrfalcon Technology Inc. Motion recognition via a two-dimensional symbol having multiple ideograms contained therein
CN110309720A (en) * 2019-05-27 2019-10-08 北京奇艺世纪科技有限公司 Video detecting method, device, electronic equipment and computer-readable medium
CN110555523B (en) * 2019-07-23 2022-03-29 中建三局智能技术有限公司 Short-range tracking method and system based on impulse neural network
CN110503081B (en) * 2019-08-30 2022-08-26 山东师范大学 Violent behavior detection method, system, equipment and medium based on interframe difference
CN111539290B (en) * 2020-04-16 2023-10-20 咪咕文化科技有限公司 Video motion recognition method and device, electronic equipment and storage medium
CN112464807A (en) * 2020-11-26 2021-03-09 北京灵汐科技有限公司 Video motion recognition method and device, electronic equipment and storage medium

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022111506A1 (en) * 2020-11-26 2022-06-02 北京灵汐科技有限公司 Video action recognition method and apparatus, electronic device and storage medium
CN112818958A (en) * 2021-03-24 2021-05-18 苏州科达科技股份有限公司 Action recognition method, device and storage medium
CN113052091A (en) * 2021-03-30 2021-06-29 中国北方车辆研究所 Action recognition method based on convolutional neural network
CN113111842A (en) * 2021-04-26 2021-07-13 浙江商汤科技开发有限公司 Action recognition method, device, equipment and computer readable storage medium
CN113239855A (en) * 2021-05-27 2021-08-10 北京字节跳动网络技术有限公司 Video detection method and device, electronic equipment and storage medium
CN113239855B (en) * 2021-05-27 2023-04-18 抖音视界有限公司 Video detection method and device, electronic equipment and storage medium
CN113269264A (en) * 2021-06-04 2021-08-17 北京灵汐科技有限公司 Object recognition method, electronic device, and computer-readable medium
CN114466153B (en) * 2022-04-13 2022-09-09 深圳时识科技有限公司 Self-adaptive pulse generation method and device, brain-like chip and electronic equipment
CN114466153A (en) * 2022-04-13 2022-05-10 深圳时识科技有限公司 Self-adaptive pulse generation method and device, brain-like chip and electronic equipment
CN114495178B (en) * 2022-04-14 2022-06-21 深圳时识科技有限公司 Pulse sequence randomization method and device, brain-like chip and electronic equipment
CN114495178A (en) * 2022-04-14 2022-05-13 深圳时识科技有限公司 Pulse sequence randomization method and device, brain-like chip and electronic equipment
CN115171221A (en) * 2022-09-06 2022-10-11 上海齐感电子信息科技有限公司 Action recognition method and action recognition system
CN115171221B (en) * 2022-09-06 2022-12-06 上海齐感电子信息科技有限公司 Action recognition method and action recognition system
CN115908954A (en) * 2023-03-01 2023-04-04 四川省公路规划勘察设计研究院有限公司 Geological disaster hidden danger identification system and method based on artificial intelligence and electronic equipment
CN116311003A (en) * 2023-05-23 2023-06-23 澳克多普有限公司 Video detection method and system based on dual-channel loading mechanism

Also Published As

Publication number Publication date
WO2022111506A1 (en) 2022-06-02

Similar Documents

Publication Publication Date Title
CN112464807A (en) Video motion recognition method and device, electronic equipment and storage medium
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
CN112597941B (en) Face recognition method and device and electronic equipment
CN110070029B (en) Gait recognition method and device
CN113378600B (en) Behavior recognition method and system
CN111062263B (en) Method, apparatus, computer apparatus and storage medium for hand gesture estimation
CN111523378B (en) Human behavior prediction method based on deep learning
CN109063626B (en) Dynamic face recognition method and device
CN112562255B (en) Intelligent image detection method for cable channel smoke and fire conditions in low-light-level environment
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN112861575A (en) Pedestrian structuring method, device, equipment and storage medium
CN112487913A (en) Labeling method and device based on neural network and electronic equipment
Arya et al. Object detection using deep learning: a review
CN113591674A (en) Real-time video stream-oriented edge environment behavior recognition system
CN112418032A (en) Human behavior recognition method and device, electronic equipment and storage medium
Kadim et al. Deep-learning based single object tracker for night surveillance.
WO2022205329A1 (en) Object detection method, object detection apparatus, and object detection system
CN111652181B (en) Target tracking method and device and electronic equipment
CN112766179A (en) Fire smoke detection method based on motion characteristic hybrid depth network
CN115798055A (en) Violent behavior detection method based on corersort tracking algorithm
CN114241573A (en) Facial micro-expression recognition method and device, electronic equipment and storage medium
CN114821777A (en) Gesture detection method, device, equipment and storage medium
CN110826469A (en) Person detection method and device and computer readable storage medium
CN116645727B (en) Behavior capturing and identifying method based on Openphase model algorithm
CN114155475B (en) Method, device and medium for identifying end-to-end personnel actions under view angle of unmanned aerial vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination