CN112464807A

CN112464807A - Video motion recognition method and device, electronic equipment and storage medium

Info

Publication number: CN112464807A
Application number: CN202011351589.1A
Authority: CN
Inventors: 吴臻志; 马欣
Original assignee: Beijing Lynxi Technology Co Ltd
Current assignee: Beijing Lynxi Technology Co Ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-03-09
Also published as: WO2022111506A1

Abstract

The application discloses a video action recognition method and device, electronic equipment and a storage medium, and belongs to the technical field of neural networks. The video motion recognition method comprises the following steps: acquiring a target video clip; carrying out differential processing on the image frames in the target video clip to obtain a differential image information sequence, wherein the differential image information sequence comprises at least one frame of differential image information; and inputting the differential image information sequence into a video action recognition network to determine an action recognition result of the target video segment. According to the embodiment of the application, the calculation speed in the video motion recognition process can be improved.

Description

Video motion recognition method and device, electronic equipment and storage medium

Technical Field

The application belongs to the technical field of neural networks, and particularly relates to a video motion recognition method and device, electronic equipment and a storage medium.

Background

The method and the device identify the action in the shot video, and have good application prospect in video monitoring and user interaction.

In the related art, there are drawbacks such as a large amount of calculation and a slow calculation speed when performing motion recognition.

Disclosure of Invention

The embodiment of the application aims to provide a video motion recognition method, a video motion recognition device, electronic equipment and a storage medium, and can solve the problem that a video motion recognition method in the related art is slow in calculation speed.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a video motion recognition method, where the method includes:

acquiring a target video clip;

carrying out differential processing on the image frames in the target video clip to obtain a differential image information sequence, wherein the differential image information sequence comprises at least one frame of differential image information;

and inputting the differential image information sequence into a video action recognition network to determine an action recognition result of the target video segment.

In a second aspect, an embodiment of the present application provides a video motion recognition apparatus, where the apparatus includes:

the acquisition module is used for acquiring a target video clip;

the difference module is used for carrying out difference processing on the image frames in the target video clip to obtain a difference image information sequence, and the difference image information sequence comprises at least one frame of difference image information;

and the identification module is used for inputting the difference image information sequence into a video action identification network so as to determine the action identification result of the target video segment.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In the embodiment of the application, a target video clip is obtained; carrying out differential processing on the image frames in the target video clip to obtain a differential image information sequence, wherein the differential image information sequence comprises at least one frame of differential image information; and inputting the differential image information sequence into a video action recognition network to determine an action recognition result of the target video segment. Therefore, the video motion recognition network carries out motion recognition based on the difference image information sequence, the calculated amount of the video motion recognition network is reduced, and the calculating speed in the video motion recognition process can be improved.

Drawings

Fig. 1 is a flowchart of a video motion recognition method according to an embodiment of the present application;

fig. 2 is one of schematic structural diagrams of a video motion recognition network to which a video motion recognition method according to an embodiment of the present application can be applied;

fig. 3 is a second schematic structural diagram of a video motion recognition network to which a video motion recognition method according to an embodiment of the present application can be applied;

fig. 4 is a structural diagram of a video motion recognition apparatus according to an embodiment of the present application;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The video motion recognition method, the video motion recognition apparatus, the electronic device, and the readable storage medium provided in the embodiments of the present application are described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

In the related art, video motion recognition may be performed by:

the first method is as follows: video motion is predicted by a spatio-temporal dual-Stream Network structure (which may also be referred to as a Two Stream Network).

The space-time double-flow network structure comprises two branches, wherein one branch network extracts image information according to an input single-frame image, and image classification is carried out. And the other branch network extracts motion information between frames according to an input continuous 10-frame optical flow (optical flow) motion field, the network structures of the two branches are the same, and the excitation function of the output layer is the same as the softmax function, namely the softmax function is adopted for prediction. And finally, fusing the results of the two branch networks in a mode of direct averaging or Support Vector Machine (SVM).

In this embodiment, it is necessary to calculate and store the optical flow of the video image in advance, and the data amount of the optical flow is large, so that a large storage space is required.

Therefore, the space-time dual-stream network has the following disadvantages:

the first disadvantage is that: and the training process of the space-time double-flow network is repeated.

On one hand, the training process is complex and the training time is long because the training needs to be carried out on the two branches respectively; on the other hand, since the optical flow in a video segment may be displaced in a particular direction, it is necessary to subtract the average value of all optical flow vectors from the optical flow in advance during the training process.

The second disadvantage is that: in the process of predicting video motion, the speed of identifying the result of the video motion is slow due to the large amount of calculation.

In the application process, the video is also required to be converted into the optical flow through the optical flow model, the converted optical flow is calculated by adopting a space-time double-flow network, and the defect of large calculation amount exists when the converted optical flow is calculated by adopting the space-time double-flow network based on the characteristic that the optical flow has large data volume, so that the speed of identifying the video action result is low.

The third disadvantage is that: can only be applied to motion recognition in images or segment video.

Since the spatio-temporal dual-stream network operates only one frame (spatial network) or operates a single stack of frames in a short segment (temporal network), access to temporal context is limited, and thus modeling for a long-range temporal structure cannot be achieved.

The second method comprises the following steps: video motion is predicted by a three-dimensional Convolutional Neural network (3D CNN).

In this manner, a video is divided into a plurality of fixed-length segments, and then motion information of each video segment is extracted separately. When the 3D CNN is applied to video motion recognition, since the 3D CNN has a large parameter amount, training becomes more difficult and requires more training data, and thus, the training process of the 3D CNN is complicated and takes a long time.

The third method comprises the following steps: video motion is predicted by a Convolutional Long Short-Term Memory network (ConvLSTM).

In the method, the characteristics of each frame of image in the video are extracted through a CNN network, and then the time sequence relation among the characteristics of the frame of image is mined by adopting an LSTM network. However, the ConvLSTM method is not much used in video analysis because LSTM has limited long range dependence to capture, and the network model is more difficult and slower to train than spatio-temporal dual-stream networks and 3D convolutional networks.

The method is as follows: time Segment Networks (TSNs).

Like the space-time dual-stream network, the TSN is also composed of a space-stream convolutional network and a time-stream convolutional network. But unlike two-stream, which uses a single frame or a single pile of frames, TSN uses a series of short segments sparsely sampled from the whole video, each of which will give its own preliminary prediction of the behavior class, from which a prediction at video level is derived. Moreover, in the learning process of the TSN, the loss value of the video level prediction needs to be optimized by iteratively updating the model parameters.

The TSN is essentially an improved network for the spatio-temporal dual-stream network, which has the same defect as the spatio-temporal dual-stream network, i.e. the training process is complicated and the calculation process is slow.

In view of the above-mentioned drawbacks in the related art, in the embodiment of the present application, a frame image of a video segment is subjected to difference processing in advance to obtain two-dimensional difference image information, and a motion recognition result in a video is determined by performing feature extraction on the difference image information and performing linear weighting and processing based on the extracted features, so that a model structure can be simplified, a calculation amount in a video motion recognition process is reduced, and a calculation speed in the video motion recognition process can be increased.

Referring to fig. 1, which is a flowchart of a video motion recognition method according to an embodiment of the present disclosure, as shown in fig. 1, the method may include the following steps:

step 101, obtaining a target video clip.

In some optional embodiments, the target video clip may be acquired by a video acquisition device such as a camera. In addition, in the process that the time length of the video acquired by the video acquisition device is long, the video can be divided into a plurality of sections of videos with preset time lengths, for example: 4s (seconds), 5s, etc., where the preset time length is not particularly limited, in this case, the target video segment may include part or all of the plurality of segments of video with the preset time length.

And 102, carrying out differential processing on the image frames in the target video clip to obtain a differential image information sequence, wherein the differential image information sequence comprises at least one frame of differential image information.

The difference image information may be a difference image after the difference processing or image information obtained by processing the difference image, and may be, for example, an image frame obtained by performing binarization processing on the difference image, where the value of each pixel in the binarized image frame is binary data, for example, 1 or 0.

When the pixel values in the differential image information are binary data, compared with pixel difference values with various values, the data complexity of the binary data is lower, so that the calculation complexity of the video motion recognition network can be simplified, the method can be applied to the video motion recognition network constructed by the impulse neural network, and the training speed and the reasoning speed of the video motion recognition network can be improved.

In some alternative embodiments, the differential processing may be understood as: the target video segment comprises a plurality of frames of image frames arranged according to a time sequence, two or more adjacent images are subjected to image data differential processing one by one, and a differential image information sequence of the target video segment is obtained after each image frame in the target video segment is traversed.

For example: assume that a video segment includes an image frame: 1. 2, 3, and 4, performing image data difference processing on two adjacent image frames, wherein the difference processing may be: the pixel value of the image frame 1 is subtracted by the corresponding pixel value of the image frame 2, the pixel value of the image frame 2 is subtracted by the corresponding pixel value of the image frame 3, and the pixel value of the image frame 3 is subtracted by the corresponding pixel value of the image frame 4, so that a plurality of difference images which are arranged in sequence can be obtained by carrying out difference processing on adjacent image frames in a video segment, and the information sequence of the difference images can be determined according to the difference images.

Preferably, L may be equal to 2 to perform a difference processing on two adjacent image frames to find a motion difference between the two adjacent image frames, and of course, in some alternative implementations, L may take any one of 2, 3, 4, and other integers greater than or equal to 2, and is not specifically limited herein.

As an optional implementation manner, the performing differential processing on the image frames in the target video segment to obtain a differential image information sequence includes:

converting the target video clip into image frames arranged according to time sequence;

respectively carrying out gray processing on the image frames, and respectively carrying out difference processing on L adjacent image frames in the image frames after the gray processing to obtain at least one frame of difference image, wherein L is an integer greater than or equal to 2;

respectively generating differential image information corresponding to each frame of differential image so as to determine a differential image information sequence according to at least one frame of differential image information, wherein the differential image information comprises the pixel enhancement information and the pixel weakening information.

For example, assume that the image frame a and the image frame B after the grayscale processing are subjected to difference processing to obtain a difference image. Generating the difference image information corresponding to each frame of the difference image may be understood as generating the difference image information including the pixel enhancement information and the pixel reduction information according to the difference value in the difference image.

For example, the difference image includes a plurality of difference values, and a difference value greater than or equal to a first threshold value may be determined as a pixel enhancement value, and a difference value less than or equal to a second threshold value in the difference image may be determined as a pixel reduction value. Wherein the pixel enhancement value may refer to an enhanced pixel value. The pixel reduction value may refer to a reduced pixel value. The pixel enhancement value and the pixel reduction value may be understood as action edge data.

Wherein the pixel enhancement information can be understood as: and determining an image channel according to the pixel enhancement value in the differential image. Accordingly, the pixel reduction information can be understood as: and determining an image channel according to the pixel attenuation value in the differential image.

In the embodiment, the image frames are subjected to gray processing, so that the color image frames can be converted into gray images, unnecessary color features are analyzed in the feature extraction and analysis process, and the data calculation amount in the video motion recognition process can be reduced. Differential image information is generated through the action edge data, action recognition is carried out based on the differential image information, storage space occupied by the data in the recognition process can be reduced, and recognition speed is improved.

Further, the generating the differential image information corresponding to each frame of the differential image includes:

determining a pixel enhancement value and a pixel reduction value in the plurality of differential values;

generating the pixel enhancement information according to the pixel enhancement value;

and generating the pixel weakening information according to the pixel weakening value.

In this embodiment, the pixel enhancement information and the pixel reduction information are generated according to the difference value, and are input to the video motion recognition network, so that the two-channel two-dimensional data is provided for the video motion recognition network.

In this way, compared with the defects of large calculation amount, low feature extraction precision and the like in the process of extracting optical flow data of a plurality of image frames in advance and performing 3D convolution processing on the optical flow data in the prior art, the embodiment of the application can simply extract the action difference between the differential images based on the pixel enhancement information and the pixel reduction information, and can simplify the complexity of the action of recognizing the differential images.

In some optional implementations, in the sequence of differential image information of the target video segment, the differential value of each piece of differential image information may be analog information or digital information, and the differential value in the differential image information may be divided into a pixel enhancement value and a pixel reduction value, and the pixel enhancement value and the pixel reduction value may be determined according to a value of the analog information or the digital information. For example: in the case where the differential value of the target video segment is an analog information sequence, an analog information value equal to or greater than a first threshold value (e.g., +5) is determined as a pixel enhancement value, and an analog information value equal to or less than a second threshold value (e.g., -5) is determined as a pixel reduction value.

In addition to the analog information sequence, the difference image obtained by the difference processing may be binarized to obtain difference image information having binary data as pixel values.

In some optional embodiments, the determining the sequence of difference image information includes:

and converting the differential value sequence into a digital information sequence, namely binary data, wherein the binary data can be applied to a Spiking Neural Network (SNN) with a simpler model structure.

In some optional embodiments, the difference image includes N difference values, the pixel enhancement information includes N pixel values corresponding to the N difference values, respectively, the pixel reduction information includes N pixel values corresponding to the N difference values, respectively, and N is an integer greater than 1;

wherein the generating the pixel enhancement information according to the pixel enhancement value comprises:

determining a first pixel value corresponding to the pixel enhancement value as 1, and determining a pixel value obtained by dividing the first pixel value from the N pixel values as 0 to obtain the pixel enhancement information;

wherein the generating the pixel reduction information according to the pixel reduction value comprises:

and determining a second pixel value corresponding to the pixel weakening value as 1, and determining a pixel value obtained by dividing the second pixel value from the N pixel values as 0 to obtain the pixel weakening information.

In the embodiment, the difference image is converted into the pixel enhancement information and the pixel weakening information, and under the application scene of the SNN neural network model, the two-channel two-dimensional data can be provided for the SNN neural network model, so that the calculation complexity of the SNN neural network model is simplified.

Of course, in some embodiments, the sequence of difference images (analog information) obtained after the difference processing may be input into the video motion recognition network, and the normalization processing may be performed in a batch normalization layer or the like in the video motion recognition network, which may also achieve the determination of the motion recognition result of the target video segment based on the sequence of difference image information, and is not limited in detail herein.

In some optional embodiments, the performing difference processing on L adjacent image frames in the image frames after the gray processing respectively to obtain at least one frame of difference image includes:

respectively carrying out differential processing on L adjacent image frames in the image frames after the gray processing to obtain at least one frame of differential image, wherein each frame of differential image comprises N differential values, the pixel enhancement information comprises N pixel values respectively corresponding to the N differential values, the pixel attenuation information comprises N pixel values respectively corresponding to the N differential values, and N is an integer greater than 1;

determining that a pixel value corresponding to a first differential value in the pixel emphasis information is equal to 1 and a pixel value corresponding to the first differential value in the pixel reduction information is equal to 0, in a case where the first differential value of the N differential values is greater than or equal to a first threshold;

determining that a pixel value corresponding to a second differential value in the pixel emphasis information is equal to 0 and determining that a pixel value corresponding to the second differential value in the pixel reduction information is equal to 1, in a case where the second differential value of the N differential values is less than or equal to a second threshold value;

determining that a pixel value corresponding to a third differential value in the pixel emphasis information is equal to 0 and determining that a pixel value corresponding to the third differential value in the pixel reduction information is equal to 0, in a case where the third differential value in the differential value sequence is between the first threshold and the second threshold.

In an implementation, the difference value is analog information, and indicates that the pixel is enhanced if the pixel value in the pixel enhancement information is equal to 1, and indicates that the pixel is not enhanced (may be unchanged or weakened) if the pixel value in the pixel enhancement information is equal to 0; if the pixel value in the pixel reduction information is equal to 1, it indicates that the pixel is reduced, and if the pixel value in the pixel reduction information is equal to 0, it indicates that the pixel is not reduced (may be no change or enhanced).

Compared with an analog information number, the difference value converted into the digital signal can simplify the data processing process of the video identification network, and can be applied to the video identification network constructed based on the impulse neural network, so that the operation efficiency of the video identification network is improved.

Further, the pixel enhancement information is transmitted to a video identification network through a pixel enhancement channel, and the pixel reduction information is transmitted to the video identification network through a pixel reduction channel.

Namely, the RGB images of three channels can be converted into the images of two channels through image difference processing, so that the data complexity is simplified. In addition, the inter-frame relation of the images can be found through the difference processing, so that the video motion characteristics can be obtained more easily when the difference images are subjected to feature extraction, and the speed of video motion recognition is improved.

In an implementation, a first identifier may be added to the pixel enhancement information, and a second identifier may be added to the pixel reduction information, so that after the pixel enhancement information and the pixel reduction information are jointly transmitted to a video recognition network, the video recognition network divides the pixel enhancement information and the image reduction information according to the first identifier and the second identifier, which is not limited specifically herein.

Of course, in an application scenario where the calculation power is sufficient, the image frames may not be subjected to the grayscale processing, but a plurality of adjacent image frames may be directly subjected to the difference processing to obtain the difference image information, which is not limited in this respect.

In some optional embodiments, the differential image information may further include all 0 channels, for example, the differential image information includes pixel enhancement information, pixel reduction information, and all 0 channels.

And 103, inputting the difference image information sequence into a video motion recognition network to determine a motion recognition result of the target video segment.

The video motion recognition network can be any trained neural network for motion recognition.

In a possible implementation manner, the video motion recognition network is constructed based on a pulse neural network, and the input video motion recognition network may be a differential image information sequence determined according to the target video segment, and the differential image information sequence includes at least one frame of differential image information. The frame of differential image information may include two image channels, which are pixel enhancement information and pixel reduction information, respectively, each image channel may include a plurality of pixels, and a pixel value may be 0 or 1. Here, the pixel having a pixel value of 1 in the pixel enhancement information may be understood as an enhanced pixel, and the pixel having a pixel value of 0 may be a non-enhanced pixel. The pixel having the pixel value of 1 in the pixel reduction information may be understood as a reduced pixel, and the pixel having the pixel value of 0 may be an unattenuated pixel.

In addition, the extracting the feature value of the difference image information by the video motion recognition network may be that the video motion recognition network extracts the feature value of the difference image information by using a convolution leakage integral distribution model, and the feature value may include a time sequence feature value and a space feature value.

In some optional embodiments, the video action recognition network may extract feature values of a video segment, and after performing weighting processing on the feature values, may obtain a plurality of tag values corresponding to a plurality of preset action tags, respectively, and the action recognition result of determining the target video segment may be a preset action corresponding to a target tag having a largest value among the plurality of tag values.

The larger the value of the tag value is, the closer the action in the video and the preset action corresponding to the tag value are.

In practical application, the action in the video may not be completely matched with a preset action, and therefore, the often obtained video action result may include a plurality of tag values whose values are close to each other or larger than a preset threshold, and at this time, the action identification result for determining the target video segment may also be: and determining that the video action is close to the preset actions corresponding to the values respectively.

For example: fig. 2 is a schematic structural diagram of a video motion recognition network applicable to the video motion recognition method provided in the embodiment of the present application, as shown in the embodiment of the present application. As shown in fig. 2, the video motion recognition network includes: the convolution leakage integral issuing module 10 and the full connection layer module 20, where the extracting the feature value of the difference image information through the video motion recognition network and performing weighting processing on the feature value to determine the motion recognition result of the target video segment includes:

extracting the characteristic value of the differential image information through the convolution leakage integral distribution module 10, and performing weighting processing on the characteristic value through the full connection layer module 20 to determine the action recognition result of the target video clip.

As an alternative implementation, as shown in fig. 2, the convolution leakage integral distribution module includes: a convolution leakage integration and issuance (e.g., ConvLIF or ConvLIAF) layer 11, a Batch Normalization (BN) layer 12, a Rectified Linear Unit (ReLU) layer 13, and a global Pooling (which may also be referred to as: Avg Pooling) layer 14;

the extracting the feature value of the difference image information by the convolution leakage integral distribution module 10 includes:

performing time sequence convolution processing and leakage integral distribution processing on the difference image information through a convolution leakage integral distribution layer 11 to respectively extract a time sequence characteristic value and a space characteristic value of the target video clip, wherein the characteristic value of the difference image information comprises the time sequence characteristic value and the space characteristic value, and the convolution leakage integral distribution layer 11 adopts a pulse neural network model;

performing batch normalization processing on the feature values of the target video segments through a batch normalization layer 12, wherein the feature values of the target video segments comprise the time sequence feature values and the spatial feature values;

performing linear correction processing on the characteristic values subjected to batch standardization processing through a linear rectification layer 13;

the feature values after the linear correction processing are subjected to average pooling processing by the global pooling layer 14.

In implementation, the full connection layer module 20 obtains the feature data after the average pooling process to perform weighted summation process, so as to reassemble the feature values extracted from each convolution leakage integral issuing module 10 into a complete feature map, so as to obtain a tag value corresponding to the feature map, which is used as an identification result of the video action.

In some optional embodiments, the above-mentioned performing, by the convolution and leakage integral distribution layer 11, a time-series convolution process and a leakage integral distribution process on the difference image information to extract a time-series characteristic value and a spatial characteristic value of the target video segment, respectively, may be implemented by the following processes:

the original LIF model is described as a differential equation to show the dynamic behavior of neurons, and the expression of the original LIF model can be as follows:

where τ is a time factor for the neuron, V_resetIs the reset potential. Xi (t) is the passing weight of the ith neuron of W_iAn input signal (pulsed or no signal) connected to the current neuron. When V (t) reaches a certain threshold value V_thAt the same time, a pulse signal is sent out, and V (t) is reset to its initial value V_resetAnd n represents the total number of neurons. To facilitate derivation and training, we use iterative versions of LIF over discrete time, each iteration process may include the following steps:

1) synapse integration, expressed as:

I^t＝Conv(X^t,W)

wherein, I^tDenotes synaptic integration at time t, X^tRepresents the activation value of the presynaptic neuron, and W denotes the synaptic weight. Synaptic integration may take the form of full connectivity or convolution, with Conv in the above equation representing convolution.

2) Combining the spatial information and the time information, the expression is as follows:

wherein

And

respectively, the previous and current membrane potentials.

3) Comparing threshold values, and transmitting pulses, wherein the expression is as follows:

wherein F^tIs a transmitted signal. F^t1 denotes a transmit pulse event, if F^t0 indicates that no impulse event was transmitted.

4) Resetting the membrane potential, expressed as:

wherein the content of the first and second substances,

membrane potential after reset is indicated.

5) Execution leakage, expressed as:

where α and β represent multiplicative and additive attenuation coefficients, respectively.

6) Output F^t。

In addition, the batch normalization processing, the linear correction processing, and the global pooling processing are the same as the corresponding processing methods in the prior art, and are not described herein again.

In the present embodiment, sequential convolution processing is performed on consecutive image frames by the ConvLIAF layer; the output characteristic values are normalized through a Batch Normalization layer, the stability of the video motion recognition network is guaranteed, and meanwhile, the probability of overfitting possibly occurring in the process of training the video motion recognition network can be effectively reduced; increasing the nonlinear relation among the layers of the neural network through the RELU layer; by means of the AvgPooling layer, on the one hand, useless parameters can be prevented from increasing the time complexity, and on the other hand, the degree of integration of the eigenvalues is also increased.

In some alternative embodiments, the AvgPooling layer may be selected from an AvgPooling 2D (two-dimensional) layer wrapped with a Time distribution (which may also be referred to as a Time Distributed) layer, or an AvgPooling3D (three-dimensional) layer, which is not specifically limited herein.

In addition, except for the convolution leakage integral distribution layer 11, the batch normalization layer 12, the linear rectification layer 13, the global pooling layer 14, and the full connection layer module may all adopt an Artificial Neural Network (ANN), so that the video motion recognition Network can realize better processing capability of time-space domain hybrid application by adopting a Network structure in which the Artificial Neural Network ANN and the SNN are fused. And the SNN has differentiated advantages in the scene with low precision requirement and high calculation speed requirement. The error rate of the SNN can be close to convergence in a short time, and under the same condition, the time consumption of the traditional CNN method is longer, so that when the SNN is applied to the feature processing of the image frames in the video segments with shorter time, the error rate of the SNN is close to convergence, and the time consumption is shorter; and then, further processing the characteristic value extracted by the SNN by using the ANN with higher accuracy so as to improve the processing capacity of the characteristic value.

It should be noted that, in practical applications, the video motion recognition network may include more or less network layers than the video motion recognition network shown in fig. 2 according to a difference of a model structure or an algorithm of the video motion recognition network, and is not limited in this respect.

Compared with the conventional CNN method, in the embodiment of the present application, the convolution leakage integral issuance layer 11 adopting the SNN mode can be used to identify the action with a limited number of marked videos (i.e., the output result of the video action identification network may include a tag value corresponding to a limited number of action tags, where the tag value is used to indicate the similarity between the action in the video and the corresponding action tag). The method extracts pulse information from original video data and simultaneously reserves time correlation among different frames, and a large amount of dynamic activities in the video are summarized into actions corresponding to the labels in the storage library, so that the video action recognition network adopting ANN + SNN fusion can have high efficiency and accuracy, and the requirements of the video action recognition network on storage capacity and the operation environment of the operation capacity can be reduced due to the reduction of the calculated capacity and the storage capacity in the video action recognition process, thereby improving the applicability of the video action recognition network.

In the training process, in order to learn the motion corresponding to each motion label in the video motion recognition network, the video motion recognition network may be trained through the following processes:

step 1: shooting a preset action.

In this step, a plurality of objects may be photographed, and each object performs the preset action during the photographing process, and the photographing time of each preset action may be set as a preset duration.

For example: during the photographing process, each person makes 10 preset movements (left arm rotation, right arm rotation, left hand bending, etc.), and 20s are photographed for each movement of each person.

Step 2: and (5) segmenting the video.

In this step, the video of each preset action is divided into a plurality of segments to increase the number of samples.

For example: and averagely dividing the 20 s-long video into 4 5 s-long video segments.

Preferably, in consideration of the influence of the shooting delay and the closing delay during shooting, the first 2s and last 2sd of the 20-second video may be discarded, and the remaining 16 s-long video may be divided into 4 video segments of 4 s-long duration.

Therefore, the category errors caused by action change can be effectively cut off, and the effectiveness of the sample is ensured.

And step 3: and (5) classifying the marks.

In this step, video clips of different preset actions are associated with different tags.

For example: the 4 video segments of the ith preset action can be respectively marked as: i × 5, i × 5+1, i × 5+2, and i × 5+3, where i may be any one of 0 to 9.

And 4, step 4: the video is converted into pictures.

In this step, all video clips may be converted into pictures by using a visual and machine learning software library (OpenCV).

Of course, in a specific implementation, other tools may be used to convert the video into the picture frame, and are not limited in this respect.

And 5: and dividing a training set and a testing set.

In specific implementation, the picture frame in step 4 may be used as a sample according to a preset proportion, and divided into a training set and a test set, for example: 80% of the samples were used as training set and 20% as test set. Of course, the samples may be divided into training sets and testing sets according to other proportions, which are not specifically limited herein.

Step 6: and (6) difference processing.

In this step, the picture may be resized (resize) first to reduce and standardize the size of each picture; then converting the picture after the size adjustment processing into a gray picture; finally, the consecutive (i.e. adjacent) pictures are differentially processed, pixel enhancement information and pixel reduction information are obtained according to the result of the differential processing, the pixel enhancement information is transmitted from the pixel enhancement channel, and the pixel reduction information is transmitted from the pixel reduction channel.

Thus, the original three-channel RGB image data can be changed into the current two-channel image data.

In practical application, when a moving object relatively moves between two continuous frames of images, the differential image signals are not all 0, and when no moving object relatively moves between two continuous frames of images, the differential image signals are all 0, so that the relationship between the obtained image frames is realized.

And finally, inputting the differential image signals subjected to the differential processing into a video motion recognition network through a pixel enhancement channel and a pixel weakening channel respectively for training until the accuracy of the trained video motion recognition network meets a preset condition or the samples are completely trained.

As an optional implementation manner, in order to avoid overfitting of the video motion recognition network in the training process, a discarding layer may be added in the video motion recognition network to discard the neural network unit from the network according to a preset probability.

For example: in the embodiment shown in fig. 3, the fully-connected layer module 20 includes: a full link layer 21 and a Dropout layer 22. The Dropout layer 22 temporarily discards the neural network unit from the video motion recognition network according to a certain probability, thereby effectively preventing the video motion recognition network from being over-fitted and improving the training speed of the video motion recognition. And the full connection layer 21 is used for performing a weighted summation process on the features output by the convolution leakage integral issuing module to obtain an action tag value.

As can be seen from the above, in the present embodiment, the training process of the video motion recognition network is trained based on the two-dimensional difference image information, and the ANN layer and the SNN layer do not need to be trained separately, so that the training process of the video motion recognition network is simple and the training time is short.

As an optional implementation manner, as shown in fig. 3, the number of the convolution leakage integral distribution modules 10 is at least two, and the at least two convolution leakage integral distribution modules 10 are sequentially connected to perform multi-stage feature extraction on the difference image information; the input end of the full connection layer module 20 is connected to the output end of the last convolution leakage integral distribution module 10 of the at least two convolution leakage integral distribution modules 10;

and/or

The number of the fully-connected layer modules 20 is at least two, and the at least two fully-connected layer modules 20 are sequentially connected to perform multi-stage linear processing on the characteristic values; the input end of the first fully-connected layer module 20 of the at least two fully-connected layer modules 20 is connected to the output end of the convolution leakage integral distribution module 10.

In the embodiment shown in fig. 3, the number of convolution leakage integral distribution modules 10 is only 3, and the number of fully-connected layer modules 20 is only 2 for illustration, but the image motion recognition network is not limited to include a plurality of convolution leakage integral distribution modules 10 and one fully-connected layer module 20, and include one convolution leakage integral distribution module 10 and a plurality of fully-connected layer modules 20, or the number of convolution leakage integral distribution modules 10 and fully-connected layer modules 20 is other number.

In this embodiment, the video motion recognition network employs multiple sets of convolution leakage integral distribution modules 10 and multiple sets of full connection layer modules 20 in cascade to extract and process deeper features of the video image.

In the embodiment of the application, a target video clip is obtained; carrying out difference processing on the image frames in the target video clip to obtain difference image information; inputting the differential image information into a video action recognition network, extracting a characteristic value of the differential image information through the video action recognition network, and performing weighting processing on the characteristic value to determine an action recognition result of the target video segment. Therefore, the video action recognition network only needs to extract the characteristic value of the two-dimensional difference image information to obtain the difference characteristic between the image frames, and can obtain the action recognition result of the target video clip by weighting the difference characteristic without processing the three-dimensional data of the image frames, so that the calculated amount of the video action recognition network is reduced, and the calculation speed in the video action recognition process can be improved.

It should be noted that, in the video motion recognition method provided in the embodiment of the present application, the execution subject may be a video motion recognition device, or a control module in the video motion recognition device for executing the video motion recognition method. The embodiment of the present application takes the example that the video motion recognition apparatus executes the method for recognizing the video loading motion, and the video motion recognition apparatus provided in the embodiment of the present application is described.

Referring to fig. 4, which is a structural diagram of a video motion recognition apparatus according to an embodiment of the present disclosure, as shown in fig. 4, the video motion recognition apparatus 400 may include:

an obtaining module 401, configured to obtain a target video segment;

a difference module 402, configured to perform difference processing on image frames in the target video segment to obtain a difference image information sequence, where the difference image information sequence includes at least one frame of difference image information;

the identifying module 403 is configured to input the difference image information sequence into a video motion identification network to determine a motion identification result of the target video segment.

Optionally, the video motion recognition network is constructed according to a pulse neural network, and a pixel value in the differential image information is binary data.

Optionally, the video action recognition network includes a convolution leakage integral issuing module and a full connection layer module, and the recognition module 403 is specifically configured to:

and extracting the characteristic value of the differential image information sequence through the convolution leakage integral distribution module, and performing weighting processing on the characteristic value through the full-connection layer module to determine the action recognition result of the target video clip.

Optionally, the identifying module 403 includes:

the conversion unit is used for converting the target video clip into image frames arranged according to time sequence;

the difference processing unit is used for carrying out gray processing on the image frames and respectively carrying out difference processing on L adjacent image frames in the image frames after the gray processing to obtain at least one frame of difference image, wherein L is an integer greater than or equal to 2;

the determining unit is used for respectively generating differential image information corresponding to each frame of differential image so as to determine a differential image information sequence according to at least one frame of differential image information;

wherein the differential image information includes pixel enhancement information and pixel reduction information.

Optionally, the difference image includes a plurality of difference values, where the determining unit includes:

a first determining subunit configured to determine a pixel emphasis value and a pixel subtraction value of the plurality of differential values;

a first generating subunit, configured to generate the pixel enhancement information according to the pixel enhancement value;

a second generating subunit, configured to generate the pixel reduction information according to the pixel reduction value.

Optionally, the difference image includes N difference values, the pixel enhancement information includes N pixel values respectively corresponding to the N difference values, the pixel reduction information includes N pixel values respectively corresponding to the N difference values, and N is an integer greater than 1;

wherein the first generating subunit is specifically configured to:

wherein the second generating subunit is specifically configured to:

Optionally, the first determining subunit includes:

a first determining sub-unit for determining a differential value greater than or equal to a first threshold as the pixel enhancement value;

a second determining sub-unit for determining a differential value less than or equal to a second threshold value as the pixel reduction value.

Optionally, the convolution leakage integral distribution module includes: a convolution leakage integral distribution layer, a batch standardization layer, a linear rectification layer and a global pooling layer;

an identification module 403, comprising:

a convolution leakage integral distribution unit, configured to perform time sequence convolution processing and leakage integral distribution processing on the difference image information through the convolution leakage integral distribution layer to extract a time sequence characteristic value and a space characteristic value of the target video segment, respectively, where the characteristic value of the difference image information includes the time sequence characteristic value and the space characteristic value, and the convolution leakage integral distribution layer adopts a pulse neural network model;

the batch normalization unit is used for performing batch normalization processing on the characteristic values of the target video segments through the batch normalization layer, wherein the characteristic values of the target video segments comprise the time sequence characteristic values and the space characteristic values;

the linear rectification unit is used for performing linear correction processing on the characteristic values subjected to batch standardization processing through the linear rectification layer;

and the global pooling unit is used for carrying out average pooling on the characteristic values subjected to the linear correction processing through the global pooling layer.

Optionally, the full connection layer module adopts an artificial neural network model.

Optionally, the number of the convolution leakage integral distribution modules is at least two, and the at least two convolution leakage integral distribution modules are sequentially connected to extract the multi-stage features of the differential image information; the input end of the full connection layer module is connected with the output end of the last stage of convolution leakage integral distribution module in the at least two convolution leakage integral distribution modules;

and/or

The number of the fully-connected layer modules is at least two, and the at least two fully-connected layer modules are sequentially connected to perform multistage linear processing on the characteristic values; and the input end of the first-stage full connection layer module in the at least two full connection layer modules is connected with the output end of the convolution leakage integral distribution module.

The video motion recognition device provided by the embodiment of the application has the advantages that the model structure is simple, the data size is small in the process of video motion recognition, and therefore the calculation amount in the video motion recognition process can be reduced, and the calculation efficiency is improved.

The video motion recognition device in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a Network Attached Storage (NAS), a personal computer (personal computer, PC), a Television (TV), a teller machine, a self-service machine, and the like, and the embodiments of the present application are not limited in particular.

The video motion recognition device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The video motion recognition device provided in the embodiment of the present application can implement each process implemented by the method embodiment shown in fig. 1, and is not described here again to avoid repetition.

Optionally, as shown in fig. 5, an electronic device 500 is further provided in this embodiment of the present application, and includes a processor 501, a memory 502, and a program or an instruction stored in the memory 502 and executable on the processor 501, where the program or the instruction is executed by the processor 501 to implement each process of the video motion recognition method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic devices and the non-mobile electronic devices described above.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the video motion recognition method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the above video motion recognition method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the description is omitted here.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A video motion recognition method, the method comprising:

acquiring a target video clip;

2. The video motion recognition method of claim 1, wherein the video motion recognition network is constructed according to a pulse neural network, and the pixel values in the difference image information are binary data.

3. The video motion recognition method of claim 1, wherein the video motion recognition network comprises a convolution leakage integral distribution module and a full connection layer module, and the inputting the differential image information sequence into the video motion recognition network to determine the motion recognition result of the target video segment comprises:

4. The video motion recognition method of claim 1, wherein the differentiating the image frames in the target video segment to obtain a sequence of differential image information comprises:

performing gray processing on the image frames, and performing difference processing on L adjacent image frames in the image frames after the gray processing respectively to obtain at least one frame of difference image, wherein L is an integer greater than or equal to 2;

respectively generating differential image information corresponding to each frame of differential image so as to determine a differential image information sequence according to at least one frame of differential image information;

5. The video motion recognition method of claim 4, wherein the difference image includes a plurality of difference values, and wherein the generating difference image information corresponding to each frame of difference image includes:

6. The video motion recognition method according to claim 5, wherein the difference image includes N difference values, the pixel enhancement information includes N pixel values corresponding to the N difference values, respectively, the pixel reduction information includes N pixel values corresponding to the N difference values, respectively, and N is an integer greater than 1;

7. The video motion recognition method of claim 5, wherein the determining the pixel enhancement value and the pixel reduction value of the plurality of difference values comprises:

determining a differential value greater than or equal to a first threshold value as the pixel enhancement value;

determining a differential value less than or equal to a second threshold value as the pixel reduction value.

8. The video motion recognition method of claim 3, wherein the convolution leakage integral distribution module comprises: a convolution leakage integral distribution layer, a batch standardization layer, a linear rectification layer and a global pooling layer;

the extracting the characteristic value of the differential image information through the convolution leakage integral distribution module comprises:

performing time sequence convolution processing and leakage integral distribution processing on the difference image information through the convolution leakage integral distribution layer to respectively extract a time sequence characteristic value and a space characteristic value of the target video clip, wherein the characteristic value of the difference image information comprises the time sequence characteristic value and the space characteristic value, and the convolution leakage integral distribution layer adopts a pulse neural network model;

performing batch standardization processing on the characteristic values of the target video clips through the batch standardization layer, wherein the characteristic values of the target video clips comprise the time sequence characteristic values and the space characteristic values;

performing linear correction processing on the characteristic values subjected to batch standardization processing through the linear rectification layer;

and carrying out average pooling on the characteristic values subjected to the linear correction processing through the global pooling layer.

9. The video motion recognition method of claim 8, wherein the fully-connected layer module employs an artificial neural network model.

10. The video motion recognition method according to claim 3, wherein the number of the convolution leakage integral distribution modules is at least two, and the at least two convolution leakage integral distribution modules are sequentially connected to perform multi-stage feature extraction on the difference image information; the input end of the full connection layer module is connected with the output end of the last stage of convolution leakage integral distribution module in the at least two convolution leakage integral distribution modules;

and/or

11. A video motion recognition apparatus, the apparatus comprising:

the acquisition module is used for acquiring a target video clip;

12. An electronic device comprising a processor, a memory and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the video action recognition method according to any of claims 1-10.

13. A readable storage medium, characterized in that it stores thereon a program or instructions which, when executed by a processor, implement the steps of the video motion recognition method according to any one of claims 1-10.