WO2022111506A1 - Procédé et appareil de reconnaissance d'action vidéo, dispositif électronique et support de stockage - Google Patents

Procédé et appareil de reconnaissance d'action vidéo, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2022111506A1
WO2022111506A1 PCT/CN2021/132696 CN2021132696W WO2022111506A1 WO 2022111506 A1 WO2022111506 A1 WO 2022111506A1 CN 2021132696 W CN2021132696 W CN 2021132696W WO 2022111506 A1 WO2022111506 A1 WO 2022111506A1
Authority
WO
WIPO (PCT)
Prior art keywords
pixel
video
action recognition
image information
differential image
Prior art date
Application number
PCT/CN2021/132696
Other languages
English (en)
Chinese (zh)
Inventor
吴臻志
马欣
祝夭龙
Original Assignee
北京灵汐科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京灵汐科技有限公司 filed Critical 北京灵汐科技有限公司
Publication of WO2022111506A1 publication Critical patent/WO2022111506A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present application relates to the technical field of video processing, and in particular, to a video action recognition method, apparatus, electronic device and storage medium.
  • the recognition of actions in the captured video has a good application prospect in video surveillance and user interaction.
  • the purpose of the embodiments of the present application is to provide a video action recognition method, apparatus, electronic device and storage medium, which can solve the problem of slow calculation speed in the video action recognition method in the related art.
  • an embodiment of the present application provides a video action recognition method, the method includes: acquiring a target video segment; performing differential processing on image frames in the target video segment to obtain a differential image information sequence, the The differential image information sequence includes at least one frame of differential image information; the differential image information sequence is input into a video action recognition network to determine the action recognition result of the target video segment.
  • an embodiment of the present application provides a video action recognition device, the device includes: an acquisition module for acquiring a target video segment; a difference module for performing differential processing on image frames in the target video segment , to obtain a sequence of differential image information, the sequence of differential image information includes at least one frame of differential image information; a recognition module for inputting the sequence of differential image information into a video action recognition network to determine the action recognition of the target video segment result.
  • an embodiment of the present application provides a method for training a video action recognition network, the method includes: acquiring a training data set and a test data set, where the training data set includes training image frames of multiple training video clips and Action labels corresponding to each training video clip, the test data set includes test image frames of multiple test video clips and action labels corresponding to each test video clip; difference processing is performed on the training image frames in the training video clips , to obtain the training differential image information sequence; and, performing differential processing on the test image frame in the test video segment to obtain the test differential image information sequence; using the training differential image information sequence corresponding to the training video segment to perform a differential process on the video action
  • the recognition network performs network training; and the video action recognition network obtained by training is verified by using the test differential image information sequence corresponding to the test video segment, so as to adjust the network parameters of the video action recognition network according to the verification result.
  • an embodiment of the present application provides an electronic device, the electronic device includes a processor, a memory, and a program or instruction stored on the memory and executable on the processor, the program or instruction being When executed, the processor implements the steps of the method as described in the first aspect, or implements the steps of the method as described in the third aspect.
  • an embodiment of the present application provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or instruction is executed by a processor, the steps of the method according to the first aspect are implemented , or implement the steps of the method described in the third aspect.
  • an embodiment of the present application provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement the first aspect The steps of the method, or the steps of implementing the method according to the third aspect.
  • a target video segment is acquired; image frames in the target video segment are subjected to differential processing to obtain a differential image information sequence, where the differential image information sequence includes at least one frame of differential image information; the The sequence of differential image information is input to the video action recognition network to determine the action recognition result of the target video segment.
  • the video action recognition network performs action recognition based on the differential image information sequence, which reduces the calculation amount of the video action recognition network, and can improve the calculation speed in the video action recognition process.
  • FIG. 1 is a schematic flowchart of a video action recognition method provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a specific implementation manner of step 102 in FIG. 1;
  • FIG. 3 is a schematic flowchart of a specific implementation manner of step 203 in FIG. 2;
  • FIG. 4 is a schematic structural diagram of a video action recognition network provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of another video action recognition network provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a training method of a video action recognition network provided by an embodiment of the present application
  • FIG. 7 is a schematic structural diagram of a video action recognition device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • first, second and the like in the description and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the present application can be practiced in sequences other than those illustrated or described herein, and distinguish between “first”, “second”, etc.
  • the objects are usually of one type, and the number of objects is not limited.
  • the first object may be one or more than one.
  • “and/or” in the description and claims indicates at least one of the connected objects, and the character “/" generally indicates that the associated objects are in an "or” relationship.
  • video action recognition can be performed in the following ways:
  • Method 1 Predict video actions through a spatiotemporal dual stream network structure (Two Stream Network).
  • the spatiotemporal dual-stream network structure includes two branches, wherein one branch network extracts image information according to the input single frame image, that is, image classification. Another branch network extracts the motion information between frames according to the input optical flow motion field of 10 consecutive frames.
  • the network structure of the two branches is the same, and the excitation function of the output layer is the same as the softmax function, that is, the use of The softmax function makes predictions.
  • the results of the two branch networks are fused by means of direct averaging or Support Vector Machine (SVM). Since the optical flow in a video clip may be displaced in a particular direction, during the training process, the average value of all optical flow vectors needs to be subtracted from the optical flow in advance.
  • SVM Support Vector Machine
  • the spatiotemporal dual-stream network has the following disadvantages: Disadvantage 1: The training process of the spatiotemporal dual-stream network is complex and requires a large amount of storage. On the one hand, because it needs to train two branches separately, the training process is complicated and the training time is long; Therefore, a large storage space is required. Disadvantage 2: In the process of predicting the video action, the result of recognizing the video action is slow due to the large amount of calculation. In the application process, it is also necessary to convert the video into optical flow through the optical flow model, and use the spatio-temporal dual-stream network to calculate the converted optical flow. When the optical flow is calculated, there will be a defect of a large amount of calculation, resulting in a slow speed of recognizing the results of video actions. Disadvantage 3: It can only be applied to action recognition in images or short video clips.
  • spatiotemporal dual-stream networks operate on only one frame (spatial network) or operate on a single stack of frames in short video clips (temporal network), access to temporal context is limited, which is not conducive to modeling long-range temporal structures.
  • Method 2 Predict video actions through 3D Convolutional Neural Networks (3D CNN).
  • 3D CNN 3D Convolutional Neural Networks
  • the video is divided into multiple fixed-length segments, and then the motion information of each video segment is extracted separately.
  • the 3D CNN is applied to video action recognition, the training becomes more difficult and requires more training data due to the large amount of parameters of the 3D CNN. Therefore, the training process of the 3D CNN is complicated and time-consuming.
  • Method 3 Predict video actions through Convolutional Long Short-Term Memory (ConvLSTM).
  • ConvLSTM Convolutional Long Short-Term Memory
  • the CNN network is used to extract the features of each frame image in the video, and then the LSTM network is used to mine the temporal relationship between the features of the frame images.
  • the ConvLSTM method is practical in video analysis. and poor and not widely used.
  • TSN Temporal Segment Networks
  • TSN is also composed of a spatial stream convolutional network and a temporal stream convolutional network.
  • TSN uses a series of short video clips sparsely sampled from the entire video, each clip will give its own initial prediction of the action category, from the "Consensus" to get video-level prediction results.
  • the loss value of video-level prediction needs to be optimized by iteratively updating the model parameters.
  • TSN is essentially an improved network for the spatiotemporal dual-stream network, which has the same defects as the spatiotemporal dual-stream network, that is, the training process is complex, the storage capacity is large, and the calculation process is slow.
  • the embodiments of the present application perform differential processing on the frame images of the video clips in advance to obtain two-dimensional differential image information, and perform feature extraction on the differential image information, and perform a feature extraction based on the extracted features.
  • the linear weighted sum processing method determines the action recognition result in the video, which can simplify the model structure and reduce the amount of calculation in the video action recognition process, thereby effectively improving the calculation speed of the video action recognition process.
  • FIG. 1 is a schematic flowchart of a video action recognition method provided by an embodiment of the present application. As shown in FIG. 1 , the method may include the following steps: step 101 to step 103 .
  • Step 101 Acquire a target video segment.
  • the above-mentioned target video segment may be acquired by a video acquisition device such as a camera.
  • the video can be divided into multiple segments of preset time length videos, for example: 4s (seconds), 5s, etc.
  • the preset time length is not specifically limited, and in this case, the above-mentioned target video segment may include part or all of the above-mentioned multiple pieces of video of the preset time length.
  • Step 102 Perform differential processing on the image frames in the target video segment to obtain a differential image information sequence, where the differential image information sequence includes at least one frame of differential image information.
  • difference image information may refer to a difference image after difference processing, or image information after processing the difference image, for example, may be an image frame obtained after binarization processing is performed according to the difference image, wherein two In the valued image frame, the pixel value of each pixel is binary data, for example, 1 or 0.
  • the data complexity of the binary data is lower than that of pixel values with multiple values, which can simplify the computational complexity of the video action recognition network, and can be applied to It is suitable for the video action recognition network constructed by the spiking neural network, and can improve the training speed and inference speed of the video action recognition network.
  • the above differential processing can be understood as: the target video segment includes multiple image frames arranged in time series, and in the image frames arranged in time series, every L (L greater than or equal to 2) As a group of adjacent image frames, by performing image data differential processing on the image frames of each group in turn, and after traversing each group of image frames in the target video segment, one or more differences are obtained. image, and obtain the differential image information sequence of the target video segment according to the one or more differential images.
  • the above differential processing may be:
  • the pixel values of image frame 1 are respectively subtracted from the corresponding pixel values of image frame 2
  • the pixel values of image frame 2 are respectively subtracted from the corresponding pixel values of image frame 3
  • the pixel values of image frame 3 are respectively subtracted from the image frame. 4
  • the above-mentioned L may be equal to 2, so as to perform differential processing on every two adjacent image frames, so as to find the motion difference between every two adjacent image frames.
  • the foregoing L may also take any one of integers greater than 2, such as 3, 4, and 5, which is not specifically limited herein.
  • FIG. 2 is a schematic flowchart of a specific implementation manner of step 102 in FIG. 1 .
  • the image frames in the target video segment are subjected to differential processing to obtain Obtaining the differential image information sequence, that is, step 102 , may further include: steps 201 to 203 .
  • Step 201 Convert the target video segment into image frames arranged in time sequence.
  • Step 202 Perform grayscale processing on the image frames respectively, and perform differential processing on every L adjacent image frames in all the image frames after the grayscale processing, to obtain at least one frame of differential image, wherein the The aforementioned L is an integer greater than or equal to 2.
  • Step 203 Generate differential image information corresponding to each frame of differential image respectively, so as to determine a differential image information sequence according to at least one frame of differential image information, wherein the differential image information includes pixel enhancement information and pixel weakening information.
  • the difference image C is obtained by performing difference processing between the adjacent image frames A and B after the grayscale processing.
  • Generating difference image information corresponding to the difference image C includes: according to the difference value in the difference image C, generating difference image information including pixel enhancement information and pixel reduction information.
  • the difference image C includes a plurality of difference values, a difference value greater than or equal to a first threshold value may be determined as a pixel enhancement value, and a difference value less than or equal to a second threshold value in the difference image C may be determined as a pixel attenuation value.
  • the pixel enhancement value may refer to an enhanced pixel value.
  • a pixel attenuation value may refer to a reduced pixel value.
  • the pixel enhancement value and pixel weakening value can be understood as action edge data.
  • the pixel enhancement information can be understood as: image channels determined according to pixel enhancement values in the differential image.
  • the pixel attenuation information can be understood as: an image channel determined according to the pixel attenuation value in the differential image.
  • the color image frame can be converted into a grayscale image, so that unnecessary color features can be analyzed in the process of feature extraction and analysis, so that the process of video action recognition can be reduced.
  • the differential image information is generated by the action edge data, and the action recognition is performed based on the differential image information, which can reduce the storage space occupied by the data during the recognition process and improve the recognition speed.
  • FIG. 3 is a schematic flowchart of a specific implementation manner of step 203 in FIG. 2 .
  • the differential image includes a plurality of differential values.
  • the generating of the differential image information corresponding to the differential images of each frame includes steps 301 to 303 .
  • Step 301 Determine a pixel enhancement value and a pixel weakening value in the plurality of difference values.
  • Step 302 Generate the pixel enhancement information according to the pixel enhancement value.
  • Step 303 Generate the pixel attenuation information according to the pixel attenuation value.
  • pixel enhancement information and pixel reduction information are generated according to the difference values, so as to respectively input the pixel enhancement information and pixel reduction information to the video action recognition network, so as to provide dual-channel two-dimensional data for the video action recognition network.
  • the motion difference between the difference images can be easily extracted based on the pixel enhancement information and the pixel reduction information, and the complexity of the motion recognition of the difference image can be simplified.
  • the differential value of each differential image information may be analog information or digital information, and the differential value in the differential image information may be divided into pixel enhancement values. and pixel attenuation value, the pixel enhancement value and pixel attenuation value may be determined according to the values of analog information or digital information. For example, when the difference value is analog information, the analog information value greater than or equal to the first threshold (for example: +5) is determined as the pixel enhancement value, and the analog information value less than or equal to the second threshold (for example: -5) is determined as the pixel enhancement value Determined as the pixel attenuation value.
  • the first threshold for example: +5
  • the analog information value less than or equal to the second threshold for example: -5
  • the differential value of the differential image can also be represented by digital information.
  • the differential image obtained after the above differential processing is subjected to binarization processing to obtain pixels
  • the value is differential image information of binary data
  • binary data is digital information.
  • the determining the differential image information sequence includes: converting the differential value sequence of the differential image into a digital information sequence, wherein the digital information is binary data, and the binary data can be applied to It is based on the Spiking Neural Network (SNN) with a simpler model structure.
  • SNN Spiking Neural Network
  • the differential image information sequence is a sequence composed of at least one differential image information
  • the differential value sequence is a sequence composed of multiple differential values of the differential image
  • the digital information sequence is a digital information sequence corresponding to multiple differential values. A sequence of information.
  • the difference image includes N difference values, where N is an integer greater than 1, and the pixel enhancement information includes N pixel values corresponding to the N difference values respectively.
  • the generating the pixel enhancement information according to the pixel enhancement value includes: determining a first pixel value corresponding to the pixel enhancement value as 1, and dividing the first pixel value among N pixel values The pixel value of is determined to be 0 to obtain the pixel enhancement information.
  • the difference image includes N difference values, where N is an integer greater than 1, and the pixel attenuation information includes N pixel values corresponding to the N difference values respectively.
  • the generating the pixel attenuation information according to the pixel attenuation value includes: determining a second pixel value corresponding to the pixel attenuation value as 1, and dividing the second pixel value among the N pixel values The pixel value of is determined to be 0 to obtain the pixel attenuation information.
  • the differential image is converted into pixel enhancement information and pixel reduction information
  • the SNN neural network model can be provided with dual-channel two-dimensional data to simplify the SNN neural network model. Computational complexity.
  • the sequence of differential images (analog information) obtained after differential processing can also be input to a video action recognition network for standardization processing in a batch normalization layer in the video action recognition network, which can also be implemented according to
  • the differential image information sequence determines the action recognition result of the target video segment, which is not specifically limited here.
  • each frame of the difference image includes N difference values
  • the pixel enhancement information includes N pixel values corresponding to the N difference values
  • the pixel reduction information includes N pixel values corresponding to the N difference values.
  • N pixel values corresponding to the N difference values respectively, where N is an integer greater than 1; in step 203, difference image information corresponding to the difference images of each frame is generated, including steps 2031-2033.
  • Step 2031 In the case where the first difference value in the N difference values is greater than or equal to the first threshold, determine that the pixel value corresponding to the first difference value in the pixel enhancement information is equal to 1, and determine that the pixel value corresponding to the first difference value is equal to 1. The pixel value corresponding to the first difference value in the pixel attenuation information is equal to 0.
  • Step 2032 In the case that the second difference value in the N difference values is less than or equal to the second threshold, determine that the pixel value corresponding to the second difference value in the pixel enhancement information is equal to 0, and determine that the pixel value corresponding to the second difference value is equal to 0.
  • the pixel value corresponding to the second difference value in the pixel attenuation information is equal to 1.
  • Step 2033 In the case where the third difference value in the N difference values is located between the first threshold value and the second threshold value, determine the pixel enhancement information corresponding to the third difference value.
  • the pixel value is equal to 0, and it is determined that the pixel value corresponding to the third difference value in the pixel attenuation information is equal to 0.
  • the difference value is analog information. If the pixel value in the pixel enhancement information is equal to 1, it means that the pixel has enhancement, and if the pixel value in the pixel enhancement information is equal to 0, it means that the pixel is enhanced No enhancement (it can be no change in pixel value or attenuation); if the pixel value in the pixel attenuation information is equal to 1, it means that the pixel is weakened; if the pixel value in the pixel attenuation information is equal to 0, it means that the pixel is attenuated. No pixel attenuation (may be no change in pixel value or enhancement).
  • the difference value converted into digital information can simplify the data processing process of the video action recognition network, and can be applied to the video action recognition network constructed based on the spiking neural network, thereby improving the operation efficiency of the video action recognition network.
  • the pixel enhancement information is transmitted to the video action recognition network through the pixel enhancement channel
  • the pixel reduction information is transmitted to the video action recognition network through the pixel reduction channel.
  • the three-channel RGB image can be converted into a two-channel image through image difference processing, thereby simplifying the data complexity.
  • the inter-frame relationship of the images can be found through the difference processing, so that when the feature extraction is performed on the difference image, the video action features can be more easily obtained, thereby improving the speed of video action recognition.
  • the differential image information when there is relative movement between two consecutive image frames, the differential image information will not be all 0. When there is no relative movement between two consecutive frames of images, the differential image information will be all 0. , in this way, the relationship between the acquired image frames is achieved.
  • a first identifier can also be added to the pixel enhancement information
  • a second identifier can be added to the pixel reduction information, so as to transmit the above-mentioned pixel enhancement information and the above-mentioned pixel reduction information to the video together.
  • the video action recognition network divides the pixel enhancement information and the image reduction information according to the first identifier and the second identifier, which is not specifically limited here.
  • the differential image information may further include an all-zero channel, that is, the differential image information includes pixel enhancement information, pixel attenuation information, and an all-zero channel, where the all-zero channel refers to a pixel whose value is all zero. image channel.
  • Step 103 Input the differential image information sequence into a video action recognition network to determine the action recognition result of the target video segment.
  • the video action recognition network may be any neural network trained for video action recognition.
  • the video action recognition network is constructed based on a spiking neural network
  • the input data of the video action recognition network may be a differential image information sequence determined according to the target video segment
  • the differential image information sequence includes at least one frame of differential image information.
  • One frame of differential image information may include two image channels, respectively pixel enhancement information and pixel reduction information, each image channel may include multiple pixels, and the pixel value may be 0 or 1.
  • a pixel with a pixel value of 1 in the pixel enhancement information can be understood as an enhanced pixel
  • a pixel with a pixel value of 0 is a non-enhanced pixel.
  • a pixel with a pixel value of 1 can be understood as a weakened pixel
  • a pixel with a pixel value of 0 is a non-attenuated pixel.
  • step 103 the differential image information sequence is input into a video action recognition network, the feature values of the differential image information sequence are extracted through the video action recognition network, and the feature values are weighted to determine the Action recognition results for the target video clip.
  • the above-mentioned extraction of the eigenvalues of the differential image information sequence through the video action recognition network may be that the video action recognition network adopts the convolution leakage integral distribution module to extract the eigenvalues of the differential image information sequence, and the eigenvalues of the differential image information sequence.
  • the time series feature value and the spatial feature value corresponding to each differential image information may be included.
  • the video action recognition network can extract feature values of video clips, and after weighting the feature values, can obtain multiple labels corresponding to multiple preset action labels respectively value, the above-mentioned determining the action recognition result of the target video segment may be determining that the action recognition result of the target video segment is the preset action corresponding to the target label with the largest value among the plurality of label values.
  • the action in the video may not completely match a preset action. Therefore, the video action recognition results often obtained may include multiple values that are close to each other, or multiple values that are greater than the preset threshold.
  • the above-mentioned determination of the action recognition result of the target video segment may also be: determining that the video action is close to the preset actions corresponding to the multiple values respectively.
  • FIG. 4 is a schematic structural diagram of a video action recognition network provided by an embodiment of the present application.
  • the video action recognition network includes: a convolution leakage point distribution module 10 and a fully connected layer module 20.
  • the module 10 extracts the feature values of the differential image information sequence, and performs weighting processing on the feature values through the fully connected layer module 20 to determine the action recognition result of the target video segment.
  • FIG. 5 is a schematic structural diagram of another video action recognition network provided by an embodiment of the present application.
  • the convolution leakage point issuing module 10 includes: convolution leakage point A release (eg, ConvLIF or ConvLIAF) layer 11, a Batch Normalization (BN) layer 12, a Rectified Linear Unit (ReLU) layer 13, and a global pooling (Avg Pooling) layer 14.
  • convolution leakage point A release eg, ConvLIF or ConvLIAF
  • BN Batch Normalization
  • ReLU Rectified Linear Unit
  • Avg Pooling global pooling
  • the extraction of the feature value of the differential image information sequence through the convolution leakage integral distribution module 10 includes: step a-step d.
  • Step a performing time-series convolution processing and leaky integral distribution processing on the differential image information sequence through the convolution leaky integral distribution layer 11, so as to extract the feature values corresponding to each of the differential image information in the differential image information sequence, respectively,
  • the eigenvalues of the differential image information include time series eigenvalues and spatial eigenvalues.
  • the convolution leaky integral emission layer 11 employs a spiking neural network model.
  • Step b through the batch normalization layer 12, perform batch normalization processing on the feature values of the differential image information sequence, wherein the feature values of the differential image information sequence include the time series feature values corresponding to each of the differential image information and the spatial eigenvalues.
  • Step c through the linear rectification layer 13, perform linear correction processing on the eigenvalues after the batch normalization processing;
  • Step d through the global pooling layer 14, average pooling is performed on the linearly corrected eigenvalues.
  • the fully connected layer module 20 obtains the eigenvalue data after the average pooling process and performs weighted summation processing, so as to reassemble the eigenvalues extracted in each convolution leakage integral distribution module 10 into a
  • the complete feature map is used to obtain the label value corresponding to the feature map as the recognition result of the video action.
  • the above-mentioned convolution leakage integration distribution layer 11 performs time-series convolution processing and leakage integration distribution processing on the differential image information sequence, so as to extract each of the differences in the differential image information sequence, respectively.
  • the time series eigenvalues and spatial eigenvalues corresponding to the image information can be realized through the following process:
  • LIF leaky integral firing
  • is the time factor of the neuron and V reset is the reset potential.
  • Xi(t) is the input signal (spike or no signal) of the ith neuron connected to the current neuron with a weight of Wi.
  • V(t) reaches a certain threshold Vth , a pulse signal is emitted and V(t) is reset to its initial value V reset , where n is the total number of neurons.
  • Synaptic integration can take the form of full connection or convolution, and Conv in the above formula represents convolution.
  • F t is the transmitted signal.
  • ⁇ and ⁇ represent the multiplicative attenuation coefficient and the additive attenuation coefficient, respectively.
  • sequential convolution processing is performed on consecutive image frames through the ConvLIAF layer; the output feature values are normalized through the Batch Normalization layer to ensure the stability of the video action recognition network.
  • sequential convolution processing is performed on consecutive image frames through the ConvLIAF layer; the output feature values are normalized through the Batch Normalization layer to ensure the stability of the video action recognition network.
  • the training process of the video action recognition network can also effectively reduce the probability of overfitting when training the network; increase the nonlinear relationship between the layers of the neural network through the RELU layer; through the AvgPooling layer, on the one hand, it can prevent useless parameters from increasing the time complexity, on the other hand , which also increases the degree of integration of the eigenvalues.
  • the AvgPooling layer can choose to use the AvgPooling 2D (two-dimensional) layer wrapped by the time distribution (which can also be called: Time Distributed) layer, or the AvgPooling 3D (three-dimensional) layer, which is not done here. Specific restrictions.
  • the batch normalization layer 12, the linear rectification layer 13, the global pooling layer 14, and the fully connected layer module other than the above-mentioned convolution leakage integral distribution layer 11 can all use artificial neural networks (Artificial Neural Network, ANN),
  • ANN Artificial Neural Network
  • the video action recognition network can achieve better processing capability for mixed applications in the spatiotemporal domain by adopting the network structure of artificial neural network ANN and SNN.
  • SNN has differentiated advantages in scenarios where the accuracy requirements are not high but the calculation speed requirements are high.
  • the error rate of SNN can be close to convergence in a very short time, and the traditional CNN method will take longer under the same circumstances.
  • the video action recognition network may include more or less network layers such as the video action recognition network shown in FIG. 4 , here There is no specific limitation.
  • the convolution leakage integral distribution layer 11 of the SNN model can be used to identify video actions with a limited number of markers (that is, the output results of the video action recognition network can include: : a label value corresponding to a limited number of action labels one-to-one, the label value is used to indicate the similarity between the action in the video and the corresponding action label).
  • the video action recognition network fused with SNN can have both high efficiency and accuracy, and because it reduces the amount of calculation and storage in the process of video action recognition, it can also reduce the operating environment of the video action recognition network for the amount of storage and computation. , so as to improve the applicability of the video action recognition network.
  • FIG. 6 is a schematic flowchart of a training method for a video action recognition network provided by an embodiment of the present application.
  • an embodiment of the present application provides a training method for a video action recognition network.
  • the training process in order to achieve The action corresponding to each action label is learned in the video action recognition network, and the video action recognition network can be trained through the following process: step 501 - step 504 .
  • Step 501 Acquire a training data set and a test data set.
  • the training data set includes training image frames of multiple training video clips and action labels corresponding to each training video clip
  • the test data set includes test image frames of multiple test video clips and action labels corresponding to each test video clip.
  • the training data set and the test data set can be obtained through the following steps 1-5:
  • Step 1 Shoot the preset action.
  • a plurality of objects can be photographed, and during the photographing process, each object performs the above-mentioned preset actions respectively, and the shooting time of each preset action can be set to a preset time length.
  • each person performs 10 preset actions (left arm rotation, right arm rotation, left hand bending, etc.), and each person's action is shot for 20 seconds.
  • Step 2 Video segmentation.
  • the above-mentioned video of each preset action is divided into multiple segments to increase the number of samples.
  • the above-mentioned 20s-long video is equally divided into four 5s-long video clips.
  • the shooting content of the first 2s and the last 2s in the above 20-second video can be discarded, and the remaining 16s video can be divided into 4 pieces. 4s video clips.
  • the action labeling error caused by the action transformation can be effectively cropped to ensure that the samples are valid.
  • Step 3 Tag classification.
  • the 4 video clips of the i-th preset action can be marked as: i ⁇ 5, i ⁇ 5+1, i ⁇ 5+2, i ⁇ 5+3, where i can be selected from 0-9 any value.
  • Step 4 Convert video to picture.
  • all video clips can be converted into image frames using a vision and machine learning software library (OpenCV).
  • OpenCV vision and machine learning software library
  • Step 5 Divide training dataset and test dataset.
  • the picture frames in step 4 can be used as samples according to a preset ratio and divided into the training data set and the test data set, for example, 80% of the samples are used as the training data set, and 20% of the samples are used as the test data set data set.
  • the samples can also be divided into training data sets and test data sets according to other ratios, which are not specifically limited here.
  • the samples divided into the training data set are called training image frames, and the corresponding video clips are called training video clips; similarly, the samples divided into the test data set are called test image frames, and the corresponding video clips are called test video clips .
  • Step 502 performing differential processing on the training image frames in the training video segment to obtain a training differential image information sequence; and performing differential processing on the test image frames in the testing video segment to obtain a test differential image information sequence .
  • the training differential image information sequence includes training differential image information of at least one frame of training differential images, and the training differential image information includes pixel enhancement information and pixel attenuation information for training; the test differential image information sequence includes at least one frame of test differential image information.
  • the training image frame may be resized first to reduce and standardize the size of each training image frame; then the resized training image frame is subjected to grayscale processing to obtain a grayscale The processed training image frame; finally, perform differential processing on multiple consecutive (that is, adjacent) training image frames in the gray-scale processed training image frame to obtain a training differential image, and obtain a training differential image according to the result of the differential processing information, the training differential image information includes pixel enhancement information and pixel weakening information for training.
  • Step 503 Use the training differential image information sequence corresponding to the training video segment to perform network training on the video action recognition network.
  • Step 504 Verify the video action recognition network obtained by training using the test differential image information sequence corresponding to the test video segment, so as to adjust the network parameters of the video action recognition network according to the verification result.
  • step 503 and step 504 the pixel enhancement information and pixel reduction information of the training differential image information after the differential processing are respectively input into the video action recognition network through the pixel enhancement channel and the pixel reduction channel for training, and the test
  • the test differential image information sequence corresponding to the video clip verifies the video action recognition network obtained by training, so as to adjust the network parameters of the video action recognition network according to the verification result, until the accuracy of the trained video action recognition network meets the prediction. Set conditions, or until all the above samples are trained.
  • the verification result is the error value between the action recognition result obtained by inputting the test differential image information sequence into the video action recognition network obtained by training and the action label of the pre-marked test video segment.
  • a drop layer may be added to the video action recognition network to drop the neural network units from the network according to a preset probability.
  • the fully connected layer module 20 includes: a fully connected layer 21 and a dropout layer 22 .
  • the Dropout layer 22 temporarily discards the neural network unit from the video action recognition network according to a certain probability, which effectively prevents the video action recognition network from overfitting, and can improve the training speed of the video action recognition.
  • the fully connected layer 21 is used to perform a weighted summation process on the features output by the convolution leaky integral distribution module 10 to obtain an action label value.
  • the training process of the video action recognition network is based on the two-dimensional differential image information, and the ANN layer and the SNN layer do not need to be trained separately. Therefore, the training process of the video action recognition network is simple. And the training time is short.
  • the number of the convolution leakage point issuing modules 10 is at least two, and at least two of the convolution leakage point issuing modules 10 are connected in sequence to The differential image information sequence is subjected to multi-level feature extraction; the input end of the fully connected layer module 20 is connected to the output end of the last stage of the convolution leaky integration distribution module 10 in the at least two convolution leaky integration distribution modules 10 .
  • the number of the fully connected layer modules 20 is at least two, and at least two of the fully connected layer modules 20 are connected in sequence to multiply the feature values.
  • Stage linear processing; the input end of the first stage fully connected layer module 20 of the at least two fully connected layer modules 20 is connected to the output end of the convolution leaky integral distribution module 10 .
  • the network includes a plurality of convolution leaky point distribution modules 10 and a fully connected layer module 20, or the image action recognition network includes a convolution leaky point distribution module 10 and a plurality of fully connected layer modules 20, or a convolution leaky point distribution module 10. and the number of fully connected layer modules 20 are multiple.
  • the video action recognition network adopts cascaded multiple groups of convolution leakage integral issuing modules 10 and multiple groups of fully connected layer modules 20 to perform deeper feature extraction and processing on video images.
  • a target video clip is acquired; image frames in the target video clip are subjected to differential processing to obtain differential image information; the differential image information is input into a video action recognition network, and the video action recognition The network extracts feature values of the differential image information, and performs weighting processing on the feature values to determine the action recognition result of the target video segment.
  • the video action recognition network only needs to extract the feature value of the two-dimensional differential image information to obtain the difference feature between the image frames, and weight the difference feature to obtain the action recognition result of the target video segment. There is no need to process the three-dimensional data of the image frame, thereby reducing the calculation amount of the video action recognition network and improving the calculation speed in the video action recognition process.
  • the execution body may be a video action recognition device, or a control module in the video action recognition device for executing the video action recognition method.
  • the video action recognition device provided by the embodiment of the present application is described by taking the video action recognition device executing the method for recognizing the action of loading a video as an example.
  • FIG. 7 is a schematic structural diagram of a video action recognition apparatus provided by an embodiment of the present application.
  • the video action recognition apparatus 700 may include: an acquisition module 701 , a difference module 702 , and an identification module 703 .
  • the acquisition module 701 is used to acquire the target video segment;
  • the difference module 702 is used to perform differential processing on the image frames in the target video segment to obtain a differential image information sequence, and the differential image information sequence includes at least one frame Differential image information;
  • the identification module 703 is configured to input the sequence of differential image information into a video action recognition network to determine the action recognition result of the target video segment.
  • the video action recognition network is constructed according to a spiking neural network, and the pixel values in the differential image information are binary data.
  • the video action recognition network includes a convolution leaky point distribution module and a fully connected layer module
  • the identification module 703 is specifically configured to: extract the differential image information through the convolution leaky point distribution module The feature value of the sequence is weighted by the fully connected layer module to determine the action recognition result of the target video segment.
  • the identification module 703 includes: a conversion unit, a differential processing unit, and a determination unit.
  • the conversion unit is used to convert the target video clips into image frames arranged in time series;
  • the differential processing unit is used to perform grayscale processing on the image frames, and respectively perform grayscale processing on all the images after the grayscale processing.
  • a determining unit configured to respectively generate differential image information corresponding to each frame of differential images , to determine a sequence of differential image information according to at least one frame of differential image information; wherein the differential image information includes pixel enhancement information and pixel weakening information.
  • the differential image includes a plurality of differential values, wherein the determining unit includes: a first determining subunit, a first generating subunit, and a second generating subunit.
  • the first determination subunit is used to determine the pixel enhancement value and the pixel reduction value in the plurality of difference values; the first generation subunit is used to generate the pixel enhancement information according to the pixel enhancement value; the first generation subunit is used for generating the pixel enhancement information; 2.
  • a generating subunit configured to generate the pixel attenuation information according to the pixel attenuation value.
  • the difference image includes N difference values
  • the pixel enhancement information includes N pixel values corresponding to the N difference values
  • the pixel reduction information includes N pixel values corresponding to the N difference values.
  • N corresponding pixel values where N is an integer greater than 1.
  • the first generating subunit is configured to: determine a first pixel value corresponding to the pixel enhancement value as 1, and determine a pixel value other than the first pixel value among the N pixel values as 0, to obtain the pixel enhancement information
  • the second generating subunit is used for: determining the second pixel value corresponding to the pixel weakening value as 1, and dividing the N pixel values by the second pixel value
  • the pixel values outside are determined to be 0 to obtain the pixel attenuation information.
  • the first determination subunit includes: a first determination subunit and a second determination subunit. Wherein, the first determination subunit is used to determine the difference value greater than or equal to the first threshold as the pixel enhancement value; the second determination subunit is used to determine the difference value less than or equal to the second threshold as the pixel enhancement value The pixel attenuation value.
  • the convolution leaky point issuance module includes a convolution leaky point issuance layer, a batch normalization layer, a linear rectification layer, and a global pooling layer.
  • the identification module 703 includes: a convolution leaky integral distribution unit, a batch normalization unit, a linear rectification unit, and a global pooling unit.
  • the convolution leakage point issuing unit is configured to perform time-series convolution processing and leakage point issuing processing on the differential image information sequence through the convolution leakage point issuing layer, so as to respectively extract each element in the differential image information sequence.
  • the eigenvalues corresponding to the differential image information wherein the eigenvalues corresponding to the differential image information include time-series eigenvalues and spatial eigenvalues, and the convolution leaky integral distribution layer adopts a spiking neural network model;
  • the batch normalization unit is used to pass The batch normalization layer performs batch normalization processing on the eigenvalues of the differential image information sequence, wherein the eigenvalues of the differential image information sequence include the time-series eigenvalues and the spatial eigenvalues corresponding to each of the differential image information eigenvalues; a linear rectification unit, used to perform linear correction processing on the batch normalized eigenvalues through the linear rectification layer; a global pooling unit, used to pass the global pooling layer to the linear correction process The corrected eigenvalues are average pooled.
  • the fully connected layer module adopts an artificial neural network model.
  • the number of the convolution leaky integral issuing modules is at least two, and at least two of the convolutional leaky integral issuing modules are connected in sequence to perform multi-level feature extraction on the differential image information sequence; so The input end of the fully connected layer module is connected to the output end of the last stage of the convolution leaky integral distribution module in the at least two convolution leaky integral distribution modules.
  • the number of the fully connected layer modules is at least two, and at least two of the fully connected layer modules are connected in sequence to perform multi-level linear processing on the eigenvalues; at least two of the fully connected layer modules are connected in sequence.
  • the input terminal of the first-level fully connected layer module in the layer module is connected to the output terminal of the convolution leaky integral distribution module.
  • the video action recognition device provided by the embodiment of the present application has a simple model structure, and has the advantage of a small amount of data in the process of video action recognition, thereby reducing the amount of computation in the video action recognition process and improving computational efficiency.
  • the video action recognition device in this embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal.
  • the apparatus may be a mobile electronic device or a non-mobile electronic device.
  • the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palmtop computer, an in-vehicle electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook, or a personal digital assistant (personal digital assistant).
  • UMPC ultra-mobile personal computer
  • netbook or a personal digital assistant
  • the non-mobile electronic device may be a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a television (television, TV), a teller machine or a self-service machine, etc., the embodiment of the present application There is no specific limitation.
  • Network Attached Storage NAS
  • personal computer personal computer, PC
  • television television
  • teller machine a self-service machine
  • the video action recognition device in the embodiment of the present application may be a device with an operating system.
  • the operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, which are not specifically limited in the embodiments of the present application.
  • the video action recognition apparatus provided in this embodiment of the present application can implement each process implemented by any of the foregoing video action recognition method embodiments, and to avoid repetition, details are not repeated here.
  • FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • an embodiment of the present application further provides an electronic device 800, including a processor 801 and a memory 802, which are stored on the memory 802 and can be accessed
  • a program or instruction running on the processor 801 when the program or instruction is executed by the processor 801, implements each process of any of the above video action recognition method embodiments or implements each process of the above training method embodiments, and can achieve The same technical effect, in order to avoid repetition, will not be repeated here.
  • the electronic devices in the embodiments of the present application include the aforementioned mobile electronic devices and non-mobile electronic devices.
  • Embodiments of the present application further provide a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or instruction is executed by a processor, each process of the foregoing video action recognition method embodiment or the foregoing training is realized.
  • Each process of the method embodiment can achieve the same technical effect, and in order to avoid repetition, it will not be repeated here.
  • the processor is the processor in the electronic device described in the foregoing embodiments.
  • the readable storage medium includes a computer-readable storage medium, such as a computer read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.
  • An embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement the above embodiments of the video action recognition method
  • the chip includes a processor and a communication interface
  • the communication interface is coupled to the processor
  • the processor is configured to run a program or an instruction to implement the above embodiments of the video action recognition method
  • the chip mentioned in the embodiments of the present application may also be referred to as a system-on-chip, a system-on-chip, a system-on-a-chip, or a system-on-a-chip, or the like.
  • the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

La présente demande divulgue un procédé et un appareil de reconnaissance d'action vidéo, un dispositif électronique et un support de stockage, se rapportant au domaine technique des réseaux neuronaux. Le procédé de reconnaissance d'action vidéo comprend les étapes consistant à : acquérir un clip vidéo cible ; effectuer un traitement différentiel sur une trame d'image dans le clip vidéo cible, de façon à obtenir une séquence d'informations d'image différentielle, la séquence d'informations d'image différentielle comprenant au moins une trame d'informations d'image différentielle ; et entrer la séquence d'informations d'image différentielle dans un réseau de reconnaissance d'action vidéo, de façon à déterminer un résultat de reconnaissance d'action du clip vidéo cible. Selon les modes de réalisation de la présente demande, la vitesse de calcul pendant une reconnaissance d'action vidéo peut être augmentée.
PCT/CN2021/132696 2020-11-26 2021-11-24 Procédé et appareil de reconnaissance d'action vidéo, dispositif électronique et support de stockage WO2022111506A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011351589.1 2020-11-26
CN202011351589.1A CN112464807A (zh) 2020-11-26 2020-11-26 视频动作识别方法、装置、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2022111506A1 true WO2022111506A1 (fr) 2022-06-02

Family

ID=74808033

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/132696 WO2022111506A1 (fr) 2020-11-26 2021-11-24 Procédé et appareil de reconnaissance d'action vidéo, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN112464807A (fr)
WO (1) WO2022111506A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114461468A (zh) * 2022-01-21 2022-05-10 电子科技大学 一种基于人工神经网络的微处理器应用场景识别方法
CN115171221A (zh) * 2022-09-06 2022-10-11 上海齐感电子信息科技有限公司 动作识别方法及动作识别系统
CN116311003A (zh) * 2023-05-23 2023-06-23 澳克多普有限公司 一种基于双通道加载机制的视频检测方法及系统
CN116614666A (zh) * 2023-07-17 2023-08-18 微网优联科技(成都)有限公司 一种基于ai摄像头特征提取系统及方法

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112464807A (zh) * 2020-11-26 2021-03-09 北京灵汐科技有限公司 视频动作识别方法、装置、电子设备和存储介质
CN112818958B (zh) * 2021-03-24 2022-07-19 苏州科达科技股份有限公司 动作识别方法、装置及存储介质
CN113052091A (zh) * 2021-03-30 2021-06-29 中国北方车辆研究所 一种基于卷积神经网络的动作识别方法
CN113111842B (zh) * 2021-04-26 2023-06-27 浙江商汤科技开发有限公司 一种动作识别方法、装置、设备及计算机可读存储介质
CN113239855B (zh) * 2021-05-27 2023-04-18 抖音视界有限公司 一种视频检测方法、装置、电子设备以及存储介质
CN113269264B (zh) * 2021-06-04 2024-07-26 北京灵汐科技有限公司 目标识别方法、电子设备和计算机可读介质
CN114333065A (zh) * 2021-12-31 2022-04-12 济南博观智能科技有限公司 一种应用于监控视频的行为识别方法、系统及相关装置
CN114466153B (zh) * 2022-04-13 2022-09-09 深圳时识科技有限公司 自适应脉冲生成方法、装置、类脑芯片和电子设备
CN114495178B (zh) * 2022-04-14 2022-06-21 深圳时识科技有限公司 脉冲序列随机化方法、装置、类脑芯片和电子设备
CN115379300A (zh) * 2022-07-27 2022-11-22 国能龙源环保有限公司 基于ai识别算法规范安装炸药包的辅助方法及辅助装置
CN115908954B (zh) * 2023-03-01 2023-07-28 四川省公路规划勘察设计研究院有限公司 基于人工智能的地质灾害隐患识别系统、方法及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190197300A1 (en) * 2017-08-03 2019-06-27 Gyrfalcon Technology Inc. Motion Recognition Via A Two-dimensional Symbol Having Multiple Ideograms Contained Therein
CN110309720A (zh) * 2019-05-27 2019-10-08 北京奇艺世纪科技有限公司 视频检测方法、装置、电子设备和计算机可读介质
CN110503081A (zh) * 2019-08-30 2019-11-26 山东师范大学 基于帧间差分的暴力行为检测方法、系统、设备及介质
CN110555523A (zh) * 2019-07-23 2019-12-10 中建三局智能技术有限公司 一种基于脉冲神经网络的短程跟踪方法及系统
CN111539290A (zh) * 2020-04-16 2020-08-14 咪咕文化科技有限公司 视频动作识别方法、装置、电子设备及存储介质
CN112464807A (zh) * 2020-11-26 2021-03-09 北京灵汐科技有限公司 视频动作识别方法、装置、电子设备和存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10387774B1 (en) * 2014-01-30 2019-08-20 Hrl Laboratories, Llc Method for neuromorphic implementation of convolutional neural networks
US11651199B2 (en) * 2017-10-09 2023-05-16 Intel Corporation Method, apparatus and system to perform action recognition with a spiking neural network
EP3789909A1 (fr) * 2019-09-06 2021-03-10 GrAl Matter Labs S.A.S. Classification d'images dans une séquence de trames

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190197300A1 (en) * 2017-08-03 2019-06-27 Gyrfalcon Technology Inc. Motion Recognition Via A Two-dimensional Symbol Having Multiple Ideograms Contained Therein
CN110309720A (zh) * 2019-05-27 2019-10-08 北京奇艺世纪科技有限公司 视频检测方法、装置、电子设备和计算机可读介质
CN110555523A (zh) * 2019-07-23 2019-12-10 中建三局智能技术有限公司 一种基于脉冲神经网络的短程跟踪方法及系统
CN110503081A (zh) * 2019-08-30 2019-11-26 山东师范大学 基于帧间差分的暴力行为检测方法、系统、设备及介质
CN111539290A (zh) * 2020-04-16 2020-08-14 咪咕文化科技有限公司 视频动作识别方法、装置、电子设备及存储介质
CN112464807A (zh) * 2020-11-26 2021-03-09 北京灵汐科技有限公司 视频动作识别方法、装置、电子设备和存储介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114461468A (zh) * 2022-01-21 2022-05-10 电子科技大学 一种基于人工神经网络的微处理器应用场景识别方法
CN115171221A (zh) * 2022-09-06 2022-10-11 上海齐感电子信息科技有限公司 动作识别方法及动作识别系统
CN115171221B (zh) * 2022-09-06 2022-12-06 上海齐感电子信息科技有限公司 动作识别方法及动作识别系统
CN116311003A (zh) * 2023-05-23 2023-06-23 澳克多普有限公司 一种基于双通道加载机制的视频检测方法及系统
CN116614666A (zh) * 2023-07-17 2023-08-18 微网优联科技(成都)有限公司 一种基于ai摄像头特征提取系统及方法
CN116614666B (zh) * 2023-07-17 2023-10-20 微网优联科技(成都)有限公司 一种基于ai摄像头特征提取系统及方法

Also Published As

Publication number Publication date
CN112464807A (zh) 2021-03-09

Similar Documents

Publication Publication Date Title
WO2022111506A1 (fr) Procédé et appareil de reconnaissance d'action vidéo, dispositif électronique et support de stockage
CN111639692B (zh) 一种基于注意力机制的阴影检测方法
US10599958B2 (en) Method and system for classifying an object-of-interest using an artificial neural network
CN113378600B (zh) 一种行为识别方法及系统
WO2021051545A1 (fr) Procédé et appareil de détermination d'action de chute sur la base d'un modèle d'identification de comportement, dispositif informatique et support d'informations
CN110717411A (zh) 一种基于深层特征融合的行人重识别方法
CN111062263B (zh) 手部姿态估计的方法、设备、计算机设备和存储介质
US20200285859A1 (en) Video summary generation method and apparatus, electronic device, and computer storage medium
CN110222718B (zh) 图像处理的方法及装置
Liu et al. Real-time facial expression recognition based on cnn
CN115240121A (zh) 一种用于增强行人局部特征的联合建模方法和装置
CN109977832B (zh) 一种图像处理方法、装置及存储介质
EP3874404A1 (fr) Reconnaissance vidéo à l'aide de modalités multiples
Gao et al. PSGCNet: A pyramidal scale and global context guided network for dense object counting in remote-sensing images
US12106541B2 (en) Systems and methods for contrastive pretraining with video tracking supervision
Niu et al. Boundary-aware RGBD salient object detection with cross-modal feature sampling
Hong et al. Characterizing subtle facial movements via Riemannian manifold
CN114333062A (zh) 基于异构双网络和特征一致性的行人重识别模型训练方法
CN117854160A (zh) 一种基于人工多模态和细粒度补丁的人脸活体检测方法及系统
CN116129228B (zh) 图像匹配模型的训练方法、图像匹配方法及其装置
Wang et al. Multi-scale multi-modal micro-expression recognition algorithm based on transformer
Wang et al. Fusion representation learning for foreground moving object detection
CN112487927B (zh) 一种基于物体关联注意力的室内场景识别实现方法及系统
CN115220574A (zh) 位姿确定方法及装置、计算机可读存储介质和电子设备
CN110378172B (zh) 信息生成方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21897007

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.09.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21897007

Country of ref document: EP

Kind code of ref document: A1