WO2021180030A1 - 行为识别方法及系统、电子设备和计算机可读存储介质 - Google Patents

行为识别方法及系统、电子设备和计算机可读存储介质 Download PDF

Info

Publication number
WO2021180030A1
WO2021180030A1 PCT/CN2021/079530 CN2021079530W WO2021180030A1 WO 2021180030 A1 WO2021180030 A1 WO 2021180030A1 CN 2021079530 W CN2021079530 W CN 2021079530W WO 2021180030 A1 WO2021180030 A1 WO 2021180030A1
Authority
WO
WIPO (PCT)
Prior art keywords
time
vector
layer
feature
series convolution
Prior art date
Application number
PCT/CN2021/079530
Other languages
English (en)
French (fr)
Inventor
吴臻志
Original Assignee
北京灵汐科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京灵汐科技有限公司 filed Critical 北京灵汐科技有限公司
Priority to US17/790,694 priority Critical patent/US20230042187A1/en
Publication of WO2021180030A1 publication Critical patent/WO2021180030A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the technical field of behavior recognition, in particular to a behavior recognition method, a behavior recognition system, an electronic device and a computer-readable storage medium.
  • Behavior recognition based on video data is widely used in various fields.
  • behavior recognition for video data has problems such as a large amount of calculation, a large weight, and a low recognition accuracy rate.
  • the purpose of the present invention is to provide a behavior recognition method, a behavior recognition system, an electronic device and a computer-readable storage medium, which can achieve artificial neural network (ANN, Artificial Neural Network).
  • ANN Artificial Neural Network
  • the convolution effect in can also reduce the amount of calculation and weight, and it can also contact multiple pictures to process the timing information between the pictures, which improves the accuracy of recognition.
  • the present invention provides a behavior recognition method, which includes: intercepting video data into multiple video segments, extracting frames for each video segment to obtain multiple frame images, and extracting frames from each video segment to obtain the Extract the optical flow from the frame image to obtain the optical flow image of each video segment; perform feature extraction on the frame image and optical flow image of each video segment respectively to obtain the feature map of the frame image and the feature map of the optical flow image of each video segment; Perform spatio-temporal convolution processing on the feature map of the frame image of each video segment and the feature map of the optical flow image respectively to determine the spatial prediction result and time prediction result of each video segment; merge the spatial prediction results of all video segments to obtain Spatial fusion results, and fusion of the temporal prediction results of all video segments to obtain a temporal fusion result; dual-stream fusion of the spatial fusion result and the temporal fusion result to obtain a behavior recognition result.
  • the spatio-temporal convolution processing is performed on the feature map of the frame image and the feature map of the optical flow image of each video segment to determine the spatial prediction result and the temporal prediction result of each video segment, including: Perform n time-series feature extraction on the feature map of the frame image of each video segment and the feature map of the optical flow image to obtain the first feature vector, where n, where n is a positive integer; matrix transformation is performed on the first feature vector Process to obtain a second feature vector; perform time-series full connection processing on the second feature vector to obtain a third feature vector; and determine the spatial prediction result and the temporal prediction result of each video segment according to the third feature vector.
  • performing n times sequential feature extraction on the feature map of the frame image and the feature map of the optical flow image of each video segment to obtain the first feature vector includes: The feature map of the frame image of each video segment and the feature map of the optical flow image are subjected to time series convolution processing to obtain a first time series convolution vector; the first time series convolution vector is pooled to obtain a first intermediate feature vector ; Determine the first intermediate feature vector as the first feature vector;
  • the time sequence feature extraction is performed n times on the feature map of the frame image of each video segment and the feature map of the optical flow image respectively to obtain the first feature vector, which includes: separately extracting the frame image of each video segment
  • the feature map of the optical flow image and the feature map of the optical flow image are subjected to time series convolution processing to obtain the first time series convolution vector;
  • the first time series convolution vector is subjected to time series convolution processing to obtain the second time series convolution vector;
  • the sequence feature extraction is performed n times on the feature map of the frame image of each video segment and the feature map of the optical flow image respectively to obtain the first feature vector, including: separately extracting the frame image of each video segment
  • the feature map of the optical flow image and the feature map of the optical flow image are subjected to time series convolution processing to obtain the first time series convolution vector; the first time series convolution vector is subjected to time series convolution processing to obtain the second time series convolution vector; Pooling the i time-series convolution vector to obtain the i-th intermediate feature vector; performing time-series convolution processing on the i-th intermediate feature vector to obtain the i+1-th time-series convolution vector; pooling the i+1-th time-series convolution vector
  • the i+1th intermediate feature vector is obtained, where i is a positive integer sequentially taken from 2 to n-1 until the nth intermediate feature quantity is obtained; the nth intermediate feature vector is determined as the first feature vector.
  • the frame extraction processing for each video segment includes: extracting frames from each video segment at a certain interval to obtain N 1 frames of images, where the interval is the total frame of each video segment Divide the number by N 1 , where N 1 is an integer greater than one.
  • extracting optical flow from the plurality of frame images of each video segment includes: extracting optical flow calculations for the extracted N 1 frame images respectively according to two adjacent two frame images. Obtain N 1 -1 optical flows; copy the optical flows of the second frame and the first frame as the first optical flow, and merge the N 1 -1 optical flows into N 1 optical flows.
  • the spatiotemporal convolution processing on the feature map of the frame image and the feature map of the optical flow image of each video segment is realized by a neural network, and the method further includes: training the neural network according to a training set.
  • the internet The internet.
  • the neural network includes: n Block blocks, a Reshape layer, a LIF layer, a fully connected layer, and a Softmax layer; wherein, the Block block includes a cascaded ConvLIF layer and a pooling layer, and n is Positive integer, and n ⁇ 1, when n>1, n Block blocks are cascaded.
  • the spatio-temporal convolution processing is performed on the frame image and the feature map of the optical flow image of each video segment through the neural network, including: using the n Block blocks to perform the spatiotemporal convolution processing on the frame image of each video segment Perform n times sequential feature extraction with the optical flow image to obtain a first feature vector; perform matrix transformation processing on the first feature vector through the Reshape layer to obtain a second feature vector; The second feature vector is subjected to time-series full connection processing to obtain a third feature vector; according to the third feature vector, the spatial prediction result and the temporal prediction result of each video segment are determined through the Softmax layer.
  • the frame image and optical flow image of each video segment are subjected to n time sequence feature extraction through the n Block blocks to obtain the first feature vector, including: through the ConvLIF
  • the layer performs sequential convolution processing on the feature maps of the frame image and the optical flow image to obtain the first sequential convolution vector;
  • the pooling layer performs pooling processing on the first sequential convolution vector to obtain the first intermediate feature Vector; determining the first intermediate feature vector as the first feature vector;
  • n When n>2, perform n time-series feature extraction on the feature map of the frame image and the feature map of the optical flow image of each video segment through the n Block blocks to obtain the first feature vector, including: passing the ConvLIF layer Perform time-series convolution processing on the feature map of the frame image and the feature map of the optical flow image of each video segment respectively to obtain the first time-series convolution vector; pool the first time-series convolution vector through the pooling layer Process to obtain the first intermediate feature vector; perform time series convolution processing on the first time series convolution vector through the pooling layer to obtain a second time series convolution vector; pass the ConvLIF layer to the i-th time series convolution vector The product vector is pooled to obtain the i-th intermediate feature vector; the i-th intermediate feature vector is subjected to time-series convolution processing through the pooling layer to obtain the i+1-th time-series convolution vector; the i+th intermediate feature vector is obtained through the ConvLIF layer 1
  • Block block further includes a BN layer cascaded between the ConvLIF layer and the pooling layer.
  • performing pooling processing on the first time series convolution vector through the pooling layer to obtain a first intermediate feature vector includes: normalizing the first time series convolution vector Processing; pooling the first time-series convolution vector after the normalization process by using the pooling layer;
  • performing pooling processing on the second time series convolution vector through a pooling layer to obtain a second intermediate feature vector includes: performing normalization processing on the second time series convolution vector through the BN layer ;Using the pooling layer to pool the standardized second sequential convolution vector;
  • performing time series convolution processing on the first time series convolution vector through the pooling layer to obtain a second time series convolution vector includes: convolving the first time series convolution vector through the BN layer The vector is normalized; the pooling layer is used to pool the normalized first time series convolution vector to obtain the second time series convolution vector;
  • performing time-series convolution processing on the i-th intermediate feature vector through the pooling layer to obtain the i+1-th time-series convolution vector includes: performing the i-th time-series convolution vector through the BN layer Perform normalization processing; use the pooling layer to pool the i-th time series convolution vector after normalization to obtain the i+1-th time series convolution vector.
  • the layer for LIF The input value X t t after the time obtained by the calculation fully connected I t, biological voltage value time t 1- Determine the membrane potential value at time t According to the membrane potential value at time t And the emission threshold V th , determine the output value F t at time t; determine whether to reset the membrane potential according to the output value F t at time t, and determine the reset membrane potential value according to the reset voltage value V reset According to the reset membrane potential value Determine the biological voltage value at time t
  • the output value F t at time t is used as the input of the next layer cascaded with the LIF layer, and the biological voltage value at time t As input for calculating the membrane potential value at time t+1.
  • the layer for ConvLIF input values X t t elapsed time or total value of convolution operation obtained after the connection I t, biological voltage value time t 1- Determine the membrane potential value at time t According to the membrane potential value at time t And the emission threshold V th , determine the output value F t at time t; determine whether to reset the membrane potential according to the output value F t at time t, and determine the reset membrane potential value according to the reset voltage value V reset According to the reset membrane potential value Determine the biological voltage value at time t
  • the output value F t at time t is used as the input of the next layer cascaded with the ConvLIF layer, and the biovoltage value at time t As input for calculating the membrane potential value at time t+1.
  • the determination of the output value at time t according to the membrane potential value at time t and the emission threshold V th includes: if the membrane potential value at time t Greater than or equal to the emission threshold V th , the output value at time t is determined to be 1; if the membrane potential value at time t If it is less than the emission threshold V th , it is determined that the output value at time t is 0.
  • the reset membrane potential value Determine the biological voltage value at time t Including: the membrane potential value reset by the Leak activation function Calculate to determine the biological voltage value at time t
  • the prediction results of all video segments are directly averaged, linearly weighted, directly taken maximum, and TOP-K weighted One of them.
  • the spatial fusion result and the time fusion result are fused by weight.
  • the present invention also provides a behavior recognition system, adopting the behavior recognition method, including: a data preprocessing module, which is used to intercept video data into multiple video segments, and extract frames for each video segment to obtain multiple Frame images, and extract optical flow from a plurality of said frame images of each video segment to obtain multiple optical flow images of each video segment respectively; a feature extraction module is used to separately extract the optical flow of the frame image and light of each video segment Image feature extraction is performed on the stream image to obtain the feature map of the frame image of each video segment and the feature map of the optical flow image; the network recognition module performs spatio-temporal analysis on the feature map of the frame image of each video segment and the feature map of the optical flow image.
  • a data preprocessing module which is used to intercept video data into multiple video segments, and extract frames for each video segment to obtain multiple Frame images, and extract optical flow from a plurality of said frame images of each video segment to obtain multiple optical flow images of each video segment respectively
  • a feature extraction module is used to separately extract the optical flow of the frame image and light
  • the network fusion module which fuses the spatial prediction results of all video segments to obtain the spatial fusion result, and fuse the temporal prediction results of all video segments , To obtain a time fusion result;
  • a dual-stream fusion module which is used to perform dual-stream fusion on the spatial fusion result and the time fusion result to obtain a behavior recognition result.
  • the network recognition module separately performs spatiotemporal convolution processing on the feature map of the frame image and the feature map of the optical flow image of each video segment to determine the spatial prediction result and the temporal prediction result of each video segment, Including: performing n times sequential feature extraction on the feature map of the frame image and the feature map of the optical flow image of each video segment to obtain the first feature vector, where n, and n is a positive integer; Perform matrix transformation processing on the vector to obtain the second eigenvector; perform time-series full connection processing on the second eigenvector to obtain the third eigenvector; determine the spatial prediction result and time of each video segment according to the third eigenvector forecast result.
  • the network recognition module separately performs n times sequential feature extraction on the feature map of the frame image and the feature map of the optical flow image of each video segment to obtain the first feature vector, including : Perform sequential convolution processing on the feature map of the frame image and the feature map of the optical flow image of each video segment to obtain the first sequential convolution vector; perform pooling processing on the first sequential convolution vector to obtain the first An intermediate feature vector; determining the first intermediate feature vector as the first feature vector;
  • the time sequence feature extraction is performed n times on the feature map of the frame image of each video segment and the feature map of the optical flow image respectively to obtain the first feature vector, which includes: separately extracting the frame image of each video segment
  • the feature map of the optical flow image and the feature map of the optical flow image are subjected to time series convolution processing to obtain the first time series convolution vector;
  • the first time series convolution vector is subjected to time series convolution processing to obtain the second time series convolution vector;
  • the sequence feature extraction is performed n times on the feature map of the frame image of each video segment and the feature map of the optical flow image respectively to obtain the first feature vector, including: separately extracting the frame image of each video segment
  • the feature map of the optical flow image and the feature map of the optical flow image are subjected to time series convolution processing to obtain the first time series convolution vector; the first time series convolution vector is subjected to time series convolution processing to obtain the second time series convolution vector; Pooling the i time-series convolution vector to obtain the i-th intermediate feature vector; performing time-series convolution processing on the i-th intermediate feature vector to obtain the i+1-th time-series convolution vector; pooling the i+1-th time-series convolution vector
  • the i+1th intermediate feature vector is obtained, where i is a positive integer sequentially taken from 2 to n-1 until the nth intermediate feature quantity is obtained; the nth intermediate feature vector is determined as the first feature vector.
  • the data preprocessing module frame extraction process for each video segment comprising: said each video segment according to a certain frame decimation interval, to give N 1-frame image, wherein the spacer is a video clip The total number of frames is divided by N 1 , where N 1 is an integer greater than one.
  • each of said plurality of said video segment data pre-processing module extracts the frame image frame pumping optical flow, comprises: N 1 frame image extracted, respectively, according to the two adjacent The optical flow of the two frames of images is extracted and calculated to obtain N 1 -1 optical flows; the optical flows of the second frame and the first frame are copied as the first optical flow, and the N 1 -1 optical flows are combined into N 1 light flow.
  • the network recognition module performs spatio-temporal convolution processing on the feature maps of the frame image and the optical flow image respectively through a neural network, and the system further includes: training the neural network according to a training set.
  • the neural network includes: n Block blocks, a Reshape layer, a LIF layer, a fully connected layer, and a Softmax layer; wherein, the Block block includes a cascaded ConvLIF layer and a pooling layer, and n is Positive integer, and n ⁇ 1, when n>1, n Block blocks are cascaded.
  • the spatio-temporal convolution processing is performed on the frame image and the feature map of the optical flow image of each video segment through the neural network, including: using the n Block blocks to perform the spatiotemporal convolution processing on the frame image of each video segment Perform n times sequential feature extraction with the optical flow image to obtain a first feature vector; perform matrix transformation processing on the first feature vector through the Reshape layer to obtain a second feature vector; The second feature vector is subjected to time-series full connection processing to obtain a third feature vector; according to the third feature vector, the spatial prediction result and the temporal prediction result of each video segment are determined through the Softmax layer.
  • the frame image and optical flow image of each video segment are subjected to n time sequence feature extraction through the n Block blocks to obtain the first feature vector, including: through the ConvLIF
  • the layer performs sequential convolution processing on the feature maps of the frame image and the optical flow image to obtain the first sequential convolution vector;
  • the pooling layer performs pooling processing on the first sequential convolution vector to obtain the first intermediate feature Vector; determining the first intermediate feature vector as the first feature vector;
  • n When n>2, perform n time-series feature extraction on the feature map of the frame image and the feature map of the optical flow image of each video segment through the n Block blocks to obtain the first feature vector, including: passing the ConvLIF layer Perform time-series convolution processing on the feature map of the frame image and the feature map of the optical flow image of each video segment respectively to obtain the first time-series convolution vector; pool the first time-series convolution vector through the pooling layer Process to obtain the first intermediate feature vector; perform time series convolution processing on the first time series convolution vector through the pooling layer to obtain a second time series convolution vector; pass the ConvLIF layer to the i-th time series convolution vector The product vector is pooled to obtain the i-th intermediate feature vector; the i-th intermediate feature vector is subjected to time-series convolution processing through the pooling layer to obtain the i+1-th time-series convolution vector; the i+th intermediate feature vector is obtained through the ConvLIF layer 1
  • Block block further includes a BN layer cascaded between the ConvLIF layer and the pooling layer.
  • performing pooling processing on the first time series convolution vector through the pooling layer to obtain a first intermediate feature vector includes: performing a pooling process on the first time series convolution vector through the BN layer The convolution vector is standardized; the pooling layer is used to pool the normalized first time series convolution vector;
  • performing pooling processing on the second time series convolution vector through a pooling layer to obtain a second intermediate feature vector includes: performing normalization processing on the second time series convolution vector through the BN layer ;Using the pooling layer to pool the standardized second sequential convolution vector;
  • performing time series convolution processing on the first time series convolution vector through the pooling layer to obtain a second time series convolution vector includes: convolving the first time series convolution vector through the BN layer The vector is normalized; the pooling layer is used to pool the normalized first time series convolution vector to obtain the second time series convolution vector;
  • performing time-series convolution processing on the i-th intermediate feature vector through the pooling layer to obtain the i+1-th time-series convolution vector includes: performing the i-th time-series convolution vector through the BN layer Perform normalization processing; use the pooling layer to pool the i-th time series convolution vector after normalization to obtain the i+1-th time series convolution vector.
  • the LIF layer is used for:
  • the input value X t t after the time obtained by the calculation fully connected I t, the voltage value and biological time t-1 Determine the membrane potential value at time t According to the membrane potential value at time t And the emission threshold V th , determine the output value F t at time t; determine whether to reset the membrane potential according to the output value F t at time t, and determine the reset membrane potential value according to the reset voltage value V reset According to the reset membrane potential value Determine the biological voltage value at time t
  • the output value F t at time t is used as the input of the next layer cascaded with the LIF layer, and the biological voltage value at time t As input for calculating the membrane potential value at time t+1.
  • the layer for ConvLIF input values X t t elapsed time or total value of convolution operation obtained after the connection I t, biological voltage value time t 1- Determine the membrane potential value at time t According to the membrane potential value at time t And the emission threshold V th , determine the output value F t at time t; determine whether to reset the membrane potential according to the output value F t at time t, and determine the reset membrane potential value according to the reset voltage value V reset According to the reset membrane potential value Determine the biological voltage value at time t
  • the output value F t at time t is used as the input of the next layer cascaded with the ConvLIF layer, and the biovoltage value at time t As input for calculating the membrane potential value at time t+1.
  • the determination of the output value at time t according to the membrane potential value at time t and the emission threshold V th includes: if the membrane potential value at time t Greater than or equal to the emission threshold V th , the output value at time t is determined to be 1; if the membrane potential value at time t If it is less than the emission threshold V th , it is determined that the output value at time t is 0.
  • the reset membrane potential value Determine the biological voltage value at time t Including: the membrane potential value reset by the Leak activation function Calculate to determine the biological voltage value at time t
  • the network fusion module merges the spatial prediction results of all video segments and the temporal prediction results of all video segments, the prediction results of all video segments are directly averaged, linearly weighted, and the maximum value is directly taken. And one of TOP-K weighting.
  • the dual-stream fusion module when the dual-stream fusion module performs dual-stream fusion on the spatial fusion result and the time fusion result, the spatial fusion result and the time fusion result adopt weighted fusion.
  • the present invention also provides an electronic device, including a memory and a processor, the memory is used to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to realize the behavior recognition method.
  • the present invention also provides a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the behavior recognition method.
  • the convolution effect in ANN can be achieved, and the amount of calculation and weight can be reduced, which greatly reduces the amount of calculation, reduces the requirements for computing equipment, and correspondingly reduces the size of the network and reduces the storage space.
  • FIG. 1 is a schematic flowchart of a behavior recognition method according to an exemplary embodiment of the present disclosure
  • Fig. 2 is a flowchart of a behavior recognition method provided by an exemplary embodiment of the present disclosure
  • Fig. 3 is a structural diagram of a neural network according to an exemplary embodiment of the present disclosure.
  • Fig. 4 is a working flow chart of the ConvLIF layer and the LIF layer in the neural network according to an exemplary embodiment of the present disclosure
  • Fig. 5 is a block diagram of a behavior recognition system according to an exemplary embodiment of the present disclosure.
  • the directional indication is only used to explain that it is in a specific posture (as shown in the drawings). If the specific posture changes, the relative positional relationship, movement, etc. of the components below will also change the directional indication accordingly.
  • the terms used are for illustrative purposes only, and are not intended to limit the scope of the present disclosure.
  • the terms “including” and/or “including” are used to specify the existence of the described elements, steps, operations and/or components, but do not exclude the presence or addition of one or more other elements, steps, operations and/or components .
  • the terms “first”, “second”, etc. may be used to describe various elements, do not represent the order, and do not limit these elements.
  • “plurality” means two or more. These terms are only used to distinguish one element from another.
  • a series of short clips are sparsely sampled from the entire video, and each video clip will give its own preliminary prediction of the behavior category, and the fusion of these clips will obtain the video-level The prediction result, and then the prediction fusion of all modes (space and time) produces the final prediction result, as shown in Figure 1, including:
  • the video data is equally divided into N video segments.
  • the average is divided into 4 segments.
  • extracting frames for each video segment includes: extracting frames for each video segment at a certain interval to obtain an image with N 1 (for example, 40) frame size [320,240,3], Among them, the interval is the total number of frames of the video segment divided by N 1 (for example, 40, according to the method of rounding off the remainder). Wherein, N 1 is an integer greater than 1, and the present disclosure does not limit the value of N 1.
  • Optical flow uses the changes in the time domain of pixels in the image sequence and the correlation between adjacent frames to find the correspondence between the previous frame and the current frame, so as to calculate the motion information of the object between adjacent frames One way.
  • extracting the optical flow from the frame image after the frame extraction includes: extracting the optical flow calculation from the extracted N 1 (for example, 40) frame images, respectively, according to two adjacent frames of images. Obtain N 1 -1 (for example, 39) optical flows; copy the optical flows of the second frame and the first frame as the first optical flow, and merge them with N 1 -1 (for example, 39) optical flows into N 1 (for example, 40 ) An optical flow.
  • the Brox algorithm is used when calculating the optical flow.
  • S2 Perform feature extraction on the frame image and optical flow image of each video segment, respectively, to obtain the feature map of the frame image and the feature map of the optical flow image of each video segment.
  • the Inception V3 model trained by ImageNet is used to classify frame images and optical flow images, and image features are extracted to obtain the feature map of the frame image and the feature map of the optical flow image of each video segment.
  • S3 Perform spatio-temporal convolution processing on the feature map of the frame image of each video segment and the feature map of the optical flow image respectively, and determine the spatial prediction result (ie the category probability distribution of the spatial stream) and the time prediction result (ie Class probability distribution of time stream).
  • performing spatio-temporal convolution processing on the feature maps of the frame image and the optical flow image respectively to determine the spatial prediction result and the temporal prediction result of each video segment includes:
  • the spatial prediction result and the temporal prediction result of each video segment are determined.
  • the time sequence feature extraction may refer to performing feature extraction processing with time sequence on the feature map.
  • Matrix transformation processing refers to the process of expanding the last few dimensions of a matrix.
  • Sequential fully connected processing refers to fully connected processing with sequential processing. In this way, multiple pictures can be processed at a time, which not only guarantees the feature extraction effect, but also connects multiple pictures to process the timing information between the pictures, thereby improving the recognition accuracy.
  • n is not specifically limited.
  • n 1
  • performing n times sequential feature extraction on the feature map of the frame image and the feature map of the optical flow image of each video segment respectively to obtain the first feature vector including:
  • the first intermediate feature vector is determined as the first feature vector.
  • n 2
  • performing n times sequential feature extraction on the feature map of the frame image and the feature map of the optical flow image of each video segment to obtain the first feature vector includes:
  • the second intermediate feature vector is determined as the first feature vector.
  • the step of performing n times sequential feature extraction on the feature map of the frame image and the feature map of the optical flow image of each video segment to obtain the first feature vector includes:
  • the nth intermediate feature vector is determined as the first feature vector.
  • the time-series convolution processing may refer to performing convolution processing on the feature map with timing information.
  • the feature map may be convolved through a convolution layer with timing information.
  • the time series convolution vector contains the time dimension, so the pooling layer needs to be encapsulated to enable the time series convolution vector to be pooled.
  • performing n times of sequential feature extraction on the feature map of the frame image and the feature map of the optical flow image of each video segment to obtain the first feature vector includes:
  • the third intermediate feature vector is determined as the first feature vector.
  • the spatio-temporal convolution processing on the feature maps of the frame image and the optical flow image is implemented by a neural network, and the method further includes: training the neural network according to a training set.
  • the present disclosure can use, for example, the UCF101 data set, which has 13,320 videos from 101 action categories, with the greatest diversity in action, and in terms of camera movement, object appearance and posture, object ratio, viewpoint, messy background, lighting There are big differences in terms of conditions.
  • the 101 action videos are divided into 25 groups, and each group can contain 4 to 7 action videos. Videos from the same group may have some common features, such as similar backgrounds, similar viewpoints, and so on.
  • Action categories can be divided into five types: 1) human-object interaction 2) only body movements 3) human-human interaction 4) playing musical instruments 5) sports.
  • the video data in the UCF101 data set is subjected to frame extraction processing, including: decomposing each video segment into frame images and saving the number of frames in a csv file; selecting multiple frames from the decomposed frame images with a number greater than N 1 (for example 40) and less than N2 (e.g. 900) sample; sample frames will be selected into 4 equal parts; and each sample in a certain frame decimation interval, wherein the interval is divided by the total number of frames of the video segment N 1 ( For example, 40, according to the method of rounding off the remainder), N 1 frame (for example, 40) of the image size [320,240,3] is obtained.
  • N 1 frame for example, 40
  • the sampling fragment in this way only contains a small part of frames.
  • this method greatly reduces the computational overhead.
  • the optical flow is extracted using the above-mentioned optical flow extraction method to obtain the data set required by the neural network.
  • the data set is divided into training set Train and test set Test according to ucfTrainTestlist.
  • the neural network is trained through the training set, and the trained neural network is used as a prediction model for obtaining the temporal prediction results and spatial prediction results of the video clips.
  • the feature maps of frame images and optical flow images are input into the trained neural network for processing, and the trained neural network outputs the spatial prediction result (that is, the category probability distribution of the spatial stream) and the time prediction result (that is, the Class probability distribution of time stream).
  • the neural network includes: n Block blocks (net Block in Fig. 3), Reshape layer (Reshape Layer in Fig. 3), and LIF layer (Fig. 3 LIF Layer), fully connected layer (FC Layer in Figure 3) and Softmax layer (Softmax Layer in Figure 3).
  • the Block block includes a cascaded ConvLIF layer (ConvLIF2D Layer in Figure 3) and a pooling layer (Time Distribution MaxPooling2D Layer in Figure 3).
  • n is a positive integer, and n ⁇ 1. When n>1, n Block blocks are cascaded.
  • performing spatio-temporal convolution processing on the frame image and the feature map of the optical flow image of each video segment through a neural network including:
  • the spatial prediction result and the temporal prediction result of each video segment are determined through the Softmax layer.
  • n 1
  • at least one time sequence feature extraction is performed on the frame image and optical flow image of each video segment through n Block blocks to obtain the first feature vector, including:
  • the first intermediate feature vector is determined as the first feature vector.
  • n when n>2, perform n times sequential feature extraction on the feature map of the frame image and the feature map of the optical flow image of each video segment through the n Block blocks to obtain the first feature Vectors, including:
  • the nth intermediate feature vector is determined as the first feature vector.
  • the ConvLIF layer of the first Block block can be used to perform sequential convolution processing on the feature maps of the frame image and the optical flow image to obtain the first sequential convolution.
  • Vector and pool the first time series convolution vector through the pooling layer of the first Block block to obtain the first intermediate feature vector.
  • the Block block further includes: a BN (Batch Normalization) layer cascaded between the ConvLIF layer and the pooling layer, and the time-series convolution vector is standardized through the BN layer, And the normalized time series convolution vector is pooled.
  • BN Batch Normalization
  • performing pooling processing on the first time series convolution vector by the pooling layer to obtain the first intermediate feature vector includes:
  • pooling is performed on the second time series convolution vector through the pooling layer to obtain the second intermediate feature vector, including:
  • performing time-series convolution processing on the first time-series convolution vector through the pooling layer to obtain a second time-series convolution vector includes:
  • time-series convolution processing is performed on the i-th intermediate feature vector through the pooling layer to obtain the i+1-th time-series convolution vector, including:
  • the pooling layer is used to pool the i-th time-series convolution vector after the normalization process to obtain the i+1-th time-series convolution vector.
  • the Reshape layer can be added to process the output data of the Block block, and the dimension of the data can be expanded as the input of the LIF layer.
  • the output shape of the Block block is (10, 2, 2, 1024)
  • the reshape layer is added to process the output data, and the next three dimensions are directly expanded to obtain data with a shape of (10, 4096).
  • the BN (Batch Normalization) layer is cascaded between the ConvLIF layer and the pooling layer to standardize the data in batches, which can accelerate the network convergence speed and improve the stability of training.
  • the FC fully connected layer is used as the fully connected layer
  • the Max Pooling layer is used as the pooling layer
  • the LIF layer is used to:
  • the output value F t at time t is used as the input of the next layer cascaded with the LIF layer, and the biological voltage value at time t As an input for calculating the membrane potential value at time t+1, the input value X t is a discrete value.
  • the ConvLIF layer is used to:
  • the output value F t at time t is used as the input of the next layer cascaded with the ConvLIF layer, and the biological voltage value at time t As an input for calculating the membrane potential value at time t+1, the input value X t is a discrete value.
  • determining the output value at time t according to the membrane potential value at time t and the emission threshold V th includes:
  • the reset membrane potential value Determine the biological voltage value at time t Including: membrane potential value reset by Leak activation function Calculate to determine the biological voltage value at time t ⁇ is the leakage mechanism, and ⁇ is the bias of the theoretical value between 0-1.
  • the pooling layer needs to be encapsulated so that it can process the output result of the ConvLIF.
  • the TimeDistribution layer is used to encapsulate the pooling layer MaxPooling2D, so that the MaxPooling2D layer can process the output result of ConvLIF.
  • the neural network described in the present disclosure uses a fusion network of ANN and SNN, that is, the fusion of the ConvLIF layer and the LIF layer, and the normalization layer and the pooling layer.
  • the LIF layer is a fully connected layer with timing, which can process information with timing. Its function is similar to LSTM in ANN, but its weight is significantly lower than LSTM.
  • the ConvLIF layer is a convolutional layer with timing information, which can process convolution with timing.
  • the ConvLIF layer In the convolution of ANN, only one picture can be processed, and it is not related to the previous and subsequent pictures, while the ConvLIF layer can be used at one time. Processing multiple pictures, that can achieve the convolution effect in the ANN, you can also contact multiple pictures to process the timing information between the pictures.
  • the weight of the ConvLIF layer is also significantly lower than that of the Conv3D layer (the convolutional network of the present disclosure).
  • the weight and calculation amount of the ConvLIF2D layer is only one-half of the Conv3D layer), which further reduces the amount of calculation, reduces the requirements for computing equipment, reduces the size of the network, and reduces the storage space.
  • both the spatial prediction result and the temporal prediction result adopt a direct average fusion method. This fusion method can jointly model multiple video clips and capture visual information from the entire video to improve the recognition effect .
  • the behavior recognition method of the present disclosure does not limit the fusion method of the spatial prediction result and the temporal prediction result.
  • the spatial fusion result and the time fusion result adopt weighted fusion for dual-stream fusion, for example, the weight of the spatial stream fusion result is set to 0.6, and the weight of the time stream fusion result is set to 0.4.
  • the behavior recognition method of the present disclosure does not limit the dual-stream fusion method.
  • the behavior recognition system described in the embodiment of the present disclosure adopts the aforementioned behavior recognition method.
  • the behavior recognition system includes a data preprocessing module 510, a feature extraction module 520, a network recognition module 530, and a network fusion Module 540, dual-stream fusion module 550.
  • the data preprocessing module 510 is used to intercept video data into multiple video segments, extract frames for each video segment to obtain multiple frame images, and extract optical flow from the multiple frame images of each video segment, respectively Obtain multiple optical flow images for each video segment.
  • the data preprocessing module 510 divides the video data into N video segments. For example, the average is divided into 4 segments.
  • the data preprocessing module 510 when the data preprocessing module 510 extracts frames for each video segment, it includes: extracting frames for each video segment at a certain interval, where the interval is the total number of frames of the video segment divided by N 1 (for example, 40, 40, according to the method of rounding off the remainder), N 1 (for example, 40) an image with a frame size of [320, 240, 3] is obtained.
  • N 1 for example, 40
  • the data preprocessing module 510 extracts the optical flow from the frame image after the frame extraction, including: extracting the extracted N 1 (for example, 40) frame image, extracting the next frame and the previous frame Optical flow calculation obtains N 1 -1 (for example, 39) optical flows; copy the optical flows of the second frame and the first frame as the first optical flow, and merge with N 1 -1 (for example, 39) optical flows into N 1 (For example, 40) optical streams.
  • the Brox algorithm is used when calculating the optical flow.
  • the feature extraction module 520 is configured to perform feature extraction on the feature map and the optical flow image of the frame image of each video segment to obtain the feature map of the frame image and the optical flow image of each video segment.
  • the feature extraction module 520 uses the Inception V3 model trained by ImageNet to classify the frame images and optical flow images, extract image features, and obtain the feature maps of the frame images and optical flow images of each video segment. Feature map.
  • the network recognition module 530 is used to perform spatio-temporal convolution processing on the feature map of the frame image and the feature map of the optical flow image of each video segment, and determine the spatial prediction result (that is, the category probability distribution of the spatial stream) and the time of each video segment.
  • the prediction result that is, the class probability distribution of the time stream.
  • the network recognition module 530 when the network recognition module 530 performs spatio-temporal convolution processing on the feature maps of the frame image and the optical flow image respectively to determine the spatial prediction result and the temporal prediction result of each video segment, it includes:
  • the spatial prediction result and the temporal prediction result of each video segment are determined.
  • the time sequence feature extraction may refer to performing feature extraction processing with time sequence on the feature map.
  • Matrix transformation processing refers to the process of expanding the last few dimensions of a matrix.
  • Sequential fully connected processing refers to fully connected processing with sequential processing. In this way, multiple pictures can be processed at a time, which not only guarantees the feature extraction effect, but also connects multiple pictures to process the timing information between the pictures, thereby improving the recognition accuracy.
  • n 1
  • the method includes:
  • the first intermediate feature vector is determined as the first feature vector.
  • performing n times sequential feature extraction on the feature map of the frame image and the feature map of the optical flow image of each video segment to obtain the first feature vector includes:
  • the sequence feature extraction is performed n times on the feature map of the frame image and the feature map of the optical flow image of each video segment to obtain the first feature vector, including:
  • the nth intermediate feature vector is determined as the first feature vector.
  • the time-series convolution processing may refer to performing convolution processing on the feature map with timing information.
  • the feature map may be convolved through a convolution layer with timing information.
  • the time series convolution vector contains the time dimension, so the pooling layer needs to be encapsulated to enable the time series convolution vector to be pooled.
  • the data preprocessing module extracts frames for each video segment, including: extracting frames for each video segment at a certain interval to obtain N 1 frames of images, where the interval is the total number of frames of the video segment divided by N 1 , N 1 is an integer greater than 1.
  • the data preprocessing module extracts the optical flow from the frame images of the multiple frames of each video segment, including:
  • N 1 -1 optical flows are obtained by extracting optical flows from two adjacent frames of images, respectively;
  • the optical flows of the second frame and the first frame are copied as the first optical flow, and combined with the N 1 -1 optical flows into N 1 optical flows.
  • the network recognition module 530 performs spatiotemporal convolution processing on the feature map of the frame image and the feature map of the optical flow image of each video segment, and is implemented by a neural network.
  • the system further includes: according to training Set to train the neural network.
  • the present disclosure can use, for example, the UCF101 data set, which has 13,320 videos from 101 action categories, with the greatest diversity in action, and in terms of camera movement, object appearance and posture, object proportions, viewpoints, messy backgrounds, and lighting There are big differences in terms of conditions.
  • the videos of 101 action categories are divided into 25 groups, and each group can contain 4-7 action videos. Videos from the same group may have some common features, such as similar backgrounds, similar viewpoints, and so on.
  • Action categories can be divided into five types: 1) human-object interaction 2) only body movements 3) human-human interaction 4) playing musical instruments 5) sports.
  • the video data in the UCF101 data set is subjected to frame extraction processing, including: decomposing each video segment into frame images and saving the number of frames in a csv file; selecting multiple frames from the decomposed frame images with a number greater than N 1 (for example 40) Samples smaller than N 2 (for example, 900); divide the number of frames of the selected samples into 4 evenly; extract frames from each sample at a certain interval, where the interval is the total number of frames of the video segment divided by N 1 (For example, 40, according to the method of discarding the remainder), N 1 frame (for example, 40) of the image with the size [320,240,3] is obtained.
  • the sampling fragment in this way only contains a small part of frames.
  • this method greatly reduces the computational overhead.
  • the optical flow is extracted using the above-mentioned optical flow extraction method to obtain the data set required by the neural network.
  • the data set is divided into training set Train and test set Test according to ucfTrainTestlist.
  • the neural network is trained through the training set, and the trained neural network is used as a prediction model for obtaining the temporal prediction results and spatial prediction results of the video clips.
  • the feature maps of frame images and optical flow images are input into the trained neural network for processing, and the trained neural network outputs the spatial prediction result (that is, the category probability distribution of the spatial stream) and the time prediction result (that is, the Class probability distribution of time stream).
  • the neural network includes: n Block blocks, a Reshape layer, a LIF layer, a fully connected layer, and a Softmax layer; wherein, the Block block includes: a cascaded ConvLIF layer and Pooling layer.
  • n is a positive integer, and n ⁇ 1, when n>1, n Block blocks are cascaded.
  • performing spatio-temporal convolution processing on the frame image and the feature map of the optical flow image of each video segment through a neural network including:
  • the spatial prediction result and the temporal prediction result of each video segment are determined through the Softmax layer.
  • the frame image and optical flow image of each video segment are subjected to n time-series feature extraction through n Block blocks to obtain the first feature vector, which includes:
  • the first intermediate feature vector is determined as the first feature vector.
  • n When n>2, perform n time-series feature extraction on the feature map of the frame image and the feature map of the optical flow image of each video segment through the n Block blocks to obtain the first feature vector, including:
  • the nth intermediate feature vector is determined as the first feature vector.
  • the ConvLIF layer of the first Block block can perform time series convolution processing on the feature maps of the frame image and the optical flow image respectively to obtain the first time series volume.
  • the product vector is pooled and the first time series convolution vector is pooled through the pooling layer of the first Block block to obtain the first intermediate feature vector.
  • the second time-series convolution vector is processed by the pooling layer of the second Block block. Perform pooling processing to obtain a second intermediate feature vector, and determine the second intermediate feature vector as the first feature vector.
  • the Block block further includes: a BN (Batch Normalization) layer cascaded between the ConvLIF layer and the pooling layer, and the time-series convolution vector is standardized through the BN layer, And the normalized time series convolution vector is pooled.
  • BN Batch Normalization
  • performing pooling processing on the first time series convolution vector by the pooling layer to obtain the first intermediate feature vector includes:
  • pooling is performed on the second time series convolution vector through the pooling layer to obtain the second intermediate feature vector, including:
  • performing time-series convolution processing on the first time-series convolution vector through the pooling layer to obtain a second time-series convolution vector includes:
  • time-series convolution processing is performed on the i-th intermediate feature vector through the pooling layer to obtain the i+1-th time-series convolution vector, including:
  • the pooling layer is used to pool the i-th time-series convolution vector after the normalization process to obtain the i+1-th time-series convolution vector.
  • the Reshape layer is added to process the output data of the Block block, and the dimension of the data is expanded and used as the input of the LIF layer.
  • the output shape of the Block block is (10, 2, 2, 1024)
  • the reshape layer is added to process the output data, and the next three dimensions are directly expanded to obtain data with a shape of (10, 4096).
  • the BN (Batch Normalization) layer is cascaded between the ConvLIF layer and the pooling layer to standardize the data in batches, which can accelerate the network convergence speed and improve the stability of training.
  • the FC fully connected layer is used as the fully connected layer
  • the Max Pooling layer is used as the pooling layer
  • the LIF layer is used to:
  • the output value F t at time t is used as the input of the next layer cascaded with the LIF layer, and the biological voltage value at time t As the input for calculating the membrane potential value at time t+1, the input values are all discrete values.
  • the ConvLIF layer is used to:
  • the output value F t at time t is used as the input of the next layer cascaded with the ConvLIF layer, and the biological voltage value at time t As the input for calculating the membrane potential value at time t+1, the input values are all discrete values.
  • determining the output value at time t according to the membrane potential value at time t and the emission threshold V th includes:
  • the reset membrane potential value Determine the biological voltage value at time t Including: membrane potential value reset by Leak activation function Calculate to determine the biological voltage value at time t ⁇ is the leakage mechanism, and ⁇ is the bias of the theoretical value between 0-1.
  • the pooling layer needs to be encapsulated so that it can process the output result of the ConvLIF.
  • the TimeDistribution layer is used to encapsulate the pooling layer MaxPooling2D, so that the MaxPooling2D layer can process the output result of ConvLIF.
  • the network fusion module 540 is used to fuse the spatial prediction results of all video segments to obtain a spatial fusion result, and fuse the temporal prediction results of all video segments to obtain a temporal fusion result.
  • both the spatial prediction result and the temporal prediction result adopt a direct average fusion method. This fusion method can jointly model multiple video clips and capture visual information from the entire video to improve the recognition effect .
  • the behavior recognition system of the present disclosure does not limit the fusion method of the spatial prediction result and the temporal prediction result.
  • the dual-stream fusion module 550 is used to perform dual-stream fusion of the spatial fusion result and the time fusion result to obtain the behavior recognition result.
  • the spatial fusion result and the time fusion result adopt weighted fusion for dual-stream fusion, for example, the weight of the spatial stream fusion result is set to 0.6, and the weight of the time stream fusion result is set to 0.4.
  • the behavior recognition system of the present disclosure does not limit the dual-stream fusion method.
  • the present disclosure also relates to an electronic device, including a server, a terminal, and the like.
  • the electronic device includes: at least one processor; a memory communicatively connected with the at least one processor; and a communication component communicatively connected with the storage medium, the communication component receiving and sending data under the control of the processor; wherein the memory stores An instruction that can be executed by at least one processor, and the instruction is executed by at least one processor to implement the behavior recognition method in the foregoing embodiment.
  • the memory as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules.
  • the processor executes various functional applications and data processing of the device by running non-volatile software programs, instructions, and modules stored in the memory, that is, realizing the above behavior identification method.
  • the memory may include a program storage area and a data storage area, where the program storage area can store an operating system and an application program required by at least one function; the data storage area can store a list of options and the like.
  • the memory may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the memory may optionally include a memory remotely arranged with respect to the processor, and these remote memories may be connected to an external device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • One or more modules are stored in the memory, and when executed by one or more processors, the behavior identification method in any of the foregoing method embodiments is executed.
  • the above-mentioned products can execute the behavior recognition method provided in the embodiments of this application, and have the corresponding functional modules and beneficial effects of the execution method.
  • the behavior recognition method provided in the embodiments of this application.
  • the present disclosure also relates to a computer-readable storage medium for storing a computer-readable program, and the computer-readable program is used for a computer to execute some or all of the above-mentioned behavior recognition method embodiments.
  • the program is stored in a storage medium and includes several instructions to enable a device ( It may be a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. .

Abstract

一种行为识别方法和系统,将视频数据截取成多个视频片段,对每个视频片段抽帧处理得到帧图像,并对帧图像提取光流得到光流图像;分别对每个视频片段的帧图像和光流图像进行特征提取,得到每个视频片段的帧图像和光流图像的特征图;分别对帧图像和光流图像的特征图进行时空卷积处理,确定每个视频片段的空间预测结果和时间预测结果;对所有视频片段的空间预测结果进行融合,得到空间融合结果,并对所有视频片段的时间预测结果进行融合,得到时间融合结果;对空间融合结果和时间融合结果进行双流融合,得到行为识别结果。既能保证卷积效果,也能降低计算量和权重量,还可联系多张图片,处理图片之间的时序信息,提高识别准确率。

Description

行为识别方法及系统、电子设备和计算机可读存储介质 技术领域
本发明涉及行为识别技术领域,具体而言,涉及一种行为识别方法、一种行为识别系统、一种电子设备和一种计算机可读存储介质。
背景技术
基于视频数据的行为识别被广泛应用在各个领域。然而,相关技术中,针对视频数据的行为识别具有计算量较大,权重量也较大,识别准确率较低等问题。
发明内容
为解决上述问题,本发明的目的在于提供一种行为识别方法、一种行为识别系统、一种电子设备和一种计算机可读存储介质,既可以做到人工神经网络(ANN,Artificial Neural Network)中的卷积效果,也能降低计算量和权重量,还可以联系多张图片,处理图片之间的时序信息,提高了识别的准确率。
本发明提供了一种行为识别方法,包括:将视频数据截取成多个视频片段,对每个视频片段抽帧处理,得到多个帧图像,并对每个视频片段抽帧后的得到所述帧图像提取光流,得到每个视频片段的光流图像;分别对每个视频片段的帧图像和光流图像进行特征提取,得到每个视频片段的帧图像的特征图和光流图像的特征图;分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理,确定每个视频片段的空间预测结果和时间预测结果;对所有视频片段的空间预测结果进行融合,得到空间融合结果,并对所有视频片段的时间预测结果进行融合,得到时间融合结果;对所述空间融合结果和所述时间融合结果进行双流融合,得到行为识别结果。
作为本发明进一步的改进,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理,确定每个视频片段的空间预测结果和时间预测结果,包括:分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,其中,n其中,n为正整数;对所述第一特征向量进行矩阵变换处理,得到第二特征向量;对所述第二特征向量进行时序全连接处理,得到第三特征向量;根据所述第三特征向量,确定每个视频片段的空间预测结果和时间预测结果。
作为本发明进一步的改进,当n=1时,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;将所述第一中间特征向量确定为第一特征向量;
当n=2时,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;对第一时序卷积向量进行时序卷积处理,得到第二时序卷积向量;对所述第二时序卷积向量进行池化处理,得到第二中间特征向量;将所述第二中间特征向量确定为所述第一特征向量;
当n>2时,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特 征提取,得到第一特征向量,包括:分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第1时序卷积向量;对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量;对所述第i时序卷积向量进行池化处理,得到第i中间特征向量;对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量;对第i+1时序卷积向量进行池化处理,得到第i+1中间特征向量,其中,i为依次取自2至n-1的正整数,直至得到第n中间特征量;将第n中间特征向量确定为第一特征向量。
作为本发明进一步的改进,所述对每个视频片段抽帧处理,包括:将所述每个视频片段按照一定间隔抽取帧,得到N 1帧图像,其中,间隔为每个视频片段的总帧数除以N 1,N 1为大于1的整数。
作为本发明进一步的改进,对每个所述视频片段的多个所述帧图像提取光流,包括:对抽取出的N 1帧图像,分别根据两两相邻的两帧图像提取光流计算得到N 1-1个光流;复制第二帧与第一帧的光流作为第一个光流,与所述N 1-1个光流合并为N 1个光流。
作为本发明进一步的改进,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理通过神经网络实现,所述方法还包括:根据训练集训练所述神经网络。
作为本发明进一步的改进,所述神经网络包括:n个Block块、Reshape层、LIF层、全连接层和Softmax层;其中,所述Block块包括级联的ConvLIF层和池化层,n为正整数,且n≥1,当n>1时,n个Block块级联。
作为本发明进一步的改进,通过所述神经网络分别对每个视频片段的帧图像和光流图像的特征图进行时空卷积处理,包括:通过所述n个Block块对每个视频片段的帧图像和光流图像进行n次时序特征提取,得到第一特征向量;通过所述Reshape层对所述第一特征向量进行矩阵变换处理,得到第二特征向量;通过LIF层和所述全连接层对所述第二特征向量进行时序全连接处理,得到第三特征向量;根据所述第三特征向量,通过所述Softmax层确定每个视频片段的空间预测结果和时间预测结果。
作为本发明进一步的改进,当n=1时,通过所述n个Block块对每个视频片段的帧图像和光流图像进行n次时序特征提取,得到第一特征向量,包括:通过所述ConvLIF层分别对帧图像和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;将所述第一中间特征向量确定为第一特征向量;
当n=2时,通过所述n个Block块对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:通过所述ConvLIF层分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;通过ConvLIF层对所述第一中间特征向量进行时序卷积处理,得到第二时序卷积向量,通过池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量;将所述第二中间特征向量作为所述第一特征向量;
当n>2时,通过所述n个Block块对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:通过所述ConvLIF层分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第1时序卷积向量;通过所述池化层对所述第1时序卷积向量进行池化处理,得到第1中间特征向量;通过所述池化层对所述第1时序卷积 向量进行时序卷积处理,得到第2时序卷积向量;通过所述ConvLIF层对所述第i时序卷积向量进行池化处理,得到i中间特征向量;通过所述池化层对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量;通过所述ConvLIF层对第i+1时序卷积向量进行池化处理,得到第i+1中间特征向量,其中,i为依次取自2至n-1的正整数,直至得到第n中间特征量;将第n中间特征向量确定为第一特征向量。
作为本发明进一步的改进,所述Block块还包括级联于ConvLIF层和池化层之间的BN层。
当n=1或n=2时,通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量,包括:对所述第一时序卷积向量进行标准化处理;利用池化层将标准化处理后的第一时序卷积向量进行池化处理;
当n=2时,通过池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量,包括:通过所述BN层对所述第二时序卷积向量进行标准化处理;利用池化层将标准化处理后的第二时序卷积向量进行池化处理;
当n>2时,通过所述池化层对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量,包括:通过所述BN层对所述第1时序卷积向量进行标准化处理;利用池化层将标准化处理后的第1时序卷积向量进行池化处理,以得到第2时序卷积向量;
当n>2时,通过所述池化层对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量,包括:通过所述BN层对所述第i时序卷积向量进行标准化处理;利用池化层将标准化处理后的第i时序卷积向量进行池化处理,以得到第i+1时序卷积向量。
作为本发明进一步的改进,所述LIF层用于:根据t时刻的输入值X t经过全连接运算后得到的值I t,与t-1时刻的生物电压值
Figure PCTCN2021079530-appb-000001
确定t时刻的膜电位值
Figure PCTCN2021079530-appb-000002
根据t时刻的膜电位值
Figure PCTCN2021079530-appb-000003
与发射阈值V th,确定t时刻的输出值F t;根据t时刻的输出值F t确定是否重置膜电位,并根据重置的电压值V reset确定重置的膜电位值
Figure PCTCN2021079530-appb-000004
根据重置的膜电位值
Figure PCTCN2021079530-appb-000005
确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000006
其中,所述t时刻的输出值F t作为与所述LIF层级联的下一层的输入,所述t时刻的生物电压值
Figure PCTCN2021079530-appb-000007
作为计算t+1时刻的膜电位值的输入。
作为本发明进一步的改进,所述ConvLIF层用于:根据t时刻的输入值X t经过卷积运算或全连接运算后得到的值I t,与t-1时刻的生物电压值
Figure PCTCN2021079530-appb-000008
确定t时刻的膜电位值
Figure PCTCN2021079530-appb-000009
根据t时刻的膜电位值
Figure PCTCN2021079530-appb-000010
与发射阈值V th,确定t时刻的输出值F t;根据t时刻的输出值F t确定是否重置膜电位,并根据重置的电压值V reset确定重置的膜电位值
Figure PCTCN2021079530-appb-000011
根据重置的膜电位值
Figure PCTCN2021079530-appb-000012
确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000013
其中,所述t时刻的输出值F t作为与所述ConvLIF层级联的下一层的输入,所述t时刻的生物电压值
Figure PCTCN2021079530-appb-000014
作为计算t+1时刻的膜电位值的输入。
作为本发明进一步的改进,所述根据t时刻的膜电位值和发射阈值V th,确定时刻t的输出值,包括:若t时刻的膜电位值
Figure PCTCN2021079530-appb-000015
大于或等于发射阈值V th,则确定所述t时刻的输出值为1;若t时刻的膜电位值
Figure PCTCN2021079530-appb-000016
小于发射阈值V th,则确定所述t时刻的输出值为0。
作为本发明进一步的改进,所述根据重置的膜电位值
Figure PCTCN2021079530-appb-000017
确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000018
包括:通过Leak激活函数对所述重置的膜电位值
Figure PCTCN2021079530-appb-000019
进行计算,确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000020
作为本发明进一步的改进,对所有视频片段的空间预测结果和所有视频片段的时间预测结果进 行融合时,对所有视频片段的预测结果采用直接平均、线性加权、直接取最大值和TOP-K加权中的一种。
作为本发明进一步的改进,在双流融合时,所述空间融合结果和所述时间融合结果在双流融合时,将所述空间融合结果和所述时间融合结果采用加权融合。
本发明还提供了一种行为识别系统,采用所述行为识别方法,包括:数据预处理模块,其用于将视频数据截取成多个视频片段,对每个视频片段抽帧处理,得到多个帧图像,并对每个视频片段的多个所述帧图像提取光流,分别得到每个视频片段的多个光流图像;特征提取模块,其用于分别对每个视频片段的帧图像和光流图像进行图像特征提取,得到每个视频片段的帧图像的特征图和光流图像的特征图;网络识别模块,其分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理,确定每个视频片段的空间预测结果和时间预测结果;网络融合模块,其对所有视频片段的空间预测结果进行融合,得到空间融合结果,并对所有视频片段的时间预测结果进行融合,得到时间融合结果;双流融合模块,其用于对所述空间融合结果和所述时间融合结果进行双流融合,得到行为识别结果。
作为本发明进一步的改进,所述网络识别模块分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理,确定每个视频片段的空间预测结果和时间预测结果,包括:分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,其中,n其中,且n为正整数;对所述第一特征向量进行矩阵变换处理,得到第二特征向量;对所述第二特征向量进行时序全连接处理,得到第三特征向量;根据所述第三特征向量,确定每个视频片段的空间预测结果和时间预测结果。
作为本发明进一步的改进,当n=1时,所述网络识别模块分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;将所述第一中间特征向量确定为第一特征向量;
当n=2时,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;对第一时序卷积向量进行时序卷积处理,得到第二时序卷积向量;对所述第二时序卷积向量进行池化处理,得到第二中间特征向量;将所述第二中间特征向量确定为所述第一特征向量;
当n>2时,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第1时序卷积向量;对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量;对所述第i时序卷积向量进行池化处理,得到第i中间特征向量;对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量;对第i+1时序卷积向量进行池化处理,得到第i+1中间特征向量,其中,i为依次取自2至n-1的正整数,直至得到第n中间特征量;将第n中间特征向量确定为第一特征向量。
作为本发明进一步的改进,所述数据预处理模块对每个视频片段抽帧处理,包括:将所述每个 视频片段按照一定间隔抽取帧,得到N 1帧图像,其中,间隔为视频片段的总帧数除以N 1,N 1为大于1的整数。
作为本发明进一步的改进,所述数据预处理模块对每个所述视频片段的多个抽帧后的帧图像提取光流,包括:对抽取出的N 1帧图像,分别根据两两相邻的两帧图像提取光流计算得到N 1-1个光流;复制第二帧与第一帧的光流作为第一个光流,与所述N 1-1个光流合并为N 1个光流。作为本发明进一步的改进,所述网络识别模块分别对帧图像和光流图像的特征图进行时空卷积处理通过神经网络实现,所述系统还包括:根据训练集训练所述神经网络。
作为本发明进一步的改进,所述神经网络包括:n个Block块、Reshape层、LIF层、全连接层和Softmax层;其中,所述Block块包括级联的ConvLIF层和池化层,n为正整数,且n≥1,当n>1时,n个Block块级联。
作为本发明进一步的改进,通过所述神经网络分别对每个视频片段的帧图像和光流图像的特征图进行时空卷积处理,包括:通过所述n个Block块对每个视频片段的帧图像和光流图像进行n次时序特征提取,得到第一特征向量;通过所述Reshape层对所述第一特征向量进行矩阵变换处理,得到第二特征向量;通过LIF层和所述全连接层对所述第二特征向量进行时序全连接处理,得到第三特征向量;根据所述第三特征向量,通过所述Softmax层确定每个视频片段的空间预测结果和时间预测结果。
作为本发明进一步的改进,当n=1时,通过所述n个Block块对每个视频片段的帧图像和光流图像进行n次时序特征提取,得到第一特征向量,包括:通过所述ConvLIF层分别对帧图像和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;将所述第一中间特征向量确定为第一特征向量;
当n=2是,通过所述n个Block块对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:通过所述ConvLIF层分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;通过ConvLIF层对所述第一中间特征向量进行时序卷积处理,得到第二时序卷积向量,通过池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量;将所述第二中间特征向量作为所述第一特征向量;
当n>2时,通过所述n个Block块对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:通过所述ConvLIF层分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第1时序卷积向量;通过所述池化层对所述第1时序卷积向量进行池化处理,得到第1中间特征向量;通过所述池化层对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量;通过所述ConvLIF层对所述第i时序卷积向量进行池化处理,得到i中间特征向量;通过所述池化层对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量;通过所述ConvLIF层对第i+1时序卷积向量进行池化处理,得到第i+1中间特征向量,其中,i为依次取自2至n-1的正整数,直至得到第n中间特征量;将第n中间特征向量确定为第一特征向量。
作为本发明进一步的改进,所述Block块还包括级联于ConvLIF层和池化层之间的BN层。
当n=1或n=2时,通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量,包括:通过所述BN层对所述第一时序卷积向量进行标准化处理;利用池化层将标准化处理后的第一时序卷积向量进行池化处理;
当n=2时,通过池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量,包括:通过所述BN层对所述第二时序卷积向量进行标准化处理;利用池化层将标准化处理后的第二时序卷积向量进行池化处理;
当n>2时,通过所述池化层对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量,包括:通过所述BN层对所述第1时序卷积向量进行标准化处理;利用池化层将标准化处理后的第1时序卷积向量进行池化处理,以得到第2时序卷积向量;
当n>2时,通过所述池化层对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量,包括:通过所述BN层对所述第i时序卷积向量进行标准化处理;利用池化层将标准化处理后的第i时序卷积向量进行池化处理,以得到第i+1时序卷积向量。
作为本发明进一步的改进,所述LIF层用于:
根据t时刻的输入值X t经过全连接运算后得到的值I t,与t-1时刻的生物电压值
Figure PCTCN2021079530-appb-000021
确定t时刻的膜电位值
Figure PCTCN2021079530-appb-000022
根据t时刻的膜电位值
Figure PCTCN2021079530-appb-000023
与发射阈值V th,确定t时刻的输出值F t;根据t时刻的输出值F t确定是否重置膜电位,并根据重置的电压值V reset确定重置的膜电位值
Figure PCTCN2021079530-appb-000024
根据重置的膜电位值
Figure PCTCN2021079530-appb-000025
确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000026
其中,所述t时刻的输出值F t作为与所述LIF层级联的下一层的输入,所述t时刻的生物电压值
Figure PCTCN2021079530-appb-000027
作为计算t+1时刻的膜电位值的输入。
作为本发明进一步的改进,所述ConvLIF层用于:根据t时刻的输入值X t经过卷积运算或全连接运算后得到的值I t,与t-1时刻的生物电压值
Figure PCTCN2021079530-appb-000028
确定t时刻的膜电位值
Figure PCTCN2021079530-appb-000029
根据t时刻的膜电位值
Figure PCTCN2021079530-appb-000030
与发射阈值V th,确定t时刻的输出值F t;根据t时刻的输出值F t确定是否重置膜电位,并根据重置的电压值V reset确定重置的膜电位值
Figure PCTCN2021079530-appb-000031
根据重置的膜电位值
Figure PCTCN2021079530-appb-000032
确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000033
其中,所述t时刻的输出值F t作为与所述ConvLIF层级联的下一层的输入,所述t时刻的生物电压值
Figure PCTCN2021079530-appb-000034
作为计算t+1时刻的膜电位值的输入。
作为本发明进一步的改进,所述根据t时刻的膜电位值和发射阈值V th,确定时刻t的输出值,包括:若t时刻的膜电位值
Figure PCTCN2021079530-appb-000035
大于或等于发射阈值V th,则确定所述t时刻的输出值为1;若t时刻的膜电位值
Figure PCTCN2021079530-appb-000036
小于发射阈值V th,则确定所述t时刻的输出值为0。
作为本发明进一步的改进,所述根据重置的膜电位值
Figure PCTCN2021079530-appb-000037
确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000038
包括:通过Leak激活函数对所述重置的膜电位值
Figure PCTCN2021079530-appb-000039
进行计算,确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000040
作为本发明进一步的改进,所述网络融合模块对所有视频片段的空间预测结果和所有视频片段的时间预测结果进行融合时,对所有视频片段的预测结果采用直接平均、线性加权、直接取最大值和TOP-K加权中的一种。
作为本发明进一步的改进,所述双流融合模块对所述空间融合结果和所述时间融合结果进行双流融合时,将所述空间融合结果和所述时间融合结果采用加权融合。
本发明还提供了一种电子设备,包括存储器和处理器,所述存储器用于存储一条或多条计算机指令,其中,所述一条或多条计算机指令被处理器执行以实现所述的行为识别方法。
本发明还提供了一种计算机可读存储介质,其上存储有计算机程序,、所述计算机程序被处理器执行以实现所述的行为识别方法。
本发明的有益效果为:
即可以做到ANN中的卷积效果,也能降低计算量和权重量,大大降低了计算量,降低对计算设备的要求,也相应减小网络的大小,减少存储空间。还可以联系多张图片,处理图片之间的时序信息,提高了识别的准确率。
附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开一示例性实施例所述的一种行为识别方法的流程示意图;
图2是本公开一示例性实施例所提供的行为识别方法的流程框图;
图3为本公开一示例性实施例所述的神经网络的结构图;
图4为本公开一示例性实施例所述的神经网络中ConvLIF层和LIF层的工作流程图;
图5是本公开一示例性实施例所述的行为识别系统的模块图。
具体实施方式
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
需要说明,若本公开实施例中有涉及方向性指示(诸如上、下、左、右、前、后……),则该方向性指示仅用于解释在某一特定姿态(如附图所示)下各部件之间的相对位置关系、运动情况等,如果该特定姿态发生改变时,则该方向性指示也相应地随之改变。
另外,在本公开的描述中,所用术语仅用于说明目的,并非旨在限制本公开的范围。术语“包括”和/或“包含”用于指定所述元件、步骤、操作和/或组件的存在,但并不排除存在或添加一个或多个其他元件、步骤、操作和/或组件的情况。术语“第一”、“第二”等可能用于描述各种元件,不代表顺序,且不对这些元件起限定作用。此外,在本公开的描述中,除非另有说明,“多个”的含义是两个及两个以上。这些术语仅用于区分一个元素和另一个元素。结合以下附图,这些和/或其他方面变得显而易见,并且,本领域普通技术人员更容易理解关于本公开所述实施例的说明。附图仅出于说明的目的用来描绘本公开所述实施例。本领域技术人员将很容易地从以下说明中认识到,在不背离本公开所述原理的情况下,可以采用本公开所示结构和方法的替代实施例。
本公开实施例的一种行为识别方法,从整个视频中稀疏地采样一系列短片段,每个视频片段都将给出其本身对于行为类别的初步预测,从这些片段的融合来得到视频级的预测结果,之后对所有模式(空间和时间)的预测融合产生最终的预测结果,如图1所示,包括:
S1,将视频数据截取成多个视频片段,对每个视频片段抽帧处理,得到多个帧图像,并对每个 视频片段抽帧后得到的多个所述帧图像提取光流,得到每个视频片段的光流图像。
在一种可选的实施方式中,如图2所示,将视频数据均分成N个视频片段。例如,平均分为4段。
在一种可选的实施方式中,对每个视频片段抽帧处理,包括:将每个视频片段按照一定间隔抽取帧,得到N 1(例如40)帧大小为[320,240,3]的图像,其中,间隔为视频片段的总帧数除以N 1(例如40,按照舍掉余数的方法)。其中,N 1为大于1的整数,本公开对N 1的取值不做限制。
光流是利用图像序列中像素在时间域上的变化以及相邻帧之间的相关性来找到上一帧跟当前帧之间存在的对应关系,从而计算出相邻帧之间物体的运动信息的一种方法。在一种可选的实施方式中,对抽帧后的帧图像提取光流,包括:对抽取出的N 1(例如40)帧图像,分别根据两两相邻的两帧图像提取光流计算得到N 1-1(例如39)个光流;复制第二帧与第一帧的光流作为第一个光流,与N 1-1(例如39)个光流合并为N 1(例如40)个光流。在一种可选的实施方式中,在计算光流时,采用Brox算法。
S2,分别对每个视频片段的帧图像和光流图像进行特征提取,得到每个视频片段的帧图像的特征图和光流图像的特征图。
在一种可选的实施方式中,采用ImageNet训练的Inception V3模型对帧图像和光流图像进行图像分类,提取图像特征,得到每个视频片段的帧图像的特征图和光流图像的特征图。
S3,分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理,确定每个视频片段的空间预测结果(即空间流的类别概率分布)和时间预测结果(即时间流的类别概率分布)。
在一种可选的实施方式中,分别对帧图像和光流图像的特征图进行时空卷积处理,确定每个视频片段的空间预测结果和时间预测结果,包括:
分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,其中,n≥1,n为正整数;
对第一特征向量进行矩阵变换处理,得到第二特征向量;
对第二特征向量进行时序全连接处理,得到第三特征向量;
根据第三特征向量,确定每个视频片段的空间预测结果和时间预测结果。
其中,时序特征提取可以是指对特征图进行带时序的特征提取处理。矩阵变换处理是指将一个矩阵后几个维度展开的过程。时序全连接处理是指带时序处理的全连接处理。这样,一次可以处理多张图片,不仅可以保证特征提取效果,还可以联系多张图片,处理图片之间的时序信息,从而提高识别准确率。
在本公开中,对n的取值不做特殊的限定。
在一种可选的实施方式中,n=1,分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
对第一时序卷积向量进行池化处理,得到第一中间特征向量;
将所述第一中间特征向量确定为第一特征向量。
在一种实施方式中,n=2,相应地,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
对第一时序卷积向量进行时序卷积处理,得到第二时序卷积向量;
对所述第二时序卷积向量进行池化处理,得到第二中间特征向量;
将所述第二中间特征向量确定为所述第一特征向量。
在一种实施方式中,n>2,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第1时序卷积向量;
对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量;
对所述第i时序卷积向量进行池化处理,得到第i中间特征向量;
对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量;
对第i+1时序卷积向量进行池化处理,得到第i+1中间特征向量,其中,i为依次取自2至n-1的正整数,直至得到第n中间特征量;
将第n中间特征向量确定为第一特征向量。
其中,时序卷积处理可以是指对特征图进行带时序信息的卷积处理,例如,可以通过带时序信息的卷积层对特征图进行卷积处理。这样,能够联系多张图片,处理图片之间的时序信息。时序卷积向量包含了时间维度,因此需要将池化层进行封装,以使能对时序卷积向量进行池化处理。
下面以n=3为例,对所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量的步骤进行简单介绍。
相应地,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第1时序卷积向量;
对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量;
对所述第2时序卷积向量进行池化处理,得到第2中间特征向量;
对第2中间特征向量进行时序卷积处理,得到第3时序卷积向量;
对第3时序卷积向量进行池化处理,得到第3中间特征向量;
将第3中间特征向量确定为第一特征向量。
在一种可选的实施方式中,分别对帧图像和光流图像的特征图进行时空卷积处理通过神经网络实现,所述方法还包括:根据训练集训练所述神经网络。
本公开可以采用例如UCF101数据集,其拥有来自101个动作类别的13320个视频,在动作方面具有最大的多样性,并且在摄像机运动、物体外观和姿势、物体比例、视点、杂乱的背景、照明条件等方面存在很大的差异。101个动作类别的视频分为25个组,每个组可以包含4至7个动作的 视频。来自同一组的视频可能具有一些共同的特征,例如相似的背景、相似的视点等。动作类别可以分为五种类型:1)人与物体的互动2)仅身体动作3)人与人的互动4)演奏乐器5)运动。
将UCF101数据集中的视频数据进行抽帧处理,包括:将每个视频片段分解成帧图像并将帧数保存在csv文件中;从分解后的帧图像中选取多个帧数大于N 1(例如40)且小于N2(例如900)的样本;将选取的样本的帧数平均分为4份;将每份样本按照一定间隔抽取帧,其中,间隔为视频片段的总帧数除以N 1(例如40,按照舍掉余数的方法),得到N 1帧(例如40)大小为[320,240,3]的图像。这种方式的采样片段只包含一小部分帧,与使用密集采样帧的方法相比,这种方法大大降低计算开销。同样,UCF101数据集中的视频数据在抽帧后,采用上述提取光流的方式提取光流,得到神经网络所需要的数据集。数据集按照ucfTrainTestlist分为训练集Train和测试集Test。通过训练集对神经网络进行训练,训练后的神经网络作为获取视频片段的时间预测结果和空间预测结果的预测模型。例如,将帧图像和光流图像的特征图输入训练后的神经网络中进行处理,训练后的神经网络输出每个视频片段的空间预测结果(即空间流的类别概率分布)和时间预测结果(即时间流的类别概率分布)。
在一种可选的实施方式中,如图3所示,神经网络包括:n个Block块(图3中的net Block)、Reshape层(图3中的Reshape Layer)、LIF层(图3中的LIF Layer)、全连接层(图3中的FC Layer)和Softmax层(图3中的Softmax Layer)。其中,Block块包括级联的ConvLIF层(图3中的ConvLIF2D Layer)和池化层(图3中的Time Distribution MaxPooling2D Layer)。n为正整数,且n≥1,当n>1时,n个Block块级联。
在一种可选的实施方式中,通过神经网络分别对每个视频片段的帧图像和光流图像的特征图进行时空卷积处理,包括:
通过n个Block块对每个视频片段的帧图像和光流图像进行n次时序特征提取,得到第一特征向量;
通过Reshape层对第一特征向量进行矩阵变换处理,得到第二特征向量;
通过LIF层和全连接层对第二特征向量进行时序全连接处理,得到第三特征向量;
根据第三特征向量,通过Softmax层确定每个视频片段的空间预测结果和时间预测结果。
在本公开中,对n的具体数值不做特殊的限定。例如,在一种可选的实施方式中,n=1,通过n个Block块对每个视频片段的帧图像和光流图像进行至少一次时序特征提取,得到第一特征向量,包括:
通过ConvLIF层分别对帧图像和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
通过池化层对第一时序卷积向量进行池化处理,得到第一中间特征向量;
将第一中间特征向量确定为第一特征向量。
作为另一种实施方式,n=2,相应地,通过所述n个Block块对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
通过所述ConvLIF层分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;
通过ConvLIF层对所述第一中间特征向量进行时序卷积处理,得到第二时序卷积向量,通过池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量;
将所述第二中间特征向量作为所述第一特征向量。
作为另一种可选实施方式,当n>2时,通过所述n个Block块对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
通过所述ConvLIF层分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第1时序卷积向量;
通过所述池化层对所述第1时序卷积向量进行池化处理,得到第1中间特征向量;
通过所述池化层对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量;
通过所述ConvLIF层对所述第i时序卷积向量进行池化处理,得到i中间特征向量;
通过所述池化层对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量;
通过所述ConvLIF层对第i+1时序卷积向量进行池化处理,得到第i+1中间特征向量,其中,i为依次取自2至n-1的正整数,直至得到第n中间特征量;
将第n中间特征向量确定为第一特征向量。
举例来说,包括三个Block块,在进行三次时序特征提取时,可以通过第一个Block块的ConvLIF层分别对帧图像和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量,并通过第一个Block块的池化层对第一时序卷积向量进行池化处理,得到第一中间特征向量。通过第2个Block块的ConvLIF层对所述第一中间特征向量进行时序卷积处理,得到第2时序卷积向量,通过第2个Block块的池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量,通过第3个Block块的ConvLIF层对所述第2中间特征向量进行时序卷积处理,得到第3时序卷积向量,通过第3个Block块的池化层对所述第3时序卷积向量进行池化处理,得到第3中间特征向量,将第3中间特征向量确定为第一特征向量。本公开对Block块的数量不做限制。
在一种可选的实施方式中,Block块还包括:级联于ConvLIF层和池化层之间的BN(Batch Normalization)层,通过所述BN层对所述时序卷积向量进行标准化处理,并将标准化处理后的时序卷积向量进行池化处理。
具体地,当n=1或n=2时,通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量,包括:
通过所述BN层对所述第一时序卷积向量进行标准化处理;
利用池化层将标准化处理后的第一时序卷积向量进行池化处理;
当n=2时,通过池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量,包括:
通过所述BN层对所述第二时序卷积向量进行标准化处理;
利用池化层将标准化处理后的第二时序卷积向量进行池化处理;
当n>2时,通过所述池化层对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量,包括:
通过所述BN层对所述第1时序卷积向量进行标准化处理;
利用池化层将标准化处理后的第1时序卷积向量进行池化处理,以得到第2时序卷积向量;
当n>2时,通过所述池化层对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量,包括:
通过所述BN层对所述第i时序卷积向量进行标准化处理;
利用池化层将标准化处理后的第i时序卷积向量进行池化处理,以得到第i+1时序卷积向量。
由于Block块输出数据的维度不适用于作为LIF层的输入,可以加入Reshape层对Block块的输出数据进行处理,将数据的维度展开后作为LIF层的输入。例如,Block块的输出shape为(10,2,2,1024),加入reshape层,对输出数据进行处理,将后面三个维度直接展开,得到shape为(10,4096)的数据。级联于ConvLIF层和池化层之间的BN(Batch Normalization)层,对数据进行批量标准化,可以加速网络收敛速度,提升训练的稳定性。
在一种可选的实施方式中,全连接层采用FC全连接层,池化层采用MaxPooling池化层。
在一种可选的实施方式中,如图4所示,LIF层用于:
根据t时刻的输入值X t经过全连接运算后得到的值I t,与t-1时刻的生物电压值
Figure PCTCN2021079530-appb-000041
确定t时刻的膜电位值
Figure PCTCN2021079530-appb-000042
其中,I t=X t*W,W为输入值X t的权重,
Figure PCTCN2021079530-appb-000043
根据t时刻的膜电位值
Figure PCTCN2021079530-appb-000044
与发射阈值V th,确定t时刻的输出值F t
根据t时刻的输出值F t确定是否重置膜电位,并根据重置的电压值V reset确定重置的膜电位值
Figure PCTCN2021079530-appb-000045
其中,
Figure PCTCN2021079530-appb-000046
根据重置的膜电位值
Figure PCTCN2021079530-appb-000047
确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000048
其中,t时刻的输出值F t作为与LIF层级联的下一层的输入,t时刻的生物电压值
Figure PCTCN2021079530-appb-000049
作为计算t+1时刻的膜电位值的输入,输入值X t均为离散值。
在一种可选的实施方式中,如图3所示,ConvLIF层用于:
根据t时刻的输入值X t经过卷积运算后得到的值I t,与t-1时刻的生物电压值
Figure PCTCN2021079530-appb-000050
确定t时刻的膜电位值
Figure PCTCN2021079530-appb-000051
其中,I t=Conv(X t,W,),W为输入值X t的权重,
Figure PCTCN2021079530-appb-000052
根据t时刻的膜电位值
Figure PCTCN2021079530-appb-000053
与发射阈值V th,确定t时刻的输出值F t
根据t时刻的输出值F t确定是否重置膜电位,并根据重置的电压值V reset确定重置的膜电位值
Figure PCTCN2021079530-appb-000054
其中,
Figure PCTCN2021079530-appb-000055
根据重置的膜电位值
Figure PCTCN2021079530-appb-000056
确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000057
其中,t时刻的输出值F t作为与ConvLIF层级联的下一层的输入,t时刻的生物电压值
Figure PCTCN2021079530-appb-000058
作为计算t+1时刻的膜电位值的输入,输入值X t均为离散值。
在一种可选的实施方式中,根据t时刻的膜电位值和发射阈值V th,确定时刻t的输出值,包括:
若t时刻的膜电位值
Figure PCTCN2021079530-appb-000059
大于或等于发射阈值V th,则确定t时刻的输出值为1;
若t时刻的膜电位值
Figure PCTCN2021079530-appb-000060
小于发射阈值V th,则确定t时刻的输出值为0。
在一种可选的实施方式中,根据重置的膜电位值
Figure PCTCN2021079530-appb-000061
确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000062
包括:通过Leak激活函数对重置的膜电位值
Figure PCTCN2021079530-appb-000063
进行计算,确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000064
α为泄露机制,β为理论值在0-1之间的偏置。
在一种可选的实施方式中,由于ConvLIF层比Conv层多出时间维度,故在ConvLIF与池化层连接时,需要将池化层封装起来,使其能够处理ConvLIF的输出结果。例如,采用TimeDistribution 层将池化层MaxPooling2D进行封装,使MaxPooling2D层能处理ConvLIF的输出结果。
本公开所述的神经网络使用ANN和SNN融合的网络,即ConvLIF层和LIF层与归一化层和池化层的融合。其中LIF层是带有时序的全连接层,可以处理带有时序的信息,其作用类似于ANN中的LSTM,但权重量明显低于LSTM(本公开的卷积网络的LIF的计算量只有LSTM的四分之一,只有GRU的三分之一),大大降低了计算量,降低对计算设备的要求,也相应减小了网络的大小,减少了存储空间。ConvLIF层是带有时序信息的卷积层,可以处理带有时序的卷积,在ANN的卷积中,只能处理一张图片,且与前后的图片都没有关联,而ConvLIF层则一次可以处理多张图片,即可以做到ANN中的卷积效果,还可以联系多张图片,处理图片之间的时序信息,另外ConvLIF层的权重量也明显低于Conv3D层(本公开的卷积网络的ConvLIF2D层的权重量和计算量只有Conv3D层的二分之一),进一步降低了计算量,降低对计算设备的要求,也减小了网络的大小,减少了存储空间。
S4,对所有视频片段的空间预测结果进行融合,得到空间融合结果,并对所有视频片段的时间预测结果进行融合,得到时间融合结果。
在一种可选的实施方式中,对所有视频片段的空间预测结果和所有视频片段的时间预测结果进行融合时,对所有视频片段的空间预测结果采用直接平均、线性加权、直接取最大值和TOP-K加权中的一种,对所有视频片段的时间预测结果采用直接平均、线性加权、直接取最大值和TOP-K加权中的一种。在一种可选的实施方式中,空间预测结果和时间预测结果均采用直接平均的融合方法,这种融合方法可以联合建模多个视频片段,并从整个视频中捕获视觉信息,提高识别效果。本公开的行为识别方法对空间预测结果和时间预测结果的融合方法不作限制。
S5,对空间融合结果和时间融合结果进行双流融合,得到行为识别结果。
在一种可选的实施方式中,空间融合结果和时间融合结果采用加权融合进行双流融合,例如设置空间流融合结果的权重为0.6,时间流融合结果的权重为0.4。本公开的行为识别方法对双流融合的方法不作限制。
本公开实施方式所述的一种行为识别系统,采用前述的行为识别方法,如图5所示,所述行为识别系统包括数据预处理模块510、特征提取模块520、网络识别模块530、网络融合模块540、双流融合模块550。
数据预处理模块510用于将视频数据截取成多个视频片段,对每个视频片段抽帧处理,得到多个帧图像,并对每个视频片段的多个所述帧图像提取光流,分别得到每个视频片段的多个光流图像。
在一种可选的实施方式中,数据预处理模块510将视频数据均分成N个视频片段。例如,平均分为4段。
在一种可选的实施方式中,数据预处理模块510对每个视频片段抽帧处理时,包括:将每个视频片段按照一定间隔抽取帧,其中,间隔为视频片段的总帧数除以N 1(例如40,40,按照舍掉余数的方法),得到N 1(例如40)帧大小为[320,240,3]的图像。这种方式的采样片段只包含一小部分帧,与使用密集采样帧的方法相比,这种方法大大降低计算开销。本公开对N 1的取值不做限制。
在一种可选的实施方式中,数据预处理模块510对抽帧后的帧图像提取光流,包括:对抽取出的N 1(例如40)帧图像,将后一帧与前一帧提取光流计算得到N 1-1(例如39)个光流;复制第二 帧与第一帧的光流作为第一个光流,与N 1-1(例如39)个光流合并为N 1(例如40)个光流。在一种可选的实施方式中,在计算光流时,采用Brox算法。
特征提取模块520用于分别对每个视频片段的帧图像的特征图和光流图像进行特征提取,得到每个视频片段的帧图像和光流图像的特征图。
在一种可选的实施方式中,特征提取模块520采用ImageNet训练的Inception V3模型对帧图像和光流图像进行图像分类,提取图像特征,得到每个视频片段的帧图像的特征图和光流图像的特征图。
网络识别模块530用于分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理,确定每个视频片段的空间预测结果(即空间流的类别概率分布)和时间预测结果(即时间流的类别概率分布)。
在一种可选的实施方式中,网络识别模块530在分别对帧图像和光流图像的特征图进行时空卷积处理,确定每个视频片段的空间预测结果和时间预测结果时,包括:
分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,其中,n≥1,且n为正整数;
对第一特征向量进行矩阵变换处理,得到第二特征向量;
对第二特征向量进行时序全连接处理,得到第三特征向量;
根据第三特征向量,确定每个视频片段的空间预测结果和时间预测结果。
其中,时序特征提取可以是指对特征图进行带时序的特征提取处理。矩阵变换处理是指将一个矩阵后几个维度展开的过程。时序全连接处理是指带时序处理的全连接处理。这样,一次可以处理多张图片,不仅可以保证特征提取效果,还可以联系多张图片,处理图片之间的时序信息,从而提高识别准确率。
在一种可选的实施方式中,n=1,网络识别模块530在分别对帧图像和光流图像的特征图进行n次时序特征提取,得到第一特征向量时,包括:
分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
对第一时序卷积向量进行池化处理,得到第一中间特征向量;
将所述第一中间特征向量确定为第一特征向量。
当n=2时,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
对第一时序卷积向量进行时序卷积处理,得到第二时序卷积向量;
对所述第二时序卷积向量进行池化处理,得到第二中间特征向量;
将所述第二中间特征向量确定为所述第一特征向量;
当n>2时,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第1时序卷积向量;
对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量;
对所述第i时序卷积向量进行池化处理,得到第i中间特征向量;
对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量;
对第i+1时序卷积向量进行池化处理,得到第i+1中间特征向量,其中,i为依次取自2至n-1的正整数,直至得到第n中间特征量;
将第n中间特征向量确定为第一特征向量。
其中,时序卷积处理可以是指对特征图进行带时序信息的卷积处理,例如,可以通过带时序信息的卷积层对特征图进行卷积处理。这样,能够联系多张图片,处理图片之间的时序信息。时序卷积向量包含了时间维度,因此需要将池化层进行封装,以使能对时序卷积向量进行池化处理。
所述数据预处理模块对每个视频片段抽帧处理,包括:将所述每个视频片段按照一定间隔抽取帧,得到N 1帧图像,其中,间隔为视频片段的总帧数除以N 1,N 1为大于1的整数。
所述数据预处理模块对每个所述视频片段的多个抽帧后的帧图像提取光流,包括:
对抽取出的N 1帧图像,分别根据两两相邻的两帧图像提取光流计算得到N 1-1个光流;
复制第二帧与第一帧的光流作为第一个光流,与所述N 1-1个光流合并为N 1个光流。
在一种可选的实施方式中,网络识别模块530分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理通过神经网络实现,所述系统还包括:根据训练集训练所述神经网络。
本公开可以采用例如UCF101数据集,其拥有来自101个动作类别的13320个视频,在动作方面具有最大的多样性,并且在摄像机运动,物体外观和姿势,物体比例,视点,杂乱的背景,照明条件等方面存在很大的差异。101个动作类别的视频分为25个组,每个组可以包含4-7个动作的视频。来自同一组的视频可能具有一些共同的特征,例如相似的背景,相似的视点等。动作类别可以分为五种类型:1)人与物体的互动2)仅身体动作3)人与人的互动4)演奏乐器5)运动。
将UCF101数据集中的视频数据进行抽帧处理,包括:将每个视频片段分解成帧图像并将帧数保存在csv文件中;从分解后的帧图像中选取多个帧数大于N 1(例如40)且小于N 2(例如900)的样本;将选取的样本的帧数平均分为4份;将每份样本按照一定间隔抽取帧,其中,间隔为视频片段的总帧数除以N 1(例如40,按照舍掉余数的方法),得到N 1帧(例如40)大小为[320,240,3]的图像。这种方式的采样片段只包含一小部分帧,与使用密集采样帧的方法相比,这种方法大大降低计算开销。同样,UCF101数据集中的视频数据在抽帧后,采用上述提取光流的方式提取光流,得到神经网络所需要的数据集。数据集按照ucfTrainTestlist分为训练集Train和测试集Test。通过训练集对神经网络进行训练,训练后的神经网络作为获取视频片段的时间预测结果和空间预测结果的预测模型。例如,将帧图像和光流图像的特征图输入训练后的神经网络中进行处理,训练后的神经网络输出每个视频片段的空间预测结果(即空间流的类别概率分布)和时间预测结果(即时间流的类别概率分布)。
在一种可选的实施方式中,如图3所示,神经网络包括:n个Block块、Reshape层、LIF层、全连接层和Softmax层;其中,Block块包括:级联的ConvLIF层和池化层。n为正整数,且n≥1, 当n>1时,n个Block块级联。
在一种可选的实施方式中,通过神经网络分别对每个视频片段的帧图像和光流图像的特征图进行时空卷积处理,包括:
通过n个Block块对每个视频片段的帧图像和光流图像进行n次时序特征提取,得到第一特征向量;
通过Reshape层对第一特征向量进行矩阵变换处理,得到第二特征向量;
通过LIF层和全连接层对第二特征向量进行时序全连接处理,得到第三特征向量;
根据第三特征向量,通过Softmax层确定每个视频片段的空间预测结果和时间预测结果。
在一种可选的实施方式中,当n=1时,通过n个Block块对每个视频片段的帧图像和光流图像进行n次时序特征提取,得到第一特征向量,包括:
通过ConvLIF层分别对帧图像和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
通过池化层对第一时序卷积向量进行池化处理,得到第一中间特征向量;
将第一中间特征向量确定为第一特征向量。
当n=2时,通过所述n个Block块对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
通过所述ConvLIF层分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;
通过ConvLIF层对所述第一中间特征向量进行时序卷积处理,得到第二时序卷积向量,通过池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量;
将所述第二中间特征向量作为所述第一特征向量;
当n>2时,通过所述n个Block块对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
通过所述ConvLIF层分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第1时序卷积向量;
通过所述池化层对所述第1时序卷积向量进行池化处理,得到第1中间特征向量;
通过所述池化层对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量;
通过所述ConvLIF层对所述第i时序卷积向量进行池化处理,得到i中间特征向量;
通过所述池化层对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量;
通过所述ConvLIF层对第i+1时序卷积向量进行池化处理,得到第i+1中间特征向量,其中,i为依次取自2至n-1的正整数,直至得到第n中间特征量;
将第n中间特征向量确定为第一特征向量。
举例来说,包括两个Block块,在进行两次时序特征提取时,可以通过第一个Block块的ConvLIF层分别对帧图像和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量,并通过第一个Block块的池化层对第一时序卷积向量进行池化处理,得到第一中间特征向量。通过第二个Block块的ConvLIF层对所述第一中间特征向量进行时序卷积处理,得到第二时序卷积向量,通过第二个 Block块的池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量,将第二中间特征向量确定为第一特征向量。
在一种可选的实施方式中,Block块还包括:级联于ConvLIF层和池化层之间的BN(Batch Normalization)层,通过所述BN层对所述时序卷积向量进行标准化处理,并将标准化处理后的时序卷积向量进行池化处理。
具体地,当n=1或n=2时,通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量,包括:
通过所述BN层对所述第一时序卷积向量进行标准化处理;
利用池化层将标准化处理后的第一时序卷积向量进行池化处理;
当n=2时,通过池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量,包括:
通过所述BN层对所述第二时序卷积向量进行标准化处理;
利用池化层将标准化处理后的第二时序卷积向量进行池化处理;
当n>2时,通过所述池化层对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量,包括:
通过所述BN层对所述第1时序卷积向量进行标准化处理;
利用池化层将标准化处理后的第1时序卷积向量进行池化处理,以得到第2时序卷积向量;
当n>2时,通过所述池化层对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量,包括:
通过所述BN层对所述第i时序卷积向量进行标准化处理;
利用池化层将标准化处理后的第i时序卷积向量进行池化处理,以得到第i+1时序卷积向量。
由于Block块输出数据的维度不适用于作为LIF层的输入,加入Reshape层对Block块的输出数据进行处理,将数据的维度展开后作为LIF层的输入。例如,Block块的输出shape为(10,2,2,1024),加入reshape层,对输出数据进行处理,将后面三个维度直接展开,得到shape为(10,4096)的数据。级联于ConvLIF层和池化层之间的BN(Batch Normalization)层,对数据进行批量标准化,可以加速网络收敛速度,提升训练的稳定性。
在一种可选的实施方式中,全连接层采用FC全连接层,池化层采用MaxPooling池化层。
在一种可选的实施方式中,如图4所示,LIF层用于:
根据t时刻的输入值X t经过全连接运算后得到的值I t,与t-1时刻的生物电压值
Figure PCTCN2021079530-appb-000065
确定t时刻的膜电位值
Figure PCTCN2021079530-appb-000066
其中,I t=X t*W,W为输入值X t的权重,
Figure PCTCN2021079530-appb-000067
根据t时刻的膜电位值
Figure PCTCN2021079530-appb-000068
与发射阈值V th,确定t时刻的输出值F t
根据t时刻的输出值F t确定是否重置膜电位,并根据重置的电压值V reset确定重置的膜电位值
Figure PCTCN2021079530-appb-000069
其中,
Figure PCTCN2021079530-appb-000070
根据重置的膜电位值
Figure PCTCN2021079530-appb-000071
确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000072
其中,t时刻的输出值F t作为与LIF层级联的下一层的输入,t时刻的生物电压值
Figure PCTCN2021079530-appb-000073
作为计算t+1时刻的膜电位值的输入,输入值均为离散值。
在一种可选的实施方式中,如图4所示,ConvLIF层用于:
根据t时刻的输入值X t经过卷积运算后得到的值I t,与t-1时刻的生物电压值
Figure PCTCN2021079530-appb-000074
确定t时刻的膜电位值
Figure PCTCN2021079530-appb-000075
其中,I t=Conv(X t,W,),W为输入值X t的权重,
Figure PCTCN2021079530-appb-000076
根据t时刻的膜电位值
Figure PCTCN2021079530-appb-000077
与发射阈值V th,确定t时刻的输出值F t
根据t时刻的输出值F t确定是否重置膜电位,并根据重置的电压值V reset确定重置的膜电位值
Figure PCTCN2021079530-appb-000078
其中,
Figure PCTCN2021079530-appb-000079
根据重置的膜电位值
Figure PCTCN2021079530-appb-000080
确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000081
其中,t时刻的输出值F t作为与ConvLIF层级联的下一层的输入,t时刻的生物电压值
Figure PCTCN2021079530-appb-000082
作为计算t+1时刻的膜电位值的输入,输入值均为离散值。
在一种可选的实施方式中,根据t时刻的膜电位值和发射阈值V th,确定时刻t的输出值,包括:
若t时刻的膜电位值
Figure PCTCN2021079530-appb-000083
大于或等于发射阈值V th,则确定t时刻的输出值为1;
若t时刻的膜电位值
Figure PCTCN2021079530-appb-000084
小于发射阈值V th,则确定t时刻的输出值为0。
在一种可选的实施方式中,根据重置的膜电位值
Figure PCTCN2021079530-appb-000085
确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000086
包括:通过Leak激活函数对重置的膜电位值
Figure PCTCN2021079530-appb-000087
进行计算,确定t时刻的生物电压值
Figure PCTCN2021079530-appb-000088
α为泄露机制,β为理论值在0-1之间的偏置。
在一种可选的实施方式中,由于ConvLIF层比Conv层多出时间维度,故在ConvLIF与池化层连接时,需要将池化层封装起来,使其能够处理ConvLIF的输出结果。例如,采用TimeDistribution层将池化层MaxPooling2D进行封装,使MaxPooling2D层能处理ConvLIF的输出结果。
网络融合模块540其用于对所有视频片段的空间预测结果进行融合,得到空间融合结果,并对所有视频片段的时间预测结果进行融合,得到时间融合结果。
在一种可选的实施方式中,对所有视频片段的空间预测结果和所有视频片段的时间预测结果进行融合时,对所有视频片段的空间预测结果采用直接平均、线性加权、直接取最大值和TOP-K加权中的一种,对所有视频片段的时间预测结果采用直接平均、线性加权、直接取最大值和TOP-K加权中的一种。在一种可选的实施方式中,空间预测结果和时间预测结果均采用直接平均的融合方法,这种融合方法可以联合建模多个视频片段,并从整个视频中捕获视觉信息,提高识别效果。本公开的行为识别系统对空间预测结果和时间预测结果的融合方法不作限制。
双流融合模块550用于空间融合结果和时间融合结果进行双流融合,得到行为识别结果。
在一种可选的实施方式中,空间融合结果和时间融合结果采用加权融合进行双流融合,例如设置空间流融合结果的权重为0.6,时间流融合结果的权重为0.4。本公开的行为识别系统对双流融合的方法不作限制。
本公开还涉及一种电子设备,包括服务器、终端等。该电子设备包括:至少一个处理器;与至少一个处理器通信连接的存储器;以及与存储介质通信连接的通信组件,所述通信组件在处理器的控制下接收和发送数据;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行以实现上述实施例中的行为识别方法。
在一种可选的实施方式中,存储器作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块。处理器通过运行存储在存储器中的非易失性软件程序、指令以及模块,从而执行设备的各种功能应用以及数据处理,即实现上述行为识别方法。
存储器可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储选项列表等。此外,存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至外接设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
一个或者多个模块存储在存储器中,当被一个或者多个处理器执行时,执行上述任意方法实施例中的行为识别方法。
上述产品可执行本申请实施例所提供的行为识别方法,具备执行方法相应的功能模块和有益效果,未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的行为识别方法。
本公开还涉及一种计算机可读存储介质,用于存储计算机可读程序,所述计算机可读程序用于供计算机执行上述部分或全部的行为识别方法的实施例。
即,本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本公开的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
此外,本领域普通技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本公开的范围之内并且形成不同的实施例。例如,在权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。
本领域技术人员应理解,尽管已经参考示例性实施例描述了本公开,但是在不脱离本公开的范围的情况下,可进行各种改变并可用等同物替换其元件。另外,在不脱离本公开的实质范围的情况下,可进行许多修改以使特定情况或材料适应本公开的教导。因此,本公开不限于所公开的特定实施例,而是本公开将包括落入所附权利要求范围内的所有实施例。

Claims (34)

  1. 一种行为识别方法,其特征在于,包括:
    将视频数据截取成多个视频片段,对每个视频片段抽帧处理,得到多个帧图像,并对每个视频片段抽帧后得到的多个所述帧图像提取光流,得到每个视频片段的光流图像;
    分别对每个视频片段的帧图像和光流图像进行特征提取,得到每个视频片段的帧图像的特征图和光流图像的特征图;
    分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理,确定每个视频片段的空间预测结果和时间预测结果;
    对所有视频片段的空间预测结果进行融合,得到空间融合结果,并对所有视频片段的时间预测结果进行融合,得到时间融合结果;
    对所述空间融合结果和所述时间融合结果进行双流融合,得到行为识别结果。
  2. 根据权利要求1所述的行为识别方法,其特征在于,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理,确定每个视频片段的空间预测结果和时间预测结果,包括:
    分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,其中,n其中,n为正整数;
    对所述第一特征向量进行矩阵变换处理,得到第二特征向量;
    对所述第二特征向量进行时序全连接处理,得到第三特征向量;
    根据所述第三特征向量,确定每个视频片段的空间预测结果和时间预测结果。
  3. 根据权利要求2所述的行为识别方法,其特征在于,当n=1时,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
    分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
    对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;
    将所述第一中间特征向量确定为第一特征向量;
    当n=2时,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
    分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
    对第一时序卷积向量进行时序卷积处理,得到第二时序卷积向量;
    对所述第二时序卷积向量进行池化处理,得到第二中间特征向量;
    将所述第二中间特征向量确定为所述第一特征向量;
    当n>2时,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
    分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第1时序 卷积向量;
    对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量;
    对所述第i时序卷积向量进行池化处理,得到第i中间特征向量;
    对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量;
    对第i+1时序卷积向量进行池化处理,得到第i+1中间特征向量,其中,i为依次取自2至n-1的正整数,直至得到第n中间特征量;
    将第n中间特征向量确定为第一特征向量。
  4. 根据权利要求1所述的行为识别方法,其特征在于,所述对每个视频片段抽帧处理,包括:
    将所述每个视频片段按照一定间隔抽取帧,得到N 1帧图像,其中,间隔为每个视频片段的总帧数除以N 1,N 1为大于1的整数。
  5. 根据权利要求4所述的行为识别方法,其特征在于,对每个所述视频片段的多个所述帧图像提取光流,包括:
    对抽取出的N 1帧图像,分别根据两两相邻的两帧图像提取光流计算得到N 1-1个光流;
    复制第二帧与第一帧的光流作为第一个光流,与所述N 1-1个光流合并为N 1个光流。
  6. 根据权利要求1-5中任意一项所述的行为识别方法,其特征在于,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理通过神经网络实现,所述方法还包括:根据训练集训练所述神经网络。
  7. 根据权利要求6所述的行为识别方法,其特征在于,所述神经网络包括:
    n个Block块、Reshape层、LIF层、全连接层和Softmax层;其中,所述Block块包括级联的ConvLIF层和池化层,n为正整数,且n≥1,当n>1时,n个Block块级联。
  8. 根据权利要求7所述的行为识别方法,其特征在于,通过所述神经网络分别对每个视频片段的帧图像和光流图像的特征图进行时空卷积处理,包括:
    通过所述n个Block块对每个视频片段的帧图像和光流图像进行n次时序特征提取,得到第一特征向量;
    通过所述Reshape层对所述第一特征向量进行矩阵变换处理,得到第二特征向量;
    通过LIF层和所述全连接层对所述第二特征向量进行时序全连接处理,得到第三特征向量;
    根据所述第三特征向量,通过所述Softmax层确定每个视频片段的空间预测结果和时间预测结果。
  9. 根据权利要求8所述的行为识别方法,其特征在于,当n=1时,通过所述n个Block块对每个视频片段的帧图像和光流图像进行n次时序特征提取,得到第一特征向量,包括:
    通过所述ConvLIF层分别对帧图像和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
    通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;
    将所述第一中间特征向量确定为第一特征向量;
    当n=2时,通过所述n个Block块对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
    通过所述ConvLIF层分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
    通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;
    通过ConvLIF层对所述第一中间特征向量进行时序卷积处理,得到第二时序卷积向量,通过池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量;
    将所述第二中间特征向量作为所述第一特征向量;
    当n>2时,通过所述n个Block块对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
    通过所述ConvLIF层分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第1时序卷积向量;
    通过所述池化层对所述第1时序卷积向量进行池化处理,得到第1中间特征向量;
    通过所述池化层对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量;
    通过所述ConvLIF层对所述第i时序卷积向量进行池化处理,得到i中间特征向量;
    通过所述池化层对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量;
    通过所述ConvLIF层对第i+1时序卷积向量进行池化处理,得到第i+1中间特征向量,其中,i为依次取自2至n-1的正整数,直至得到第n中间特征量;
    将第n中间特征向量确定为第一特征向量。
  10. 根据权利要求9所述的行为识别方法,其特征在于,所述Block块还包括级联于ConvLIF层和池化层之间的BN层,
    当n=1或n=2时,通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量,包括:
    通过所述BN层对所述第一时序卷积向量进行标准化处理;
    利用池化层将标准化处理后的第一时序卷积向量进行池化处理;
    当n=2时,通过池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量,包括:
    通过所述BN层对所述第二时序卷积向量进行标准化处理;
    利用池化层将标准化处理后的第二时序卷积向量进行池化处理;
    当n>2时,通过所述池化层对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量,包括:
    通过所述BN层对所述第1时序卷积向量进行标准化处理;
    利用池化层将标准化处理后的第1时序卷积向量进行池化处理,以得到第2时序卷积向量;
    当n>2时,通过所述池化层对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量,包括:
    通过所述BN层对所述第i时序卷积向量进行标准化处理;
    利用池化层将标准化处理后的第i时序卷积向量进行池化处理,以得到第i+1时序卷积向量。
  11. 根据权利要求8所述的行为识别方法,其特征在于,所述LIF层用于:
    根据t时刻的输入值X t经过全连接运算后得到的值I t,与t-1时刻的生物电压值
    Figure PCTCN2021079530-appb-100001
    确定t时 刻的膜电位值
    Figure PCTCN2021079530-appb-100002
    根据t时刻的膜电位值
    Figure PCTCN2021079530-appb-100003
    与发射阈值V th,确定t时刻的输出值F t
    根据t时刻的输出值F t确定是否重置膜电位,并根据重置的电压值V reset确定重置的膜电位值
    Figure PCTCN2021079530-appb-100004
    根据重置的膜电位值
    Figure PCTCN2021079530-appb-100005
    确定t时刻的生物电压值
    Figure PCTCN2021079530-appb-100006
    其中,所述t时刻的输出值F t作为与所述LIF层级联的下一层的输入,所述t时刻的生物电压值
    Figure PCTCN2021079530-appb-100007
    作为计算t+1时刻的膜电位值的输入。
  12. 根据权利要求9所述的行为识别方法,其特征在于,所述ConvLIF层用于:
    根据t时刻的输入值X t经过卷积运算后得到的值I t,与t-1时刻的生物电压值
    Figure PCTCN2021079530-appb-100008
    确定t时刻的膜电位值
    Figure PCTCN2021079530-appb-100009
    根据t时刻的膜电位值
    Figure PCTCN2021079530-appb-100010
    与发射阈值V th,确定t时刻的输出值F t
    根据t时刻的输出值F t确定是否重置膜电位,并根据重置的电压值V reset确定重置的膜电位值
    Figure PCTCN2021079530-appb-100011
    根据重置的膜电位值
    Figure PCTCN2021079530-appb-100012
    确定t时刻的生物电压值
    Figure PCTCN2021079530-appb-100013
    其中,所述t时刻的输出值F t作为与所述ConvLIF层级联的下一层的输入,所述t时刻的生物电压值
    Figure PCTCN2021079530-appb-100014
    作为计算t+1时刻的膜电位值的输入。
  13. 根据权利要求11或12所述的行为识别方法,其特征在于,所述根据t时刻的膜电位值与发射阈值V th,确定时刻t的输出值,包括:
    若t时刻的膜电位值
    Figure PCTCN2021079530-appb-100015
    大于或等于发射阈值V th,则确定所述t时刻的输出值为1;
    若t时刻的膜电位值
    Figure PCTCN2021079530-appb-100016
    小于发射阈值V th,则确定所述t时刻的输出值为0。
  14. 根据权利要求11或12所述的行为识别方法,其特征在于,所述根据重置的膜电位值
    Figure PCTCN2021079530-appb-100017
    确定t时刻的生物电压值
    Figure PCTCN2021079530-appb-100018
    包括:通过Leak激活函数对所述重置的膜电位值
    Figure PCTCN2021079530-appb-100019
    进行计算,确定t时刻的生物电压值
    Figure PCTCN2021079530-appb-100020
  15. 根据权利要求1至5中任意一项所述的行为识别方法,其特征在于,对所有视频片段的空间预测结果和所有视频片段的时间预测结果进行融合时,对所有视频片段的预测结果采用直接平均、线性加权、直接取最大值和TOP-K加权中的一种。
  16. 根据权利要求1至5中任意一项所述的行为识别方法,其特征在于,所述空间融合结果和所述时间融合结果在双流融合时,将所述空间融合结果和所述时间融合结果采用加权融合。
  17. 一种行为识别系统,其特征在于,采用如权利要求1-16中任意一项所述的行为识别方法,包括:
    数据预处理模块,其用于将视频数据截取成多个视频片段,对每个视频片段抽帧处理,得到多个帧图像,并对每个视频片段的多个所述帧图像提取光流,分别得到每个视频片段的多个光流图像;
    特征提取模块,其用于分别对每个视频片段的帧图像和光流图像进行图像特征提取,得到每个视频片段的帧图像的特征图和光流图像的特征图;
    网络识别模块,其分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理,确定每个视频片段的空间预测结果和时间预测结果;
    网络融合模块,其对所有视频片段的空间预测结果进行融合,得到空间融合结果,并对所有视频片段的时间预测结果进行融合,得到时间融合结果;
    双流融合模块,其用于对所述空间融合结果和所述时间融合结果进行双流融合,得到行为识别结果。
  18. 根据权利要求17所述的行为识别系统,其特征在于,所述网络识别模块分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理,确定每个视频片段的空间预测结果和时间预测结果,包括:
    分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,其中,n其中,且n为正整数;
    对所述第一特征向量进行矩阵变换处理,得到第二特征向量;
    对所述第二特征向量进行时序全连接处理,得到第三特征向量;
    根据所述第三特征向量,确定每个视频片段的空间预测结果和时间预测结果。
  19. 根据权利要求18所述的行为识别系统,其特征在于,当n=1时,所述网络识别模块分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
    分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
    对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;
    将所述第一中间特征向量确定为第一特征向量;
    当n=2时,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
    分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
    对第一时序卷积向量进行时序卷积处理,得到第二时序卷积向量;
    对所述第二时序卷积向量进行池化处理,得到第二中间特征向量;
    将所述第二中间特征向量确定为所述第一特征向量;
    当n>2时,所述分别对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
    分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第1时序卷积向量;
    对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量;
    对所述第i时序卷积向量进行池化处理,得到第i中间特征向量;
    对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量;
    对第i+1时序卷积向量进行池化处理,得到第i+1中间特征向量,其中,i为依次取自2至n-1的正整数,直至得到第n中间特征量;
    将第n中间特征向量确定为第一特征向量。
  20. 根据权利要求17所述的行为识别系统,其特征在于,所述数据预处理模块对每个视频片段抽帧处理,包括:将所述每个视频片段按照一定间隔抽取帧,得到N 1帧图像,其中,间隔为视频 片段的总帧数除以N 1,N 1为大于1的整数。
  21. 根据权利要求20所述的行为识别系统,其特征在于,所述数据预处理模块对每个所述视频片段的多个抽帧后的帧图像提取光流,包括:
    对抽取出的N 1帧图像,分别根据两两相邻的两帧图像提取光流计算得到N 1-1个光流;
    复制第二帧与第一帧的光流作为第一个光流,与所述N 1-1个光流合并为N 1个光流。
  22. 根据权利要求17-21中任意一项所述的行为识别系统,其特征在于,所述网络识别模块分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时空卷积处理通过神经网络实现,所述系统还包括:根据训练集训练所述神经网络。
  23. 根据权利要求22所述的行为识别系统,其特征在于,所述神经网络包括:n个Block块、Reshape层、LIF层、全连接层和Softmax层;其中,所述Block块包括级联的ConvLIF层和池化层,n为正整数,且n≥1,当n>1时,n个Block块级联。
  24. 根据权利要求23所述的行为识别系统,其特征在于,通过所述神经网络分别对每个视频片段的帧图像和光流图像的特征图进行时空卷积处理,包括:
    通过所述n个Block块对每个视频片段的帧图像和光流图像进行n次时序特征提取,得到第一特征向量;
    通过所述Reshape层对所述第一特征向量进行矩阵变换处理,得到第二特征向量;
    通过LIF层和所述全连接层对所述第二特征向量进行时序全连接处理,得到第三特征向量;
    根据所述第三特征向量,通过所述Softmax层确定每个视频片段的空间预测结果和时间预测结果。
  25. 根据权利要求24所述的行为识别系统,其特征在于,当n=1时,通过所述n个Block块对每个视频片段的帧图像和光流图像进行n次时序特征提取,得到第一特征向量,包括:
    通过所述ConvLIF层分别对帧图像和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
    通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;
    将所述第一中间特征向量确定为第一特征向量;
    当n=2时,通过所述n个Block块对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
    通过所述ConvLIF层分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷积处理,得到第一时序卷积向量;
    通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量;
    通过ConvLIF层对所述第一中间特征向量进行时序卷积处理,得到第二时序卷积向量,通过池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量;
    将所述第二中间特征向量作为所述第一特征向量;
    当n>2时,通过所述n个Block块对每个视频片段的帧图像的特征图和光流图像的特征图进行n次时序特征提取,得到第一特征向量,包括:
    通过所述ConvLIF层分别对每个视频片段的帧图像的特征图和光流图像的特征图进行时序卷 积处理,得到第1时序卷积向量;
    通过所述池化层对所述第1时序卷积向量进行池化处理,得到第1中间特征向量;
    通过所述池化层对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量;
    通过所述ConvLIF层对所述第i时序卷积向量进行池化处理,得到i中间特征向量;
    通过所述池化层对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量;
    通过所述ConvLIF层对第i+1时序卷积向量进行池化处理,得到第i+1中间特征向量,其中,i为依次取自2至n-1的正整数,直至得到第n中间特征量;
    将第n中间特征向量确定为第一特征向量。
  26. 根据权利要求25所述的行为识别系统,其特征在于,所述Block块还包括级联于ConvLIF层和池化层之间的BN层,
    当n=1或n=2时,通过所述池化层对所述第一时序卷积向量进行池化处理,得到第一中间特征向量,包括:
    通过所述BN层对所述第一时序卷积向量进行标准化处理;
    利用池化层将标准化处理后的第一时序卷积向量进行池化处理;
    当n=2时,通过池化层对所述第二时序卷积向量进行池化处理,得到第二中间特征向量,包括:
    通过所述BN层对所述第二时序卷积向量进行标准化处理;
    利用池化层将标准化处理后的第二时序卷积向量进行池化处理;
    当n>2时,通过所述池化层对所述第1时序卷积向量进行时序卷积处理,得到第2时序卷积向量,包括:
    通过所述BN层对所述第1时序卷积向量进行标准化处理;
    利用池化层将标准化处理后的第1时序卷积向量进行池化处理,以得到第2时序卷积向量;
    当n>2时,通过所述池化层对第i中间特征向量进行时序卷积处理,得到第i+1时序卷积向量,包括:
    通过所述BN层对所述第i时序卷积向量进行标准化处理;
    利用池化层将标准化处理后的第i时序卷积向量进行池化处理,以得到第i+1时序卷积向量。
  27. 根据权利要求24所述的行为识别系统,其特征在于,所述LIF层用于:
    根据t时刻的输入值X t经过全连接运算后得到的值I t,与t-1时刻的生物电压值
    Figure PCTCN2021079530-appb-100021
    确定t时刻的膜电位值
    Figure PCTCN2021079530-appb-100022
    根据t时刻的膜电位值
    Figure PCTCN2021079530-appb-100023
    与发射阈值V th,确定t时刻的输出值F t
    根据t时刻的输出值F t确定是否重置膜电位,并根据重置的电压值V rdset确定重置的膜电位值
    Figure PCTCN2021079530-appb-100024
    根据重置的膜电位值
    Figure PCTCN2021079530-appb-100025
    确定t时刻的生物电压值
    Figure PCTCN2021079530-appb-100026
    其中,所述t时刻的输出值F t作为与所述LIF层级联的下一层的输入,所述t时刻的生物电压值
    Figure PCTCN2021079530-appb-100027
    作为计算t+1时刻的膜电位值的输入。
  28. 根据权利要求25所述的行为识别系统,其特征在于,所述ConvLIF层用于:
    根据t时刻的输入值X t经过卷积运算或全连接运算后得到的值I t,与t-1时刻的生物电压值
    Figure PCTCN2021079530-appb-100028
    确定t时刻的膜电位值
    Figure PCTCN2021079530-appb-100029
    根据t时刻的膜电位值
    Figure PCTCN2021079530-appb-100030
    与发射阈值V th,确定t时刻的输出值F t
    根据t时刻的输出值F t确定是否重置膜电位,并根据重置的电压值V reset确定重置的膜电位值
    Figure PCTCN2021079530-appb-100031
    根据重置的膜电位值
    Figure PCTCN2021079530-appb-100032
    确定t时刻的生物电压值
    Figure PCTCN2021079530-appb-100033
    其中,所述t时刻的输出值F t作为与所述ConvLIF层级联的下一层的输入,所述t时刻的生物电压值
    Figure PCTCN2021079530-appb-100034
    作为计算t+1时刻的膜电位值的输入。
  29. 根据权利要求27或28所述的行为识别系统,其特征在于,所述根据t时刻的膜电位值与发射阈值V th,确定时刻t的输出值,包括:
    若t时刻的膜电位值
    Figure PCTCN2021079530-appb-100035
    大于或等于发射阈值V th,则确定所述t时刻的输出值为1;
    若t时刻的膜电位值
    Figure PCTCN2021079530-appb-100036
    小于发射阈值V th,则确定所述t时刻的输出值为0。
  30. 根据权利要求27或28所述的行为识别系统,其特征在于,所述根据重置的膜电位值
    Figure PCTCN2021079530-appb-100037
    确定t时刻的生物电压值
    Figure PCTCN2021079530-appb-100038
    包括:通过Leak激活函数对所述重置的膜电位值
    Figure PCTCN2021079530-appb-100039
    进行计算,确定t时刻的生物电压值
    Figure PCTCN2021079530-appb-100040
  31. 根据权利要求17所述的行为识别系统,其特征在于,所述网络融合模块对所有视频片段的空间预测结果和所有视频片段的时间预测结果进行融合时,对所有视频片段的预测结果采用直接平均、线性加权、直接取最大值和TOP-K加权中的一种。
  32. 根据权利要求17所述的行为识别系统,其特征在于,所述双流融合模块对所述空间融合结果和所述时间融合结果进行双流融合时,将所述空间融合结果和所述时间融合结果采用加权融合。
  33. 一种电子设备,包括存储器和处理器,其特征在于,所述存储器用于存储一条或多条计算机指令,其中,所述一条或多条计算机指令被处理器执行以实现如权利要求1-16中任一项所述的行为识别方法。
  34. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行以实现如权利要求1-16中任一项所述的行为识别方法。
PCT/CN2021/079530 2020-03-09 2021-03-08 行为识别方法及系统、电子设备和计算机可读存储介质 WO2021180030A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/790,694 US20230042187A1 (en) 2020-03-09 2021-03-08 Behavior recognition method and system, electronic device and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010157538.9 2020-03-09
CN202010157538.9A CN113378600B (zh) 2020-03-09 2020-03-09 一种行为识别方法及系统

Publications (1)

Publication Number Publication Date
WO2021180030A1 true WO2021180030A1 (zh) 2021-09-16

Family

ID=77568439

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/079530 WO2021180030A1 (zh) 2020-03-09 2021-03-08 行为识别方法及系统、电子设备和计算机可读存储介质

Country Status (3)

Country Link
US (1) US20230042187A1 (zh)
CN (1) CN113378600B (zh)
WO (1) WO2021180030A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114339403A (zh) * 2021-12-31 2022-04-12 西安交通大学 一种视频动作片段生成方法、系统、设备及可读存储介质
CN114842554A (zh) * 2022-04-22 2022-08-02 北京昭衍新药研究中心股份有限公司 一种基于局部和全局时空特征的群体猴子动作识别方法
CN114973120A (zh) * 2022-04-14 2022-08-30 山东大学 一种基于多维传感数据与监控视频多模异构融合的行为识别方法及系统
CN115862151A (zh) * 2023-02-14 2023-03-28 福建中医药大学 基于游戏预测老年人反应能力的数据处理系统及方法
CN114677704B (zh) * 2022-02-23 2024-03-26 西北大学 一种基于三维卷积的时空特征多层次融合的行为识别方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332670A (zh) * 2021-10-15 2022-04-12 腾讯科技(深圳)有限公司 视频行为识别方法、装置、计算机设备和存储介质
CN115171221B (zh) * 2022-09-06 2022-12-06 上海齐感电子信息科技有限公司 动作识别方法及动作识别系统
CN117523669A (zh) * 2023-11-17 2024-02-06 中国科学院自动化研究所 手势识别方法、装置、电子设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108492319A (zh) * 2018-03-09 2018-09-04 西安电子科技大学 基于深度全卷积神经网络的运动目标检测方法
CN109377555A (zh) * 2018-11-14 2019-02-22 江苏科技大学 自主水下机器人前景视场三维重建目标特征提取识别方法
CN109711338A (zh) * 2018-12-26 2019-05-03 上海交通大学 利用光流指导特征融合的物体实例分割方法
CN110826447A (zh) * 2019-10-29 2020-02-21 北京工商大学 一种基于注意力机制的餐厅后厨人员行为识别方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170132785A1 (en) * 2015-11-09 2017-05-11 Xerox Corporation Method and system for evaluating the quality of a surgical procedure from in-vivo video

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108492319A (zh) * 2018-03-09 2018-09-04 西安电子科技大学 基于深度全卷积神经网络的运动目标检测方法
CN109377555A (zh) * 2018-11-14 2019-02-22 江苏科技大学 自主水下机器人前景视场三维重建目标特征提取识别方法
CN109711338A (zh) * 2018-12-26 2019-05-03 上海交通大学 利用光流指导特征融合的物体实例分割方法
CN110826447A (zh) * 2019-10-29 2020-02-21 北京工商大学 一种基于注意力机制的餐厅后厨人员行为识别方法

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114339403A (zh) * 2021-12-31 2022-04-12 西安交通大学 一种视频动作片段生成方法、系统、设备及可读存储介质
CN114339403B (zh) * 2021-12-31 2023-03-28 西安交通大学 一种视频动作片段生成方法、系统、设备及可读存储介质
CN114677704B (zh) * 2022-02-23 2024-03-26 西北大学 一种基于三维卷积的时空特征多层次融合的行为识别方法
CN114973120A (zh) * 2022-04-14 2022-08-30 山东大学 一种基于多维传感数据与监控视频多模异构融合的行为识别方法及系统
CN114973120B (zh) * 2022-04-14 2024-03-12 山东大学 一种基于多维传感数据与监控视频多模异构融合的行为识别方法及系统
CN114842554A (zh) * 2022-04-22 2022-08-02 北京昭衍新药研究中心股份有限公司 一种基于局部和全局时空特征的群体猴子动作识别方法
CN115862151A (zh) * 2023-02-14 2023-03-28 福建中医药大学 基于游戏预测老年人反应能力的数据处理系统及方法

Also Published As

Publication number Publication date
CN113378600B (zh) 2023-12-29
US20230042187A1 (en) 2023-02-09
CN113378600A (zh) 2021-09-10

Similar Documents

Publication Publication Date Title
WO2021180030A1 (zh) 行为识别方法及系统、电子设备和计算机可读存储介质
Xiong et al. From open set to closed set: Counting objects by spatial divide-and-conquer
Liu et al. Teinet: Towards an efficient architecture for video recognition
Jia et al. Segment, magnify and reiterate: Detecting camouflaged objects the hard way
Li et al. Unsupervised learning of view-invariant action representations
Wan et al. Residual regression with semantic prior for crowd counting
WO2022111506A1 (zh) 视频动作识别方法、装置、电子设备和存储介质
CN110555387B (zh) 骨架序列中基于局部关节点轨迹时空卷的行为识别方法
CN110717411A (zh) 一种基于深层特征融合的行人重识别方法
CN111639564B (zh) 一种基于多注意力异构网络的视频行人重识别方法
CN113688723A (zh) 一种基于改进YOLOv5的红外图像行人目标检测方法
CN109902601B (zh) 一种结合卷积网络和递归网络的视频目标检测方法
CN112070044B (zh) 一种视频物体分类方法及装置
CN112149459A (zh) 一种基于交叉注意力机制的视频显著性物体检测模型及系统
CN114494981B (zh) 一种基于多层次运动建模的动作视频分类方法及系统
CN113239869B (zh) 基于关键帧序列和行为信息的两阶段行为识别方法及系统
CN111310609B (zh) 基于时序信息和局部特征相似性的视频目标检测方法
CN112801019B (zh) 基于合成数据消除无监督车辆再识别偏差的方法及系统
CN111079507B (zh) 一种行为识别方法及装置、计算机装置及可读存储介质
Liao et al. A deep ordinal distortion estimation approach for distortion rectification
Yang et al. Counting crowds using a scale-distribution-aware network and adaptive human-shaped kernel
Zhang et al. Modeling long-and short-term temporal context for video object detection
CN115311504A (zh) 一种基于注意力重定位的弱监督定位方法和装置
CN111553337A (zh) 一种基于改进锚框的高光谱多目标检测方法
CN115410030A (zh) 目标检测方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21768285

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21768285

Country of ref document: EP

Kind code of ref document: A1