WO2024120125A1 - 行为识别方法、电子设备及计算机可读存储介质 - Google Patents
行为识别方法、电子设备及计算机可读存储介质 Download PDFInfo
- Publication number
- WO2024120125A1 WO2024120125A1 PCT/CN2023/131344 CN2023131344W WO2024120125A1 WO 2024120125 A1 WO2024120125 A1 WO 2024120125A1 CN 2023131344 W CN2023131344 W CN 2023131344W WO 2024120125 A1 WO2024120125 A1 WO 2024120125A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frequency domain
- behavior recognition
- feature
- data
- features
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 103
- 238000003860 storage Methods 0.000 title claims abstract description 14
- 230000002776 aggregation Effects 0.000 claims abstract description 44
- 238000004220 aggregation Methods 0.000 claims abstract description 44
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 42
- 230000004927 fusion Effects 0.000 claims abstract description 24
- 230000006399 behavior Effects 0.000 claims description 178
- 230000007774 longterm Effects 0.000 claims description 37
- 230000002123 temporal effect Effects 0.000 claims description 32
- 239000013598 vector Substances 0.000 claims description 29
- 230000003044 adaptive effect Effects 0.000 claims description 20
- 230000003068 static effect Effects 0.000 claims description 17
- 239000000284 extract Substances 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 14
- 238000013507 mapping Methods 0.000 claims description 8
- 230000010365 information processing Effects 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 2
- 230000003542 behavioural effect Effects 0.000 claims 1
- 230000000875 corresponding effect Effects 0.000 description 22
- 230000008569 process Effects 0.000 description 22
- OVOUKWFJRHALDD-UHFFFAOYSA-N 2-[2-(2-acetyloxyethoxy)ethoxy]ethyl acetate Chemical compound CC(=O)OCCOCCOCCOC(C)=O OVOUKWFJRHALDD-UHFFFAOYSA-N 0.000 description 16
- 238000004364 calculation method Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 13
- 230000003287 optical effect Effects 0.000 description 12
- 230000009471 action Effects 0.000 description 9
- 230000005284 excitation Effects 0.000 description 9
- 238000012544 monitoring process Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 238000007781 pre-processing Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000013139 quantization Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 101100194363 Schizosaccharomyces pombe (strain 972 / ATCC 24843) res2 gene Proteins 0.000 description 4
- 230000000295 complement effect Effects 0.000 description 4
- 230000006835 compression Effects 0.000 description 4
- 238000007906 compression Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000007500 overflow downdraw method Methods 0.000 description 3
- 230000004936 stimulating effect Effects 0.000 description 3
- 238000000844 transformation Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 2
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000002829 reductive effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 1
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000009429 distress Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 208000024756 faint Diseases 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000009182 swimming Effects 0.000 description 1
- 206010042772 syncope Diseases 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
Definitions
- the present application relates to the field of computer vision technology, and in particular to a behavior recognition method, an electronic device, and a computer-readable storage medium.
- the video-based behavior recognition method mainly explores how to perceive the action changes of a target or even multiple targets in a continuous video stream, and then rises from the summary of continuous actions to the judgment of behavior.
- the video-based behavior recognition method has become an important technology in computer vision, which has widely promoted the development of artificial intelligence.
- a widely used method for behavior recognition is a dual-stream neural network, which divides the video into two parts, space and time, and feeds the RGB image and optical flow image into two neural networks respectively, and fuses them to obtain the final classification result.
- Another solution is a three-dimensional convolutional neural network, which optimizes the neural network into a three-dimensional convolutional neural network for video, so as to capture more time and space information.
- the first solution adds additional optical flow to the extracted RGB image. Although it can help improve the performance of video behavior recognition, the calculation cost of optical flow is high, which will greatly reduce the efficiency of behavior recognition.
- the three-dimensional convolutional neural network of the second solution can achieve relatively good recognition results, its data calculation amount is much larger than that of the two-dimensional convolutional neural network, which greatly occupies computing resources and reduces the efficiency of behavior recognition. Therefore, how to improve the efficiency of behavior recognition and reduce the amount of data calculation while ensuring the accuracy of behavior recognition is still a major challenge in the field of computer vision.
- the present application provides a behavior recognition method, comprising:
- the first spatiotemporal aggregation feature is longitudinally fused with the fast motion feature to obtain a second spatiotemporal aggregation feature, and behavior recognition of the compressed video stream is performed based on the second spatiotemporal aggregation feature.
- the present application also provides an electronic device, which includes: a memory, a processor, and a behavior recognition program stored in the memory and executable on the processor, wherein the behavior recognition program implements the above-mentioned behavior recognition method when executed by the processor.
- the present application also provides a computer-readable storage medium, on which a behavior recognition program is stored.
- a behavior recognition program is stored.
- the behavior recognition program is executed by a processor, the behavior recognition method as described above is implemented.
- FIG1 is a flow chart of a first embodiment of a behavior recognition method of the present application.
- FIG2 is a flow chart of a second embodiment of the behavior recognition method of the present application.
- FIG3 is a flow chart of a third embodiment of the behavior recognition method of the present application.
- FIG4 is a flow chart of a fourth embodiment of the behavior recognition method of the present application.
- FIG5 is a schematic diagram of a process for obtaining frequency domain data in an embodiment of the present application.
- FIG6 is a schematic diagram of a process of feature fusion performed by a TDACs module according to an embodiment of the present application.
- FIG7 is a flowchart of a behavior recognition method according to a specific embodiment of the present application.
- FIG8 is a structural diagram of a F2D-SIFP network according to an embodiment of the present application.
- FIG9 is a schematic diagram of a process of capturing long-term and short-term features by the AME module according to an embodiment of the present application.
- FIG10 is a schematic diagram of an interface of a behavior recognition result according to an embodiment of the present application.
- FIG11 is a comparison diagram of experimental data of F2D-SIFPNet and other compressed domain behavior recognition networks according to an embodiment of the present application.
- FIG12 is an experimental data diagram of the present application example to verify the effectiveness of the TDACs fusion method
- FIG13 is a comparison diagram of experimental data of the AME module and other time modeling modules in an embodiment of the present application.
- FIG14 is a GradCam diagram of the behavior recognition method according to an embodiment of the present application.
- FIG15 is an experimental data diagram of an embodiment of the present application verifying the effectiveness of F2D-SIFPNet in capturing motion information in long-duration videos;
- FIG. 16 is a schematic diagram of the hardware structure of the electronic device involved in the embodiment of the present application.
- connection can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediate medium, it can be the internal connection of two elements or the interaction relationship between two elements, unless otherwise clearly defined.
- fixation can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediate medium, it can be the internal connection of two elements or the interaction relationship between two elements, unless otherwise clearly defined.
- a widely used method for behavior recognition is a dual-stream neural network, which divides the video into two parts, space and time, and feeds the RGB image and optical flow image into two neural networks respectively, and fuses them to obtain the final classification result.
- Another solution is a three-dimensional convolutional neural network, which optimizes the neural network into a three-dimensional convolutional neural network for video, so as to capture more time and space information.
- the first solution adds Although the introduction of additional optical flow can help improve the performance of video behavior recognition, the computational cost of optical flow is high, which will greatly reduce the efficiency of behavior recognition.
- the three-dimensional convolutional neural network of the second solution can achieve relatively good recognition results, its data calculation amount is much larger than that of the two-dimensional convolutional neural network, which greatly occupies computing resources and reduces the efficiency of behavior recognition. Therefore, how to improve the efficiency of behavior recognition and reduce the amount of data calculation while ensuring the accuracy of behavior recognition is still a major challenge in the field of computer vision.
- an embodiment of the present application provides a behavior recognition method, with reference to FIG1 , which is a flow chart of an embodiment of a behavior recognition method of the present application.
- the behavior recognition method includes:
- Step S10 obtaining a compressed video stream, determining frequency domain data corresponding to the compressed video stream, and extracting frequency domain I frame data and frequency domain P frame data from the frequency domain data;
- the frequency domain data corresponding to the compressed video stream includes frequency domain I frame data and frequency domain P frame data.
- the frequency domain I frame data includes the DCT coefficients of the I frame image.
- the frequency domain P frame data includes the inter-frame motion vector (MV, Motion Vectors), and the DCT residual coefficient (R, Residuals) of the P frame image.
- MV Motion Vector
- R DCT residual coefficient
- step S10 of the above embodiment the step of determining the frequency domain data corresponding to the compressed video stream includes:
- Step S11 entropy decoding the compressed video stream to obtain inter-frame motion vectors and intermediate compressed code stream data
- the extraction of inter-frame action vectors is very important, and is used to establish the connection between actions.
- Step S12 reordering and inverse quantization are performed on the intermediate compressed code stream data in sequence to obtain DCT coefficients of the I frame image and DCT residual coefficients of the P frame image;
- Step S13 connecting the inter-frame motion vector and the DCT residual coefficient to obtain frequency domain P frame data corresponding to the frequency domain data, and determining frequency domain I frame data corresponding to the frequency domain data according to the DCT coefficient;
- Step S14 Use the frequency domain I frame data and the frequency domain P frame data as frequency domain data corresponding to the compressed video stream.
- FIG. 5 is a flow chart of obtaining frequency domain data in an embodiment of the present application.
- Video compression standards such as H263, H264, MPEG4 have the same encoding process, including DCT (Discrete Cosine Transform), quantization, zigzag sorting and entropy coding.
- DCT Discrete Cosine Transform
- the reverse semi-decoding process consists of FPDec, which is composed of entropy decoding, zigzag reordering and inverse quantization, and the compressed image is decoded and converted into frequency domain DCT coefficients (frequency domain DCT coefficients include DCT coefficients of I frame images, and DCT residual coefficients of P frame images) and motion vectors (i.e., inter-frame motion vectors) as inputs to the network.
- FFmpeg can be used to achieve the acquisition of frequency domain I frame data, frequency domain residual R data, and motion vector MV data.
- entropy decoding converts the bit stream into a series of frequency domain data of the video sequence.
- zigzag reordering restores the order of frequency domain DCT coefficients by arranging the DC, low frequency and high frequency components of each block from top to bottom and from left to right.
- inverse quantization obtains the final frequency domain DCT coefficients by calculating the quantizer scale parameter.
- step S20 is performed to input the frequency domain I frame data into the slow path SI of the two-dimensional convolutional neural network for static semantic processing to obtain sparse spatial features, and input the frequency domain P frame data into the fast path FP of the two-dimensional convolutional neural network for motion information processing to obtain fast motion features;
- the two-dimensional convolutional neural network is a two-dimensional convolutional neural network that is trained and converged through a large amount of frequency domain data.
- both the slow path SI and the fast path FP adjust the traditional two-dimensional convolutional neural network, that is, the convolution layer and pooling layer of the first node are removed, and a 1 ⁇ 1 2D convolution is added to adapt the dimension of the frequency domain input channel to the dimension of the res2 layer channel in the path.
- the two-dimensional convolutional neural network includes a slow path SI and a fast path FP, wherein the slow path SI receives I frame data at a low frame rate for static semantic processing to obtain sparse spatial features, and the fast path FP addresses the motion information contained in the frequency domain P frame data at a high frame rate to obtain fast motion features.
- the frequency domain I frame data has sparse spatial features for sparse spatial representation, and the frequency domain I frame data can be used to represent the slow motion information of the video.
- the frequency domain P frame data has fast motion features for fast motion representation, and the frequency domain P frame data can be used to characterize slow motion information of the video. Therefore, the input frequency domain I frame data can be processed with static semantics through a pre-set converged two-dimensional convolutional neural network to obtain the sparse spatial features represented by the frequency domain I frame data, and the input frequency domain P frame data can be processed with motion information to obtain the fast motion features represented by the frequency domain P frame data.
- Step S30 extracting significant motion features from the fast motion features in the fast path FP, integrating the significant motion features into the slow path SI based on the temporal attention dimension, and performing horizontal feature fusion with the sparse spatial features to obtain a first spatiotemporal aggregation feature;
- this embodiment also proposes a lateral connection between the slow path SI and the fast path FP to integrate spatial and temporal frequency domain features, and by adding a temporal attention connection module (i.e., TDAC, Temporal-Dimension Attention Connections) connecting the two paths of the slow path SI and the fast path FP, it is possible to extract significant motion features from the fast motion features in the fast path FP based on the temporal attention dimension, integrate the significant motion features into the slow path SI and the sparse spatial features for lateral feature fusion, and obtain the first spatiotemporal aggregation features.
- a temporal attention connection module i.e., TDAC, Temporal-Dimension Attention Connections
- the temporal attention connection module TDAC can integrate significant motion information in the fast path FP and introduce it into the slow path SI path, so that the slow path SI can connect the motion features of the P frame image while retaining the spatial semantic features of the I frame image, thereby realizing the aggregation of spatiotemporal semantic information.
- this embodiment proposes a frequency domain Slow-I-Fast-P network (Frequency Slow-I-Fast-P, FSIFP).
- FSIFP consists of a slow path SI and a fast path FP, as well as a new temporal attention connection (TDAC).
- TDAC temporal attention connection
- Both the slow path SI and the fast path FP in FSIFP can be divided into 4 layers: res2 to res5. After res2, res3 and res4, TDAC fuses the features from the fast path FP to the SI path, and the fusion mode is cascade (as shown in Figure 8).
- Step S40 vertically fuse the first spatiotemporal aggregation feature with the fast motion feature to obtain a second spatiotemporal aggregation feature, and perform behavior recognition on the compressed video stream based on the second spatiotemporal aggregation feature.
- the second spatiotemporal aggregation feature can be obtained by longitudinally fusing the first spatiotemporal aggregation feature output by the slow path SI in the two-dimensional convolutional neural network with the fast motion feature output by the fast path FP. That is, in order to further strengthen the information sharing and collaborative training between the slow path SI and the fast path FP, a longitudinal connection between the slow path SI and the fast path FP is proposed to further integrate the spatial and temporal frequency domain features. Then, behavior recognition of the compressed video stream is performed based on the second spatiotemporal aggregation feature, so as to improve the efficiency of behavior recognition and reduce the amount of data calculation while ensuring the accuracy of behavior recognition.
- the present application proposes a behavior recognition method, an electronic device and a computer-readable storage medium.
- the technical solution of the embodiment of the present application is to obtain a compressed video stream, determine the frequency domain data corresponding to the compressed video stream, and extract the frequency domain I frame data and the frequency domain P frame data in the frequency domain data; input the frequency domain I frame data into the slow path SI of the two-dimensional convolutional neural network for static semantic processing to obtain sparse spatial features, and input the frequency domain P frame data into the fast path FP of the two-dimensional convolutional neural network for motion information processing to obtain fast motion features; extract the significant motion features in the fast motion features in the fast path FP, integrate the significant motion features into the slow path SI based on the time attention dimension and perform horizontal feature fusion with the sparse spatial features to obtain a first spatiotemporal aggregation feature; perform vertical feature fusion with the first spatiotemporal aggregation feature and the fast motion feature to obtain a second spatiotemporal aggregation feature, and perform behavior recognition of the
- RGB images need to decode data from the frequency domain (DCT coefficients) to the spatial domain (RGB pixels), which increases the time for data preprocessing.
- DCT coefficients frequency domain
- RGB pixels spatial domain
- the three-dimensional convolutional neural network of the second solution can achieve relatively good recognition results, its data calculation amount is much larger than that of the two-dimensional convolutional neural network, which greatly occupies computing resources and reduces the efficiency of behavior recognition.
- the embodiment of the present application divides the two-dimensional convolutional neural network into a slow path SI and a fast path FP, and uses the two-dimensional convolutional neural network to directly effectively model the frequency domain information, which can greatly improve the performance of the three-dimensional convolutional neural network.
- the computing cost is reduced to a certain extent, the amount of data calculation is reduced, and the efficiency of behavior recognition is improved.
- the feature extraction is directly performed on the frequency domain information, thereby avoiding the process of decoding the data from the DCT (Discrete Cosine Transform) coefficients in the frequency domain into the RGB (red, blue, green, three primary colors) data in the spatial domain.
- DCT Discrete Cosine Transform
- RGB red, blue, green, three primary colors
- the significant motion information in the fast path FP can be integrated and introduced into the slow path SI path, so that the slow path SI can not only connect the motion features of the P frame image, but also retain the spatial semantic features of the I frame image, realize the aggregation of spatiotemporal semantic information, and through the feature-level fusion in the middle stage of the network, it can more effectively extract feature information, selectively share the key frame significant information contained in the fast path FP, and improve the accuracy and efficiency of behavior recognition.
- the second spatiotemporal aggregation feature is obtained, the information sharing and collaborative training between the slow path SI and the fast path FP are strengthened, the spatial and temporal frequency domain features are further integrated, and the frequency data is obtained from the compressed video to significantly reduce data redundancy and fully describe the spatiotemporal discrimination information of human behavior, and then extract the frequency domain discrimination information and motion clues based on the new FSIFP, and fully learn the spatiotemporal complementary features, so as to achieve the improvement of the efficiency of behavior recognition and reduce the amount of data calculation while ensuring the accuracy of behavior recognition.
- the behavior recognition method of this embodiment provides a frequency domain Slow-I-Fast-P network (Frequency Slow-I-Fast-P, FSIFP), which is divided into a slow path SI and a fast path FP, and uses a two-dimensional convolutional neural network to directly effectively model the frequency domain information.
- FSIFP Frequency Slow-I-Fast-P
- it can greatly reduce the computational cost.
- it because it directly extracts features from the frequency domain information, it can avoid the process of decoding the data from the DCT coefficients in the frequency domain to the RGB data in the spatial domain, simplifying the data preprocessing process and improving the speed of behavior recognition.
- the information of the fast path FP and the slow path SI is fused, and the key frame salient information contained in the fast path FP is selectively shared to achieve feature-level fusion in the middle stage of the network, which can more effectively extract feature information and improve the accuracy of behavior recognition.
- a temporal attention connection module i.e., TDAC, Temporal-Dimension Attention Connections
- the behavior recognition method of the embodiment of the present application can be implemented in a variety of monitoring environments that require early warning, and can quickly identify human actions and behaviors from compressed video frames collected by high-definition monitoring cameras to achieve the effect of real-time monitoring and early warning.
- the following is a brief introduction to several common compressed domain behavior recognition scenarios:
- embodiments of the present application can also be implemented in network supervision and online short video classification to identify MPEG4-encoded offline short videos uploaded by users on the network.
- Network supervision When illegal actions are identified in a short video, the video can be deleted or the user can be prohibited from uploading to achieve the goal of network supervision.
- the hardware environment of the behavior recognition method of the embodiment of the present application can be the configuration information shown in Table (1).
- the embodiment of the present application can obtain real-time behavior frequency domain data through a camera, input it into the compressed domain fast behavior recognition network proposed in the embodiment of the present application to perform behavior recognition, and finally output and display the behavior category. Based on some situations, this embodiment can not only greatly reduce the computing cost, but also improve the accuracy of behavior recognition to a certain extent. Since the motion vectors and residuals extracted by the embodiment of the present application and the decoded I frames are all processed in the compressed domain, a partial decoding operation is used to avoid complete decoding and reconstruction of the video, which can improve the processing efficiency of the system and facilitate real-time application.
- the embodiment of the present application utilizes the temporal correlation and spatial correlation of the motion vectors of the image frames in the compressed video, thereby improving the accuracy of the action recognition completed by the embodiment of the present application using several image frames.
- the time attention dimension includes a static feature branch and a time attention weight branch.
- the time attention dimension includes a static feature branch and a time attention weight branch.
- the step of integrating the significant motion feature into the slow path SI based on the time attention dimension and performing horizontal feature fusion with the sparse spatial feature to obtain the first spatiotemporal aggregation feature includes:
- Step S31 integrating the significant motion features into the slow path SI
- Step S32 stimulating the first time dimension information corresponding to the significant motion feature through the time attention weight branch, and stimulating the second time dimension information corresponding to the sparse spatial feature in the slow path SI through the static feature branch;
- Step S33 performing horizontal feature fusion on the first time dimension information, the second time dimension information and the sparse spatial feature to obtain a first spatiotemporal aggregation feature.
- FIG6 In order to help understand the present application, a specific embodiment is listed for illustration, please refer to FIG6:
- FIG6 is a flow chart of feature fusion performed by the TDACs module of an embodiment of the present application.
- the TDACs module is a temporal-dimensional attention connection (TDAC or TDACs) module.
- the TDAC module is an attention module based on the time dimension, which consists of a static feature branch S and a temporal attention weight branch A.
- the motion information of the key frame is extracted by stimulating time-sensitive features from the fast path FP.
- the static feature branch S effectively stimulates the time information through 3D convolution; the temporal attention weight branch A uses two 2D convolutions to compress and stimulate the features of the time dimension.
- the input of the TDAC module is Where C represents the number of input channels, T FP is the number of input frames of the fast path FP, H is the input feature map height, and W is the input feature map width.
- the vector is obtained by branching S Where T SI is the number of frames input to the SI path. The process can be expressed as:
- K 4 is a 3D convolution layer with a convolution kernel size of 7 ⁇ 1 ⁇ 1.
- Branch A generates a weight for each frame in the fast path FP according to motion saliency, which focuses more on temporal modeling rather than spatial features. Therefore, global average pooling is first used to compress spatial information:
- ⁇ is a hyperparameter representing the squeezing of the time dimension.
- ⁇ is set to 2.
- the TDAC module can integrate the significant motion information in the fast path FP into the SI path, so that the SI path retains the spatial semantic features of the I frame while fusing the motion features of the P frame, thus realizing the aggregation of spatiotemporal semantic information.
- This embodiment integrates significant motion features into the slow path SI, stimulates the first time dimension information corresponding to the significant motion features through the time attention weight branch, and stimulates the second time dimension information corresponding to the sparse spatial features in the slow path SI through the static feature branch, and then performs horizontal feature fusion of the first time dimension information, the second time dimension information and the sparse spatial features to obtain the first spatiotemporal aggregation feature, so that the significant motion information in the fast path FP is integrated into the SI path, so that the SI path retains the spatial semantic features of the I frame while fusing the motion features of the P frame, thereby realizing the aggregation of spatiotemporal semantic information, and thus improving the accuracy of behavior recognition for compressed videos.
- step S30 of the above embodiment the step of performing behavior recognition of the compressed video stream according to the second spatiotemporal aggregation feature includes:
- Step S41 generating an adaptive convolution kernel based on the long-term temporal features corresponding to the compressed video stream
- Step S42 according to the adaptive convolution kernel, capturing the long-term motion clues in the second spatiotemporal aggregation features to obtain the long-term motion features corresponding to the compressed video stream, and according to the adaptive convolution kernel, capturing the short-term motion clues in the second spatiotemporal aggregation features to obtain the short-term motion features corresponding to the compressed video stream;
- Step S43 performing behavior recognition on the compressed video stream based on the long-term motion features and the short-term motion features.
- this embodiment provides an Adaptive Motion Excitation (AME) module, which effectively captures long-term and short-term motion changes.
- the AME module adaptively captures motion changes and can be easily embedded in a standard ResNet block for effective temporal modeling. Specifically, the AME module is embedded in each ResNet block from res2 to res5 of the slow path SI and the fast path FP. It is specifically inserted into the residual path after the first 1 ⁇ 1 Conv2D layer. Finally, the fully connected layer outputs a classification score based on the cascaded res5 features to obtain the recognition result.
- the AME module is configured to: generate an adaptive convolution kernel based on the long-term temporal features corresponding to the compressed video stream; capture the long-term motion clues in the second spatiotemporal aggregation features according to the adaptive convolution kernel to obtain the long-term motion features corresponding to the compressed video stream; and capture the short-term motion clues in the second spatiotemporal aggregation features according to the adaptive convolution kernel to obtain the compressed video stream.
- the invention relates to a method for realizing a method for realizing a short-term motion feature corresponding to a video stream; performing behavior recognition of the compressed video stream based on the long-term motion feature and the short-term motion feature, so that the embodiment of the present application adds an AME module to generate an adaptive convolution kernel according to global information, and extracts the short-term motion feature after the channel shift, thereby improving the accuracy of behavior recognition by simultaneously capturing long-term and short-term motion changes.
- the step of performing behavior recognition on the compressed video stream according to the long-term motion feature and the short-term motion feature includes:
- Step A10 determining a long-term behavior feature according to the long-term motion feature, and determining a short-term behavior feature according to the short-term motion feature;
- Step A20 fusing and analyzing the long-term behavior features and the short-term behavior features to obtain key behavior features corresponding to the compressed video stream;
- Step A30 performing behavior recognition on the compressed video stream based on the key behavior features.
- the AME module proposed in this embodiment generates an adaptive convolution kernel based on long-term temporal features for dynamic aggregation of short motion features between frames.
- Figure 9 is a schematic diagram of the process of capturing long-term and short-term features by the AME module of an embodiment of the present application.
- the right branch ST performs temporal convolution in the dimension of time to capture short-term motion features from adjacent frames
- the left branch LT uses the same compression and excitation mechanism as TDAC to generate a dynamic convolution kernel for each video clip.
- the dynamic convolution kernel adaptively extracts the unique long-term motion information of each video, and the two branches, the right branch ST and the left branch LT, realize the fusion of long-term and short-term motion features through convolution.
- This embodiment completely learns significant frequency components through FSIFP and AME, where FSIFP effectively simulates slow spatial features and fast temporal changes at the same time, and AME generates adaptive convolution kernels for capturing long-term and short-term motion clues.
- F2D-SIFPNet Frequency 2D Slow-I-Fast-P Network
- the AME module in this implementation is different from modeling short-term motion information, such as the ME (Motion Excitation) method, which fills the last time dimension of the network's motion features with zeros to match the dimension of the features.
- ME Motion Excitation
- AME performs temporal convolution on the T (time) dimensional features to capture motion relationships.
- the AME module generates an adaptive convolution kernel based on long-term features for dynamic aggregation of inter-frame motion features.
- the right branch ST models short-term motion features from adjacent frames
- the left branch LT uses the same squeeze-excitation mechanism as TDAC to generate a dynamic convolution kernel for each video clip.
- the dynamic convolution kernel adaptively extracts long-term motion information unique to each video.
- C represents the number of channels of the input feature map
- T is the number of input frames
- H is the height of the input feature map
- W is the width of the input feature map.
- this embodiment obtains an adaptive convolution kernel
- K is a hyperparameter representing the convolution kernel size. In one embodiment, K is set to 3.
- ⁇ is a hyperparameter representing channel dimension squeezing. In one embodiment, ⁇ is set to 8.
- K 5 is a 2D convolutional layer. Then, after performing the operations of equations (3) and (2) on F 2 , the feature Subsequently, the key vector M is obtained by a 1 ⁇ 1 2D convolution and sigmoid activation:
- the behavior recognition results output by the compressed domain fast behavior recognition network of the embodiment of the present application are displayed, including real-time display of the action category corresponding to the network output, and the frame rate and number of frames of the behavior recognition, as shown in FIG10 .
- the embodiment of the present application provides a behavior recognition method F2D-SIFPNet based on compressed domain video, and a behavior recognition system based on compressed domain, which can optimize the existing video behavior recognition scheme.
- the embodiment of the present application provides a compressed domain fast behavior recognition network, which includes three parts: frequency domain Slow-I-Fast-P network (Frequency Slow-I-Fast-P, FSIFP), temporal-dimensional attention connections (Temporal-Dimension Attention Connections, TDACs) and adaptive motion excitation module (Adaptive Motion Excitation, AME).
- FSIFP is divided into a slow path SI and a fast path FP, and uses a two-dimensional convolutional neural network to directly model the frequency domain information effectively.
- TDACs fuses the information of the FP path and the SI path, selectively shares the key frame salient information contained in the FP path, and realizes the feature-level fusion in the middle stage of the network, which can more effectively extract feature information and improve the accuracy of behavior recognition.
- the embodiment of the present application designs an AME module, which can generate an adaptive convolution kernel based on global information and extract short-term motion features after channel shifting.
- the module can capture long-term and short-term motion changes at the same time and improve the accuracy of behavior recognition.
- the embodiment of the present application implements a compressed domain real-time behavior recognition system, which includes: obtaining real-time behavior frequency domain data through a camera, inputting it into the proposed compressed domain fast behavior recognition network for behavior recognition, and finally outputting and displaying the behavior category.
- the embodiment of the present application can greatly reduce the computational cost and improve the accuracy of behavior recognition to a certain extent based on some situations.
- the behavior recognition method based on frequency domain video in the embodiment of the present application is faster than the behavior recognition method based on RGB frames, while maintaining a competitive recognition accuracy. Its research results can be widely used in higher-level computer vision tasks such as human posture estimation, pedestrian re-identification, and time-series behavior detection, and have a driving effect on computer intelligent visual processing technology. With the continuous development of computer vision technology, behavior recognition in video plays an important role in tasks such as visual monitoring, video analysis, and video data mining. It will also have broader application prospects in multiple fields such as intelligent monitoring, medical assistance, animation production, smart home environment, and human-computer interaction.
- the step A30 of performing behavior recognition on the compressed video stream according to the key behavior features includes:
- Step B10 querying and obtaining behavior category information of the key behavior feature mapping from a preset behavior feature mapping table
- Step B20 Using the behavior category information as a recognition result of behavior recognition on the compressed video stream.
- the behavior category information may be falling, fighting, and asking for help, etc., which is not specifically limited in this embodiment. It can be understood by those skilled in the art that different behavior type information corresponds to different key behavior features.
- the behavior feature mapping table stores multiple behavior features and behavior category information mapped to each behavior feature.
- the one-to-one mapping relationship between behavior features and behavior category information can be calibrated and pre-stored in the system by a person skilled in the art through pre-learning and training.
- This embodiment obtains the behavior category information of the key behavior feature mapping by querying from a preset behavior feature mapping table, and uses the behavior category information as the recognition result of behavior recognition on the compressed video stream, thereby improving the efficiency and accuracy of behavior recognition on the compressed domain video.
- the method further includes:
- Step C10 performing downsampling processing of the frequency domain channel on the frequency domain data to obtain frequency domain data of the low frequency channel
- the step of extracting frequency domain I frame data and frequency domain P frame data from the frequency domain data comprises:
- Step C20 extracting frequency domain I frame data and frequency domain P frame data from the frequency domain data of the low frequency channel.
- the step C10 of performing downsampling processing on the frequency domain data in the frequency domain channel to obtain the frequency domain data of the low frequency channel includes:
- Step D10 performing downsampling processing on the inter-frame motion vector in the frequency domain data in the frequency domain channel to obtain the inter-frame motion vector of the low-frequency channel;
- Step D20 performing frequency domain channel downsampling processing on the DCT residual coefficients of the P frame image in the frequency domain data to obtain DCT residual coefficients of the low frequency channel;
- Step D30 connecting the inter-frame motion vector of the low-frequency channel and the DCT residual coefficient of the low-frequency channel to obtain frequency domain P frame data of the low-frequency channel;
- Step D40 down-sampling the DCT coefficients of the I frame image in the frequency domain data in the frequency domain channel to obtain the frequency domain I frame data of the low frequency channel, and use the frequency domain I frame data of the low frequency channel and the frequency domain P frame data of the low frequency channel as the frequency domain data of the low frequency channel.
- the frequency domain data is downsampled by frequency domain channels to obtain frequency domain data of low-frequency channels, so as to select significant channels in the frequency domain data as the input of the two-dimensional convolutional neural network, or as the input of the FSIFP network, so as to reduce data redundancy.
- this embodiment performs downsampling on the frequency domain data before inputting it into the FSIFP network or the two-dimensional convolutional neural network, that is, by providing a frequency domain channel selection (FCS) module to remove low-discrimination channels, so as to achieve low-frequency feature screening preprocessing of the frequency domain data before the frequency domain data is sent to the FSIFP network or the two-dimensional convolutional neural network to enhance the significance of the input, so that the two-dimensional convolutional neural network or the FSIFP network has a higher recognition of the feature.
- FCS frequency domain channel selection
- the frequency domain data i.e., frequency domain DCT coefficients
- motion vectors i.e., inter-frame motion vectors MV
- FCS frequency domain channel selection
- the compressed video stream is first segmented, and the video of equal length can be divided into TSI parts.
- the DCT coefficients of the I frame of the video clip are reshaped and the channels are selected.
- the number of channels of the selected I frame is 24, the height of the I frame is H/8, and the width is W/8.
- TSI ⁇ 24 ⁇ H/8 ⁇ W/8 is used as the network model input of the slow path.
- the number of channels of the motion vector of the P frame of the video segment is 2, the height H and width W of the motion vector of the P frame are downsampled to H/8 and W/8, and the input of the motion vector of the P frame of the video segment is TFP ⁇ 2 ⁇ (H/8) ⁇ (W/8).
- the DCT coefficients of the residual image of the P frame in the video clip are reshaped and channel selected, the number of channels after selection is 24, the height of the residual is H/8, the width is W/8, and the video clip input is TFP ⁇ 24 ⁇ (H/8) ⁇ (W/8).
- the FSIFP network takes the frequency domain I frame, the frequency domain residual R and the motion vector MV as input.
- their frequency domain representation i.e., the DCT coefficient
- entropy decoding Z-type reordering and inverse quantization.
- 24 low-frequency channels are selected by frequency domain channel selection, and these low-frequency channels contain almost all the appearance information in the spatial image.
- the MV is obtained directly by entropy decoding.
- the MV represents the displacement of the macroblock in the P frame.
- the MV is downsampled from H ⁇ W ⁇ 2 to H/8 ⁇ W/8 ⁇ 2 without losing the motion clues contained in the high-resolution video, so as to achieve a smaller size of the frequency domain data without losing the motion information as much as possible, and the downsampled frequency domain data is still the most discriminative information. Therefore, FCS reduces the computational complexity of the network by downsampling the frequency domain data to enhance the saliency of the input, thereby making the two-dimensional convolutional neural network or FSIFP network more recognizable for the feature, suppressing the background motion channel, thereby reducing background interference, improving the accuracy of behavior recognition, making the distinguished spatiotemporal features more prominent, ensuring that the network learns useful low-frequency information, and further improving the accuracy of behavior recognition.
- this embodiment provides a new frequency domain data enhancement technology: horizontal block flipping (HFB), which can generate different training samples to prevent severe overfitting under zero calculation and zero parameters.
- the frequency domain video frame contains multiple DCT blocks, and horizontal block flipping (HFB) flips the video frame in units of blocks and exchanges the positions of the blocks according to the horizontal symmetry axis.
- HFB preserves the frequency band distribution within the block, thereby avoiding destroying the spatial semantics of the video frame in the frequency domain.
- the F2D-SIFPNet of this embodiment is a lightweight network that uses a two-dimensional convolutional neural network as the backbone network, and the computational cost is relatively small compared to a three-dimensional convolutional neural network.
- Directly using the DCT coefficients of frequency domain data can avoid the process of decoding data from the DCT coefficients in the frequency domain to the RGB information in the spatial domain, simplify the data preprocessing process, reduce the amount of network calculations, and increase the speed of behavior recognition.
- F2D-SIFPNet is better than all other compressed domain methods, such as CoViAR, DMC-Net, and MV2Flow.
- F2D-SIFPNet is also better than methods whose input data is frequency domain data: on UCF101, the accuracy of the F2D-SIFPNet method is 10.3% higher than Fast-CoViAR and 4.6% higher than Faster-FCoViAR. This shows that directly inputting frequency domain I frame data, MV, and R frequency domain data into the network cannot fully utilize the spatiotemporal correlation between I frames and P frames, while the F2D-SIFPNet method effectively integrates the static spatial features and dynamic motion information of I frames and P frames.
- F2D-SIFPNets has a higher accuracy than SIFP, and SIFP's GFLOPs is 2.73 times that of F2D-SIFPNets.
- F2D-SIFPNet increases the number of input frames and further improves the accuracy.
- F2D-SIFPNet outperforms many methods based on RGB frame data that require full decoding of compressed video, such as I3D, ECO, and TSM. It can be seen that only two methods (TEA and TDN) have slightly higher accuracy than F2D-SIFPNet on UCF101, but their GFLOPs are larger.
- This embodiment combines the temporal attention mechanism through the new TDAC to effectively and dynamically adjust the connection between the slow SI path and the fast path FP.
- the AME module can adaptively capture motion changes and can be easily embedded in the standard ResNet bottleneck block for effective temporal modeling. These two modules can improve the accuracy of behavior recognition.
- the effectiveness of the TDAC fusion method is shown in Figure 12. The table shows the results of the following different fusion methods: (1) SI-only (unfused), (2) FP-only (unfused), (3) late fusion (weighted average score of SI-only and FP-only), (4) T-Conv, and (5) TDAC.
- the effectiveness of the AME module is shown in Figure 13.
- the AME module is compared with the method without adding the temporal modeling module and adding the ME temporal modeling module.
- the results show that the AME module is 2.6% better than the baseline method and 1.1% higher than the ME, indicating that the adaptive motion excitation module AME can more effectively improve the performance of behavior recognition than the static motion excitation module ME.
- the AME module enables the method of this embodiment to simultaneously utilize long-term motion clues and short-term temporal information between frames, thereby improving the accuracy of recognition.
- the visualization results of Figure 14 can more intuitively observe this improvement: due to the adaptive motion excitation of the AME module, F2D-SIFPNet can not only locate short-term motion areas, but also long-term motion features.
- the F2D-SIFPNet of this embodiment focuses on both the short-term motion of the fingers and the long-term swing of the arms, while the ME module can only focus on the short-term motion of the fingers, and the baseline method focuses more on the background.
- FIG14 uses GradCam to observe the dynamic area.
- Top Baseline
- middle ME module
- bottom AME module.
- this embodiment trains a (4+32) frame network with AME (FSIFP+TSACs+AME) or ME (FSIFP+TSACs+ME).
- This embodiment only displays the GradCam graph on the I frame of the frequency, so that more MVs and video frames can be input with the same amount of computation, more motion information can be captured, and behaviors in long-term videos can be more effectively recognized.
- the accuracy of F2D-SIFPNet is also higher than most RGB input-based methods, and it has better recognition accuracy under similar computational complexity, which better verifies the effectiveness of F2D-SIFPNet in capturing motion information in long-term videos.
- the embodiment of the present application also provides an electronic device, which may be, for example, an edge router, or a broadband remote access server (Broadband Remote Access Server, BRAS), a broadband network gateway (Broadband Network Gateway), a serving GPRS support node (Serving GPRS Support Node, SGSN), a gateway GPRS support node (Gateway GPRS Support Node, GGSN), a mobility management entity (Mobility Management Entity, MME) or a serving gateway (Serving GateWay, S-GW) wait.
- BRAS broadband Remote Access Server
- BRAS broadband network gateway
- a serving GPRS support node Serving GPRS support node
- SGSN serving GPRS support node
- GGSN gateway GPRS support node
- MME Mobility Management Entity
- S-GW serving gateway
- FIG. 16 is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present application.
- the electronic device may include: a processor 1001, such as a central processing unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005.
- the communication bus 1002 is used to realize the connection and communication between these components.
- the user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the user interface 1003 may also include a standard wired interface and a wireless interface.
- the network interface 1004 may include a standard wired interface and a wireless interface (such as a wireless fidelity (Wireless-Fidelity, WI-FI) interface).
- the memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) or a stable non-volatile memory (Non-Volatile Memory, NVM), such as a disk storage.
- RAM Random Access Memory
- NVM Non-Volatile Memory
- the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
- the structure shown in FIG16 does not limit the electronic device and may include more or fewer components than shown, or combine certain components, or arrange components differently.
- the memory 1005 as a storage medium may include an operating system, a data storage module, a network communication module, a user interface module, and a behavior recognition program.
- the network interface 1004 is mainly used for data communication with other devices; the user interface 1003 is mainly used for data interaction with the user; the processor 1001 and the memory 1005 in this embodiment can be set in the communication device, and the communication device calls the behavior recognition program stored in the memory 1005 through the processor 1001, and executes the behavior recognition method provided in any of the above embodiments.
- the terminal proposed in this embodiment and the method for behavior recognition proposed in the above embodiment belong to the same inventive concept.
- the technical details not described in detail in this embodiment can be referred to any of the above embodiments, and this embodiment has the same beneficial effects as executing the behavior recognition method.
- an embodiment of the present application also proposes a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, on which a behavior recognition program is stored, and when the behavior recognition program is executed by a processor, the behavior recognition method of the present application as described above is implemented.
- the various embodiments of the electronic device and computer-readable storage medium of the present application can all refer to the various embodiments of the behavior recognition method of the present application, which will not be repeated here.
- the technical solution of the present application is essentially or the part that contributes to some situations can be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above, including a number of instructions for an electronic device (which can be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in each embodiment of the present application.
- a storage medium such as ROM/RAM, magnetic disk, optical disk
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Image Analysis (AREA)
Abstract
本申请公开了一种行为识别方法、电子设备及计算机可读存储介质,属于计算机视觉技术领域。本申请通过提取频域数据中的频域I帧数据和频域P帧数据;将频域I帧数据输入至二维卷积神经网络的慢速路径SI得到稀疏空间特征,并将频域P帧数据输入至二维卷积神经网络的快速路径FP得到快速运动特征;提取快速运动特征中的显著运动特征,将显著运动特征与稀疏空间特征进行横向特征融合得到第一时空聚合特征;将第一时空聚合特征与快速运动特征进行纵向特征融合得到第二时空聚合特征以进行行为识别。
Description
相关申请
本申请要求于2022年12月8号申请的、申请号为202211580062.5的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及计算机视觉技术领域,尤其涉及行为识别方法、电子设备及计算机可读存储介质。
基于视频的行为识别方法主要探究如何感知在一段连续的视频流中某一目标乃至多个目标的动作变化,进而由对连续动作的总结上升为对行为的判断。伴随着人们对人工智能需求的不断增加,基于视频的行为识别方法已成为计算机视觉的重要技术,其广泛地推动了人工智能的发展。
目前,行为识别方法采用较多的一种方案是基于双流的神经网络,其将视频分成空间和时间两个部分,分别将RGB图像和光流图像送入两支神经网络,并融合得到最终分类结果。另外一种方案是基于三维卷积的神经网络,其针对视频将神经网络优化为三维卷积神经网络,以此来捕捉更多的时间和空间信息。
然而,上述行为识别方法均存在一些不可忽略的技术问题,第一种方案在提取的RGB图像中加入额外的光流,虽然能辅助提高视频行为识别的性能,但光流的计算成本较高,会大幅度降低行为识别的效率。而第二种方案的三维卷积神经网络虽然能取得比较好的识别效果,但其数据计算量远远大于二维卷积神经网络的数据计算量,因而极大地占用了计算资源,同时降低了行为识别的效率。因此,如何在保证行为识别精度的同时,提高行为识别的效率,降低数据计算量,仍是计算机视觉领域的一大挑战。
发明内容
本申请提供一种行为识别方法,包括:
获取压缩视频流,确定所述压缩视频流对应的频域数据,并提取所述频域数据中的频域I帧数据和频域P帧数据;
将所述频域I帧数据输入至二维卷积神经网络的慢速路径SI进行静态语义处理,得到稀疏空间特征,并将所述频域P帧数据输入至所述二维卷积神经网络的快速路径FP进行运动信息处理,得到快速运动特征;
在所述快速路径FP提取所述快速运动特征中的显著运动特征,基于时间注意力维度将所述显著运动特征整合至所述慢速路径SI,并与所述稀疏空间特征进行横向特征融合,得到第一时空聚合特征;
将所述第一时空聚合特征与所述快速运动特征进行纵向特征融合,得到第二时空聚合特征,并依据所述第二时空聚合特征进行压缩视频流的行为识别。
此外,本申请还提供一种电子设备,所述电子设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的行为识别程序,所述行为识别程序被所述处理器执行时实现如上述的行为识别方法。
此外,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有行为识别程序,所述行为识别程序被处理器执行时实现如上述的行为识别方法。
为了更清楚地说明本申请实施例或一些情形中的技术方案,下面将对实施例或一些情形描述中所需
要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图示出的结构获得其他的附图。
图1为本申请行为识别方法第一实施例的流程示意图;
图2为本申请行为识别方法第二实施例的流程示意图;
图3为本申请行为识别方法第三实施例的流程示意图;
图4为本申请行为识别方法第四实施例的流程示意图;
图5为本申请实施例中获取频域数据的流程示意图;
图6为本申请实施例的TDACs模块进行特征融合的流程示意图;
图7为本申请一具体实施例的行为识别方法的流程框图;
图8为本申请实施例的F2D-SIFP网络的结构图;
图9为本申请实施例的AME模块对长短期特征进行捕捉的流程示意图;
图10为本申请实施例的行为识别结果的界面示意图;
图11为本申请实施例的F2D-SIFPNet与其他压缩域行为识别网络的实验数据对照图;
图12为本申请实施例验证TDACs融合方式的有效性的实验数据图;
图13为本申请实施例AME模块与其他时间建模模块的实验数据对照图;
图14为本申请实施例的行为识别方法的GradCam图;
图15为本申请实施例验证F2D-SIFPNet捕捉长时视频中运动信息的有效性的实验数据图;
图16为本申请实施例方案涉及的电子设备的硬件结构示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明,本申请实施例中所有方向性指示(诸如上、下、左、右、前、后……)仅用于解释在某一特定姿态(如附图所示)下各部件之间的相对位置关系、运动情况等,如果该特定姿态发生改变时,则该方向性指示也相应地随之改变。
在本申请中,除非另有明确的规定和限定,术语“连接”、“固定”等应做广义理解,例如,“固定”可以是固定连接,也可以是可拆卸连接,或成一体;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通或两个元件的相互作用关系,除非另有明确的限定。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本申请中的具体含义。
另外,在本申请中如涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。
目前,行为识别方法采用较多的一种方案是基于双流的神经网络,其将视频分成空间和时间两个部分,分别将RGB图像和光流图像送入两支神经网络,并融合得到最终分类结果。另外一种方案是基于三维卷积的神经网络,其针对视频将神经网络优化为三维卷积神经网络,以此来捕捉更多的时间和空间信息。
然而,上述行为识别方法均存在一些不可忽略的技术问题,第一种方案在提取的RGB图像中加
入额外的光流,虽然能辅助提高视频行为识别的性能,但光流的计算成本较高,会大幅度降低行为识别的效率。而第二种方案的三维卷积神经网络虽然能取得比较好的识别效果,但其数据计算量远远大于二维卷积神经网络的数据计算量,因而极大地占用了计算资源,同时降低了行为识别的效率。因此,如何在保证行为识别精度的同时,提高行为识别的效率,降低数据计算量,仍是计算机视觉领域的一大挑战。
基于此,本申请实施例提供了一种行为识别方法,参照图1,图1为本申请一种行为识别方法一实施例的流程示意图。本实施例中,所述行为识别方法包括:
步骤S10,获取压缩视频流,确定所述压缩视频流对应的频域数据,并提取所述频域数据中的频域I帧数据和频域P帧数据;
在本实施例中,压缩视频流对应的频域数据包括频域I帧数据和频域P帧数据。其中,频域I帧数据包括I帧图像的DCT系数。频域P帧数据包括帧间运动矢量(MV,Motion Vectors),以及P帧图像的DCT残差系数(R,Residuals)。本领域技术人员可知的是,该帧间运动矢量表征像素的逐块移动,该DCT残差系数表征相对于参考I帧图像的像素偏移。
示例性地,请参照图2,在上述实施例步骤S10中,确定所述压缩视频流对应的频域数据的步骤包括:
步骤S11,将所述压缩视频流进行熵解码,得到帧间运动矢量和中间压缩码流数据;
本实施例在动作识别过程中,帧间动作矢量的提取是非常重要的,用于建立动作之间的联系。
步骤S12,将所述中间压缩码流数据依次进行重排序操作和逆量化处理,得到I帧图像的DCT系数,以及P帧图像的DCT残差系数;
步骤S13,将所述帧间运动矢量和所述DCT残差系数进行连接,得到所述频域数据对应的频域P帧数据,并根据所述DCT系数确定所述频域数据对应的频域I帧数据;
步骤S14,将所述频域I帧数据和所述频域P帧数据作为所述压缩视频流对应的频域数据。
参照图5,图5为本申请实施例中获取频域数据的流程示意图。视频压缩标准(如H263、H264、MPEG4)具有相同的编码过程,包括DCT(Discrete Cosine Transform,离散余弦变化)、量化、Z字形排序和熵编码。因此,反过来的半解码过程由FPDec由熵解码、Z字形重排序和逆量化组成,对压缩后的图像进行解码视频转换为频域DCT系数(频域DCT系数包括I帧图像的DCT系数,以及P帧图像的DCT残差系数)和运动矢量(即帧间运动矢量),作为网络的输入。其中,可采用FFmpeg实现频域I帧数据、频域残差R数据、运动矢量MV数据的获取。通过修改FFmpeg中库函数文件,以获得逆量化之后的频域DCT系数,即频域数据。在一实施例中,对于I帧和残差(即P帧图像的DCT残差系数),首先,熵解码将比特流转换成视频序列的一系列频域数据。其次,Z字形重排序通过从上到下和从左到右排列每个块的DC、低频和高频分量来恢复频域DCT系数的顺序。第三,逆量化通过计算量化器尺度参数得到最终的频域DCT系数。
所述步骤S10之后,执行步骤S20,将所述频域I帧数据输入至二维卷积神经网络的慢速路径SI进行静态语义处理,得到稀疏空间特征,并将所述频域P帧数据输入至所述二维卷积神经网络的快速路径FP进行运动信息处理,得到快速运动特征;
在本实施例中,该二维卷积神经网络为通过大量的频域数据进行训练并完成收敛的二维卷积神经网络。为了实现空间维度匹配,慢速路径SI和快速路径FP都对传统的二维卷积神经网络进行了调整,即移除了第一个节点的卷积层和池化层,并添加了一个1×1的2D卷积使频域输入通道的维数适应于路径中res2层通道的维度。
该二维卷积神经网络包括慢速路径SI和快速路径FP,慢速路径SI以低帧速率接收I帧数据,用于静态语义处理,得到稀疏空间特征。快速路径FP以高帧速率寻址包含在频域P帧数据中的运动信息,以得到快速运动特征。
频域I帧数据中具有用于稀疏空间表示的稀疏空间特征,频域I帧数据可用于表征视频慢速运动信息。
频域P帧数据中具有用于快速运动表示的快速运动特征,频域P帧数据可用于表征视频慢速运动信息。因此可通过预先设置的收敛好的二维卷积神经网络,对输入的频域I帧数据进行静态语义处理,得到频域I帧数据所表示的稀疏空间特征,并对输入的频域P帧数据进行运动信息处理,得到频域P帧数据所表示的快速运动特征。
步骤S30,在所述快速路径FP提取所述快速运动特征中的显著运动特征,基于时间注意力维度将所述显著运动特征整合至所述慢速路径SI,并与所述稀疏空间特征进行横向特征融合,得到第一时空聚合特征;
在本实施例中,由于快速运动特征和稀疏空间特征在人类行为的表示中相辅相成。为了更好地同时学习频域I帧数据和频域P帧数据之间的时空互补特征,在二维卷积神经网络中提出了一条慢速路径SI和快速路径FP,在此基础上,由于慢速路径SI和快速路径FP之间的信息共享和协作训练至关重要。基于此,本实施例该还提出了一种慢速路径SI和快速路径FP之间的横向连接,以整合空间和时间频域特征,通过添加一个连接慢速路径SI和快速路径FP这两条路径的时间注意力连接模块(即TDAC,Temporal-Dimension Attention Connections),从而实现基于时间注意力维度,在快速路径FP提取快速运动特征中的显著运动特征,将显著运动特征整合至慢速路径SI与稀疏空间特征进行横向特征融合,得到第一时空聚合特征。也即,时间注意力连接模块TDAC可以在快速路径FP中集成显著运动信息引入到慢速路径SI路径中,使得慢速路径SI可以在连接P帧图像的运动特征的同时,还能保留I帧图像的空间语义特征,实现时空语义信息的聚合。
也即,本实施例提出了一种频域Slow-I-Fast-P网络(Frequency Slow-I-Fast-P,FSIFP)。FSIFP由慢速路径SI和快速路径FP,以及一种新的时间注意力连接(TDAC)组成。FSIFP中的慢速路径SI和快速路径FP都可以分为4层:res2到res5。在res2、res3和res4之后,TDAC将特征从快速路径FP融合到SI路径,融合模式为级联(如图8所示)。
步骤S40,将所述第一时空聚合特征与所述快速运动特征进行纵向特征融合,得到第二时空聚合特征,并依据所述第二时空聚合特征进行压缩视频流的行为识别。
在本实施例,可通过将二维卷积神经网络中的慢速路径SI输出的第一时空聚合特征,与快速路径FP输出的快速运动特征进行纵向特征融合,得到第二时空聚合特征。也即,为了进一步加强慢速路径SI和快速路径FP之间的信息共享和协作训练,提出了一种慢速路径SI和快速路径FP之间的纵向连接,以进一步整合空间和时间频域特征。然后依据该第二时空聚合特征进行压缩视频流的行为识别,实现在保证行为识别精度的同时,提高行为识别的效率,降低数据计算量。
本申请提出一种行为识别方法、电子设备及计算机可读存储介质,在行为识别方法中,本申请实施例的技术方案是通过获取压缩视频流,确定压缩视频流对应的频域数据,并提取频域数据中的频域I帧数据和频域P帧数据;将频域I帧数据输入至二维卷积神经网络的慢速路径SI进行静态语义处理,得到稀疏空间特征,并将频域P帧数据输入至二维卷积神经网络的快速路径FP进行运动信息处理,得到快速运动特征;在快速路径FP提取快速运动特征中的显著运动特征,基于时间注意力维度将显著运动特征整合至慢速路径SI与稀疏空间特征进行横向特征融合,得到第一时空聚合特征;将第一时空聚合特征与快速运动特征进行纵向特征融合,得到第二时空聚合特征,并依据第二时空聚合特征进行压缩视频流的行为识别,从而实现在保证行为识别精度的同时,提高行为识别的效率,降低数据计算量。
由于在提取的RGB图像中加入额外的光流,虽然能辅助提高视频行为识别的性能,但光流的计算成本较高,会大幅度降低行为识别的效率,并且RGB图像需要将数据从频域(DCT系数)解码到空域(RGB像素),这增加了数据预处理的时间。而第二种方案的三维卷积神经网络虽然能取得比较好的识别效果,但其数据计算量远远大于二维卷积神经网络的数据计算量,因而极大地占用了计算资源,同时降低了行为识别的效率。
相比于目前的该行为识别方法,本申请实施例通过将二维卷积神经网络分为慢速路径SI和快速路径FP,并利用二维卷积神经网络直接对频域信息进行有效建模,相较于三维卷积神经网络,可以大幅
度降低计算成本,降低数据计算量,提高行为识别的效率,同时通过提取频域数据中的频域I帧数据和频域P帧数据的步骤,直接对频域信息进行特征提取,因此可以避免将数据从频域的DCT(Discrete Cosine Transform,离散余弦变换)系数解码为空间域的RGB(red blue green,三原色)数据的过程,由于在提取的RGB图像中加入额外的光流需要将I帧和P帧解码到RGB域,且光流的计算成本和计算负载较高,而申请实施例可以直接利用视频部分解码的频域数据进行视频行为识别,减少数据预处理的时间,同时简化数据预处理流程,从而提高了行为识别的速度。并通过时间注意力连接模块TDAC可以在快速路径FP中集成显著运动信息引入到慢速路径SI路径中,使得慢速路径SI可以在连接P帧图像的运动特征的同时,还能保留I帧图像的空间语义特征,实现时空语义信息的聚合,通过网络中间阶段的特征级融合,可以更加有效提取特征信息,选择性地共享快速路径FP中含有的关键帧显著信息,提高行为识别的准确率和效率。然后,通过将通过将二维卷积神经网络中的慢速路径SI输出的第一时空聚合特征,与快速路径FP输出的快速运动特征进行纵向特征融合,得到第二时空聚合特征,加强慢速路径SI和快速路径FP之间的信息共享和协作训练,进一步整合空间和时间频域特征,并且通过从压缩视频中获取频率数据来显著减少数据冗余并充分描述人类行为的时空辨别信息,然后基于新的FSIFP提取频域判别信息和运动线索,同时充分学习时空互补特征,进而实现在保证行为识别精度的同时,提高行为识别的效率,降低数据计算量。
另外,本实施例的行为识别方法提供了一种频域Slow-I-Fast-P网络(Frequency Slow-I-Fast-P,FSIFP),分为慢速路径SI和快速路径FP,利用二维卷积神经网络直接对频域信息进行有效建模,相较于SIFP的三维卷积神经网络,可以大幅度降低计算成本,同时因为是直接对频域信息进行特征提取,可以避免将数据从频域的DCT系数解码为空间域的RGB数据的过程,简化数据预处理流程,提高行为识别的速度。然后,通过提供一种时间注意力连接模块(即TDAC,Temporal-Dimension Attention Connections),将快速路径FP和慢速路径SI的信息进行融合,选择性地共享快速路径FP中含有的关键帧显著信息,实现网络中间阶段的特征级融合,可以更加有效提取特征信息,提高行为识别的准确率。
本申请实施例的行为识别方法可以实施在多种需要预警的监控环境中,对高清监控摄像机采集到的压缩视频帧快速识别出人的动作行为,达到实时监控预警的效果。下面就几种常见的压缩域行为识别场景进行简单的介绍:
摔倒事件识别监测:
在老年公寓及医院等场所中,当有人发生晕倒或者不小心摔倒时可以实时预警,从而更快的提供治疗服务。
打架事件识别监测:
在监狱或者学校等场所中,当有人群众斗殴发生打架行为时可以实时预警,更快地制止闹事行为。
求救事件识别监测:
在游泳馆等场所中,当有人向监控摄像头摆手发出求救动作时可以实时预警,更高效的进行救助。
另外,本申请实施例还可以实施在网络监管及网络短视频分类中,对网络中用户上传的MPEG4编码离线小视频进行识别。
网络监管:当识别到小视频中有非法行为动作时可以删除视频或禁止用户上传来达到网络监管的目标。
网络短视频分类:对用户上传的小视频进行行为识别并分类,方便其他用户快速找到想要查询的相应动作类别视频。
其中,本申请实施例的行为识别方法的硬件环境可为表(一)所示的配置信息。
(一)
本申请实施例可通过摄像头获得实时行为频域数据,输入到本申请实施例所提出的压缩域快速行为识别网络中进行行为识别,最终输出并显示行为类别。本实施例可以在一些情形的基础上,既大幅度降低了计算成本,又在一定程度上提升了行为识别的准确率。由于本申请实施例提取的运动矢量和残差以及解码I帧都是在压缩域处理,采用的是部分解码操作,避免了视频完全解码和重构,这样可以提高系统的处理效率,便于实时应用。本申请实施例利用压缩视频中图像帧的运动矢量时间相关性和空间相关性,从而提升本申请实施例利用若干图像帧完成动作识别的精度
基于上述本申请的第一实施例,提出本申请行为识别方法的第三实施例,请参照图3,在上述实施例步骤S30中,所述时间注意力维度包括静态特征分支和时间注意力权重分支,所述时间注意力维度包括静态特征分支和时间注意力权重分支,所述基于时间注意力维度将所述显著运动特征整合至所述慢速路径SI,并与所述稀疏空间特征进行横向特征融合,得到第一时空聚合特征的步骤包括:
步骤S31,将所述显著运动特征整合至所述慢速路径SI;
步骤S32,通过所述时间注意力权重分支激发所述显著运动特征对应的第一时间维度信息,并通过所述静态特征分支在所述慢速路径SI激发所述稀疏空间特征对应的第二时间维度信息;
步骤S33,将所述第一时间维度信息、所述第二时间维度信息和所述稀疏空间特征进行横向特征融合,得到第一时空聚合特征。
为了助于理解本申请,列举一具体实施例进行说明,请参照图6:
如图6所示,图6为本申请实施例的TDACs模块进行特征融合的流程示意图。TDACs模块为一种时间注意力连接(Temporal-Dimension Attention Connections,TDAC或称为TDACs)模块,TDAC模块是一个基于时间维度的注意模块,由静态特征分支S和时间注意力权重分支A组成。通过从快速路径FP中激发时间敏感特征来提取关键帧的运动信息。静态特征分支S通过3D卷积有效激发时间信息;时间注意力权重分支A采用两个2D卷积来压缩和激发时间维度的特征。
TDAC模块的输入为其中C表示输入通道数,TFP为快速路径FP的输入帧数,H为输入特征图高度,W为输入特征图宽度。通过分支S得到向量其中TSI为SI路径输入的帧数,该过程可以表示为:
其中K4为卷积核大小为7×1×1的3D卷积层,为卷积操作。分支A根据运动显著性为快速路径FP中的每一帧生成一个权值,它更侧重于时间建模而不是空间特征。因此,首先采用全局平均池化来压缩空间信息:
其中然后,使用1×1的2D卷积将帧数从TFP减少到其中η是代表时间维度挤压的超参数,在一实施例中,η设为2。
经过第二个2D卷积后,可以得到权重
其中K1和K2是卷积核大小为3×3的2D卷积层。最后,TDAC的公式为:
Y=S(X)⊙A(X)=F⊙W,(4)
Y=S(X)⊙A(X)=F⊙W,(4)
其中⊙代表元素乘法。TDAC模块可以将快速路径FP中的显著运动信息整合到SI路径中,使SI路径在融合P帧的运动特征的同时保留了I帧的空间语义特征,实现了时空语义信息的聚合。
本实施例通过将显著运动特征整合至慢速路径SI,通过时间注意力权重分支激发显著运动特征对应的第一时间维度信息,并通过静态特征分支在所述慢速路径SI激发稀疏空间特征对应的第二时间维度信息,然后将第一时间维度信息、第二时间维度信息和稀疏空间特征进行横向特征融合,得到第一时空聚合特征,从而使得将快速路径FP中的显著运动信息整合到SI路径中,使SI路径在融合P帧的运动特征的同时保留了I帧的空间语义特征,实现了时空语义信息的聚合,进而提高了对压缩视频进行行为识别的精度。
基于上述本申请的第一实施例,提出本申请行为识别方法的第四实施例,请参照图4,在上述实施例步骤S30中,依据所述第二时空聚合特征进行压缩视频流的行为识别的步骤包括:
步骤S41,基于所述压缩视频流对应的长期时间特征生成自适应卷积核;
步骤S42,根据所述自适应卷积核,捕捉所述第二时空聚合特征中的长期运动线索,得到所述压缩视频流对应的长期运动特征,并根据所述自适应卷积核,捕捉所述第二时空聚合特征中的短期运动线索,得到所述压缩视频流对应的短期运动特征;
步骤S43,依据所述长期运动特征和所述短期运动特征进行压缩视频流的行为识别。
为了助于理解本申请实施例的技术构思或技术原理,列举一具体实施例进行说明,请参照图8:
如图8所示,本实施例提供一种自适应运动激励(AME,Adaptive Motion Excitation)模块,该AME模块有效地捕捉长期和短期运动变化,通过AME模块自适应地捕捉运动变化,并可以很容易地嵌入到标准的ResNet块中,以进行有效的时间建模。具体为AME模块嵌入到慢速路径SI和快速路径FP的res2至res5中的每个ResNet块中。它被专门插入到第一个1×1的Conv2D层之后的剩余路径中最后,全连接层基于级联的res5特征输出分类分数,以获得识别结果。
其中,该AME模块设置为:基于所述压缩视频流对应的长期时间特征生成自适应卷积核;根据所述自适应卷积核,捕捉所述第二时空聚合特征中的长期运动线索,得到所述压缩视频流对应的长期运动特征,并根据所述自适应卷积核,捕捉所述第二时空聚合特征中的短期运动线索,得到所述压缩视频
流对应的短期运动特征;依据所述长期运动特征和所述短期运动特征进行压缩视频流的行为识别,从而使得本申请实施例通过添加一个AME模块,从而实现根据全局信息生成一个自适应卷积核,并提取信道移位后的短时运动特征,通过同时捕捉长期和短期的运动变化,提高了行为识别的准确率。
在一实施例中,依据所述长期运动特征和所述短期运动特征进行压缩视频流的行为识别的步骤包括:
步骤A10,根据所述长期运动特征确定长时序行为特征,并根据所述短期运动特征确定短时序行为特征;
步骤A20,将所述长时序行为特征和所述短时序行为特征进行融合分析,得到所述压缩视频流对应的关键行为特征;
步骤A30,依据所述关键行为特征进行压缩视频流的行为识别。
本实施例提出的该AME模块基于长期时间特征生成自适应卷积核,用于帧间短运动特征的动态聚合。如图9中的AME块所示,图9为本申请实施例的AME模块对长短期特征进行捕捉的流程示意图。其中,右分支ST在时间的维度上执行时间卷积以从相邻帧捕获短期运动特征,并且左分支LT使用与TDAC相同的压缩和激励机制,来为每个视频剪辑生成动态卷积核。动态卷积核自适应地提取每个视频的唯一长期运动信息,右分支ST和左分支LT这两个分支通过卷积实现了长期和短期运动特征的融合。
本实施例通过FSIFP和AME完全学习显著频率分量,其中FSIFP同时有效地模拟慢速空间特征和快速时间变化,并且AME生成用于捕获长期和短期运动线索的自适应卷积核。在GFLOPs、UCF101和HMDB51三个数据集上的大量实验表明,本申请实施例行为识别方法提出的F2D-SIFPNet(Frequency 2D Slow-I-Fast-P Network)比基于RGB原始视频的方法快1.72倍,同时在压缩域中实现了最先进的行为识别精度(如图11所示)。
本实施的AME模块不同于建模短期运动信息,如ME(Motion Excitation)方法,在网络的运动特征的最后一个时间维度填入零来匹配特征的维度,AME对T(time)维特征进行时间卷积来捕捉运动关系。
AME模块生成一种基于长时间特征的自适应卷积核,用于帧间运动特征的动态聚合。右边的分支ST对来自相邻帧的短期运动特征进行建模,左边的分支LT使用与TDAC相同的挤压-激励机制,为每个视频剪辑生成一个动态卷积核。该动态卷积核自适应地提取每个视频特有的长期运动信息。这两个分支通过卷积实现了长期和短期运动特征的融合。
给定的输入
其中,C表示输入特征图的通道数,T为输入帧数,H为输入特征图高度,W为输入特征图宽度。将X池化后,输入到左分支LT的第一个1D卷积中,将时间维度T从增加到λT。其中λ是代表时间维度激励的超参数,在一实施例中,λ设为2。
在第二个1D卷积之后,本实施例得到了一个自适应卷积核
其中K为超参数,代表卷积核大小,在一实施例中,K设为3。
在右边的ST分支中,首先挤压输入以获得特征其中β是表示通道维度挤压的超参数,在一实施例中,β设为8。
而运动特征可以通过下式得到:
其中K5是2D卷积层,然后,对F2执行式(3)和式(2)的运算操作后,可以得到特征随后,关键向量M通过1×1的2D卷积和sigmoid激活获得:
然后,通过公式(7)得到输出特征图
最终,将本申请实施例的压缩域快速行为识别网络输出的行为识别结果进行显示。其中,包括将网络输出对应的动作类别,以及行为识别的帧率,帧数等进行实时显示,显示界面如图10所示。
需要说明的是,上述具体实施例仅用于帮助理解本申请实施例的技术构思或技术原理,并不构成对本申请的限定,基于所示出的实施例进行更多形式的简单变换,均应在本申请的保护范围内。
为了助于理解本申请实施例的技术原理,列举一具体实施例的行为识别方法,参照图7和图8:
本申请实施例提供了一种基于压缩域视频的行为识别方法F2D-SIFPNet,以及基于压缩域的行为识别系统,可以优化现有的视频行为识别方案。第一方面,本申请实施例提供了一种压缩域快速行为识别网络,该网络包括:频域Slow-I-Fast-P网络(Frequency Slow-I-Fast-P,FSIFP)、时间注意力连接(Temporal-Dimension Attention Connections,TDACs)和自适应运动激励模块(Adaptive Motion Excitation,AME)这三个部分。首先,FSIFP分为慢速路径SI和快速路径FP,利用二维卷积神经网络直接对频域信息进行有效建模。相较于SIFP的三维卷积神经网络,可以大幅度降低计算成本,同时因为是直接对频域信息进行特征提取,可以避免将数据从频域的DCT系数解码为空间域的RGB数据的过程,简化数据预处理流程,提高行为识别的速度。然后,TDACs将FP路径和SI路径的信息进行融合,选择性地共享FP路径中含有的关键帧显著信息,实现网络中间阶段的特征级融合,可以更加有效提取特征信息,提高行为识别的准确率。随后,本申请实施例设计了一个AME模块,该模块可以根据全局信息生成一个自适应卷积核,并提取信道移位后的短时运动特征,该模块可以同时捕捉长期和短期的运动变化,提高行为识别的准确率。第二方面,本申请实施例实现了一个压缩域实时行为识别系统,该系统包括:通过摄像头获得实时行为频域数据,输入到所提出的压缩域快速行为识别网络中进行行为识别,最终输出并显示行为类别。本申请实施例可以在一些情形的基础上,既大幅度降低了计算成本,又在一定程度上提升了行为识别的准确率。
本申请实施例的基于频域视频的行为识别方法相比基于RGB帧的行为识别方法有更快的速度,同时保持有竞争力的识别准确率,其研究成果可以被广泛应用于人体姿态估计、行人重识别、时序行为检测等更高层级的计算机视觉任务中,对计算机智能视觉处理技术有着推动作用。随着计算机视觉技术的不断发展,视频中的行为识别在视觉监控、视频分析和视频数据挖掘等任务中发挥着重要作用,在智能监控、医疗辅助、动画制作、智能家居环境和人机交互等多个领域中也将会有更广阔的应用前景。
需要说明的是,以上所揭露的仅为本申请一种可选实施例而已,当然不能以此来限定本申请的保护范围,本领域普通技术人员可以理解实现上述实施例的全部或部分流程,并依本申请权利要求所做的等同变化,仍属于本申请所涵盖的范围。
也就是说,上述具体实施例仅用于帮助理解本申请实施例的技术构思,并不构成对本申请的限定,基于该技术构思进行更多形式的简单变换,均应在本申请的保护范围内。
在一种可实施的方式中,所述步骤A30,依据所述关键行为特征进行压缩视频流的行为识别的步骤包括:
步骤B10,从预设的行为特征映射表中,查询得到所述关键行为特征映射的行为类别信息;
步骤B20,将所述行为类别信息,作为对压缩视频流进行行为识别的识别结果。
在本实施例中,该行为类别信息可为摔倒、打架和求救等,本实施例对此不作具体的限定。本领域技术人员可以理解的是,不同的行为类型信息对应不同的关键行为特征。
在本实施例中,该行为特征映射表中存储有多个行为特征,以及各行为特征映射的行为类别信息。其中,行为特征与行为类别信息之间一一映射的映射关系,本领域技术人员可通过预先进行学习训练的方式而完成标定并预存于系统中。
本实施例通过从预设的行为特征映射表中,查询得到关键行为特征映射的行为类别信息,并通过将行为类别信息,作为对压缩视频流进行行为识别的识别结果,从而提高了对压缩域视频进行行为识别的效率和精度。
在一种可能的实施方式中,在所述确定所述压缩视频流对应的频域数据的步骤之后,所述方法还包括:
步骤C10,对所述频域数据进行频域信道的下采样处理,得到低频通道的频域数据;
所述提取所述频域数据中的频域I帧数据和频域P帧数据的步骤包括:
步骤C20,从所述低频通道的频域数据中提取频域I帧数据和频域P帧数据。
示例性地,所述步骤C10,对所述频域数据进行频域信道的下采样处理,得到低频通道的频域数据的步骤包括:
步骤D10,将频域数据中的帧间运动矢量进行频域信道的下采样处理,得到低频通道的帧间运动矢量;
步骤D20,将所述频域数据中P帧图像的DCT残差系数进行频域信道的下采样处理,得到低频通道的DCT残差系数;
步骤D30,将低频通道的帧间运动矢量,以及低频通道的DCT残差系数进行连接,得到低频通道的频域P帧数据;
步骤D40,将所述频域数据中I帧图像的DCT系数进行频域信道的下采样处理,得到低频通道的频域I帧数据,并将低频通道的频域I帧数据,以及低频通道的频域P帧数据,作为低频通道的频域数据。
本实施例通过对频域数据进行频域信道的下采样处理,得到低频通道的频域数据,从而选择出频域数据中的显著信道,作为二维卷积神经网络的输入,或者说作为FSIFP网络的输入,以减少数据冗余,由于高频数据包含非显著冗余信息,并且大多数高频分量为零,因此本实施例对频域数据输入至FSIFP网络或二维卷积神经网络之前进行了下采样处理,即通过提供一种频域信道选择(FCS)模块,来去除低判别信道,实现在频域数据送入FSIFP网络或二维卷积神经网络之前,对频域数据进行低频的特征筛选预处理,以增强输入的显著性,从而使得二维卷积神经网络或FSIFP网络对该特征的辨识度更高,本实施例通过频域信道选择(FCS)策略对频域数据(即频域DCT系数)和运动矢量(即帧间运动矢量MV)进行下采样,使区分的时空特征更加突出,确保网络学习有用的低频信息,进一步提
高了行为识别的准确率。
为了助于理解本申请的技术构思或技术原理,列举一具体实施例进行具体说明:
在该具体实施例中,首先将所述压缩视频流进行分段,可以将视频等时长分为TSI份,每个视频片段中随机选择一个I帧的DCT系数和α个P帧的频域数据,含运动矢量数据和残差图像的DCT系数,因此快速网络的输入帧数为TSI,慢速网络的输入帧数为TFP=α×TSI。
然后,将所述视频片段I帧的DCT系数进行重塑和通道选择,选择后的I帧的通道数为24,所述I帧的高度为H/8,宽度为W/8,将TSI×24×H/8×W/8作为慢速路径的网络模型输入
其中,所述视频片段P帧的运动矢量的通道数为2,将所述P帧的运动矢量的高度H和宽度W下采样为H/8和W/8,所述视频片段P帧的运动矢量的输入为TFP×2×(H/8)×(W/8)。
将所述视频片段中P帧的残差图像的DCT系数进行重塑和通道选择,选择后的通道数为24,所述残差的高度为H/8,宽度为W/8,所述视频片段输入为TFP×24×(H/8)×(W/8)。
最后,将所述视频片段的P帧的运动矢量输入和残差图像DCT系数输入进行连接,每个视频片段的快速网络的输入为TFP×26×(H/8)×(W/8)。
在本实施例中,FSIFP网络以频域I帧、频域残差R和运动矢量MV作为输入。对于I帧和残差R,首先通过熵解码、Z型重排序和逆量化得到它们的频域表示,即DCT系数。然后,利用频域通道选择选择24个低频通道,这些低频通道包含了空间图像中几乎所有的外观信息。MV则直接通过熵解码得到。MV表示P帧中宏块的位移。本实施例将MV从H×W×2下采样到H/8×W/8×2,同时不丢失高分辨率视频中包含的运动线索,实现在尽量不丢失运动信息的情况下,使频域数据具有较小的尺寸,且下采样后的频域数据仍然是最具鉴别性的信息。因此,FCS降低了网络的计算复杂度,通过对频域数据进行下采样处理,以增强输入的显著性,从而使得二维卷积神经网络或FSIFP网络对该特征的辨识度更高,抑制背景运动通道,从而减少背景的干扰,提高行为识别准确率,使区分的时空特征更加突出,确保网络学习有用的低频信息,进一步提高了行为识别的准确率。
另外,本实施例提供了一种新的频域数据增强技术:块水平翻转(HFB),它可以生成不同的训练样本,以防止零计算和零参数下的严重过拟合。频域视频帧包含多个DCT块,水平逐块翻转(HFB)以块为单位翻转视频帧,并根据水平对称轴交换块的位置。HFB保留了块内的频带分布,从而避免了在频域中破坏视频帧的空间语义。
为了助于理解本申请的技术构思,以及展示相关实验数据而支持本申请的技术原理,列举一具体实施例:
本实施例的F2D-SIFPNet是一个轻量级网络,该网络使用二维卷积神经网络作为骨干网络,计算成本相对于三维卷积神经网络相对较小。直接使用频域数据的DCT系数可以避免将数据从频域的DCT系数解码到空间域的RGB信息的过程,简化数据预处理流程,降低网络的计算量,提高行为识别的速度。
如图11所示,F2D-SIFPNet的准确率优于其他所有压缩域的方法,例如CoViAR、DMC-Net和MV2Flow。F2D-SIFPNet也优于输入数据为频域数据的方法:在UCF101上,F2D-SIFPNet方法的准确率比Fast-CoViAR高10.3%,比Faster-FCoViAR高4.6%。这说明直接将频域I帧数据、MV、R的频域数据输入到网络中并不能充分利用I帧与P帧之间的时空相关性,而F2D-SIFPNet方法有效地融合了I帧和P帧的静态空间特征和动态运动信息。此外,F2D-SIFPNets准确率高于SIFP,而SIFP的GFLOPs是F2D-SIFPNets的2.73倍。在F2D-SIFPNets的基础上,F2D-SIFPNet增加了输入帧数,进一步提高了准确率。F2D-SIFPNet优于许多基于RGB帧数据、需要对压缩视频进行完全解码的方法,例如I3D,ECO和TSM。可以看到,只有两个方法(TEA和TDN)在UCF101上的准确率比F2D-SIFPNet略高,但它们的GFLOPs更大。
图11在UCF101和HMDB51数据集上,F2D-SIFPNet与SOTA方法进行比较。“R18”表示ResNet18,“R50”表示ResNet50,“R152”表示ResNet152。计算复杂度以单个crop作为输入的模型GFLOPs来衡量。
本实施例通过新的TDAC结合了时间注意力机制,有效和动态地调整慢速SI路径和快速路径FP的连接。AME模块可以自适应地捕捉运动变化,并且可以很容易地嵌入到标准ResNet瓶颈块中进行有效的时间建模,这两个模块可以提升行为识别的准确率。TDAC融合方式的有效性如图12所示。表中显示了如下几种不同的融合方式的结果:(1)SI-only(未融合),(2)FP-only(未融合),(3)后期融合(SI-only和FP-only的加权平均得分),(4)T-Conv,(5)TDAC。仅快速路径FP(未融合)的准确率只有79.3%,而SI路径融合后的准确率提高了4.7%(从TSACs的90.2%提高到最好的94.9%)。此外,不同的FSIFP融合模型都优于SI-only和FP-only,说明快速路径FP和SI路径有很大的互补性。
AME模块的有效性如图13所示。将AME模块与不加入时间建模模块、加入ME时间建模模块进行比较,结果表明,AME模块比基线方法好2.6%,比ME高1.1%,说明自适应的运动激励模块AME比静态的运动激励模块ME能更有效地提升行为识别的性能。AME模块使得本实施例的方法能够同时利用帧间的长期运动线索和短期时间信息,从而提高了识别的准确性。图14的可视化结果可以更直观地观察到这种改进:由于AME模块的自适应运动激励,F2D-SIFPNet不仅可以定位短期运动区域,还可以定位长期运动特征。例如,在“PlayingPiano”的可视化中,本实施例的F2D-SIFPNet既关注手指的短期运动,也可以关注手臂的长期摆动,而ME模块只能关注手指的短期运动,基线方法更多地关注背景。
在本实施例中,图14利用GradCam对动态区进行观察。上图:基线,中间:ME模块,底部:AME模块。在这个可视化的对照图像中,本实施例用AME(FSIFP+TSACs+AME)或ME(FSIFP+TSACs+ME)训练一个(4+32)帧网络。本实施例只在频率的I帧上显示GradCam图,从而在同等的计算量下可以输入更多的MV和视频帧,捕捉到更多运动信息,更有效的识别长时视频中的行为。
本申请捕捉长视频中的运动信息有效性如图15所示。本实施例在ActivityNet1.3这种视频时长较长的数据集上进行了实验,F2D-SIFPNet的准确率优于其他所有压缩域的方法,例如CoViAR、TSM和TAM(TSM和TAM为采用压缩域的I帧作为输入的结果),达到了81.2的准确率和86.2的mAP。可以看到随着输入帧数的增加,8-32的模型F2D-SIFPNet的结果较4-16的模型F2D-SIFPNets的结果提升较多,而将F2D-SIFPNet的clip从1增加到3,准确率也有较明显的提升。这说明在从剪辑(trimmed)视频数据到非剪辑(untrimmed)视频数据的迁移学习过程中,对于ActivityNet1.3这种视频时长较长的数据集,输入帧数可以增加网络学习整个视频的全局时空信息的能力。直接将I帧数据、MV、R的频域数据输入到网络中并不能充分利用I帧与P帧之间的时空相关性,本实施例的F2D-SIFPNet方法有效地融合了I帧和P帧的静态空间特征和动态运动信息。图15在ActivityNet1.3数据集上与SOTA方法进行比较。为了便于比较,本实施例只给出了没有光流的方法的结果。标有“our impl”的结果是在本实施例的设备上的进行复现实验的结果,其他方法的结果为在他们的论文中报道的结果。“R18”表示ResNet18,“R50”表示ResNet50,“R152”表示ResNet152。计算复杂度以单个crop作为输入的模型GFLOPs来衡量。
在本实施例中,F2D-SIFPNet的准确率也高于大部分基于RGB输入方法,在相似计算复杂度的情况下具有更好的识别准确率,更好的验证了F2D-SIFPNet对于长时视频中运动信息捕捉的有效性。
该具体实施例阐述的诸多细节仅助于理解本申请通过相关实验研究支持本申请的技术原理,并不构成对本申请的限定,基于该技术原理进行更多形式的简单变换,均在本申请的保护范围内。
此外,本申请实施例还提供一种电子设备,该电子设备例如可以是边缘路由器,还可以是宽带远程接入服务器(Broadband Remote Access Server,BRAS)、宽带网络网关(Broadband Network Gateway)、服务GPRS支持节点(Serving GPRS Support Node,SGSN)、网关GPRS支持节点(Gateway GPRS Support Node,GGSN)、移动管理实体(MobilityManagement Entity,MME)或服务网关(Serving GateWay,S-GW)
等。
参照图16,图16为本申请实施例提供的一种电子设备的硬件结构示意图。如图16所示,电子设备可以包括:处理器1001,例如中央处理器(Central Processing Unit,CPU),通信总线1002、用户接口1003,网络接口1004,存储器1005。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可以包括标准的有线接口、无线接口(如无线保真(Wireless-Fidelity,WI-FI)接口)。存储器1005可以是高速的随机存取存储器(Random Access Memory,RAM),也可以是稳定的非易失性存储器(Non-Volatile Memory,NVM),例如磁盘存储器。存储器1005还可以是独立于前述处理器1001的存储设备。
本领域技术人员可以理解,图16中示出的结构并不构成对电子设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。如图16所示,作为一种存储介质的存储器1005中可以包括操作系统、数据存储模块、网络通信模块、用户接口模块以及行为识别程序。
在图16所示的电子设备中,网络接口1004主要用于与其他设备进行数据通信;用户接口1003主要用于与用户进行数据交互;本实施例中的处理器1001、存储器1005可以设置在通信设备中,通信设备通过处理器1001调用存储器1005中存储的行为识别程序,并执行上述任一实施例提供的应用于行为识别方法。
本实施例提出的终端与上述实施例提出的应用于行为识别方法属于同一发明构思,未在本实施例中详尽描述的技术细节可参见上述任意实施例,并且本实施例具备与执行行为识别方法相同的有益效果。
此外,本申请实施例还提出一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质上存储有行为识别程序,该行为识别程序被处理器执行时实现如上所述的本申请行为识别方法。
本申请电子设备和计算机可读存储介质的各实施例,均可参照本申请行为识别方法各个实施例,此处不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对一些情形做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台电子设备(可以是手机、计算机、服务器、空调器、或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的可选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。
Claims (10)
- 一种行为识别方法,其中,所述方法包括:获取压缩视频流,确定所述压缩视频流对应的频域数据,并提取所述频域数据中的频域I帧数据和频域P帧数据;将所述频域I帧数据输入至二维卷积神经网络的慢速路径SI进行静态语义处理,得到稀疏空间特征,并将所述频域P帧数据输入至所述二维卷积神经网络的快速路径FP进行运动信息处理,得到快速运动特征;在所述快速路径FP提取所述快速运动特征中的显著运动特征,基于时间注意力维度将所述显著运动特征整合至所述慢速路径SI,并与所述稀疏空间特征进行横向特征融合,得到第一时空聚合特征;将所述第一时空聚合特征与所述快速运动特征进行纵向特征融合,得到第二时空聚合特征,并依据所述第二时空聚合特征进行压缩视频流的行为识别。
- 如权利要求1所述的行为识别方法,其中,所述确定所述压缩视频流对应的频域数据的步骤包括:将所述压缩视频流进行熵解码,得到帧间运动矢量和中间压缩码流数据;将所述中间压缩码流数据依次进行重排序操作和逆量化处理,得到I帧图像的DCT系数,以及P帧图像的DCT残差系数;将所述帧间运动矢量和所述DCT残差系数进行连接,得到所述频域数据对应的频域P帧数据,并根据所述DCT系数确定所述频域数据对应的频域I帧数据;将所述频域I帧数据和所述频域P帧数据作为所述压缩视频流对应的频域数据。
- 如权利要求1所述的行为识别方法,其中,所述时间注意力维度包括静态特征分支和时间注意力权重分支,所述基于时间注意力维度将所述显著运动特征整合至所述慢速路径SI,并与所述稀疏空间特征进行横向特征融合,得到第一时空聚合特征的步骤包括:将所述显著运动特征整合至所述慢速路径SI;通过所述时间注意力权重分支激发所述显著运动特征对应的第一时间维度信息,并通过所述静态特征分支在所述慢速路径SI激发所述稀疏空间特征对应的第二时间维度信息;将所述第一时间维度信息、所述第二时间维度信息和所述稀疏空间特征进行横向特征融合,得到第一时空聚合特征。
- 如权利要求1所述的行为识别方法,其中,所述依据所述第二时空聚合特征进行压缩视频流的行为识别的步骤包括:基于所述压缩视频流对应的长期时间特征生成自适应卷积核;根据所述自适应卷积核,捕捉所述第二时空聚合特征中的长期运动线索,得到所述压缩视频流对应的长期运动特征,并根据所述自适应卷积核,捕捉所述第二时空聚合特征中的短期运动线索,得到所述压缩视频流对应的短期运动特征;依据所述长期运动特征和所述短期运动特征进行压缩视频流的行为识别。
- 如权利要求4所述的行为识别方法,其中,所述依据所述长期运动特征和所述短期运动特征进行压缩视频流的行为识别的步骤包括:根据所述长期运动特征确定长时序行为特征,并根据所述短期运动特征确定短时序行为特征;将所述长时序行为特征和所述短时序行为特征进行融合分析,得到所述压缩视频流对应的关键 行为特征;依据所述关键行为特征进行压缩视频流的行为识别。
- 如权利要求5所述的行为识别方法,其中,所述依据所述关键行为特征进行压缩视频流的行为识别的步骤包括:从预设的行为特征映射表中,查询得到所述关键行为特征映射的行为类别信息;将所述行为类别信息,作为对压缩视频流进行行为识别的识别结果。
- 如权利要求1所述的行为识别方法,其中,在所述确定所述压缩视频流对应的频域数据的步骤之后,所述方法还包括:对所述频域数据进行频域信道的下采样处理,得到低频通道的频域数据;所述提取所述频域数据中的频域I帧数据和频域P帧数据的步骤包括:从所述低频通道的频域数据中提取频域I帧数据和频域P帧数据。
- 如权利要求7所述的行为识别方法,其中,所述对所述频域数据进行频域信道的下采样处理,得到低频通道的频域数据的步骤包括:将频域数据中的帧间运动矢量进行频域信道的下采样处理,得到低频通道的帧间运动矢量;将所述频域数据中P帧图像的DCT残差系数进行频域信道的下采样处理,得到低频通道的DCT残差系数;将低频通道的帧间运动矢量,以及低频通道的DCT残差系数进行连接,得到低频通道的频域P帧数据;将所述频域数据中I帧图像的DCT系数进行频域信道的下采样处理,得到低频通道的频域I帧数据,并将低频通道的频域I帧数据,以及低频通道的频域P帧数据,作为低频通道的频域数据。
- 一种电子设备,其中,所述电子设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的行为识别程序,所述行为识别程序被所述处理器执行时实现如权利要求1至8中任一项所述的行为识别方法。
- 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有行为识别程序,所述行为识别程序被处理器执行时实现如权利要求1至8中任一项所述的行为识别方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211580062.5A CN118212559A (zh) | 2022-12-08 | 2022-12-08 | 行为识别方法、电子设备及计算机可读存储介质 |
CN202211580062.5 | 2022-12-08 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024120125A1 true WO2024120125A1 (zh) | 2024-06-13 |
Family
ID=91378554
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/131344 WO2024120125A1 (zh) | 2022-12-08 | 2023-11-13 | 行为识别方法、电子设备及计算机可读存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN118212559A (zh) |
WO (1) | WO2024120125A1 (zh) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107122712A (zh) * | 2017-03-27 | 2017-09-01 | 大连大学 | 基于卷积神经网络和双向局部特征聚合描述向量的掌纹图像识别方法 |
CN113920581A (zh) * | 2021-09-29 | 2022-01-11 | 江西理工大学 | 一种时空卷积注意力网络用于视频中动作识别的方法 |
CN114241598A (zh) * | 2021-11-18 | 2022-03-25 | 浙江工业大学 | 一种基于并联注意力和双流权重自适应的动作识别方法 |
CN115019389A (zh) * | 2022-05-27 | 2022-09-06 | 成都云擎科技有限公司 | 一种基于运动显著性和SlowFast的行为识别方法 |
-
2022
- 2022-12-08 CN CN202211580062.5A patent/CN118212559A/zh active Pending
-
2023
- 2023-11-13 WO PCT/CN2023/131344 patent/WO2024120125A1/zh unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107122712A (zh) * | 2017-03-27 | 2017-09-01 | 大连大学 | 基于卷积神经网络和双向局部特征聚合描述向量的掌纹图像识别方法 |
CN113920581A (zh) * | 2021-09-29 | 2022-01-11 | 江西理工大学 | 一种时空卷积注意力网络用于视频中动作识别的方法 |
CN114241598A (zh) * | 2021-11-18 | 2022-03-25 | 浙江工业大学 | 一种基于并联注意力和双流权重自适应的动作识别方法 |
CN115019389A (zh) * | 2022-05-27 | 2022-09-06 | 成都云擎科技有限公司 | 一种基于运动显著性和SlowFast的行为识别方法 |
Non-Patent Citations (1)
Title |
---|
LI JIAPENG; WEI PING; ZHANG YONGCHI; ZHENG NANNING: "A Slow-I-Fast-P Architecture for Compressed Video Action Recognition", CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, ACM, NEW YORK, NY, USA, 12 October 2020 (2020-10-12) - 5 May 2022 (2022-05-05), New York, NY, USA, pages 2039 - 2047, XP059453342, ISBN: 978-1-4503-9157-3, DOI: 10.1145/3394171.3413641 * |
Also Published As
Publication number | Publication date |
---|---|
CN118212559A (zh) | 2024-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11310501B2 (en) | Efficient use of quantization parameters in machine-learning models for video coding | |
Habibian et al. | Video compression with rate-distortion autoencoders | |
US11310498B2 (en) | Receptive-field-conforming convolutional models for video coding | |
Zhang et al. | Learning structure of stereoscopic image for no-reference quality assessment with convolutional neural network | |
WO2021232969A1 (zh) | 动作识别方法、装置、设备及存储介质 | |
CN109379550A (zh) | 基于卷积神经网络的视频帧率上变换方法及系统 | |
CN110751649B (zh) | 视频质量评估方法、装置、电子设备及存储介质 | |
CN110798690B (zh) | 视频解码方法、环路滤波模型的训练方法、装置和设备 | |
WO2021155832A1 (zh) | 一种图像处理方法以及相关设备 | |
CN111985343A (zh) | 一种行为识别深度网络模型的构建方法及行为识别方法 | |
WO2022073282A1 (zh) | 一种基于特征交互学习的动作识别方法及终端设备 | |
CN111784578A (zh) | 图像处理、模型训练方法及装置、设备、存储介质 | |
WO2023174098A1 (zh) | 一种实时手势检测方法及装置 | |
WO2020061008A1 (en) | Receptive-field-conforming convolution models for video coding | |
CN112040222A (zh) | 一种视觉显著性预测方法及设备 | |
TWI826160B (zh) | 圖像編解碼方法和裝置 | |
CN112906721A (zh) | 图像处理方法、装置、设备及计算机可读存储介质 | |
dos Santos et al. | CV-C3D: action recognition on compressed videos with convolutional 3d networks | |
CN113095206A (zh) | 虚拟主播生成方法、装置和终端设备 | |
Wu et al. | Memorize, then recall: a generative framework for low bit-rate surveillance video compression | |
Liu et al. | Combined CNN/RNN video privacy protection evaluation method for monitoring home scene violence | |
WO2024120125A1 (zh) | 行为识别方法、电子设备及计算机可读存储介质 | |
CN113393435A (zh) | 一种基于动态上下文感知滤波网络的视频显著性检测方法 | |
CN108810551A (zh) | 一种视频帧预测方法、终端及计算机存储介质 | |
CN112200816A (zh) | 视频图像的区域分割及头发替换方法、装置及设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23899705 Country of ref document: EP Kind code of ref document: A1 |