WO2022174616A1 - Behavior recognition method and apparatus, and electronic device and storage medium - Google Patents

Behavior recognition method and apparatus, and electronic device and storage medium Download PDF

Info

Publication number
WO2022174616A1
WO2022174616A1 PCT/CN2021/127119 CN2021127119W WO2022174616A1 WO 2022174616 A1 WO2022174616 A1 WO 2022174616A1 CN 2021127119 W CN2021127119 W CN 2021127119W WO 2022174616 A1 WO2022174616 A1 WO 2022174616A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
sub
channel
adjacent frames
convolution model
Prior art date
Application number
PCT/CN2021/127119
Other languages
French (fr)
Chinese (zh)
Inventor
苏海昇
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2022174616A1 publication Critical patent/WO2022174616A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects

Definitions

  • the present disclosure relates to the technical field of computer vision, and relates to a method and device for behavior recognition, an electronic device and a storage medium.
  • embodiments of the present disclosure provide a method and apparatus for behavior recognition, an electronic device, and a storage medium.
  • a behavior recognition result is obtained based on the first data and the second data.
  • the acquiring differential information of data of adjacent frames includes:
  • performing feature alignment on the data of the adjacent frames includes:
  • a similarity matrix is used to perform feature alignment on the features of two adjacent frames after dimension reduction.
  • the obtaining the foreground data of the adjacent frames from the feature-aligned data based on the at least one ladder-structured convolution model includes:
  • the first frame data and the second frame data in the foreground data of the adjacent frames are respectively copied into N shares;
  • the two-dimensional convolution model is an N-order two-dimensional convolution model
  • the determining, based on the differential information, the first data representing motion information in the data includes:
  • the foreground data in the data is processed based on the channel weight to obtain the first data.
  • the processing of foreground data in the data based on the channel weight to obtain the first data includes:
  • the processing of the data grouped by each channel in the time series dimension to obtain the second data includes:
  • the processed data of all channel groups are fused to obtain the second data.
  • the use of the time series convolution model to process the data grouped by each channel includes:
  • the time-series convolution model is used to fuse data time-series information of different scales in the channel dimension.
  • the use of the time series convolution model to fuse data time series information of different scales in the channel dimension includes:
  • the data of each channel grouping is divided into N sub-data, the second sub-data in the N sub-data is input into the one-dimensional convolution model in the time series convolution model, and the second sub-series data is obtained;
  • the Kth sub-series data is fused with the (K+1)th sub-data, and preprocessing is performed on the fused data; wherein, K is greater than or equal to 2, and K is less than or equal to N-1;
  • the first result and the second result are fused, they are input to the one-dimensional convolution model to obtain the (K+1)th sub-series data.
  • the data of the processed all channel groupings are fused to obtain the second data including:
  • the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data are respectively subjected to time-series convolution processing, the data after time-series convolution processing are concatenated to obtain the second sub-series data. data.
  • the method before acquiring the difference information of the data of the adjacent frames of the image, the method further comprises: using a three-dimensional convolution model to acquire the data of the adjacent frames of the image;
  • the method further includes: performing channel dimension reduction processing on the first data and the second data.
  • the behavior recognition method is implemented by a temporal motion model, and the temporal motion model includes an enhanced motion transformation module and a long-term temporal modeling module;
  • the enhanced motion transformation module is used to obtain the difference information of the data of adjacent frames of the image; based on the difference information, determine the first data representing the motion information in the data;
  • the long-time sequence modeling module is configured to perform channel grouping on the data of the adjacent frames based on the feature; and process the data of each channel grouping in the time sequence dimension to obtain second data.
  • the three-dimensional convolution model is used to obtain data of adjacent frames of the image
  • the one-dimensional convolution model is used to perform channel dimension reduction processing on the first data and the second data.
  • an acquisition unit configured to acquire differential information of data of adjacent frames
  • a determining unit configured to determine, based on the differential information, first data representing motion information in the data
  • a grouping unit configured to perform channel grouping on the data of the adjacent frames based on the feature
  • a processing unit configured to process the data grouped by each channel in the time sequence dimension to obtain second data
  • An identification unit configured to obtain a behavior identification result based on the first data and the second data.
  • the obtaining unit is further configured to perform feature alignment on the data of the adjacent frames; and obtain the adjacent frames from the feature-aligned data based on at least one convolutional model with a staircase structure the foreground data; perform differential processing on the foreground data to obtain differential information of the foreground data of the adjacent frames.
  • the obtaining unit is further configured to obtain features of the data of two adjacent frames; perform dimension reduction processing on the features of each frame in the channel dimension; The features of the two frames are feature aligned.
  • the obtaining unit is further configured to respectively copy the first frame data and the second frame data in the foreground data of the adjacent frames into N shares;
  • the first second frame data is input into a two-dimensional convolution model, and the obtained first output result is differentiated from the first first frame data to obtain a first difference result;
  • the two-dimensional convolution model is an N-order two-dimensional convolution model product model;
  • the first output result and the (M+1)th second frame data are added and processed, and then input the two-dimensional convolution model, and the obtained (M+1)th output result and the Mth first frame data
  • Do the difference to get the (M+1)th difference result M is greater than or equal to 1, and M is less than or equal to N-1; after splicing the obtained first difference result to the Nth difference result, input it to the one-dimensional convolution model, Obtain the foreground data of the adjacent frame.
  • the determining unit is further configured to determine a channel weight based on the differential information; and process foreground data in the data based on the channel weight to obtain the first data.
  • the determining unit is further configured to add the foreground data to the product of the foreground data and the channel weight to obtain the first data.
  • the processing unit is further configured to use a time series convolution model to process the data grouped by each channel; and fuse the processed data of all channel groups to obtain the second data.
  • the processing unit is further configured to use the time-series convolution model to fuse data time-series information of different scales in the channel dimension.
  • the processing unit is further configured to divide the data grouped by each channel into N sub-data, and input the second sub-data into the one-dimensional convolution model in the time series convolution model, to obtain The second sub-series data; after merging the K-th sub-series data with the (K+1)th sub-data, perform preprocessing on the fused data, K is greater than or equal to 2, and K is less than or equal to N-1; The first result of multiplying the value of 1 with the Kth sub-series data point, and the second result of multiplying the difference between 1 and the value obtained by the preprocessing and the (K+1)th sub-data point; After the first result and the second result are fused, they are input to the one-dimensional convolution model to obtain the (K+1)th sub-series data.
  • the processing unit is further configured to perform time-series convolution processing on the first sub-data, the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data, respectively, and then process the time-series data.
  • the convolution-processed data are concatenated to obtain the second data.
  • the processing unit is further configured to use a three-dimensional convolution model to obtain data of adjacent frames of the image; and/or, after obtaining the second data, use a one-dimensional convolution model to perform a Channel dimension reduction processing is performed on the first data and the second data.
  • the computer program product provided by the embodiments of the present disclosure includes computer-executable instructions, and after the computer-executable instructions are executed, the above-mentioned behavior identification method can be implemented.
  • Executable instructions are stored on the storage medium provided by the embodiments of the present disclosure, and when the executable instructions are executed by the processor, the above-mentioned behavior identification method is implemented.
  • the electronic device includes a memory and a processor
  • the memory stores computer-executable instructions
  • the processor can implement the above-mentioned behavior when running the computer-executable instructions on the memory recognition methods.
  • difference information of data of adjacent frames is obtained; based on the difference information, first data representing motion information in the data is determined; Perform channel grouping; process the data of each channel grouping in the time sequence dimension to obtain second data; obtain a behavior recognition result based on the first data and the second data.
  • the difference information of the data of adjacent frames can be obtained, and the background noise in the image can be eliminated; the processing of each channel grouping in the time series dimension realizes a larger time series receptive field; therefore, the behavior provided by the embodiments of the present disclosure
  • the recognition method can improve the accuracy of behavior recognition from the spatial and temporal dimensions.
  • FIG. 1 is a schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of an optional processing flow for an identification device according to an embodiment of the present disclosure to obtain differential information of data of adjacent frames;
  • FIG. 3 is a schematic diagram of an optional processing flow of feature alignment performed on the data of the adjacent frames by an identification device according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of performing dimension reduction processing on features of different time frames according to an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram of cascading features using a step cascading result according to an embodiment of the present disclosure
  • FIG. 6 is a schematic diagram of a data processing flow for adding an EMT module and a TSS module according to an embodiment of the present disclosure
  • FIG. 7 is a schematic diagram of a characteristic pattern of the data according to the disclosed embodiment after passing through the EMT module;
  • FIG. 8 is a schematic structural composition diagram of a behavior recognition device provided by an embodiment of the present disclosure.
  • FIG. 9 is a schematic structural composition diagram of an electronic device according to an embodiment of the disclosure.
  • video data has exploded in recent years.
  • behavior recognition has a wide range of needs in scenarios such as image recognition, human-computer interaction, and personalized recommendation.
  • the process of video-based action recognition is to determine the action category based on a given video containing an action.
  • the accuracy of judging behavior categories is an important indicator of behavior recognition, and motion modeling and time series modeling based on video data are important factors that affect the results of behavior recognition.
  • Motion modeling based on video data is most commonly used to model motion information between adjacent frames based on optical flow features.
  • two-stream based action recognition methods extract motion features through optical flow modeling.
  • the extraction and modeling process of optical flow requires a large amount of data calculation, which is difficult to apply to scenarios with high real-time requirements.
  • an important role of optical flow is to highlight moving objects when describing the motion relationship between adjacent frames
  • an alternative solution is to use the difference of features between different frames to approximate optical flow; however, in In the process of approximating the optical flow by using the difference of features between different frames, the edge motion information of the moving object and the non-moving object can be obtained at the same time. Since the non-moving object is the background part of the image, the edge motion information of the non-moving object It belongs to noise; and the edge motion information of the stationary part of the moving object also belongs to interference noise.
  • the first scheme is to use the structure of two-dimensional (2D, 2Dimension) convolutional neural network (CNN, Convolutional Neural Networks) plus an inter-frame aggregator.
  • CNN Convolutional Neural Networks
  • the aggregator is generally Using avg/max/3D (3D, 3Dimension) convolutional/recurrent neural network (RNN, Rerrent Neural Network) operations, this scheme simply performs frame-level score fusion or frame-level high-level feature fusion, but does not consider features Aggregation of timing information at the level.
  • RNN avg/max/3D (3D, 3Dimension) convolutional/recurrent neural network
  • the second solution is to use 3D convolution, which uses 3D convolution to aggregate time sequence relationships at the feature level.
  • 3D convolution has many parameters and a large amount of measurement, 3D convolution is decoupled into 2D+1D convolution and 2D convolution. Spatial information modeling, 1D convolution is only responsible for temporal relationship modeling. However, for 3D/(2D+1D) convolution, only the timing relationship within the local window is modeled. For long-term timing relationships, vertical stacked convolution blocks are used to achieve the purpose of modeling long-range timing. This vertical structure is difficult to optimize for shallow temporal convolutions.
  • an embodiment of the present disclosure proposes a behavior recognition method.
  • Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless otherwise indicated.
  • the execution subject of the behavior recognition method provided by the embodiments of the present disclosure may be a behavior recognition device, such as a terminal device or a server, or may be any other data processing device.
  • the processing device of the capability is not limited in the embodiments of the present disclosure.
  • a schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure includes at least the following steps:
  • Step S101 acquiring difference information of data of adjacent frames.
  • the behavior recognition device obtains differential information of data of adjacent frames.
  • the adjacent frames may be two adjacent frames in the video data.
  • an optional processing flow for the identification device to obtain differential information of data of adjacent frames includes at least the following steps:
  • Step S1a performing feature alignment on the data of the adjacent frames.
  • an optional processing flow for the identification device to perform feature alignment on the data of the adjacent frames includes at least the following steps:
  • Step S1a1 acquiring data features of two adjacent frames.
  • the shape of the features of the data may be Among them, N represents the batch size, F is the number of frames, C is the number of channels, H is the height of a single-frame image, and W is the width of a single-frame image.
  • Step S1a2 performing dimension reduction processing on the channel dimension of the feature of each of the two adjacent frames.
  • the recognition device performs frame-level separation on the input feature X to obtain
  • the recognition device performs dimensionality reduction processing on the feature of each frame in the channel dimension, and uses a 1 ⁇ 1 convolution to compress the feature channel, as shown in the following formulas (1) and (2), the formula (1) is the result of dimension reduction processing in the channel dimension for the feature of the frame at time t as shown in Fig. 4, formula (2) is the feature of the frame at time t+1 as shown in Fig.
  • the result of dimensionality reduction processing is shown in the following formulas (1) and (2), the formula (1) is the result of dimension reduction processing in the channel dimension for the feature of the frame at time t as shown in Fig. 4, formula (2) is the feature of the frame at time t+1 as shown in Fig. The result of dimensionality reduction processing.
  • l is the channel compression ratio, and the value of l can be flexibly set according to the actual application scenario, for example, it is set to a value such as 16.
  • Step S1a3 using the similarity matrix to perform feature alignment on the features of the two adjacent frames after dimension reduction processing.
  • the recognition device uses the similarity matrix to wrap and align adjacent frames, as shown in the following formulas (3) and (4).
  • the r() function is used to transform the size and shape of the feature.
  • Step S1b based on at least one convolutional model with a ladder structure, obtain foreground data of the adjacent frame from the data after feature alignment.
  • the identification device utilizes a set of 2D convolutions in a staircase structure to extract the first sample data representing motion information.
  • the 2D convolution of a set of ladder structures can be 2 2D convolutions, or 4 or 6 2D convolutions; the following takes the 2D convolution of a set of ladder structures as 4 2D convolutions as an example, the The feature is divided into 4 parts according to the channel, and the following formula (5), formula (6), formula (7) and formula (8) are used to perform convolution calculation on the aligned features, and obtain the described data from the feature alignment. Foreground data of adjacent frames to obtain multi-scale motion information.
  • foreground data of adjacent frames can be extracted, that is, motion information can be extracted, and background data in adjacent frames can be filtered and deleted.
  • the right part of Fig. 4 is the principle flow chart of the multi-scale difference module (MSFD, Multi Scale Feature Difference).
  • MSFD Multi Scale Feature Difference
  • N 4 in Figure 4
  • the first second frame data in the N second frame data is input into the two-dimensional convolution model, and the first output result obtained is the same as the first
  • the first frame data is differentiated to obtain the first difference result
  • the two-dimensional convolution model is an N-order two-dimensional convolution model
  • the first output result is added to the (M+1)th second frame data
  • M+1)th difference result where M is greater than or equal to 1, and M is less than or equal to N-1:
  • different motion change information is captured through a set of 2D convolutions, so that the subsequent motion difference information can more accurately describe and identify behaviors.
  • Step S1c performing differential processing on the foreground data to obtain differential information of the foreground data of the adjacent frames.
  • the identification device performs differential processing on the foreground data by using the following formula (9) and formula (10) to obtain differential information of the foreground data of adjacent frames.
  • the background motion noise can be eliminated.
  • the result obtained after the 5th data is subjected to 2D convolution processing is fused with the 6th data, and the result after 2D convolution processing of the 5th data and the 6th data is merged with the 7th data.
  • M out represents the difference information of the foreground data of adjacent frames.
  • Step S102 Based on the difference information, determine first data representing motion information in the data.
  • the channel weight is determined based on the differential information of the foreground data, that is, the differential information of the foreground data is used as the channel weight, and the foreground data is processed by using the channel weight to obtain the first data representing motion information in the data.
  • using the channel weight to process the foreground data may be using the channel weight to enhance the foreground data to obtain the first data representing motion information in the data.
  • the difference information may also be directly used as the first data representing the motion information.
  • the channel weight is determined by the following formula (11); the foreground data is enhanced by the following formula (12), that is, the foreground data is multiplied by the channel weight to obtain a product result, and the product result is combined with The foreground data is added to obtain the first data.
  • the inter-frame similarity matrix is used to realize the feature alignment between the frames, and the interference problem caused by the background jitter is eliminated as much as possible.
  • the embodiment of the present disclosure uses a set of stepped 2D convolutions to extract data of different scales, and then performs differential processing on the data of different scales to eliminate background noise in the image and obtain motion of different scales.
  • EMT Enhanced Motion Transformer Enhanced Motion Transformer
  • Step S103 Perform channel grouping on the data of the adjacent frames based on the feature.
  • the identification device processes the data of each channel group by using a time series convolution model, and fuses the processed data of all channel groups to obtain the second data.
  • Step S104 Process the data grouped by each channel in the time sequence dimension to obtain second data.
  • the identification device uses the time series convolution model to fuse data time series information of different scales in the channel dimension; for example, the identification device can use 1D time series convolution to process the data grouped by each channel, as shown in the following formula (13) to formula (15) are shown:
  • the data of each channel grouping is divided into N sub-data, the second sub-data is input into the one-dimensional convolution model in the described time series convolution model, and the second sub-series data is obtained;
  • the Kth sub-series data and the (( K+1) After the sub-data is fused, perform preprocessing on the fused data, where K is greater than or equal to 2, and K is less than or equal to N-1; result, and the second result of multiplying the difference between 1 and the value obtained by the preprocessing and the (K+1)th sub-data point; after fusing the first result and the second result, input to The one-dimensional convolution model obtains the (K+1)th sub-series data.
  • the preprocessing may be to perform channel fusion processing on the fused data, such as Spatial Average Pooling (SAP, Spatial Average Pooling), Fully Connected (FC, Fully Connected), and a normalized exponential function (Softmax).
  • SAP Spatial Average Pooling
  • FC Fully Connected
  • the identification device uses 1D time series convolution to process the data grouped by each channel, it fuses the processed data of all channel groups to obtain second data. For example, after the first sub-data, the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data are respectively subjected to time-series convolution processing, the data after time-series convolution processing are concatenated to obtain the the second data.
  • formula (20) represents the result obtained by fusing the data of two adjacent channel groups.
  • the second part of the data is input into the one-dimensional convolution model in the time series convolution model, to obtain the second part of the time series data
  • the second sub-sequence data is fused and preprocessed, and the result obtained is input into the one-dimensional convolution model in the time-series convolution model to obtain the third sub-sequence data
  • the fourth sub-data and the third sub-series data are fused and preprocessed
  • the obtained result is input into the one-dimensional convolution model in the time-series convolution model to obtain the fourth sub-series data.
  • the sub-data, the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data are respectively subjected to time-series convolution processing, and then the time-series convolution-processed data are concatenated to obtain the second data.
  • formula (25) represents the time series information of the feature.
  • the time series information of different scales is obtained by cascaded 1D convolution, and the time series information of different scales is connected in steps to obtain the 1D convolution of the large receptive field, that is, the long time series modeling (TSS, Temporal Step- Structure) module to process the data.
  • TSS Long time series modeling
  • Step S105 Obtain a behavior recognition result based on the first data and the second data.
  • the first data is data from which background motion noise has been eliminated from the video data
  • the second data is long-sequence data
  • the identification device accurately performs behavior identification on the input video data based on the first data and the second data , such as judging behavior categories, etc.
  • steps S101 to S102 and steps S103 to S104 do not have a sequence of execution.
  • Steps S101 and S102 may be executed first, and then steps S103 and 104 may be executed, or steps S103 and S103 may be executed first.
  • steps S101 and S102 are executed again.
  • the first data may be acquired first, and then the second data may be acquired, or the second data may be acquired first, and then the first data may be acquired.
  • FIG. 6 A schematic diagram of a data processing flow for adding an EMT module and a TSS module in an embodiment of the present disclosure, as shown in FIG. 6 , on the basis of the existing behavior recognition method, an EMT module and a TSS module are added.
  • an EMT module and a TSS module are added before the data processing of the EMT module.
  • the data performs channel dimensionality reduction processing.
  • the added EMT module and TSS module constitute a temporal motion model (TMM, Temporal and Motion Module), and the TMM can be embedded into an existing behavior recognition model such as a two-dimensional residual network (2D ResNet, 2 Dimension Residual Network) model, as shown in Figure 6, after TMM is embedded in the three-dimensional convolution model used to obtain the data of adjacent frames of the image, and the channel dimension reduction process is performed on the first data and the second data.
  • TMM Temporal and Motion Module
  • 2D ResNet 2 Dimension Residual Network
  • the characteristic pattern of the data after passing through the EMT module as shown in Figure 7, the first row is the original frame of the input data, the second row and the third row are the characteristic patterns of the data output after passing through the EMT module; The characteristic pattern of the motion area is obvious.
  • the behavior recognition method provided by the embodiment of the present disclosure can be applied to at least scenarios such as time-frequency recommendation, image recognition, and human-computer interaction.
  • FIG. 8 is a schematic structural composition diagram of the behavior recognition apparatus 200 provided by the embodiment of the present disclosure, and the device includes:
  • an acquisition unit 201 configured to acquire differential information of data of adjacent frames
  • a determining unit 202 configured to determine, based on the difference information, first data representing motion information in the data
  • a grouping unit 203 configured to perform channel grouping on the data of the adjacent frames based on the feature
  • the processing unit 204 is configured to process the data grouped by each channel in the time series dimension to obtain second data;
  • the identification unit 205 is configured to obtain a behavior identification result based on the first data and the second data.
  • the obtaining unit 201 is further configured to perform feature alignment on the data of the adjacent frames;
  • the obtaining unit 201 is further configured to obtain features of data of two adjacent frames
  • a similarity matrix is used to perform feature alignment on the features of two adjacent frames after dimension reduction.
  • the obtaining unit 201 is further configured to copy the first frame data and the second frame data in the foreground data of the adjacent frames into N copies respectively;
  • the first second frame data in the N second frame data is input into the two-dimensional convolution model, and the obtained first output result is differentiated from the first first frame data to obtain the first difference result;
  • the two-dimensional The convolution model is an N-order two-dimensional convolution model;
  • the determining unit 202 is further configured to determine the channel weight based on the differential information
  • the foreground data in the data is processed based on the channel weight to obtain the first data.
  • the determining unit 202 is further configured to add the foreground data to the product of the foreground data and the channel weight to obtain the first data.
  • the processing unit 204 is further configured to process the data grouped by each channel by using a time series convolution model
  • the processed data of all channel groups are fused to obtain the second data.
  • the processing unit 204 is further configured to use the time series convolution model to fuse data time series information of different scales in the channel dimension.
  • the processing unit 204 is further configured to divide the data grouped by each channel into N sub-data, and input the second sub-data into the one-dimensional convolution in the time series convolution model model to obtain the second sub-series data;
  • K is greater than or equal to 2
  • K is less than or equal to N-1
  • the first result and the second result are fused, they are input to the one-dimensional convolution model to obtain the (K+1)th sub-series data.
  • the processing unit 204 is further configured to perform sequential convolution processing on the first sub-data, the second sub-sequential data, the third sub-sequential data, and the Nth sub-sequential data respectively, The data after time series convolution processing is then concatenated to obtain the second data.
  • the processing unit 204 is further configured to acquire data of adjacent frames of the image by using a three-dimensional convolution model
  • a one-dimensional convolution model is used to perform channel dimension reduction processing on the first data and the second data.
  • each unit in the behavior recognition apparatus shown in FIG. 8 can be understood with reference to the relevant description of the behavior recognition method described above.
  • the function of each unit in the behavior recognition device shown in FIG. 8 can be realized by a program running on a processor, or can be realized by a logic circuit.
  • the above-mentioned behavior recognition device in the embodiment of the present disclosure is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium.
  • the technical solutions of the embodiments of the present disclosure, or the units that make contributions to the prior art may be embodied in the form of software products, and the computer software products are stored in a storage medium, including a number of instructions for So that an electronic device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present disclosure.
  • the aforementioned storage medium includes: U disk, mobile hard disk, Read Only Memory (ROM, Read Only Memory), magnetic disk or optical disk and other media that can store program codes.
  • embodiments of the present disclosure are not limited to any particular combination of hardware and software.
  • an embodiment of the present disclosure further provides a computer program product, in which computer-executable instructions are stored, and when the computer-executable instructions are executed, the above-mentioned behavior identification method of the embodiment of the present disclosure can be implemented.
  • An embodiment of the present disclosure further provides a storage medium, where executable instructions are stored on the storage medium, and when the executable instructions are executed by a processor, the above-mentioned behavior identification method is implemented.
  • FIG. 9 is a schematic structural composition diagram of the electronic device according to the embodiment of the present disclosure.
  • the electronic device 300 may include One or more (only one is shown in the figure) processor 301 (the processor 301 may include but is not limited to the processing of a microprocessor (MCU, Micro Controller Unit) or a programmable logic device (FPGA, Field Programmable Gate Array), etc. means), memory 302 for storing data, and transmission means 303 for communication functions.
  • MCU Microprocessor
  • FPGA Field Programmable Gate Array
  • FIG. 9 is only a schematic diagram, which does not limit the structure of the above-mentioned electronic device.
  • the electronic device 300 may also include more or fewer components than shown in FIG. 9 , or have a different configuration than that shown in FIG. 9 .
  • the memory 302 can be used to store software programs and modules of application software, such as program instructions/modules corresponding to the methods in the embodiments of the present disclosure, and the processor 301 executes various functional applications by running the software programs and modules stored in the memory 302 And data processing, that is, to realize the above method.
  • Memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • memory 302 may further include memory located remotely from processor 301, which may be connected to electronic device 300 through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the memory 302 may be either volatile memory or non-volatile memory, and may include both volatile and non-volatile memory.
  • the non-volatile memory can be ROM, Programmable Read-Only Memory (PROM, Programmable Read-Only Memory), Erasable Programmable Read-Only Memory (EPROM, Erasable Programmable Read-Only Memory), Electrically Erasable Programmable Read-Only Memory Programmable read-only memory (EEPROM, Electrically Erasable Programmable Read-Only Memory), magnetic random access memory (FRAM, ferromagnetic random access memory), flash memory (Flash Memory), magnetic surface memory, optical disk, or CD-ROM -ROM, Compact Disc Read-Only Memory); magnetic surface memory can be disk memory or tape memory.
  • RAM Random Access Memory
  • SRAM Static Random Access Memory
  • SSRAM Synchronous Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • SDRAM Synchronous Dynamic Random Access Memory
  • DDRSDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • ESDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • ESDRAM Enhanced Type Synchronous Dynamic Random Access Memory
  • SLDRAM Synchronous Link Dynamic Random Access Memory
  • DRRAM Direct Rambus Random Access Memory
  • DRRAM Direct Rambus Random Access Memory
  • the memory 302 in the embodiment of the present disclosure is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program, such as an application, used to operate on a communication device. A program implementing the method of the embodiment of the present disclosure may be included in an application program.
  • Transmission means 303 is used to receive or transmit data via a network.
  • the aforementioned network may include a wireless network provided by a communication provider of the electronic device 300 .
  • the transmission device 303 includes a network adapter (NIC, Network Interface Controller), which can be connected to other network devices through a base station so as to communicate with the Internet.
  • the transmission device 303 may be a radio frequency (RF, Radio Frequency) module, which is used to communicate with the Internet in a wireless manner.
  • RF Radio Frequency
  • the disclosed method and smart device may be implemented in other manners.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division.
  • multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling, or direct coupling, or communication connection between the constituent units shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms. of.
  • the unit described above as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place or distributed to multiple network units;
  • the purpose of the solution in this embodiment can be achieved by selecting any unit or all of the units according to actual needs.
  • each functional unit in each embodiment of the present disclosure may all be integrated into one second processing unit, or each unit may be separately used as a unit, or two or more units may be integrated into one unit;
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
  • the embodiment of the present disclosure obtains difference information of the data of adjacent frames; based on the difference information, determines the first data representing motion information in the data; performs channel grouping on the data of the adjacent frames based on the characteristics; The data grouped by each channel is processed to obtain second data; a behavior recognition result is obtained based on the first data and the second data.
  • the difference information of the data of adjacent frames can be obtained, and the background noise in the image can be eliminated; the processing of each channel grouping in the time series dimension realizes a larger time series receptive field; therefore, the behavior provided by the embodiments of the present disclosure
  • the recognition method can improve the accuracy of behavior recognition from the spatial and temporal dimensions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure relates to a behavior recognition method and apparatus, and an electronic device and a storage medium. The method comprises: acquiring differential information of data of adjacent frames; on the basis of the differential information, determining first data in the data that represents motion information; performing channel grouping on the data of the adjacent frames on the basis of a feature; processing data of each channel group in the dimension of a time sequence, so as to obtain second data; and obtaining a behavior recognition result on the basis of the first data and the second data.

Description

行为识别方法及装置、电子设备和存储介质Behavior recognition method and device, electronic device and storage medium
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本公开基于申请号为202110198255.3、申请日为2021年02月22日、申请名称为“行为识别方法及装置、电子设备和存储介质”的中国专利申请提出,并要求上述中国专利申请的优先权,上述中国专利申请的全部内容在此引入本公开作为参考。The present disclosure is based on the Chinese patent application with the application number of 202110198255.3, the filing date of which is on February 22, 2021, and the application name is "Behavior Recognition Method and Device, Electronic Equipment and Storage Medium", and claims the priority of the above-mentioned Chinese patent application, The entire contents of the above-mentioned Chinese patent application are hereby incorporated by reference into the present disclosure.
技术领域technical field
本公开涉及计算机视觉技术领域,涉及一种行为识别方法及装置、电子设备和存储介质。The present disclosure relates to the technical field of computer vision, and relates to a method and device for behavior recognition, an electronic device and a storage medium.
背景技术Background technique
在计算机视觉技术领域,基于视频数据的行为识别在视频推荐、图像识别以及人机交互等众多领域具有极其重要的作用;因此,对图像进行识别行为的精准度显得尤为重要,高精度地进行行为识别是计算机视觉技术一致追求的目标。In the field of computer vision technology, behavior recognition based on video data plays an extremely important role in many fields such as video recommendation, image recognition, and human-computer interaction; Recognition is the unanimous goal of computer vision technology.
发明内容SUMMARY OF THE INVENTION
为解决上述技术问题,本公开实施例提供了一种行为识别方法及装置、电子设备和存储介质。In order to solve the above technical problems, embodiments of the present disclosure provide a method and apparatus for behavior recognition, an electronic device, and a storage medium.
本公开实施例提供的行为识别方法,包括:The behavior recognition method provided by the embodiment of the present disclosure includes:
获取相邻帧的数据的差分信息;Obtain the differential information of the data of adjacent frames;
基于所述差分信息,确定所述数据中表征运动信息的第一数据;Based on the differential information, determining first data representing motion information in the data;
基于特征对所述相邻帧的数据进行通道分组;performing channel grouping on the data of the adjacent frames based on the feature;
在时序维度对每个通道分组的数据进行处理,获得第二数据;Process the data grouped by each channel in the time series dimension to obtain the second data;
基于所述第一数据和所述第二数据得到行为识别结果。A behavior recognition result is obtained based on the first data and the second data.
在一些实施例中,所述获取相邻帧的数据的差分信息,包括:In some embodiments, the acquiring differential information of data of adjacent frames includes:
对所述相邻帧的数据进行特征对齐;performing feature alignment on the data of the adjacent frames;
基于至少一个阶梯结构的卷积模型,从特征对齐后的数据中获取所述相邻帧的前景数据;Obtain foreground data of the adjacent frame from the feature-aligned data based on at least one ladder-structured convolution model;
对所述前景数据进行差分处理,获取所述相邻帧的前景数据的差分信息。Perform differential processing on the foreground data to obtain differential information of the foreground data of the adjacent frames.
在一些实施例中,所述对所述相邻帧的数据进行特征对齐,包括:In some embodiments, performing feature alignment on the data of the adjacent frames includes:
获取相邻两帧的数据的特征;Obtain the characteristics of the data of two adjacent frames;
对所述相邻两帧中的每一帧的所述特征在通道维度上进行降维处理;performing dimension reduction processing on the channel dimension on the feature of each of the two adjacent frames;
利用相似矩阵对降维度处理后的相邻两帧的特征进行特征对齐。A similarity matrix is used to perform feature alignment on the features of two adjacent frames after dimension reduction.
在一些实施例中,所述基于至少一个阶梯结构的卷积模型,从特征对齐后的数据中获取所述相邻帧的前景数据包括:In some embodiments, the obtaining the foreground data of the adjacent frames from the feature-aligned data based on the at least one ladder-structured convolution model includes:
对所述相邻帧的前景数据中的第一帧数据和第二帧数据分别复制为N份;The first frame data and the second frame data in the foreground data of the adjacent frames are respectively copied into N shares;
将N份第二帧数据中的第一份第二帧数据输入二维卷积模型,得到第一输出结果,将所述第一输出结果与第一份第一帧数据做差分,得到第一差分结果;所述二维卷积模型为N阶二维卷积模型;Input the first second frame data of the N second frame data into the two-dimensional convolution model to obtain the first output result, and make a difference between the first output result and the first first frame data to obtain the first output result. Difference result; the two-dimensional convolution model is an N-order two-dimensional convolution model;
将所述第一输出结果与第(M+1)份第二帧数据进行相加处理后输入二维卷积模型,得到第(M+1)输出结果,将所述第(M+1)输出结果与第M份第一帧数据做差分,得到第(M+1)差分结果;其中,M大于或等于1,M小于或等于N-1;After the first output result and the (M+1)th second frame data are added, input a two-dimensional convolution model to obtain the (M+1)th output result, and the (M+1)th Differentiate the output result with the M-th first frame data to obtain the (M+1)-th difference result; where M is greater than or equal to 1, and M is less than or equal to N-1;
将所述第一差分结果至第N差分结果拼接后,输入至一维卷积模型,得到所述相邻帧的前景数据。After splicing the first difference result to the Nth difference result, it is input into a one-dimensional convolution model to obtain foreground data of the adjacent frames.
在一些实施例中,所述基于所述差分信息,确定所述数据中表征运动信息的第一数据,包括:In some embodiments, the determining, based on the differential information, the first data representing motion information in the data includes:
基于所述差分信息确定通道权重;determining a channel weight based on the differential information;
基于所述通道权重对所述数据中的前景数据进行处理,得到所述第一数据。The foreground data in the data is processed based on the channel weight to obtain the first data.
在一些实施例中,所述基于所述通道权重对所述数据中的前景数据进行处理,得到所述第一数据包括:In some embodiments, the processing of foreground data in the data based on the channel weight to obtain the first data includes:
将所述前景数据与所述通道权重相乘得到乘积结果,将所述乘积结果与所述前景数据相加,得到所述第一数据。Multiplying the foreground data and the channel weight to obtain a product result, and adding the product result to the foreground data to obtain the first data.
在一些实施例中,所述在时序维度对每个通道分组的数据进行处理,获得第二数据,包括:In some embodiments, the processing of the data grouped by each channel in the time series dimension to obtain the second data includes:
利用时序卷积模型对每个通道分组的数据进行处理;Use the time series convolution model to process the data grouped by each channel;
将处理后的全部通道分组的数据进行融合,得到所述第二数据。The processed data of all channel groups are fused to obtain the second data.
在一些实施例中,所述利用时序卷积模型对每个通道分组的数据进行处理,包括:In some embodiments, the use of the time series convolution model to process the data grouped by each channel includes:
利用所述时序卷积模型将不同尺度的数据时序信息在通道维度融合。The time-series convolution model is used to fuse data time-series information of different scales in the channel dimension.
在一些实施例中,所述利用所述时序卷积模型将不同尺度的数据时序信息在通道维度融合,包括:In some embodiments, the use of the time series convolution model to fuse data time series information of different scales in the channel dimension includes:
将每个通道分组的数据划分为N份子数据,将所述N份子数据中的第二份子数据输入至所述时序卷积模型中的一维卷积模型,得到第二子时序数据;The data of each channel grouping is divided into N sub-data, the second sub-data in the N sub-data is input into the one-dimensional convolution model in the time series convolution model, and the second sub-series data is obtained;
将第K子时序数据与第(K+1)份子数据进行融合,对融合后的数据执行预处理;其中,K大于或等于2,K小于或等于N-1;The Kth sub-series data is fused with the (K+1)th sub-data, and preprocessing is performed on the fused data; wherein, K is greater than or equal to 2, and K is less than or equal to N-1;
计算预处理得到的值与所述第K子时序数据点乘的第一结果,以及计算1与所述预处理得到的值之差与所述第(K+1)份子数据点乘的第二结果;Calculate the first result of multiplying the value obtained by preprocessing and the Kth sub-series data point, and calculate the second result of multiplying the difference between 1 and the value obtained by the preprocessing and the (K+1)th sub-data point. result;
所述第一结果与所述第二结果融合后,输入至所述一维卷积模型,得到第(K+1)子时序数据。After the first result and the second result are fused, they are input to the one-dimensional convolution model to obtain the (K+1)th sub-series data.
在一些实施例中,所述将处理后的全部通道分组的数据进行融合,得到所 述第二数据包括:In some embodiments, the data of the processed all channel groupings are fused to obtain the second data including:
将第一子数据、第二子时序数据、第三子时序数据,直至第N子时序数据分别进行时序卷积处理后,再将时序卷积处理后的数据进行级联,得到所述第二数据。After the first sub-data, the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data are respectively subjected to time-series convolution processing, the data after time-series convolution processing are concatenated to obtain the second sub-series data. data.
在一些实施例中,所述获取图像相邻帧的数据的差分信息之前,所述方法还包括:利用三维卷积模型获取所述图像相邻帧的数据;In some embodiments, before acquiring the difference information of the data of the adjacent frames of the image, the method further comprises: using a three-dimensional convolution model to acquire the data of the adjacent frames of the image;
和/或,所述获得第二数据之后,所述方法还包括:对所述第一数据和所述第二数据执行通道降维处理。And/or, after obtaining the second data, the method further includes: performing channel dimension reduction processing on the first data and the second data.
在一些实施例中,所述行为识别方法由时序运动模型实现,所述时序运动模型包括增强运动变换模块和长时序建模模块;In some embodiments, the behavior recognition method is implemented by a temporal motion model, and the temporal motion model includes an enhanced motion transformation module and a long-term temporal modeling module;
其中,所述增强运动变换模块用于获取图像相邻帧的数据的差分信息;基于所述差分信息,确定所述数据中表征运动信息的第一数据;Wherein, the enhanced motion transformation module is used to obtain the difference information of the data of adjacent frames of the image; based on the difference information, determine the first data representing the motion information in the data;
所述长时序建模模块用于基于特征对所述相邻帧的数据进行通道分组;在时序维度对每个通道分组的数据进行处理,获得第二数据。The long-time sequence modeling module is configured to perform channel grouping on the data of the adjacent frames based on the feature; and process the data of each channel grouping in the time sequence dimension to obtain second data.
在一些实施例中,所述时序运动模型嵌入至三维卷积模型之后,所述三维卷积模型用于获取所述图像相邻帧的数据;In some embodiments, after the time series motion model is embedded in a three-dimensional convolution model, the three-dimensional convolution model is used to obtain data of adjacent frames of the image;
和/或,所述时序运动模型嵌入至一维卷积模型之前,所述一维卷积模型用于对所述第一数据和所述第二数据执行通道降维处理。And/or, before the time series motion model is embedded into a one-dimensional convolution model, the one-dimensional convolution model is used to perform channel dimension reduction processing on the first data and the second data.
本公开实施例提供的行为识别装置,包括:The behavior recognition device provided by the embodiment of the present disclosure includes:
获取单元,被配置为获取相邻帧的数据的差分信息;an acquisition unit, configured to acquire differential information of data of adjacent frames;
确定单元,被配置为基于所述差分信息,确定所述数据中表征运动信息的第一数据;a determining unit configured to determine, based on the differential information, first data representing motion information in the data;
分组单元,被配置为基于特征对所述相邻帧的数据进行通道分组;a grouping unit configured to perform channel grouping on the data of the adjacent frames based on the feature;
处理单元,被配置为在时序维度对每个通道分组的数据进行处理,获得第二数据;a processing unit, configured to process the data grouped by each channel in the time sequence dimension to obtain second data;
识别单元,被配置为基于所述第一数据和所述第二数据得到行为识别结果。An identification unit, configured to obtain a behavior identification result based on the first data and the second data.
在一些实施例中,所述获取单元,还被配置为对所述相邻帧的数据进行特征对齐;基于至少一个阶梯结构的卷积模型,从特征对齐后的数据中获取所述相邻帧的前景数据;对所述前景数据进行差分处理,获取所述相邻帧的前景数据的差分信息。In some embodiments, the obtaining unit is further configured to perform feature alignment on the data of the adjacent frames; and obtain the adjacent frames from the feature-aligned data based on at least one convolutional model with a staircase structure the foreground data; perform differential processing on the foreground data to obtain differential information of the foreground data of the adjacent frames.
在一些实施例中,所述获取单元,还被配置为获取相邻两帧的数据的特征;对每帧的特征在通道维度上进行降维处理;利用相似矩阵对降维度处理后的相邻两帧的特征进行特征对齐。In some embodiments, the obtaining unit is further configured to obtain features of the data of two adjacent frames; perform dimension reduction processing on the features of each frame in the channel dimension; The features of the two frames are feature aligned.
在一些实施例中,所述获取单元,还被配置为对所述相邻帧的前景数据中的第一帧数据和第二帧数据分别复制为N份;将N份第二帧数据中的第一份第二帧数据输入二维卷积模型,得到的第一输出结果与第一份第一帧数据做差分, 得到第一差分结果;所述二维卷积模型为N阶二维卷积模型;将第一输出结果与第(M+1)份第二帧数据进行相加处理后输入二维卷积模型,得到的第(M+1)输出结果与第M份第一帧数据做差分,得到第(M+1)差分结果,M大于或等于1,M小于或等于N-1;将得到的第一差分结果至第N差分结果拼接后,输入至一维卷积模型,得到所述相邻帧的前景数据。In some embodiments, the obtaining unit is further configured to respectively copy the first frame data and the second frame data in the foreground data of the adjacent frames into N shares; The first second frame data is input into a two-dimensional convolution model, and the obtained first output result is differentiated from the first first frame data to obtain a first difference result; the two-dimensional convolution model is an N-order two-dimensional convolution model product model; the first output result and the (M+1)th second frame data are added and processed, and then input the two-dimensional convolution model, and the obtained (M+1)th output result and the Mth first frame data Do the difference to get the (M+1)th difference result, M is greater than or equal to 1, and M is less than or equal to N-1; after splicing the obtained first difference result to the Nth difference result, input it to the one-dimensional convolution model, Obtain the foreground data of the adjacent frame.
在一些实施例中,所述确定单元,还被配置为基于所述差分信息确定通道权重;基于所述通道权重对所述数据中的前景数据进行处理,得到所述第一数据。In some embodiments, the determining unit is further configured to determine a channel weight based on the differential information; and process foreground data in the data based on the channel weight to obtain the first data.
在一些实施例中,所述确定单元,还被配置为将所述前景数据与所述前景数据和所述通道权重之积相加,得到所述第一数据。In some embodiments, the determining unit is further configured to add the foreground data to the product of the foreground data and the channel weight to obtain the first data.
在一些实施例中,所述处理单元,还被配置为利用时序卷积模型对每个通道分组的数据进行处理;将处理后的全部通道分组的数据进行融合,得到所述第二数据。In some embodiments, the processing unit is further configured to use a time series convolution model to process the data grouped by each channel; and fuse the processed data of all channel groups to obtain the second data.
在一些实施例中,所述处理单元,还被配置为利用所述时序卷积模型将不同尺度的数据时序信息在通道维度融合。In some embodiments, the processing unit is further configured to use the time-series convolution model to fuse data time-series information of different scales in the channel dimension.
在一些实施例中,所述处理单元,还被配置为将每个通道分组的数据划分为N份子数据,将第二份子数据输入至所述时序卷积模型中的一维卷积模型,得到第二子时序数据;将第K子时序数据与第(K+1)份子数据融合后,对融合的数据执行预处理,K大于或等于2,K小于或等于N-1;计算预处理得到的值与所述第K子时序数据点乘的第一结果,以及计算1与所述预处理得到的值之差与所述第(K+1)份子数据点乘的第二结果;对所述第一结果与所述第二结果融合后,输入至所述一维卷积模型,得到第(K+1)子时序数据。In some embodiments, the processing unit is further configured to divide the data grouped by each channel into N sub-data, and input the second sub-data into the one-dimensional convolution model in the time series convolution model, to obtain The second sub-series data; after merging the K-th sub-series data with the (K+1)th sub-data, perform preprocessing on the fused data, K is greater than or equal to 2, and K is less than or equal to N-1; The first result of multiplying the value of 1 with the Kth sub-series data point, and the second result of multiplying the difference between 1 and the value obtained by the preprocessing and the (K+1)th sub-data point; After the first result and the second result are fused, they are input to the one-dimensional convolution model to obtain the (K+1)th sub-series data.
在一些实施例中,所述处理单元,还被配置为将第一子数据、第二子时序数据、第三子时序数据,直至第N子时序数据分别进行时序卷积处理后,再将时序卷积处理后的数据进行级联,得到所述第二数据。In some embodiments, the processing unit is further configured to perform time-series convolution processing on the first sub-data, the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data, respectively, and then process the time-series data. The convolution-processed data are concatenated to obtain the second data.
在一些实施例中,所述处理单元,还被配置为利用三维卷积模型获取所述图像相邻帧的数据;和/或,所述获得第二数据之后,利用一维卷积模型对所述第一数据和所述第二数据执行通道降维处理。In some embodiments, the processing unit is further configured to use a three-dimensional convolution model to obtain data of adjacent frames of the image; and/or, after obtaining the second data, use a one-dimensional convolution model to perform a Channel dimension reduction processing is performed on the first data and the second data.
本公开实施例提供的计算机程序产品包括计算机可执行指令,所述计算机可执行指令被执行后,能够实现上述的行为识别方法。The computer program product provided by the embodiments of the present disclosure includes computer-executable instructions, and after the computer-executable instructions are executed, the above-mentioned behavior identification method can be implemented.
本公开实施例提供的存储介质上存储有可执行指令,所述可执行指令被处理器执行时实现上述的行为识别方法。Executable instructions are stored on the storage medium provided by the embodiments of the present disclosure, and when the executable instructions are executed by the processor, the above-mentioned behavior identification method is implemented.
本公开实施例提供的电子设备,所述电子设备包括存储器和处理器,所述存储器上存储有计算机可执行指令,所述处理器运行所述存储器上的计算机可执行指令时可实现上述的行为识别方法。In the electronic device provided by the embodiments of the present disclosure, the electronic device includes a memory and a processor, the memory stores computer-executable instructions, and the processor can implement the above-mentioned behavior when running the computer-executable instructions on the memory recognition methods.
本公开实施例提供的行为识别方法中,获取相邻帧的数据的差分信息;基 于所述差分信息,确定所述数据中表征运动信息的第一数据;基于特征对所述相邻帧的数据进行通道分组;在时序维度对每个通道分组的数据进行处理,获得第二数据;基于所述第一数据和所述第二数据得到行为识别结果。如此,获取相邻帧的数据的差分信息,能够消除图像中的背景噪声;在时序维度对每个通道分组的进行处理,实现了更大的时序感受野;因此,本公开实施例提供的行为识别方法能够从空间维度和时间维度提高行为识别的精确度。In the behavior recognition method provided by the embodiment of the present disclosure, difference information of data of adjacent frames is obtained; based on the difference information, first data representing motion information in the data is determined; Perform channel grouping; process the data of each channel grouping in the time sequence dimension to obtain second data; obtain a behavior recognition result based on the first data and the second data. In this way, the difference information of the data of adjacent frames can be obtained, and the background noise in the image can be eliminated; the processing of each channel grouping in the time series dimension realizes a larger time series receptive field; therefore, the behavior provided by the embodiments of the present disclosure The recognition method can improve the accuracy of behavior recognition from the spatial and temporal dimensions.
为使本公开实施例的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合附附图,作详细说明如下。In order to make the above-mentioned objects, features and advantages of the embodiments of the present disclosure more clearly understood, preferred embodiments are exemplified below, and are described in detail as follows in conjunction with the accompanying drawings.
附图说明Description of drawings
图1为本公开实施例提供的行为识别方法的流程示意图;FIG. 1 is a schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure;
图2为本公开实施例识别装置获取相邻帧的数据的差分信息的一种可选处理流程示意图;2 is a schematic diagram of an optional processing flow for an identification device according to an embodiment of the present disclosure to obtain differential information of data of adjacent frames;
图3为本公开实施例识别装置对所述相邻帧的数据进行特征对齐的一种可选处理流程示意图;3 is a schematic diagram of an optional processing flow of feature alignment performed on the data of the adjacent frames by an identification device according to an embodiment of the present disclosure;
图4为本公开实施例对不同时刻帧的特征进行降维处理的示意图;4 is a schematic diagram of performing dimension reduction processing on features of different time frames according to an embodiment of the present disclosure;
图5为本公开实施例利用阶梯级联结果对特征进行级联的示意图;FIG. 5 is a schematic diagram of cascading features using a step cascading result according to an embodiment of the present disclosure;
图6为本公开实施例增加EMT模块和TSS模块的数据处理流程示意图;6 is a schematic diagram of a data processing flow for adding an EMT module and a TSS module according to an embodiment of the present disclosure;
图7为本公开实施例数据经EMT模块后的特征图样示意图;FIG. 7 is a schematic diagram of a characteristic pattern of the data according to the disclosed embodiment after passing through the EMT module;
图8为本公开实施例提供的行为识别装置的结构组成示意图;FIG. 8 is a schematic structural composition diagram of a behavior recognition device provided by an embodiment of the present disclosure;
图9为本公开实施例的电子设备的结构组成示意图。FIG. 9 is a schematic structural composition diagram of an electronic device according to an embodiment of the disclosure.
具体实施方式Detailed ways
在对本公开实施例进行详细描述之前,对行为识别进行简要说明。Before describing the embodiments of the present disclosure in detail, a brief description of behavior recognition is given.
随着云计算、5G(5th Generation Mobile Communication Technology)等基础技术的提升,视频数据近年来呈爆发式增长,为了充分挖掘视频数据的价值,越来越多的研究者开始从事视频理解领域。行为识别作为视屏理解的一个基础任务,在图像识别、人机交互、个性化推荐等场景都有广泛的需求。基于视频的行为识别的处理过程是基于给定包含动作的视频,判断行为类别。判断行为类别的准确度是行为识别的一个重要指标,而基于视频数据的运动建模和时序建模是影响行为识别判断结果的重要因素。With the improvement of basic technologies such as cloud computing and 5G (5th Generation Mobile Communication Technology), video data has exploded in recent years. In order to fully tap the value of video data, more and more researchers have begun to engage in the field of video understanding. As a basic task of video understanding, behavior recognition has a wide range of needs in scenarios such as image recognition, human-computer interaction, and personalized recommendation. The process of video-based action recognition is to determine the action category based on a given video containing an action. The accuracy of judging behavior categories is an important indicator of behavior recognition, and motion modeling and time series modeling based on video data are important factors that affect the results of behavior recognition.
基于视频数据的运动建模最常用基于光流特征对相邻帧之间的运动信息进行建模。通常,基于双流的动作识别方法都是通过光流建模来提取运动特征。但是,光流的提取和建模过程数据计算量大,难以应用于对实时性要求高的场景。由于光流的一个重要作用是在描述相邻帧之间的运动关系时能够突出运动对象,因此,一种可替代方案是可以利用不同帧之间特征的差异性来近似光流;但是,在利用不同帧之间特征的差异性来近似光流的过程中,可以同时得到运 动目标和非运动目标的边缘运动信息,由于非运动目标是图像的背景部分,因此,非运动目标的边缘运动信息属于噪声;并且,运动目标的静止部分的边缘运动信息也属于干涉噪声。Motion modeling based on video data is most commonly used to model motion information between adjacent frames based on optical flow features. Usually, two-stream based action recognition methods extract motion features through optical flow modeling. However, the extraction and modeling process of optical flow requires a large amount of data calculation, which is difficult to apply to scenarios with high real-time requirements. Since an important role of optical flow is to highlight moving objects when describing the motion relationship between adjacent frames, an alternative solution is to use the difference of features between different frames to approximate optical flow; however, in In the process of approximating the optical flow by using the difference of features between different frames, the edge motion information of the moving object and the non-moving object can be obtained at the same time. Since the non-moving object is the background part of the image, the edge motion information of the non-moving object It belongs to noise; and the edge motion information of the stationary part of the moving object also belongs to interference noise.
基于视频数据的时序建模,主要存在两种主要方案;第一种方案是采用二维(2D,2Dimension)卷积神经网络(CNN,Convolutional Neural Networks)加帧间聚合器的结构,聚合器一般采用avg/max/三维(3D,3Dimension)卷积/循环神经网络(RNN,Rerrent Neural Network)等操作,这种方案简单地进行了帧级的得分融合或者帧级高层特征融合,但是没有考虑特征层面的时序信息聚合。第二种方案是采用3D卷积,利用3D卷积在特征级聚合时序关系,由于3D卷积的参数多、计量量大,于是将3D卷积解耦为2D+1D卷积,2D卷积空间信息建模,1D卷积只负责时序关系建模。但是对于3D/(2D+1D)卷积来说,均是只建模了局部窗口内的时序关系,对于长时序关系,依赖纵向的堆叠卷积块来达到建模长范围时序的目的。这种纵向结构对于浅层的时序卷积难以优化。There are two main schemes for time series modeling based on video data; the first scheme is to use the structure of two-dimensional (2D, 2Dimension) convolutional neural network (CNN, Convolutional Neural Networks) plus an inter-frame aggregator. The aggregator is generally Using avg/max/3D (3D, 3Dimension) convolutional/recurrent neural network (RNN, Rerrent Neural Network) operations, this scheme simply performs frame-level score fusion or frame-level high-level feature fusion, but does not consider features Aggregation of timing information at the level. The second solution is to use 3D convolution, which uses 3D convolution to aggregate time sequence relationships at the feature level. Because 3D convolution has many parameters and a large amount of measurement, 3D convolution is decoupled into 2D+1D convolution and 2D convolution. Spatial information modeling, 1D convolution is only responsible for temporal relationship modeling. However, for 3D/(2D+1D) convolution, only the timing relationship within the local window is modeled. For long-term timing relationships, vertical stacked convolution blocks are used to achieve the purpose of modeling long-range timing. This vertical structure is difficult to optimize for shallow temporal convolutions.
基于此,本公开实施例提出一种行为识别方法。现在将参照附图来详细描述本公开的各种示例性实施例。应注意到:除非另外说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本申请的范围。Based on this, an embodiment of the present disclosure proposes a behavior recognition method. Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless otherwise indicated.
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。Meanwhile, it should be understood that, for the convenience of description, the dimensions of various parts shown in the accompanying drawings are not drawn in an actual proportional relationship.
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses in any way.
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods, and apparatus should be considered part of the specification.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further discussion in subsequent figures.
下面对本公开实施例所涉及的一种行为识别方法进行详细介绍,本公开实施例所提供的行为识别方法的执行主体可以为行为识别装置,如终端设备或服务器等,也可以为其它具有数据处理能力的处理装置,本公开实施例中不作限定。A behavior recognition method involved in the embodiments of the present disclosure will be described in detail below. The execution subject of the behavior recognition method provided by the embodiments of the present disclosure may be a behavior recognition device, such as a terminal device or a server, or may be any other data processing device. The processing device of the capability is not limited in the embodiments of the present disclosure.
参见图1所示,为本公开实施例所提供的一种行为识别方法的流程示意图,至少包括以下几个步骤:Referring to FIG. 1, a schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure includes at least the following steps:
步骤S101、获取相邻帧的数据的差分信息。Step S101 , acquiring difference information of data of adjacent frames.
在一些实施例中,行为识别装置(以下简称识别装置)获取相邻帧的数据的差分信息。其中,相邻帧可以是视频数据中相邻的两个帧。In some embodiments, the behavior recognition device (hereinafter referred to as the recognition device) obtains differential information of data of adjacent frames. The adjacent frames may be two adjacent frames in the video data.
在一些实施例中,识别装置获取相邻帧的数据的差分信息的一种可选处理流程,如图2所示,至少包括以下几个步骤:In some embodiments, an optional processing flow for the identification device to obtain differential information of data of adjacent frames, as shown in FIG. 2 , includes at least the following steps:
步骤S1a,对所述相邻帧的数据进行特征对齐。Step S1a, performing feature alignment on the data of the adjacent frames.
在一些实施例中,识别装置对所述相邻帧的数据进行特征对齐的一种可选处理流程,如图3所示,至少包括以下几个步骤:In some embodiments, an optional processing flow for the identification device to perform feature alignment on the data of the adjacent frames, as shown in FIG. 3 , includes at least the following steps:
步骤S1a1,获取相邻两帧的数据的特征。Step S1a1, acquiring data features of two adjacent frames.
在一些实施例中,数据的特征的形状可以是
Figure PCTCN2021127119-appb-000001
其中,N表示数据量(Batch Size),F为帧数,C为通道数量,H为单帧图像的高度,W为单帧图像的宽度。
In some embodiments, the shape of the features of the data may be
Figure PCTCN2021127119-appb-000001
Among them, N represents the batch size, F is the number of frames, C is the number of channels, H is the height of a single-frame image, and W is the width of a single-frame image.
步骤S1a2,对所述相邻两帧中的每一帧的所述特征在通道维度上进行降维处理。Step S1a2, performing dimension reduction processing on the channel dimension of the feature of each of the two adjacent frames.
在一些实施例中,识别装置对输入的特征X进行帧级的分离,得到
Figure PCTCN2021127119-appb-000002
为了减少数据计算量,识别装置对每帧的特征在通道维度上进行降维处理,利用一个1×1的卷积将特征通道压缩,如下述公式(1)和公式(2)所示,公式(1)为对如图4所示的t时刻的帧的特征在通道维度上降维处理的结果,公式(2)为对如图4所示的t+1时刻的帧的特征在通道维度上降维处理的结果。
In some embodiments, the recognition device performs frame-level separation on the input feature X to obtain
Figure PCTCN2021127119-appb-000002
In order to reduce the amount of data calculation, the recognition device performs dimensionality reduction processing on the feature of each frame in the channel dimension, and uses a 1×1 convolution to compress the feature channel, as shown in the following formulas (1) and (2), the formula (1) is the result of dimension reduction processing in the channel dimension for the feature of the frame at time t as shown in Fig. 4, formula (2) is the feature of the frame at time t+1 as shown in Fig. The result of dimensionality reduction processing.
x t=Conv1D(X t),x t∈R [C/l,H,W]       (1) x t =Conv1D(X t ),x t ∈R [C/l,H,W] (1)
x t+1=Conv1D(X t+1),x t+1∈R [C/l,H,W]     (2) x t+1 = Conv1D(X t+1 ), x t+1 ∈ R [C/l,H,W] (2)
其中,l是通道压缩率,l的值可根据实际应用场景灵活设置,如设置为16等数值。Among them, l is the channel compression ratio, and the value of l can be flexibly set according to the actual application scenario, for example, it is set to a value such as 16.
步骤S1a3,利用相似矩阵对降维度处理后的相邻两帧的特征进行特征对齐。Step S1a3, using the similarity matrix to perform feature alignment on the features of the two adjacent frames after dimension reduction processing.
在一些实施例中,识别装置利用相似矩阵对相邻帧进行包裹对齐,如下述公式(3)和公式(4)所示。In some embodiments, the recognition device uses the similarity matrix to wrap and align adjacent frames, as shown in the following formulas (3) and (4).
Figure PCTCN2021127119-appb-000003
Figure PCTCN2021127119-appb-000003
Figure PCTCN2021127119-appb-000004
Figure PCTCN2021127119-appb-000004
其中,r()函数用于变换特征的尺寸形状。Among them, the r() function is used to transform the size and shape of the feature.
步骤S1b,基于至少一个阶梯结构的卷积模型,从特征对齐后的数据中获取所述相邻帧的前景数据。Step S1b, based on at least one convolutional model with a ladder structure, obtain foreground data of the adjacent frame from the data after feature alignment.
在一些实施例中,识别装置利用一组阶梯结构的2D卷积实现对表征运动信息的第一样品数据的提取。其中,一组阶梯结构的2D卷积可以是2个2D卷积、或4个、或6个等2D卷积;下面以一组阶梯结构的2D卷积为4个2D卷积为例,将特征按通道划分为4部分,利用下述公式(5)、公式(6)、公式(7)和公式(8)对对齐后的特征进行卷积计算,从特征对齐后的数据中获取所述相邻帧的前景数据,得到多尺度的运动信息。In some embodiments, the identification device utilizes a set of 2D convolutions in a staircase structure to extract the first sample data representing motion information. Among them, the 2D convolution of a set of ladder structures can be 2 2D convolutions, or 4 or 6 2D convolutions; the following takes the 2D convolution of a set of ladder structures as 4 2D convolutions as an example, the The feature is divided into 4 parts according to the channel, and the following formula (5), formula (6), formula (7) and formula (8) are used to perform convolution calculation on the aligned features, and obtain the described data from the feature alignment. Foreground data of adjacent frames to obtain multi-scale motion information.
m s=0=Conv2D(r(A(x t+1)))           (5) m s = 0 = Conv2D(r(A(x t+1 ))) (5)
m s=1=Conv2D(m s=0+r(A(x t+1)))       (6) m s=1 =Conv2D(m s=0 +r(A(x t+1 ))) (6)
m s=2=Conv2D(m s=1+r(A(x t+1)))       (7) m s=2 =Conv2D(m s=1 +r(A(x t+1 ))) (7)
m s=3=Conv2D(m s=2+r(A(x t+1)))       (8) m s=3 =Conv2D(m s=2 +r(A(x t+1 ))) (8)
通过公式(5)至公式(8),能够提取出相邻帧的前景数据,即提取出运动信息,将相邻帧中的背景数据筛选并删除。Through formula (5) to formula (8), foreground data of adjacent frames can be extracted, that is, motion information can be extracted, and background data in adjacent frames can be filtered and deleted.
例如,图4右侧部分为多尺度差分模块(MSFD,Multi Scale Feature Difference)的原理流程图,如图4所示,对所述相邻帧的前景数据中的第一帧数据和第二帧数据分别复制展开(expand)N份(图4中N为4);将N份第二帧数据中的第一份第二帧数据输入二维卷积模型,得到的第一输出结果与第一份第一帧数据做差分,得到第一差分结果;所述二维卷积模型为N阶二维卷积模型;将第一输出结果与第(M+1)份第二帧数据进行相加处理后输入二维卷积模型,得到的第(M+1)输出结果与第M份第一帧数据做差分,得到第(M+1)差分结果,M大于或等于1,M小于或等于N-1;将得到的第一差分结果至第N差分结果拼接求和(sum)后,输入至一维卷积模型,得到所述相邻帧的前景数据。For example, the right part of Fig. 4 is the principle flow chart of the multi-scale difference module (MSFD, Multi Scale Feature Difference). As shown in Fig. 4, for the first frame data and the second frame in the foreground data of the adjacent frames The data is copied and expanded in N copies (N is 4 in Figure 4); the first second frame data in the N second frame data is input into the two-dimensional convolution model, and the first output result obtained is the same as the first The first frame data is differentiated to obtain the first difference result; the two-dimensional convolution model is an N-order two-dimensional convolution model; the first output result is added to the (M+1)th second frame data After processing, input the two-dimensional convolution model, and make a difference between the (M+1)th output result and the Mth first frame data to obtain the (M+1)th difference result, where M is greater than or equal to 1, and M is less than or equal to N-1: After the obtained first difference result to the Nth difference result are spliced and summed, input to a one-dimensional convolution model to obtain foreground data of the adjacent frames.
本公开实施例中,通过一组2D卷积捕捉不同的运动变化信息,使得后续的运动差分信息对行为刻画、识别得更加准确。In the embodiment of the present disclosure, different motion change information is captured through a set of 2D convolutions, so that the subsequent motion difference information can more accurately describe and identify behaviors.
步骤S1c,对所述前景数据进行差分处理,获取所述相邻帧的前景数据的差分信息。Step S1c, performing differential processing on the foreground data to obtain differential information of the foreground data of the adjacent frames.
在一些实施例中,识别装置利用下述公式(9)和公式(10)对前景数据进行差分处理,得到相邻帧的前景数据的差分信息。In some embodiments, the identification device performs differential processing on the foreground data by using the following formula (9) and formula (10) to obtain differential information of the foreground data of adjacent frames.
本公开实施例中,通过获取相邻帧的前景数据的差分信息,能够消除背景运动噪声。In the embodiment of the present disclosure, by acquiring the difference information of the foreground data of adjacent frames, the background motion noise can be eliminated.
以图4为例,将t=i时刻的数据复制为4份,分别是第1份数据,第2份数据,第3份数据和第4份数据;将t=i+1时刻的数据复制为4份,分别是第5份数据,第6份数据,第7份数据和第8份数据。将第5份数据做二维卷积处理后得到的结果与第6份数据进行融合,第5份数据与第6份数据融合后的数据做二维卷积处理后的结果与第7份数据融合;第5份数据与第6份数据融合后的数据做二维卷积处理后的结果与第7份数据融合后的数据做二维卷积处理后,与第8份数据融合;每次融合后的数据分别与第1份数据至第4份数据做差分处理,将4次差分结果,即差分信息。Taking Figure 4 as an example, the data at time t=i is copied into 4 copies, which are the first data, the second data, the third data and the fourth data; copy the data at time t=i+1 It is 4 copies, which are the 5th data, the 6th data, the 7th data and the 8th data. The result obtained after the 5th data is subjected to 2D convolution processing is fused with the 6th data, and the result after 2D convolution processing of the 5th data and the 6th data is merged with the 7th data. Fusion; after the 5th data is fused with the 6th data, the result after two-dimensional convolution is processed with the 7th data after 2D convolution, and then it is fused with the 8th data; each time The fused data is subjected to differential processing with the first to fourth pieces of data, respectively, and the 4th differential result is the differential information.
M diff=m s=0-x t+m s=1-x t+m s=2-x t+m s=3-x t    (9) M diff = m s = 0 - x t + m s = 1 - x t + m s = 2 - x t + m s = 3 - x t (9)
M out=Conv1D(M diff),M out∈R [C,H,W]           (10) M out = Conv1D(M diff ), M out ∈ R [C,H,W] (10)
其中,M out表示相邻帧的前景数据的差分信息。 Among them, M out represents the difference information of the foreground data of adjacent frames.
步骤S102、基于所述差分信息,确定所述数据中表征运动信息的第一数据。Step S102: Based on the difference information, determine first data representing motion information in the data.
在一些实施例中,基于前景数据的差分信息确定通道权重,即将前景数据的差分信息作为通道权重,利用所述通道权重对前景数据进行处理,得到所述数据中表征运动信息的第一数据。In some embodiments, the channel weight is determined based on the differential information of the foreground data, that is, the differential information of the foreground data is used as the channel weight, and the foreground data is processed by using the channel weight to obtain the first data representing motion information in the data.
其中,利用所述通道权重对前景数据进行处理,可以是利用通道权重对前 景数据进行增强,得到所述数据中表征运动信息的第一数据。Wherein, using the channel weight to process the foreground data may be using the channel weight to enhance the foreground data to obtain the first data representing motion information in the data.
在其他实施例中,获取所述相邻帧的前景数据的差分信息之后,也可以将所述差分信息直接作为表征运动信息的第一数据。In other embodiments, after obtaining the difference information of the foreground data of the adjacent frames, the difference information may also be directly used as the first data representing the motion information.
在一些实施例中,通过如下公式(11)确定通道权重;通过如下公式(12)对前景数据进行增强,即将所述前景数据与所述通道权重相乘得到乘积结果,将所述乘积结果与所述前景数据相加,得到所述第一数据。In some embodiments, the channel weight is determined by the following formula (11); the foreground data is enhanced by the following formula (12), that is, the foreground data is multiplied by the channel weight to obtain a product result, and the product result is combined with The foreground data is added to obtain the first data.
W=sigmod(AvgPooling(M out))∈R [C,1,1]        (11) W=sigmod(AvgPooling(M out ))∈R [C,1,1] (11)
Enhanced(X t)=X t+X t⊙W∈R [C,H,W]          (12) Enhanced(X t )=X t +X t ⊙W∈R [C,H,W] (12)
至此,本公开实施例中,利用帧间相似矩阵实现帧间的特征对齐,尽可能消除背景抖动带来的干扰问题。同时,考虑运动信息的多样性,本公开实施例使用一组阶梯2D卷积来提取不同尺度的数据,再对不同尺度的数据进行差分处理,消除了图像中的背景噪声,获得不同尺度的运动显著性信息,最后利用运动显著性信息来去增强运动变化区域,即增加了增强运动变换(EMT Enhanced Motion Transformer)模块对数据进行处理。So far, in the embodiment of the present disclosure, the inter-frame similarity matrix is used to realize the feature alignment between the frames, and the interference problem caused by the background jitter is eliminated as much as possible. At the same time, considering the diversity of motion information, the embodiment of the present disclosure uses a set of stepped 2D convolutions to extract data of different scales, and then performs differential processing on the data of different scales to eliminate background noise in the image and obtain motion of different scales. Salient information, and finally use the motion saliency information to enhance the motion change area, that is, an Enhanced Motion Transformer (EMT Enhanced Motion Transformer) module is added to process the data.
步骤S103、基于特征对所述相邻帧的数据进行通道分组。Step S103: Perform channel grouping on the data of the adjacent frames based on the feature.
在一些实施例中,识别装置利用时序卷积模型对每个通道分组的数据进行处理,将处理后的全部通道分组的数据进行融合,得到所述第二数据。In some embodiments, the identification device processes the data of each channel group by using a time series convolution model, and fuses the processed data of all channel groups to obtain the second data.
在一些实施例中,为了减少数据计算量,识别装置对输入的特征X∈R NF×C×H×W进行通道分组,得到X g=i∈R [NF,C/4,H,W]In some embodiments, in order to reduce the amount of data calculation, the identification device performs channel grouping on the input features X∈R NF×C×H×W , and obtains X g= i∈R [NF,C/4,H,W] .
步骤S104、在时序维度对每个通道分组的数据进行处理,获得第二数据。Step S104: Process the data grouped by each channel in the time sequence dimension to obtain second data.
在一些实施例中,识别装置利用所述时序卷积模型将不同尺度的数据时序信息在通道维度融合;例如,识别装置可以对每个通道分组的数据利用1D时序卷积进行处理,如下述公式(13)至公式(15)所示:In some embodiments, the identification device uses the time series convolution model to fuse data time series information of different scales in the channel dimension; for example, the identification device can use 1D time series convolution to process the data grouped by each channel, as shown in the following formula (13) to formula (15) are shown:
r_out=reshape(X)∈R [NHW,C,F]           (13) r_out=reshape(X)∈R [NHW,C,F] (13)
r_out=Conv1D(r_out)∈R [NHW,C,F]        (14) r_out=Conv1D(r_out)∈R [NHW,C,F] (14)
out=reshape(r_out)∈R [NF,C,H,W]      (15) out=reshape(r_out)∈R [NF,C,H,W] (15)
将每个通道分组的数据划分为N份子数据,将第二份子数据输入至所述时序卷积模型中的一维卷积模型,得到第二子时序数据;将第K子时序数据与第(K+1)份子数据融合后,对融合的数据执行预处理,K大于或等于2,K小于或等于N-1;计算预处理得到的值与所述第K子时序数据点乘的第一结果,以及计算1与所述预处理得到的值之差与所述第(K+1)份子数据点乘的第二结果;对所述第一结果与所述第二结果融合后,输入至所述一维卷积模型,得到第(K+1)子时序数据。其中,所述预处理可以是对融合的数据执行通道融合处理,如空间平均池化(SAP,Spatial Average Pooling)、全连接(FC,Fully Connected)以及归一化指数函数(Softmax)。The data of each channel grouping is divided into N sub-data, the second sub-data is input into the one-dimensional convolution model in the described time series convolution model, and the second sub-series data is obtained; The Kth sub-series data and the (( K+1) After the sub-data is fused, perform preprocessing on the fused data, where K is greater than or equal to 2, and K is less than or equal to N-1; result, and the second result of multiplying the difference between 1 and the value obtained by the preprocessing and the (K+1)th sub-data point; after fusing the first result and the second result, input to The one-dimensional convolution model obtains the (K+1)th sub-series data. The preprocessing may be to perform channel fusion processing on the fused data, such as Spatial Average Pooling (SAP, Spatial Average Pooling), Fully Connected (FC, Fully Connected), and a normalized exponential function (Softmax).
识别装置对每个通道分组的数据利用1D时序卷积进行处理之后,对处理 后的所有通道分组的数据进行融合,得到第二数据。例如,将上述第一子数据、第二子时序数据、第三子时序数据,直至第N子时序数据分别进行时序卷积处理后,再将时序卷积处理后的数据进行级联,得到所述第二数据。After the identification device uses 1D time series convolution to process the data grouped by each channel, it fuses the processed data of all channel groups to obtain second data. For example, after the first sub-data, the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data are respectively subjected to time-series convolution processing, the data after time-series convolution processing are concatenated to obtain the the second data.
举例来说,若对相邻两个通道组的数据利用1D时序卷积进行处理得到的结果是A和B,则对通道组数据进行融合的处理过程如下公式(16)至公式(20)所示:For example, if the results obtained by using 1D time series convolution to process the data of two adjacent channel groups are A and B, the processing process of the channel group data fusion is as shown in the following formulas (16) to (20). Show:
A∈R [NF,C,H,W],B∈R [NF,C,H,W]                    (16) A∈R [NF,C,H,W] , B∈R [NF,C,H,W] (16)
C=A⊕B∈R [NF,C,H,W]                          (17) C=A⊕B∈R [NF,C,H,W] (17)
C=AvgPooling(C)∈R [NF,C,1,1]                   (18) C=AvgPooling(C)∈R [NF,C,1,1] (18)
C a=Softmax(FC(C))∈R [NF,C,1,1]                 (19) C a = Softmax(FC(C))∈R [NF,C,1,1] (19)
out=C a⊙A⊕(1-c a)⊙B∈R [NF,C,H,W]           (20) out=C a ⊙A⊕(1-c a )⊙B∈R [NF,C,H,W] (20)
其中,公式(20)表示对相邻两个通道组的数据进行融合处理得到的结果。如图5右侧部分为通道选择模块(CS,Channel Selection)的原理流程图,如图5所示,X g=0对应的数据为第一份子数据,X g=1对应的数据为第二份子数据,将第二份子数据输入至所述时序卷积模型中的一维卷积模型,得到第二子时序数据,X g=2对应的数据为第三份子数据,将第三份子数据和所述第二子时序数据融合并进行预处理,得到的结果输入至所述时序卷积模型中的一维卷积模型,得到第三子时序数据,X g=3对应的数据为第四份子数据,将第四份子数据和所述第三子时序数据融合并进行预处理,得到的结果输入至所述时序卷积模型中的一维卷积模型,得到第四子时序数据。 Wherein, formula (20) represents the result obtained by fusing the data of two adjacent channel groups. The right part of Fig. 5 is the principle flow chart of the channel selection module (CS, Channel Selection). As shown in Fig. 5, the data corresponding to X g =0 is the first sub-data, and the data corresponding to X g =1 is the second sub-data. Part of the data, the second part of the data is input into the one-dimensional convolution model in the time series convolution model, to obtain the second part of the time series data, the data corresponding to X g =2 is the third part of the data, and the third part of the data and The second sub-sequence data is fused and preprocessed, and the result obtained is input into the one-dimensional convolution model in the time-series convolution model to obtain the third sub-sequence data, and the data corresponding to X g =3 is the fourth data, the fourth sub-data and the third sub-series data are fused and preprocessed, and the obtained result is input into the one-dimensional convolution model in the time-series convolution model to obtain the fourth sub-series data.
识别装置再利用一个阶梯级联结构对如图5所示的特征X g=0、X g=1、X g=2和X g=3进行级联得到第二数据;例如,将上述第一子数据、第二子时序数据、第三子时序数据,直至第N子时序数据分别进行时序卷积处理后,再将时序卷积处理后的数据进行级联,得到所述第二数据。 The identification device then uses a cascaded structure to cascade the features X g =0, X g =1, X g =2, and X g =3 as shown in FIG. 5 to obtain the second data; The sub-data, the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data are respectively subjected to time-series convolution processing, and then the time-series convolution-processed data are concatenated to obtain the second data.
级联的处理过程如下公式(21)至公式(24)所示:The cascade processing process is shown in the following formulas (21) to (24):
O 0=X g=0                                   (21) O 0 =X g = 0 (21)
O 1=TemporalConv(X g=1)                     (22) O 1 =TemporalConv(X g=1 ) (22)
O 2=TemporalConv(CS(O 1,X g=2))               (23) O 2 =TemporalConv(CS(O 1 ,X g=2 )) (23)
O 3=TemporalConv(CS(O 2,X g=3))               (24) O 3 =TemporalConv(CS(O 2 ,X g=3 )) (24)
out=cat[O 0,O 1,O 2,O 3],out∈R [NF,C,H,W]          (25) out=cat[O 0 ,O 1 ,O 2 ,O 3 ],out∈R [NF,C,H,W] (25)
其中,公式(25)表征特征的时序信息。Among them, formula (25) represents the time series information of the feature.
本公开实施例中,通过阶梯级联1D卷积来获得不同尺度的时序信息,将不同尺度的时序信息阶梯连接获得大感受野1D卷积,即增加了长时序建模(TSS,Temporal Step-Structure)模块对数据进行处理。In the embodiment of the present disclosure, the time series information of different scales is obtained by cascaded 1D convolution, and the time series information of different scales is connected in steps to obtain the 1D convolution of the large receptive field, that is, the long time series modeling (TSS, Temporal Step- Structure) module to process the data.
步骤S105、基于所述第一数据和所述第二数据得到行为识别结果。Step S105: Obtain a behavior recognition result based on the first data and the second data.
在一些实施例中,第一数据是视频数据中消除了背景运动噪声的数据,第 二数据是长时序的数据;识别装置基于第一数据和第二数据对输入的视频数据精准地进行行为识别,如判断行为类别等。In some embodiments, the first data is data from which background motion noise has been eliminated from the video data, and the second data is long-sequence data; the identification device accurately performs behavior identification on the input video data based on the first data and the second data , such as judging behavior categories, etc.
本公开实施例中,步骤S101至步骤S102,与步骤S103至步骤S104不存在执行的先后顺序,既可以先执行步骤S101和步骤S102,再执行步骤S103和步骤104,也可以先执行步骤S103和步骤104,再执行步骤S101和步骤S102。也可以理解为,本公开实施例既可以先获取第一数据,再获取第二数据,也可以先获取第二数据,再获取第一数据。In this embodiment of the present disclosure, steps S101 to S102 and steps S103 to S104 do not have a sequence of execution. Steps S101 and S102 may be executed first, and then steps S103 and 104 may be executed, or steps S103 and S103 may be executed first. In step 104, steps S101 and S102 are executed again. It can also be understood that, in the embodiment of the present disclosure, the first data may be acquired first, and then the second data may be acquired, or the second data may be acquired first, and then the first data may be acquired.
本公开实施例增加EMT模块和TSS模块的数据处理流程示意图,如图6所示,在已有的行为识别方法的基础上,增加EMT模块和TSS模块。例如,在EMT模块的数据处理之前,利用三维卷积模型获取所述图像相邻帧的数据;在TSS模块的数据处理之后,利用一维卷积模型对所述第一数据和所述第二数据执行通道降维处理。本公开实施例中,增加的EMT模块和TSS模块构成了时序运动模型(TMM,Temporal and Motion Module),TMM可以嵌入至已经存在的行为识别模型如二维残差网络(2D ResNet,2 Dimension Residual Network)模型中,如图6中,TMM嵌入至用于获取所述图像相邻帧的数据的三维卷积模型之后,以及对所述第一数据和所述第二数据执行通道降维处理的一维卷积模型之前。通过TMM能够实现对数据的运动显著性信息增强和长期时序建模。数据经EMT模块后的特征图样,如图7所示,第一行为输入的数据的原始帧,第二行和第三行分别是经过EMT模块之后输出的数据的特征图样;经过EMT模块之后的运动区域的特征图样明显。A schematic diagram of a data processing flow for adding an EMT module and a TSS module in an embodiment of the present disclosure, as shown in FIG. 6 , on the basis of the existing behavior recognition method, an EMT module and a TSS module are added. For example, before the data processing of the EMT module, use a three-dimensional convolution model to obtain the data of the adjacent frames of the image; after the data processing of the TSS module, use a one-dimensional convolution model to analyze the first data and the second data. The data performs channel dimensionality reduction processing. In the embodiment of the present disclosure, the added EMT module and TSS module constitute a temporal motion model (TMM, Temporal and Motion Module), and the TMM can be embedded into an existing behavior recognition model such as a two-dimensional residual network (2D ResNet, 2 Dimension Residual Network) model, as shown in Figure 6, after TMM is embedded in the three-dimensional convolution model used to obtain the data of adjacent frames of the image, and the channel dimension reduction process is performed on the first data and the second data. Before the 1D convolutional model. The motion saliency information enhancement and long-term time series modeling of the data can be achieved through TMM. The characteristic pattern of the data after passing through the EMT module, as shown in Figure 7, the first row is the original frame of the input data, the second row and the third row are the characteristic patterns of the data output after passing through the EMT module; The characteristic pattern of the motion area is obvious.
基于上述对本公开实施例提供的行为识别方法的说明,本公开实施例提供的行为识别方法至少可以应用于时频推荐、图像识别、人机交互等场景。Based on the above description of the behavior recognition method provided by the embodiment of the present disclosure, the behavior recognition method provided by the embodiment of the present disclosure can be applied to at least scenarios such as time-frequency recommendation, image recognition, and human-computer interaction.
为实现本公开实施例提供的上述行为识别方法,本公开实施例还提供一种行为识别装置,图8为本公开实施例提供的行为识别装置200的结构组成示意图,所述设备包括:In order to realize the above-mentioned behavior recognition method provided by the embodiment of the present disclosure, the embodiment of the present disclosure further provides a behavior recognition apparatus. FIG. 8 is a schematic structural composition diagram of the behavior recognition apparatus 200 provided by the embodiment of the present disclosure, and the device includes:
获取单元201,被配置为获取相邻帧的数据的差分信息;an acquisition unit 201, configured to acquire differential information of data of adjacent frames;
确定单元202,被配置为基于所述差分信息,确定所述数据中表征运动信息的第一数据;a determining unit 202, configured to determine, based on the difference information, first data representing motion information in the data;
分组单元203,被配置为基于特征对所述相邻帧的数据进行通道分组;a grouping unit 203, configured to perform channel grouping on the data of the adjacent frames based on the feature;
处理单元204,被配置为在时序维度对每个通道分组的数据进行处理,获得第二数据;The processing unit 204 is configured to process the data grouped by each channel in the time series dimension to obtain second data;
识别单元205,被配置为基于所述第一数据和所述第二数据得到行为识别结果。The identification unit 205 is configured to obtain a behavior identification result based on the first data and the second data.
在一些可选实施方式中,所述获取单元201,还被配置为对所述相邻帧的数据进行特征对齐;In some optional embodiments, the obtaining unit 201 is further configured to perform feature alignment on the data of the adjacent frames;
基于至少一个阶梯结构的卷积模型,从特征对齐后的数据中获取所述相邻帧的前景数据;Obtain foreground data of the adjacent frame from the feature-aligned data based on at least one ladder-structured convolution model;
对所述前景数据进行差分处理,获取所述相邻帧的前景数据的差分信息。Perform differential processing on the foreground data to obtain differential information of the foreground data of the adjacent frames.
在一些可选实施方式中,所述获取单元201,还被配置为获取相邻两帧的数据的特征;In some optional implementations, the obtaining unit 201 is further configured to obtain features of data of two adjacent frames;
对每帧的特征在通道维度上进行降维处理;Dimensionality reduction is performed on the features of each frame in the channel dimension;
利用相似矩阵对降维度处理后的相邻两帧的特征进行特征对齐。A similarity matrix is used to perform feature alignment on the features of two adjacent frames after dimension reduction.
在一些可选实施方式中,所述获取单元201,还被配置为对所述相邻帧的前景数据中的第一帧数据和第二帧数据分别复制为N份;In some optional embodiments, the obtaining unit 201 is further configured to copy the first frame data and the second frame data in the foreground data of the adjacent frames into N copies respectively;
将N份第二帧数据中的第一份第二帧数据输入二维卷积模型,得到的第一输出结果与第一份第一帧数据做差分,得到第一差分结果;所述二维卷积模型为N阶二维卷积模型;The first second frame data in the N second frame data is input into the two-dimensional convolution model, and the obtained first output result is differentiated from the first first frame data to obtain the first difference result; the two-dimensional The convolution model is an N-order two-dimensional convolution model;
将第一输出结果与第(M+1)份第二帧数据进行相加处理后输入二维卷积模型,得到的第(M+1)输出结果与第M份第一帧数据做差分,得到第(M+1)差分结果,M大于或等于1,M小于或等于N-1;Input the two-dimensional convolution model after the first output result and the (M+1) second frame data are added, and the obtained (M+1) output result differs from the M first frame data, Obtain the (M+1)th difference result, M is greater than or equal to 1, and M is less than or equal to N-1;
将得到的第一差分结果至第N差分结果拼接后,输入至一维卷积模型,得到所述相邻帧的前景数据。After splicing the obtained first difference result to the Nth difference result, it is input into a one-dimensional convolution model to obtain foreground data of the adjacent frame.
在一些可选实施方式中,所述确定单元202,还被配置为基于所述差分信息确定通道权重;In some optional embodiments, the determining unit 202 is further configured to determine the channel weight based on the differential information;
基于所述通道权重对所述数据中的前景数据进行处理,得到所述第一数据。The foreground data in the data is processed based on the channel weight to obtain the first data.
在一些可选实施方式中,所述确定单元202,还被配置为将所述前景数据与所述前景数据和所述通道权重之积相加,得到所述第一数据。In some optional embodiments, the determining unit 202 is further configured to add the foreground data to the product of the foreground data and the channel weight to obtain the first data.
在一些可选实施方式中,所述处理单元204,还被配置为利用时序卷积模型对每个通道分组的数据进行处理;In some optional embodiments, the processing unit 204 is further configured to process the data grouped by each channel by using a time series convolution model;
将处理后的全部通道分组的数据进行融合,得到所述第二数据。The processed data of all channel groups are fused to obtain the second data.
在一些可选实施方式中,所述处理单元204,还被配置为利用所述时序卷积模型将不同尺度的数据时序信息在通道维度融合。In some optional embodiments, the processing unit 204 is further configured to use the time series convolution model to fuse data time series information of different scales in the channel dimension.
在一些可选实施方式中,所述处理单元204,还被配置为将每个通道分组的数据划分为N份子数据,将第二份子数据输入至所述时序卷积模型中的一维卷积模型,得到第二子时序数据;In some optional embodiments, the processing unit 204 is further configured to divide the data grouped by each channel into N sub-data, and input the second sub-data into the one-dimensional convolution in the time series convolution model model to obtain the second sub-series data;
将第K子时序数据与第(K+1)份子数据融合后,对融合的数据执行预处理,K大于或等于2,K小于或等于N-1;After merging the Kth sub-series data with the (K+1)th sub-data, perform preprocessing on the fused data, K is greater than or equal to 2, and K is less than or equal to N-1;
计算预处理得到的值与所述第K子时序数据点乘的第一结果,以及计算1与所述预处理得到的值之差与所述第(K+1)份子数据点乘的第二结果;Calculate the first result of multiplying the value obtained by preprocessing and the Kth sub-series data point, and calculate the second result of multiplying the difference between 1 and the value obtained by the preprocessing and the (K+1)th sub-data point. result;
对所述第一结果与所述第二结果融合后,输入至所述一维卷积模型,得到第(K+1)子时序数据。After the first result and the second result are fused, they are input to the one-dimensional convolution model to obtain the (K+1)th sub-series data.
在一些可选实施方式中,所述处理单元204,还被配置为将第一子数据、第二子时序数据、第三子时序数据,直至第N子时序数据分别进行时序卷积处理后,再将时序卷积处理后的数据进行级联,得到所述第二数据。In some optional implementation manners, the processing unit 204 is further configured to perform sequential convolution processing on the first sub-data, the second sub-sequential data, the third sub-sequential data, and the Nth sub-sequential data respectively, The data after time series convolution processing is then concatenated to obtain the second data.
在一些可选实施方式中,所述处理单元204,还被配置为利用三维卷积模型获取所述图像相邻帧的数据;In some optional implementation manners, the processing unit 204 is further configured to acquire data of adjacent frames of the image by using a three-dimensional convolution model;
和/或,所述获得第二数据之后,利用一维卷积模型对所述第一数据和所述第二数据执行通道降维处理。And/or, after the second data is obtained, a one-dimensional convolution model is used to perform channel dimension reduction processing on the first data and the second data.
本领域技术人员应当理解,图8所示的行为识别装置中的各单元的实现功能可参照前述行为识别方法的相关描述而理解。图8所示的行为识别装置中的各单元的功能可通过运行于处理器上的程序而实现,也可通过逻辑电路而实现。Those skilled in the art should understand that the implementation function of each unit in the behavior recognition apparatus shown in FIG. 8 can be understood with reference to the relevant description of the behavior recognition method described above. The function of each unit in the behavior recognition device shown in FIG. 8 can be realized by a program running on a processor, or can be realized by a logic circuit.
本公开实施例上述的行为识别装置如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术方案本质上或者说对现有技术做出贡献的单元可以以软件产品的形式体现出来,所述计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台电子设备(可以是个人计算机、服务器、或者网络设备等)执行本公开各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本公开实施例不限制于任何特定的硬件和软件结合。If the above-mentioned behavior recognition device in the embodiment of the present disclosure is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present disclosure, or the units that make contributions to the prior art, may be embodied in the form of software products, and the computer software products are stored in a storage medium, including a number of instructions for So that an electronic device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present disclosure. The aforementioned storage medium includes: U disk, mobile hard disk, Read Only Memory (ROM, Read Only Memory), magnetic disk or optical disk and other media that can store program codes. As such, embodiments of the present disclosure are not limited to any particular combination of hardware and software.
相应地,本公开实施例还提供一种计算机程序产品,其中存储有计算机可执行指令,所述计算机可执行指令被执行时能够实现本公开实施例的上述的行为识别方法。Correspondingly, an embodiment of the present disclosure further provides a computer program product, in which computer-executable instructions are stored, and when the computer-executable instructions are executed, the above-mentioned behavior identification method of the embodiment of the present disclosure can be implemented.
本公开实施例还提供一种存储介质,所述存储介质上存储有可执行指令,所述可执行指令被处理器执行时实现上所述的行为识别方法。An embodiment of the present disclosure further provides a storage medium, where executable instructions are stored on the storage medium, and when the executable instructions are executed by a processor, the above-mentioned behavior identification method is implemented.
为实现本公开实施例提供的上述行为识别方法,本公开实施例还提供一种电子设备,图9为本公开实施例的电子设备的结构组成示意图,如图9所示,电子设备300可以包括一个或多个(图中仅示出一个)处理器301(处理器301可以包括但不限于微处理器(MCU,Micro Controller Unit)或可编程逻辑器件(FPGA,Field Programmable Gate Array)等的处理装置)、用于存储数据的存储器302、以及用于通信功能的传输装置303。本领域普通技术人员可以理解,图9所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,电子设备300还可包括比图9中所示更多或者更少的组件,或者具有与图9所示不同的配置。In order to realize the above-mentioned behavior recognition method provided by the embodiment of the present disclosure, the embodiment of the present disclosure further provides an electronic device. FIG. 9 is a schematic structural composition diagram of the electronic device according to the embodiment of the present disclosure. As shown in FIG. 9 , the electronic device 300 may include One or more (only one is shown in the figure) processor 301 (the processor 301 may include but is not limited to the processing of a microprocessor (MCU, Micro Controller Unit) or a programmable logic device (FPGA, Field Programmable Gate Array), etc. means), memory 302 for storing data, and transmission means 303 for communication functions. Those of ordinary skill in the art can understand that the structure shown in FIG. 9 is only a schematic diagram, which does not limit the structure of the above-mentioned electronic device. For example, the electronic device 300 may also include more or fewer components than shown in FIG. 9 , or have a different configuration than that shown in FIG. 9 .
存储器302可用于存储应用软件的软件程序以及模块,如本公开实施例中的方法对应的程序指令/模块,处理器301通过运行存储在存储器302内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的方法。存 储器302可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器302可进一步包括相对于处理器301远程设置的存储器,这些远程存储器可以通过网络连接至电子设备300。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 302 can be used to store software programs and modules of application software, such as program instructions/modules corresponding to the methods in the embodiments of the present disclosure, and the processor 301 executes various functional applications by running the software programs and modules stored in the memory 302 And data processing, that is, to realize the above method. Memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 302 may further include memory located remotely from processor 301, which may be connected to electronic device 300 through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
可以理解,存储器302可以是易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是ROM、可编程只读存储器(PROM,Programmable Read-Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read-Only Memory)、电可擦除可编程只读存储器(EEPROM,Electrically Erasable Programmable Read-Only Memory)、磁性随机存取存储器(FRAM,ferromagnetic random access memory)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(CD-ROM,Compact Disc Read-Only Memory);磁表面存储器可以是磁盘存储器或磁带存储器。易失性存储器可以是随机存取存储器(RAM,Random Access Memory),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(SRAM,Static Random Access Memory)、同步静态随机存取存储器(SSRAM,Synchronous Static Random Access Memory)、动态随机存取存储器(DRAM,Dynamic Random Access Memory)、同步动态随机存取存储器(SDRAM,Synchronous Dynamic Random Access Memory)、双倍数据速率同步动态随机存取存储器(DDRSDRAM,Double Data Rate Synchronous Dynamic Random Access Memory)、增强型同步动态随机存取存储器(ESDRAM,Enhanced Synchronous Dynamic Random Access Memory)、同步连接动态随机存取存储器(SLDRAM,SyncLink Dynamic Random Access Memory)、直接内存总线随机存取存储器(DRRAM,Direct Rambus Random Access Memory)。本公开实施例描述的存储器302旨在包括但不限于这些和任意其它适合类型的存储器。It will be appreciated that the memory 302 may be either volatile memory or non-volatile memory, and may include both volatile and non-volatile memory. Among them, the non-volatile memory can be ROM, Programmable Read-Only Memory (PROM, Programmable Read-Only Memory), Erasable Programmable Read-Only Memory (EPROM, Erasable Programmable Read-Only Memory), Electrically Erasable Programmable Read-Only Memory Programmable read-only memory (EEPROM, Electrically Erasable Programmable Read-Only Memory), magnetic random access memory (FRAM, ferromagnetic random access memory), flash memory (Flash Memory), magnetic surface memory, optical disk, or CD-ROM -ROM, Compact Disc Read-Only Memory); magnetic surface memory can be disk memory or tape memory. Volatile memory may be Random Access Memory (RAM), which acts as an external cache. By way of example and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory Memory (DRAM, Dynamic Random Access Memory), Synchronous Dynamic Random Access Memory (SDRAM, Synchronous Dynamic Random Access Memory), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM, Double Data Rate Synchronous Dynamic Random Access Memory), Enhanced Type Synchronous Dynamic Random Access Memory (ESDRAM, Enhanced Synchronous Dynamic Random Access Memory), Synchronous Link Dynamic Random Access Memory (SLDRAM, SyncLink Dynamic Random Access Memory), Direct Memory Bus Random Access Memory (DRRAM, Direct Rambus Random Access Memory) ). The memory 302 described in the embodiments of the present disclosure is intended to include, but not be limited to, these and any other suitable types of memory.
本公开实施例中的存储器302用于存储各种类型的数据以支持电子设备的操作。这些数据的示例包括:用于在通信设备上操作的任何计算机程序,如应用程序。实现本公开实施例方法的程序可以包含在应用程序中。The memory 302 in the embodiment of the present disclosure is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program, such as an application, used to operate on a communication device. A program implementing the method of the embodiment of the present disclosure may be included in an application program.
传输装置303用于经由一个网络接收或者发送数据。上述的网络可包括电子设备300的通信供应商提供的无线网络。在一个实例中,传输装置303包括一个网络适配器(NIC,Network Interface Controller),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置303可以为射频(RF,Radio Frequency)模块,其用于通过无线方式与互联网进行通讯。Transmission means 303 is used to receive or transmit data via a network. The aforementioned network may include a wireless network provided by a communication provider of the electronic device 300 . In one example, the transmission device 303 includes a network adapter (NIC, Network Interface Controller), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission device 303 may be a radio frequency (RF, Radio Frequency) module, which is used to communicate with the Internet in a wireless manner.
本公开实施例所记载的技术方案之间,在不冲突的情况下,可以任意组合。 在本公开所提供的几个实施例中,应该理解到,所揭露的方法和智能设备,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成单元相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。The technical solutions described in the embodiments of the present disclosure may be combined arbitrarily if there is no conflict. In the several embodiments provided in the present disclosure, it should be understood that the disclosed method and smart device may be implemented in other manners. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the constituent units shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms. of.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元,即可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的单元或全部单元来实现本实施例方案的目的。The unit described above as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place or distributed to multiple network units; The purpose of the solution in this embodiment can be achieved by selecting any unit or all of the units according to actual needs.
另外,在本公开各实施例中的各功能单元可以全部集成在一个第二处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present disclosure may all be integrated into one second processing unit, or each unit may be separately used as a unit, or two or more units may be integrated into one unit; The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。The above are only specific embodiments of the present disclosure, but the protection scope of the present disclosure is not limited to this. should be included within the scope of protection of the present disclosure.
工业实用性Industrial Applicability
本公开实施例获取相邻帧的数据的差分信息;基于所述差分信息,确定所述数据中表征运动信息的第一数据;基于特征对所述相邻帧的数据进行通道分组;在时序维度对每个通道分组的数据进行处理,获得第二数据;基于所述第一数据和所述第二数据得到行为识别结果。如此,获取相邻帧的数据的差分信息,能够消除图像中的背景噪声;在时序维度对每个通道分组的进行处理,实现了更大的时序感受野;因此,本公开实施例提供的行为识别方法能够从空间维度和时间维度提高行为识别的精确度。The embodiment of the present disclosure obtains difference information of the data of adjacent frames; based on the difference information, determines the first data representing motion information in the data; performs channel grouping on the data of the adjacent frames based on the characteristics; The data grouped by each channel is processed to obtain second data; a behavior recognition result is obtained based on the first data and the second data. In this way, the difference information of the data of adjacent frames can be obtained, and the background noise in the image can be eliminated; the processing of each channel grouping in the time series dimension realizes a larger time series receptive field; therefore, the behavior provided by the embodiments of the present disclosure The recognition method can improve the accuracy of behavior recognition from the spatial and temporal dimensions.

Claims (27)

  1. 一种行为识别方法,所述方法包括:A behavior recognition method, the method includes:
    获取图像相邻帧的数据的差分信息;Obtain the difference information of the data of the adjacent frames of the image;
    基于所述差分信息,确定所述数据中表征运动信息的第一数据;Based on the differential information, determining first data representing motion information in the data;
    基于特征对所述相邻帧的数据进行通道分组;performing channel grouping on the data of the adjacent frames based on the feature;
    在时序维度对每个通道分组的数据进行处理,获得第二数据;Process the data grouped by each channel in the time series dimension to obtain the second data;
    基于所述第一数据和所述第二数据得到行为识别结果。A behavior recognition result is obtained based on the first data and the second data.
  2. 根据权利要求1所述的方法,其中,所述获取图像相邻帧的数据的差分信息,包括:The method according to claim 1, wherein the acquiring differential information of data of adjacent frames of the image comprises:
    对所述相邻帧的数据进行特征对齐;performing feature alignment on the data of the adjacent frames;
    基于至少一个阶梯结构的卷积模型,从特征对齐后的数据中获取所述相邻帧的前景数据;Obtain foreground data of the adjacent frame from the feature-aligned data based on at least one ladder-structured convolution model;
    对所述前景数据进行差分处理,获取所述相邻帧的前景数据的差分信息。Perform differential processing on the foreground data to obtain differential information of the foreground data of the adjacent frames.
  3. 根据权利要求2所述的方法,其中,所述对所述相邻帧的数据进行特征对齐,包括:The method according to claim 2, wherein the performing feature alignment on the data of the adjacent frames comprises:
    获取相邻两帧的数据的特征;Obtain the characteristics of the data of two adjacent frames;
    对所述相邻两帧中的每一帧的所述特征在通道维度上进行降维处理;performing dimension reduction processing on the channel dimension on the feature of each of the two adjacent frames;
    利用相似矩阵对降维度处理后的相邻两帧的特征进行特征对齐。A similarity matrix is used to perform feature alignment on the features of two adjacent frames after dimension reduction.
  4. 根据权利要求2或3所述的方法,其中,所述基于至少一个阶梯结构的卷积模型,从特征对齐后的数据中获取所述相邻帧的前景数据包括:The method according to claim 2 or 3, wherein, obtaining the foreground data of the adjacent frames from the feature-aligned data based on the at least one staircase-structured convolution model comprises:
    对所述相邻帧的前景数据中的第一帧数据和第二帧数据分别复制为N份;The first frame data and the second frame data in the foreground data of the adjacent frames are respectively copied into N shares;
    将N份第二帧数据中的第一份第二帧数据输入二维卷积模型,得到第一输出结果,将所述第一输出结果与第一份第一帧数据做差分,得到第一差分结果;所述二维卷积模型为N阶二维卷积模型;Input the first second frame data of the N second frame data into the two-dimensional convolution model to obtain the first output result, and make a difference between the first output result and the first first frame data to obtain the first output result. Difference result; the two-dimensional convolution model is an N-order two-dimensional convolution model;
    将所述第一输出结果与第(M+1)份第二帧数据进行相加处理后输入二维卷积模型,得到第(M+1)输出结果,将所述第(M+1)输出结果与第M份第一帧数据做差分,得到第(M+1)差分结果;其中,M大于或等于1,M小于或等于N-1;After the first output result and the (M+1)th second frame data are added, input a two-dimensional convolution model to obtain the (M+1)th output result, and the (M+1)th Differentiate the output result with the M-th first frame data to obtain the (M+1)-th difference result; where M is greater than or equal to 1, and M is less than or equal to N-1;
    将所述第一差分结果至第N差分结果拼接后,输入至一维卷积模型,得到所述相邻帧的前景数据。After splicing the first difference result to the Nth difference result, it is input into a one-dimensional convolution model to obtain foreground data of the adjacent frames.
  5. 根据权利要求4所述的方法,其中,所述基于所述差分信息,确定所述数据中表征运动信息的第一数据,包括:The method according to claim 4, wherein the determining, based on the differential information, the first data representing motion information in the data comprises:
    基于所述差分信息确定通道权重;determining a channel weight based on the differential information;
    基于所述通道权重对所述数据中的所述前景数据进行处理,得到所述第一数据。The foreground data in the data is processed based on the channel weight to obtain the first data.
  6. 根据权利要求5所述的方法,其中,所述基于所述通道权重对所述数据中的前景数据进行处理,得到所述第一数据包括:The method according to claim 5, wherein the processing of the foreground data in the data based on the channel weight to obtain the first data comprises:
    将所述前景数据与所述通道权重相乘得到乘积结果,将所述乘积结果与所述前景数据相加,得到所述第一数据。Multiplying the foreground data and the channel weight to obtain a product result, and adding the product result to the foreground data to obtain the first data.
  7. 根据权利要求1至6任一项所述的方法,其中,所述在时序维度对每个通道分组的数据进行处理,获得第二数据,包括:The method according to any one of claims 1 to 6, wherein the processing of the data grouped by each channel in the time series dimension to obtain the second data comprises:
    利用时序卷积模型对每个通道分组的数据进行处理;Use the time series convolution model to process the data grouped by each channel;
    将处理后的全部通道分组的数据进行融合,得到所述第二数据。The processed data of all channel groups are fused to obtain the second data.
  8. 根据权利要求7所述的方法,其中,所述利用时序卷积模型对每个通道分组的数据进行处理,包括:The method according to claim 7, wherein said using a time series convolution model to process the data grouped by each channel comprises:
    利用所述时序卷积模型将不同尺度的数据时序信息在通道维度融合。The time-series convolution model is used to fuse data time-series information of different scales in the channel dimension.
  9. 根据权利要求8所述的方法,其中,所述利用所述时序卷积模型将不同尺度的数据时序信息在通道维度融合,包括:The method according to claim 8, wherein the using the time series convolution model to fuse data time series information of different scales in a channel dimension, comprising:
    将每个通道分组的数据划分为N份子数据,将所述N份子数据中的第二份子数据输入至所述时序卷积模型中的一维卷积模型,得到第二子时序数据;The data of each channel grouping is divided into N sub-data, the second sub-data in the N sub-data is input into the one-dimensional convolution model in the time series convolution model, and the second sub-series data is obtained;
    将第K子时序数据与第(K+1)份子数据进行融合,对融合后的数据执行预处理;其中,K大于或等于2,K小于或等于N-1;The Kth sub-series data is fused with the (K+1)th sub-data, and preprocessing is performed on the fused data; wherein, K is greater than or equal to 2, and K is less than or equal to N-1;
    计算预处理得到的值与所述第K子时序数据点乘的第一结果,以及计算1与所述预处理得到的值之差与所述第(K+1)份子数据点乘的第二结果;Calculate the first result of multiplying the value obtained by preprocessing and the Kth sub-series data point, and calculate the second result of multiplying the difference between 1 and the value obtained by the preprocessing and the (K+1)th sub-data point. result;
    所述第一结果与所述第二结果融合后,输入至所述一维卷积模型,得到第(K+1)子时序数据。After the first result and the second result are fused, they are input to the one-dimensional convolution model to obtain the (K+1)th sub-series data.
  10. 根据权利要求9所述的方法,其中,所述将处理后的全部通道分组的数据进行融合,得到所述第二数据包括:The method according to claim 9, wherein the obtaining the second data by fusing the processed data of all the channel groups comprises:
    将第一子数据、第二子时序数据、第三子时序数据,直至第N子时序数据分别进行时序卷积处理后,再将时序卷积处理后的数据进行级联,得到所述第二数据。After the first sub-data, the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data are respectively subjected to time-series convolution processing, the data after time-series convolution processing are concatenated to obtain the second sub-series data. data.
  11. 根据权利要求1所述的方法,其中,所述获取图像相邻帧的数据的差分信息之前,所述方法还包括:利用三维卷积模型获取所述图像相邻帧的数据;The method according to claim 1, wherein, before acquiring the difference information of the data of the adjacent frames of the image, the method further comprises: using a three-dimensional convolution model to acquire the data of the adjacent frames of the image;
    和/或,所述获得第二数据之后,所述方法还包括:利用一维卷积模型对所述第一数据和所述第二数据执行通道降维处理。And/or, after obtaining the second data, the method further includes: performing channel dimension reduction processing on the first data and the second data by using a one-dimensional convolution model.
  12. 根据权利要求1至11任一项所述的方法,其中,所述方法由时序运动模型实现,所述时序运动模型包括增强运动变换模块和长时序建模模块;The method according to any one of claims 1 to 11, wherein the method is implemented by a temporal motion model, and the temporal motion model includes an enhanced motion transformation module and a long-term temporal modeling module;
    其中,所述增强运动变换模块用于获取图像相邻帧的数据的差分信息;基于所述差分信息,确定所述数据中表征运动信息的第一数据;Wherein, the enhanced motion transformation module is used to obtain the difference information of the data of adjacent frames of the image; based on the difference information, determine the first data representing the motion information in the data;
    所述长时序建模模块用于基于特征对所述相邻帧的数据进行通道分组;在时序维度对每个通道分组的数据进行处理,获得第二数据。The long-time sequence modeling module is configured to perform channel grouping on the data of the adjacent frames based on the feature; and process the data of each channel grouping in the time sequence dimension to obtain second data.
  13. 根据权利要求12所述的方法,其中,所述时序运动模型嵌入至三维卷积模型之后,所述三维卷积模型用于获取所述图像相邻帧的数据;The method according to claim 12, wherein after the time series motion model is embedded in a three-dimensional convolution model, the three-dimensional convolution model is used to obtain data of adjacent frames of the image;
    和/或,所述时序运动模型嵌入至一维卷积模型之前,所述一维卷积模型用于对所述第一数据和所述第二数据执行通道降维处理。And/or, before the time series motion model is embedded into a one-dimensional convolution model, the one-dimensional convolution model is used to perform channel dimension reduction processing on the first data and the second data.
  14. 一种行为识别装置,所述装置包括:A behavior recognition device, the device includes:
    获取单元,被配置为获取相邻帧的数据的差分信息;an acquisition unit, configured to acquire differential information of data of adjacent frames;
    确定单元,被配置为基于所述差分信息,确定所述数据中表征运动信息的第一数据;a determining unit configured to determine, based on the differential information, first data representing motion information in the data;
    分组单元,被配置为基于特征对所述相邻帧的数据进行通道分组;a grouping unit configured to perform channel grouping on the data of the adjacent frames based on the feature;
    处理单元,被配置为在时序维度对每个通道分组的数据进行处理,获得第二数据;a processing unit, configured to process the data grouped by each channel in the time sequence dimension to obtain second data;
    识别单元,被配置为基于所述第一数据和所述第二数据得到行为识别结果。An identification unit, configured to obtain a behavior identification result based on the first data and the second data.
  15. 根据权利要求14所述的装置,其中,The apparatus of claim 14, wherein,
    所述获取单元,还被配置为对所述相邻帧的数据进行特征对齐;基于至少一个阶梯结构的卷积模型,从特征对齐后的数据中获取所述相邻帧的前景数据;对所述前景数据进行差分处理,获取所述相邻帧的前景数据的差分信息。The obtaining unit is further configured to perform feature alignment on the data of the adjacent frames; obtain foreground data of the adjacent frames from the data after feature alignment based on at least one convolutional model of a ladder structure; The foreground data is subjected to differential processing to obtain differential information of the foreground data of the adjacent frames.
  16. 根据权利要求15所述的装置,其中,The apparatus of claim 15, wherein,
    所述获取单元,还被配置为获取相邻两帧的数据的特征;对每帧的特征在通道维度上进行降维处理;利用相似矩阵对降维度处理后的相邻两帧的特征进行特征对齐。The obtaining unit is also configured to obtain the features of the data of two adjacent frames; the feature of each frame is subjected to dimension reduction processing in the channel dimension; the features of the two adjacent frames after the dimension reduction processing are used to perform a feature reduction process using a similarity matrix. Align.
  17. 根据权利要求15或16所述的装置,其中,An apparatus according to claim 15 or 16, wherein,
    所述获取单元,还被配置为对所述相邻帧的前景数据中的第一帧数据和第二帧数据分别复制为N份;将N份第二帧数据中的第一份第二帧数据输入二维卷积模型,得到的第一输出结果与第一份第一帧数据做差分,得到第一差分结果;所述二维卷积模型为N阶二维卷积模型;将第一输出结果与第(M+1)份第二帧数据进行相加处理后输入二维卷积模型,得到的第(M+1)输出结果与第M份第一帧数据做差分,得到第(M+1)差分结果,M大于或等于1,M小于或等于N-1;将得到的第一差分结果至第N差分结果拼接后,输入至一维卷积模型,得到所述相邻帧的前景数据。The acquisition unit is also configured to copy the first frame data and the second frame data in the foreground data of the adjacent frames into N parts respectively; the first part of the N parts of the second frame data The data is input into a two-dimensional convolution model, and the obtained first output result is differentiated from the first first frame of data to obtain a first differential result; the two-dimensional convolution model is an N-order two-dimensional convolution model; the first The output result and the (M+1)th second frame data are added to the two-dimensional convolution model, and the obtained (M+1)th output result is different from the Mth first frame data to obtain the ( M+1) difference result, M is greater than or equal to 1, and M is less than or equal to N-1; after splicing the obtained first difference result to the Nth difference result, input it into a one-dimensional convolution model to obtain the adjacent frame prospects data.
  18. 根据权利要求17所述的装置,其中,The apparatus of claim 17, wherein,
    所述确定单元,还被配置为基于所述差分信息确定通道权重;基于所述通道权重对所述数据中的前景数据进行处理,得到所述第一数据。The determining unit is further configured to determine a channel weight based on the differential information; and process foreground data in the data based on the channel weight to obtain the first data.
  19. 根据权利要求18所述的装置,其中,The apparatus of claim 18, wherein,
    所述确定单元,还被配置为将所述前景数据与所述前景数据和所述通道权重之积相加,得到所述第一数据。The determining unit is further configured to add the foreground data to the product of the foreground data and the channel weight to obtain the first data.
  20. 根据权利要求14至19任一项所述的装置,其中,An apparatus according to any one of claims 14 to 19, wherein,
    所述处理单元,还被配置为利用时序卷积模型对每个通道分组的数据进行处理;将处理后的全部通道分组的数据进行融合,得到所述第二数据。The processing unit is further configured to use a time series convolution model to process the data grouped by each channel; and fuse the processed data of all channel groups to obtain the second data.
  21. 根据权利要求20所述的装置,其中,The apparatus of claim 20, wherein,
    所述处理单元,还被配置为利用所述时序卷积模型将不同尺度的数据时序信息在通道维度融合。The processing unit is further configured to use the time series convolution model to fuse data time series information of different scales in the channel dimension.
  22. 根据权利要求21所述的装置,其中,The apparatus of claim 21, wherein,
    所述处理单元,还被配置为将每个通道分组的数据划分为N份子数据,将第二份子数据输入至所述时序卷积模型中的一维卷积模型,得到第二子时序数据;将第K子时序数据与第(K+1)份子数据融合后,对融合的数据执行预处理,K大于或等于2,K小于或等于N-1;计算预处理得到的值与所述第K子时序数据点乘的第一结果,以及计算1与所述预处理得到的值之差与所述第(K+1)份子数据点乘的第二结果;对所述第一结果与所述第二结果融合后,输入至所述一维卷积模型,得到第(K+1)子时序数据。The processing unit is also configured to divide the data grouped by each channel into N sub-data, input the second sub-data into the one-dimensional convolution model in the time series convolution model, and obtain the second sub-series data; After merging the Kth sub-series data with the (K+1)th sub-data, perform preprocessing on the fused data, K is greater than or equal to 2, and K is less than or equal to N-1; The first result of the K sub-series data point multiplication, and the second result of the (K+1)th sub-data point multiplication between the difference between the calculated 1 and the value obtained by the preprocessing and the (K+1)th sub-data point; After the second result is fused, it is input to the one-dimensional convolution model to obtain the (K+1)th sub-series data.
  23. 根据权利要求22所述的装置,其中,The apparatus of claim 22, wherein,
    所述处理单元,还被配置为将第一子数据、第二子时序数据、第三子时序数据,直至第N子时序数据分别进行时序卷积处理后,再将时序卷积处理后的数据进行级联,得到所述第二数据。The processing unit is also configured to perform time-series convolution processing on the first sub-data, the second sub-sequence data, the third sub-series data, and the Nth sub-series data, and then perform time-series convolution processing on the data. Concatenation is performed to obtain the second data.
  24. 根据权利要求14所述的装置,其中,The apparatus of claim 14, wherein,
    所述处理单元,还被配置为利用三维卷积模型获取所述图像相邻帧的数据;和/或,所述获得第二数据之后,利用一维卷积模型对所述第一数据和所述第二数据执行通道降维处理。The processing unit is further configured to use a three-dimensional convolution model to obtain the data of the adjacent frames of the image; and/or, after obtaining the second data, use a one-dimensional convolution model to analyze the first data and the all data. Channel dimension reduction processing is performed on the second data.
  25. 一种计算机程序产品,所述计算机程序产品包括计算机可执行指令,所述计算机可执行指令被执行后,能够实现权利要求1至13任一项所述的方法步骤。A computer program product, the computer program product comprising computer-executable instructions, after the computer-executable instructions are executed, the method steps of any one of claims 1 to 13 can be implemented.
  26. 一种存储介质,所述存储介质上存储有可执行指令,所述可执行指令被处理器执行时实现权利要求1至13任一项所述的方法步骤。A storage medium storing executable instructions on the storage medium, when the executable instructions are executed by a processor, the method steps of any one of claims 1 to 13 are implemented.
  27. 一种电子设备,所述电子设备包括存储器和处理器,所述存储器上存储有计算机可执行指令,所述处理器运行所述存储器上的计算机可执行指令时可实现权利要求1至13任一项所述的方法步骤。An electronic device, the electronic device includes a memory and a processor, the memory stores computer-executable instructions, and the processor can implement any one of claims 1 to 13 when the processor executes the computer-executable instructions on the memory. The method steps described in item.
PCT/CN2021/127119 2021-02-22 2021-10-28 Behavior recognition method and apparatus, and electronic device and storage medium WO2022174616A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110198255.3A CN112926436A (en) 2021-02-22 2021-02-22 Behavior recognition method and apparatus, electronic device, and storage medium
CN202110198255.3 2021-02-22

Publications (1)

Publication Number Publication Date
WO2022174616A1 true WO2022174616A1 (en) 2022-08-25

Family

ID=76170053

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/127119 WO2022174616A1 (en) 2021-02-22 2021-10-28 Behavior recognition method and apparatus, and electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN112926436A (en)
WO (1) WO2022174616A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926436A (en) * 2021-02-22 2021-06-08 上海商汤智能科技有限公司 Behavior recognition method and apparatus, electronic device, and storage medium
CN114938349B (en) * 2022-05-20 2023-07-25 远景智能国际私人投资有限公司 Internet of things data processing method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110182469A1 (en) * 2010-01-28 2011-07-28 Nec Laboratories America, Inc. 3d convolutional neural networks for automatic human action recognition
CN110008867A (en) * 2019-03-25 2019-07-12 五邑大学 A kind of method for early warning based on personage's abnormal behaviour, device and storage medium
CN111242068A (en) * 2020-01-17 2020-06-05 科大讯飞(苏州)科技有限公司 Behavior recognition method and device based on video, electronic equipment and storage medium
CN111539290A (en) * 2020-04-16 2020-08-14 咪咕文化科技有限公司 Video motion recognition method and device, electronic equipment and storage medium
CN112926436A (en) * 2021-02-22 2021-06-08 上海商汤智能科技有限公司 Behavior recognition method and apparatus, electronic device, and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886172A (en) * 2019-02-01 2019-06-14 深圳市商汤科技有限公司 Video behavior recognition methods and device, electronic equipment, storage medium, product
CN112241673B (en) * 2019-07-19 2022-11-22 浙江商汤科技开发有限公司 Video processing method and device, electronic equipment and storage medium
CN111652910B (en) * 2020-05-22 2023-04-11 重庆理工大学 Target tracking algorithm based on object space relationship

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110182469A1 (en) * 2010-01-28 2011-07-28 Nec Laboratories America, Inc. 3d convolutional neural networks for automatic human action recognition
CN110008867A (en) * 2019-03-25 2019-07-12 五邑大学 A kind of method for early warning based on personage's abnormal behaviour, device and storage medium
CN111242068A (en) * 2020-01-17 2020-06-05 科大讯飞(苏州)科技有限公司 Behavior recognition method and device based on video, electronic equipment and storage medium
CN111539290A (en) * 2020-04-16 2020-08-14 咪咕文化科技有限公司 Video motion recognition method and device, electronic equipment and storage medium
CN112926436A (en) * 2021-02-22 2021-06-08 上海商汤智能科技有限公司 Behavior recognition method and apparatus, electronic device, and storage medium

Also Published As

Publication number Publication date
CN112926436A (en) 2021-06-08

Similar Documents

Publication Publication Date Title
WO2022174616A1 (en) Behavior recognition method and apparatus, and electronic device and storage medium
US11625433B2 (en) Method and apparatus for searching video segment, device, and medium
WO2016101628A1 (en) Data processing method and device in data modeling
JP7417759B2 (en) Methods, apparatus, electronic equipment, storage media and computer programs for training video recognition models
US10394907B2 (en) Filtering data objects
CN113570030B (en) Data processing method, device, equipment and storage medium
CN110941978B (en) Face clustering method and device for unidentified personnel and storage medium
CN110457524B (en) Model generation method, video classification method and device
CN111368850B (en) Image feature extraction method, image target detection method, image feature extraction device, image target detection device, convolution device, CNN network device and terminal
JP2020525908A (en) Image search method, device, device and readable storage medium
WO2021159787A1 (en) Content processing method and apparatus, computer-readable storage medium and computer device
Hintermüller et al. Robust principal component pursuit via inexact alternating minimization on matrix manifolds
CN112241789A (en) Structured pruning method, device, medium and equipment for lightweight neural network
US20230066703A1 (en) Method for estimating structural vibration in real time
Zhao et al. 3D target detection using dual domain attention and SIFT operator in indoor scenes
CN114492755A (en) Target detection model compression method based on knowledge distillation
WO2023050649A1 (en) Esg index determination method based on data complementing, and related product
CN110968835A (en) Approximate quantile calculation method and device
CN116451081A (en) Data drift detection method, device, terminal and storage medium
CN116229095A (en) Model training method, visual task processing method, device and equipment
CN115331081A (en) Image target detection method and device
Zhang et al. A novel target tracking method based on OSELM
CN114743150A (en) Target tracking method and device, electronic equipment and storage medium
CN113658320A (en) Three-dimensional reconstruction method, human face three-dimensional reconstruction method and related device
Shang et al. Regularization parameter selection for the low rank matrix recovery

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21926330

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE