CN112926436A

CN112926436A - Behavior recognition method and apparatus, electronic device, and storage medium

Info

Publication number: CN112926436A
Application number: CN202110198255.3A
Authority: CN
Inventors: 苏海昇
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2021-02-22
Filing date: 2021-02-22
Publication date: 2021-06-08
Also published as: WO2022174616A1

Abstract

The application discloses a behavior recognition method and device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring differential information of data of adjacent frames; determining first data representing motion information in the data based on the difference information; channel grouping data of the adjacent frames based on the features; processing the data of each channel group in a time sequence dimension to obtain second data; obtaining a behavior recognition result based on the first data and the second data. By the behavior identification method and the behavior identification device, the accuracy of behavior identification can be improved from the space dimension and the time dimension.

Description

Behavior recognition method and apparatus, electronic device, and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a behavior recognition method and apparatus, an electronic device, and a storage medium.

Background

In the technical field of computer vision, behavior identification based on video data plays an extremely important role in numerous fields such as video recommendation, security monitoring and human-computer interaction; therefore, the accuracy of behavior recognition on images is important, and performing behavior recognition with high accuracy is a goal that computer vision technology pursues consistently.

Content of application

In order to solve the technical problem, embodiments of the present application provide a behavior recognition method and apparatus, an electronic device, and a storage medium.

The behavior identification method provided by the embodiment of the application comprises the following steps:

acquiring differential information of data of adjacent frames;

determining first data representing motion information in the data based on the difference information;

channel grouping data of the adjacent frames based on the features;

processing the data of each channel group in a time sequence dimension to obtain second data;

obtaining a behavior recognition result based on the first data and the second data.

In an optional embodiment of the present application, the acquiring differential information of data of adjacent frames includes:

performing feature alignment on the data of the adjacent frames;

acquiring foreground data of the adjacent frames from the data after feature alignment based on at least one convolution model with a ladder structure;

and carrying out differential processing on the foreground data to obtain differential information of the foreground data of the adjacent frames.

In an optional embodiment of the present application, the performing feature alignment on the data of the adjacent frames includes:

acquiring the characteristics of the data of two adjacent frames;

carrying out dimension reduction processing on the features of each frame on the channel dimension;

and performing feature alignment on the features of the two adjacent frames after the dimension reduction processing by using the similarity matrix.

In an optional embodiment of the present application, the obtaining foreground data of the adjacent frame from the data after feature alignment based on the convolution model with at least one ladder structure includes:

respectively copying the first frame data and the second frame data in the foreground data of the adjacent frames into N parts;

inputting first second frame data in the N second frame data into a two-dimensional convolution model, and performing difference between the obtained first output result and the first frame data to obtain a first difference result; the two-dimensional convolution model is an N-order two-dimensional convolution model;

adding the first output result and the (M +1) th second frame data, inputting the result into a two-dimensional convolution model, and carrying out difference on the obtained (M +1) th output result and the Mth first frame data to obtain an (M +1) th difference result, wherein M is greater than or equal to 1, and M is less than or equal to N-1;

and splicing the obtained first difference result to the Nth difference result, and inputting the result to a one-dimensional convolution model to obtain the foreground data of the adjacent frames.

In an optional implementation manner of this application, the determining, based on the difference information, first data characterizing motion information in the data includes:

determining channel weights based on the difference information;

and processing foreground data in the data based on the channel weight to obtain the first data.

In an optional embodiment of the present application, the processing foreground data in the data based on the channel weight to obtain the first data includes:

and adding the foreground data and the product of the foreground data and the channel weight to obtain the first data.

In an optional embodiment of the present application, the processing the data of each channel packet in the time sequence dimension to obtain the second data includes:

processing the data grouped by each channel by utilizing a time sequence convolution model;

and fusing the processed data of all channel groups to obtain the second data.

In an optional embodiment of the present application, the processing the data of each channel packet by using the time-series convolution model includes:

and fusing data time sequence information of different scales in channel dimensions by using the time sequence convolution model.

In an optional embodiment of the present application, the fusing, in a channel dimension, data timing information of different scales by using the timing convolution model includes:

dividing data of each channel group into N parts of sub data, and inputting second part of sub data into a one-dimensional convolution model in the time sequence convolution model to obtain second sub time sequence data;

fusing the Kth sub time sequence data with the (K +1) th sub data, and then preprocessing the fused data, wherein K is more than or equal to 2, and K is less than or equal to N-1;

calculating a first result of dot multiplication of the preprocessed value and the Kth sub-time sequence data, and calculating a second result of dot multiplication of the difference between 1 and the preprocessed value and the (K +1) th sub-time sequence data;

and after the first result and the second result are fused, inputting the fused first result and the fused second result into the one-dimensional convolution model to obtain (K +1) th sub time sequence data.

In an optional embodiment of the present application, the fusing the processed data of all channel packets to obtain the second data includes:

and after the first sub data, the second sub time sequence data and the third sub time sequence data are subjected to time sequence convolution processing respectively till the Nth sub time sequence data, cascading the data subjected to the time sequence convolution processing to obtain the second data.

In an optional embodiment of the present application, before obtaining the differential information of the data of the adjacent frames of the image, the method further includes: acquiring data of adjacent frames of the image by using a three-dimensional convolution model;

and/or after obtaining the second data, the method further comprises: and executing channel dimension reduction processing on the first data and the second data.

The behavior recognition device provided by the embodiment of the application comprises:

an acquisition unit configured to acquire difference information of data of adjacent frames;

a determining unit, configured to determine, based on the difference information, first data representing motion information in the data;

a grouping unit for performing channel grouping on the data of the adjacent frames based on the characteristics;

the processing unit is used for processing the data of each channel group in the time sequence dimension to obtain second data;

and the identification unit is used for obtaining a behavior identification result based on the first data and the second data.

The computer program product provided by the embodiment of the application comprises computer executable instructions, and after the computer executable instructions are executed, the behavior recognition method can be realized.

The storage medium provided by the embodiment of the application stores executable instructions, and the executable instructions are executed by the processor to realize the behavior recognition method.

The electronic device provided by the embodiment of the application is characterized by comprising a memory and a processor, wherein the memory stores computer-executable instructions, and the processor can realize the behavior recognition method when the processor runs the computer-executable instructions on the memory.

In the behavior identification method provided by the embodiment of the application, differential information of data of adjacent frames is acquired; determining first data representing motion information in the data based on the difference information; channel grouping data of the adjacent frames based on the features; processing the data of each channel group in a time sequence dimension to obtain second data; obtaining a behavior recognition result based on the first data and the second data. Thus, the difference information of the data of the adjacent frames is obtained, and the background noise in the image can be eliminated; processing each channel group in a time sequence dimension to realize a larger time sequence receptive field; therefore, the behavior identification method provided by the embodiment of the application can improve the accuracy of behavior identification from a space dimension and a time dimension.

In order to make the aforementioned and other objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

Fig. 1 is a schematic flowchart of a behavior recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an alternative processing flow of acquiring differential information of data of adjacent frames by an identification apparatus according to an embodiment of the present application;

fig. 3 is a schematic view of an alternative processing flow of feature alignment performed on the data of the adjacent frames by the recognition device according to the embodiment of the present application;

FIG. 4 is a diagram illustrating a dimension reduction process performed on features of frames at different time instances according to an embodiment of the present application;

FIG. 5 is a diagram illustrating cascading features using a step cascading result according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a data processing flow for adding EMT and TSS according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a feature pattern of data of the embodiment of the present application after passing through an EMT module;

fig. 8 is a schematic structural component diagram of a behavior recognition device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Before describing embodiments of the present application in detail, behavior recognition is briefly described.

With the improvement of basic technologies such as cloud computing and 5G, video data has been increased explosively in recent years, and more researchers have been engaged in the field of video understanding in order to fully exploit the value of video data. Behavior recognition is used as a basic task for video understanding, and has wide requirements in scenes such as security, man-machine interaction, personalized recommendation and the like. The process of video-based behavior recognition is to determine a category of behavior based on a given video containing an action. The accuracy of judging the behavior category is an important index for behavior recognition, and motion modeling and time sequence modeling based on video data are important factors influencing the judgment result of behavior recognition.

Motion modeling based on video data most often models motion information between adjacent frames based on optical flow features. In general, the dual-stream-based motion recognition method extracts motion features through optical flow modeling. However, the extraction of optical flow and the modeling process have a large amount of data calculation, and thus it is difficult to apply the method to a scene requiring high real-time performance. Since an important role of optical flow is to highlight moving objects when describing the motion relationship between adjacent frames, an alternative is to approximate optical flow by the difference of features between different frames; however, when the advocated approximate optical flow of features between different frames is utilized, the edge motion information of a moving object and a non-moving object is obtained at the same time, and the edge motion information of the non-moving object belongs to noise because the non-moving object is a background part of an image; also, edge motion information of a stationary part of a moving object belongs to interference noise.

There are mainly two main approaches based on the time-series modeling of video data; the first scheme is to adopt a structure of 2D CNN plus an inter-frame aggregator, and the aggregator generally adopts operations such as avg/max/3D convolution/RNN, and the like, and this scheme simply performs frame-level score fusion or frame-level high-level feature fusion, but does not consider feature-level timing information aggregation. The second scheme is that 3D convolution is adopted, time sequence relation is aggregated at a characteristic level by utilizing the 3D convolution, and due to the fact that the 3D convolution has multiple parameters and large metering quantity, 3D convolution nodes are coupled into 2D +1D convolution, 2D convolution space information modeling is conducted, and the 1D convolution is only responsible for time sequence relation modeling. However, for 3D/(2D +1D) convolution, only the timing relationship in a local window is modeled, and for long timing relationship, the purpose of modeling long-range timing is achieved by depending on the longitudinal stacked convolution blocks. Such vertical structures are difficult to optimize for shallow time series convolution.

Based on this, the application provides a behavior recognition method. Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present application unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

In the following, a detailed description is given of a behavior recognition method according to an embodiment of the present application, an execution subject of the behavior recognition method according to the embodiment of the present application may be a behavior recognition device, such as a terminal device or a server, or may be another processing device with data processing capability, and the embodiment of the present application is not limited in this application.

Referring to fig. 1, a flow chart of a behavior recognition method provided in the embodiment of the present application is schematically illustrated, and the method at least includes the following steps:

step S101, difference information of data of adjacent frames is acquired.

In some embodiments, the behavior recognizing means (hereinafter, simply referred to as recognizing means) acquires differential information of data of adjacent frames. Wherein the adjacent frames may be two adjacent frames in the video data.

In some embodiments, an optional process flow of the identification device obtaining the difference information of the data of the adjacent frames, as shown in fig. 2, includes at least the following steps:

and step S1a, performing feature alignment on the data of the adjacent frames.

In some embodiments, an optional process flow of feature alignment of the data of the adjacent frames by the identification device, as shown in fig. 3, includes at least the following steps:

step S1a1, acquiring characteristics of data of two adjacent frames.

In some embodiments, the shape of the feature of the data may be

Where N denotes the data amount (Batch Size), F is the number of frames, C is the number of channels, H is the height of a single frame image, and W is the width of a single frame image.

And step S1a2, performing dimensionality reduction on the channel dimension of the features of each frame.

In some embodiments, the recognition device performs frame-level separation on the input features X to obtain X^t∈R^[C×H×W]. In order to reduce the amount of data calculation, the recognition device performs dimensionality reduction on the features of each frame in the channel dimension, and compresses the feature channels by using a convolution of 1 × 1, as shown in the following formula (1) and formula (2), wherein formula (1) is the result of dimensionality reduction on the features of the frame at time t shown in fig. 4, and formula (2) is the result of dimensionality reduction on the channel dimension on the features of the frame at time t +1 shown in fig. 4.

x^t＝Conv1D(X^t),x^t∈R[C/l,H,W] (1)

x^t+1＝Conv1D(X^+1t),x^t+1∈R[C/l,H,W] (2)

Wherein l is the channel compression rate, and the value of l can be flexibly set according to the actual application scene, such as set to a value of 16.

And step S1a3, performing feature alignment on the features of the two adjacent frames after the dimension reduction processing by using the similarity matrix.

In some embodiments, the identifying means performs wrap-alignment on adjacent frames using a similarity matrix, as shown in equations (3) and (4) below.

Where the r () function is used to transform the size shape of the feature.

And step S1b, acquiring foreground data of the adjacent frames from the data after feature alignment based on at least one convolution model with a ladder structure.

In some embodiments, the identification means performs the extraction of the first sample data characterizing the motion information using a set of 2D convolutions of the staircase structure. Wherein, the 2D convolution of the set of ladder structures may be 2D convolutions, or 4, or 6 equal 2D convolutions; taking the example that the 2D convolution of a group of ladder structures is 4 2D convolutions, dividing the features into 4 parts according to channels, performing convolution calculation on the aligned features by using the following formula (5), formula (6), formula (7) and formula (8), and acquiring foreground data of the adjacent frames from the data after feature alignment to obtain multi-scale motion information.

m^S＝0＝Conv2D(r(A(x^t+1))) (5)

m^s＝1＝Conv2D(m^S＝0+r(A(x^t+1))) (6)

m^s＝2＝Conv2D(m^S＝1+r(A(x^t+1))) (7)

m^s＝2＝Conv2D(m^S＝1+r(A(x^t+1))) (8)

Through the formulas (5) to (8), foreground data of adjacent frames can be extracted, namely motion information is extracted, and background data in the adjacent frames are screened and deleted.

Specifically, as shown in the right part of fig. 4, the first frame data and the second frame data in the foreground data of the adjacent frames are respectively copied into N (N in fig. 4 is 4); inputting first second frame data in the N second frame data into a two-dimensional convolution model, and performing difference between the obtained first output result and the first frame data to obtain a first difference result; the two-dimensional convolution model is an N-order two-dimensional convolution model; adding the first output result and the (M +1) th second frame data, inputting the result into a two-dimensional convolution model, and carrying out difference on the obtained (M +1) th output result and the Mth first frame data to obtain an (M +1) th difference result, wherein M is greater than or equal to 1, and M is less than or equal to N-1; and splicing the obtained first difference result to the Nth difference result, and inputting the result to a one-dimensional convolution model to obtain the foreground data of the adjacent frames.

In the embodiment of the application, different motion change information is captured through a group of 2D convolutions, so that subsequent motion difference information can be more accurately portrayed and identified for behaviors.

Step S1c, performing difference processing on the foreground data to obtain difference information of the foreground data of the adjacent frames.

In some embodiments, the identification device performs difference processing on the foreground data by using the following formula (9) and formula (10) to obtain difference information of the foreground data of the adjacent frames.

In the embodiment of the application, background motion noise can be eliminated by acquiring the difference information of the foreground data of the adjacent frames.

Taking fig. 4 as an example, the data at time t ═ i is copied into 4 copies, which are the 1 st data, the 2 nd data, the 3 rd data and the 4 th data, respectively; the data at time t +1 is copied into 4 copies of the 5 th, 6 th, 7 th and 8 th data, respectively. Fusing a result obtained after the 5 th data is subjected to two-dimensional convolution processing with the 6 th data, and fusing a result obtained after the 5 th data is fused with the 6 th data with the 7 th data; fusing the result of the two-dimensional convolution processing of the fused data of the 5 th data and the 6 th data and the fused data of the 7 th data with the 8 th data after the two-dimensional convolution processing; and performing difference mathematical theory on the fused data at each time and the 1 st to 4 th data respectively, and performing difference result for 4 times, namely difference information.

M_diff＝m^S＝0-x^t+m^S＝1-x^t+m^S＝2-x^t+m^S＝3-x^t (9)

M_out＝Conv1D(M_diff),M_out∈R^[C,H,W] (10)

Wherein M is_outDifferential information representing foreground data of adjacent frames.

And S102, determining first data representing motion information in the data based on the difference information.

In some embodiments, the channel weight is determined based on the difference information of the foreground data, that is, the difference information of the foreground data is used as the channel weight, and the foreground data is processed by using the channel weight, so as to obtain the first data representing the motion information in the data.

The foreground data is processed by using the channel weight, which may be by enhancing the foreground data by using the channel weight, to obtain first data representing motion information in the data.

In some embodiments, the channel weights are determined by equation (11) below; foreground data is enhanced by adding the foreground data and the product of the foreground data and the channel weight to obtain the first data according to the following formula (12).

W＝sig mod(AvgPooling(Mout))∈R[C,1,1] (11)

Enhanced(X^t)＝X^t+X^t⊙W∈R[C,H,W] (12)

Therefore, in the embodiment of the application, the inter-frame similarity matrix is used for realizing the feature alignment between frames, and the interference problem caused by background jitter is eliminated as much as possible. Meanwhile, in consideration of the diversity of motion information, the embodiment of the application extracts data of different scales by using a group of step 2D convolutions, performs differential processing on the data of different scales, eliminates background noise in an image, obtains motion significance information of different scales, and finally uses the motion significance information to enhance a motion change area, namely, adds an enhanced motion transformation module (EMT) to process the data.

And S103, grouping the channels of the data of the adjacent frames based on the characteristics.

In some embodiments, the identification device processes the data of each channel packet by using a time-series convolution model, and fuses the processed data of all channel packets to obtain the second data.

In some embodiments, to reduce the amount of data computation, the recognition device identifies the input features X ∈ R^NF×C×H×WChannel grouping is carried out to obtain X_g＝i∈R^[NF,C/4,H,W]。

And step S104, processing the data of each channel group in the time sequence dimension to obtain second data.

In some embodiments, the identification device utilizes the time sequence convolution model to fuse data time sequence information of different scales in a channel dimension; specifically, the identification means may process the data of each channel packet by using 1D time series convolution as shown in the following equations (13) to (15):

r_out＝reshape(X)∈R^[NHW,C,F] (13)

r_out＝Conv1D(r_out)∈R^[NHW,C,F] (14)

out＝reshape(r_out)∈R^[NHW,C,F] (15)

dividing data of each channel group into N parts of sub data, and inputting second part of sub data into a one-dimensional convolution model in the time sequence convolution model to obtain second sub time sequence data; fusing the Kth sub time sequence data with the (K +1) th sub data, and then preprocessing the fused data, wherein K is more than or equal to 2, and K is less than or equal to N-1; calculating a first result of dot multiplication of the preprocessed value and the Kth sub-time sequence data, and calculating a second result of dot multiplication of the difference between 1 and the preprocessed value and the (K +1) th sub-time sequence data; and after the first result and the second result are fused, inputting the fused first result and the fused second result into the one-dimensional convolution model to obtain (K +1) th sub time sequence data. Wherein the preprocessing may be to perform channel fusion processing on the fused data, such as SAP, FC, and Softmax.

And the identification device processes the data of each channel group by using 1D time sequence convolution, and then fuses the processed data of all the channel groups to obtain second data. Specifically, the first sub data, the second sub time sequence data, and the third sub time sequence data are respectively subjected to time sequence convolution processing until the nth sub time sequence data, and then the data subjected to the time sequence convolution processing are cascaded to obtain the second data.

For example, if the results obtained by processing the data of two adjacent channel groups by 1D time series convolution are a and B, the processing procedure for fusing the channel group data is as shown in the following equations (16) to (20):

A∈R^[NF,C,H,W]，B∈R^[NF,C,H,W] (16)

C＝AvgPooling(C)∈R^[NF,C,1,1] (18)

C_a＝Soft max(FC(C))∈R^[NF,C,1,1] (19)

wherein, the formula (20) represents the result obtained by performing fusion processing on the data of two adjacent channel groups. As shown in the right part of fig. 5, X_gThe data corresponding to 0 is the first subdata, X_gThe data corresponding to the sub-sequence number 1 is second sub-data, the second sub-data is input into a one-dimensional convolution model in the time sequence convolution model to obtain second sub-time sequence data, and X is_gAnd (2) taking the data corresponding to the sub-sequence as third sub-data, fusing the third sub-data and the second sub-sequence data, preprocessing, inputting the obtained result into a one-dimensional convolution model in the time sequence convolution model, and obtaining third sub-sequence data, wherein X is the number of times of the third sub-sequence data_gAnd (3) taking the data corresponding to the fourth sub-data as fourth sub-data, fusing and preprocessing the fourth sub-data and the third sub-time sequence data, and inputting an obtained result into a one-dimensional convolution model in the time sequence convolution model to obtain fourth sub-time sequence data.

The recognition means again uses a staircase cascade structure for the feature X shown in FIG. 5_g＝0、X_g＝1、X_g2 and X_gCascading 3 to obtain second data; specifically, the first sub data, the second sub time sequence data and the third sub time sequence data are respectively subjected to time sequence convolution processing till the Nth sub time sequence data, and then the time sequence convolution processing is carried outThe second data is obtained by cascading the data.

The cascaded process is as shown in equations (21) to (24):

O₀＝X_g＝0 (21)

O₁＝TemporalConv(X_g＝1) (22)

O₂＝TemporalConv(CS(O₁,X_g＝2)) (23)

O₃＝TemporalConv(CS(O₂,X_g＝3)) (24)

out＝cat[O₀,O₁,O₂,O₃],out∈R^[NF,C,H,W] (25)

where equation (25) characterizes the timing information of the feature.

In the embodiment of the application, the time sequence information of different scales is obtained through the cascade connection 1D convolution, the time sequence information of different scales is connected in a cascade mode to obtain the large receptive field 1D convolution, namely, a long time sequence modeling module (TSS) is added to process data.

And step S105, obtaining a behavior recognition result based on the first data and the second data.

In some embodiments, the first data is data from which background motion noise is removed from the video data, and the second data is long-time-series data; the identification device accurately identifies the behavior of the input video data based on the first data and the second data, such as judging the behavior category.

In the embodiment of the present application, the sequence of the steps S101 to S102 and the sequence of the steps S103 to S104 do not exist, and the steps S101 and S102 may be executed first, and then the steps S103 and S104 are executed, or the steps S103 and S104 may be executed first, and then the steps S101 and S102 are executed. It can also be understood that, in the embodiment of the present application, the first data may be obtained first, and then the second data may be obtained, or the second data may be obtained first, and then the first data may be obtained.

In the embodiment of the present application, a schematic diagram of a data processing flow for adding the EMT and the TSS is shown in fig. 6, and the EMT and the TSS are added based on an existing behavior identification method. Specifically, before data processing of the EMT, data of adjacent frames of the image are obtained by using a three-dimensional convolution model; after data processing of the TSS, channel dimension reduction processing is performed on the first data and the second data using a one-dimensional convolution model. In the embodiment of the present application, the added EMT and TSS constitute a TMM module, and the TMM module may be embedded into an existing behavior recognition model, such as a 2D ResNet model, as shown in fig. 6, after a three-dimensional convolution model for acquiring data of adjacent frames of the image and before a one-dimensional convolution model for performing channel dimension reduction processing on the first data and the second data. Through the TMM module, the motion significance information enhancement and long-term time sequence modeling of the data can be realized. As shown in fig. 7, the first line is an original frame of the input data, and the second line and the third line are respectively the feature patterns of the output data after the EMT; the characteristic pattern of the motion region after EMT is apparent.

Based on the above description of the behavior recognition method provided by the embodiment of the present application, the behavior recognition method provided by the embodiment of the present application can be applied to at least time-frequency recommendation, security monitoring, human-computer interaction, and other scenes.

In order to implement the foregoing behavior identification method provided in the embodiment of the present application, an embodiment of the present application further provides a behavior identification device, and fig. 8 is a schematic structural diagram of the behavior identification device 200 provided in the embodiment of the present application, where the apparatus includes:

an acquisition unit 201 for acquiring difference information of data of adjacent frames;

a determining unit 202, configured to determine, based on the difference information, first data representing motion information in the data;

a grouping unit 203, configured to perform channel grouping on the data of the adjacent frames based on the features;

a processing unit 204, configured to process the data of each channel packet in a time sequence dimension, so as to obtain second data;

an identifying unit 205, configured to obtain a behavior identification result based on the first data and the second data.

In some optional embodiments, the obtaining unit 201 is configured to perform feature alignment on the data of the adjacent frames;

In some optional embodiments, the obtaining unit 201 is configured to obtain characteristics of data of two adjacent frames;

In some optional embodiments, the obtaining unit 201 is configured to copy the first frame data and the second frame data in the foreground data of the adjacent frames into N copies respectively;

In some optional embodiments, the determining unit 202 is configured to determine channel weights based on the difference information;

In some optional embodiments, the determining unit 202 is configured to add the foreground data and the product of the foreground data and the channel weight to obtain the first data.

In some optional embodiments, the processing unit 204 is configured to process the data of each channel packet by using a time-series convolution model;

and fusing the processed data of all channel groups to obtain the second data.

In some optional embodiments, the processing unit 204 is configured to fuse data timing information of different scales in a channel dimension by using the timing convolution model.

In some optional embodiments, the processing unit 204 is configured to divide the data of each channel packet into N parts of sub-data, and input a second part of sub-data to a one-dimensional convolution model in the time sequence convolution model to obtain second sub-time sequence data;

In some optional embodiments, the processing unit 204 is configured to perform time-series convolution on the first sub data, the second sub time-series data, and the third sub time-series data until the nth sub time-series data, and then cascade the time-series convolution processed data to obtain the second data.

In some optional embodiments, the processing unit 204 is further configured to acquire data of adjacent frames of the image by using a three-dimensional convolution model;

and/or after the second data is obtained, performing channel dimension reduction processing on the first data and the second data by using a one-dimensional convolution model.

It should be understood by those skilled in the art that the implementation functions of the units in the behavior recognition device shown in fig. 8 can be understood by referring to the related description of the behavior recognition method. The functions of the units in the behavior recognizing apparatus shown in fig. 8 may be realized by a program running on a processor, or may be realized by specific logic circuits.

The behavior recognition device according to the embodiment of the present application may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof that contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Accordingly, the present application also provides a computer program product, in which computer-executable instructions are stored, and when executed, the computer-executable instructions can implement the behavior recognition method of the present application.

An embodiment of the present application further provides a storage medium, where the storage medium stores executable instructions, and the executable instructions, when executed by a processor, implement the behavior recognition method described above.

In order to implement the foregoing behavior recognition method provided in this embodiment of the present application, an electronic device is further provided in this embodiment of the present application, fig. 9 is a schematic structural composition diagram of the electronic device according to this embodiment of the present application, and as shown in fig. 9, the electronic device 50 may include one or more processors 502 (only one of which is shown in the figure) (the processor 502 may include, but is not limited to, a processing device such as a Microprocessor (MCU) or a Programmable logic device (FPGA), a memory 504 for storing data, and a transmission device 506 for a communication function. It will be understood by those skilled in the art that the structure shown in fig. 9 is only an illustration and is not intended to limit the structure of the electronic device. For example, electronic device 50 may also include more or fewer components than shown in FIG. 9, or have a different configuration than shown in FIG. 9.

The memory 504 can be used for storing software programs and modules of application software, such as program instructions/modules corresponding to the methods in the embodiments of the present application, and the processor 502 executes various functional applications and data processing by executing the software programs and modules stored in the memory 504, so as to implement the methods described above. The memory 504 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 504 may further include memory located remotely from the processor 502, which may be connected to the electronic device 50 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

It will be appreciated that the memory 504 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The non-volatile Memory may be ROM, Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), magnetic random access Memory (FRAM), Flash Memory (Flash Memory), magnetic surface Memory, optical Disc, or Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 504 described in embodiments herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The memory 504 in the present embodiment is used to store various types of data to support the operation of the electronic device 750. Examples of such data include: any computer program, such as an application program, for operation on the communication device 700. The program for implementing the method of the embodiment of the present application may be included in an application program.

The transmission device 506 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the electronic device 50. In one example, the transmission device 506 includes a Network adapter (NIC) that can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 506 can be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The technical solutions described in the embodiments of the present application can be arbitrarily combined without conflict.

In the several embodiments provided in the present application, it should be understood that the disclosed method and intelligent device may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one second processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims

1. A method of behavior recognition, the method comprising:

acquiring difference information of data of adjacent frames of the image;

channel grouping data of the adjacent frames based on the features;

2. The method of claim 1, wherein the obtaining differential information of data of adjacent frames of the image comprises:

performing feature alignment on the data of the adjacent frames;

3. The method of claim 2, wherein the feature aligning the data of the adjacent frames comprises:

acquiring the characteristics of the data of two adjacent frames;

4. The method of claim 2 or 3, wherein the obtaining foreground data of the adjacent frame from the feature-aligned data based on the convolution model of at least one staircase structure comprises:

5. The method of claim 4, wherein determining the first data characterizing motion information in the data based on the difference information comprises:

determining channel weights based on the difference information;

6. The method of claim 5, wherein the processing foreground data of the data based on the channel weights to obtain the first data comprises:

7. The method of any one of claims 1 to 6, wherein the processing the data of each lane packet in the timing dimension to obtain second data comprises:

and fusing the processed data of all channel groups to obtain the second data.

8. The method of claim 7, wherein processing the data for each lane packet using a time-series convolution model comprises:

9. The method of claim 8, wherein the fusing the data timing information of different scales in the channel dimension by using the timing convolution model comprises:

10. The method according to claim 9, wherein the fusing the processed data of all the channel packets to obtain the second data comprises:

11. The method of claim 1, wherein prior to obtaining differential information for data of adjacent frames of an image, the method further comprises: acquiring data of adjacent frames of the image by using a three-dimensional convolution model;

and/or after obtaining the second data, the method further comprises: and performing channel dimension reduction processing on the first data and the second data by using a one-dimensional convolution model.

12. An apparatus for behavior recognition, the apparatus comprising:

13. A computer program product, characterized in that it comprises computer-executable instructions capable, when executed, of implementing the method steps of any one of claims 1 to 11.

14. A storage medium having stored thereon executable instructions which, when executed by a processor, carry out the method steps of any one of claims 1 to 11.

15. An electronic device, comprising a memory having computer-executable instructions stored thereon and a processor, wherein the processor, when executing the computer-executable instructions on the memory, is configured to perform the method steps of any of claims 1 to 11.