WO2022174616A1

WO2022174616A1 - Behavior recognition method and apparatus, and electronic device and storage medium

Info

Publication number: WO2022174616A1
Application number: PCT/CN2021/127119
Authority: WO
Inventors: 苏海昇
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-02-22
Filing date: 2021-10-28
Publication date: 2022-08-25
Also published as: CN112926436A

Abstract

The present disclosure relates to a behavior recognition method and apparatus, and an electronic device and a storage medium. The method comprises: acquiring differential information of data of adjacent frames; on the basis of the differential information, determining first data in the data that represents motion information; performing channel grouping on the data of the adjacent frames on the basis of a feature; processing data of each channel group in the dimension of a time sequence, so as to obtain second data; and obtaining a behavior recognition result on the basis of the first data and the second data.

Description

Behavior recognition method and device, electronic device and storage medium

CROSS-REFERENCE TO RELATED APPLICATIONS

The present disclosure is based on the Chinese patent application with the application number of 202110198255.3, the filing date of which is on February 22, 2021, and the application name is "Behavior Recognition Method and Device, Electronic Equipment and Storage Medium", and claims the priority of the above-mentioned Chinese patent application, The entire contents of the above-mentioned Chinese patent application are hereby incorporated by reference into the present disclosure.

technical field

The present disclosure relates to the technical field of computer vision, and relates to a method and device for behavior recognition, an electronic device and a storage medium.

Background technique

In the field of computer vision technology, behavior recognition based on video data plays an extremely important role in many fields such as video recommendation, image recognition, and human-computer interaction; Recognition is the unanimous goal of computer vision technology.

SUMMARY OF THE INVENTION

In order to solve the above technical problems, embodiments of the present disclosure provide a method and apparatus for behavior recognition, an electronic device, and a storage medium.

The behavior recognition method provided by the embodiment of the present disclosure includes:

Obtain the differential information of the data of adjacent frames;

Based on the differential information, determining first data representing motion information in the data;

performing channel grouping on the data of the adjacent frames based on the feature;

Process the data grouped by each channel in the time series dimension to obtain the second data;

A behavior recognition result is obtained based on the first data and the second data.

In some embodiments, the acquiring differential information of data of adjacent frames includes:

performing feature alignment on the data of the adjacent frames;

Obtain foreground data of the adjacent frame from the feature-aligned data based on at least one ladder-structured convolution model;

Perform differential processing on the foreground data to obtain differential information of the foreground data of the adjacent frames.

In some embodiments, performing feature alignment on the data of the adjacent frames includes:

Obtain the characteristics of the data of two adjacent frames;

performing dimension reduction processing on the channel dimension on the feature of each of the two adjacent frames;

A similarity matrix is used to perform feature alignment on the features of two adjacent frames after dimension reduction.

In some embodiments, the obtaining the foreground data of the adjacent frames from the feature-aligned data based on the at least one ladder-structured convolution model includes:

The first frame data and the second frame data in the foreground data of the adjacent frames are respectively copied into N shares;

Input the first second frame data of the N second frame data into the two-dimensional convolution model to obtain the first output result, and make a difference between the first output result and the first first frame data to obtain the first output result. Difference result; the two-dimensional convolution model is an N-order two-dimensional convolution model;

After the first output result and the (M+1)th second frame data are added, input a two-dimensional convolution model to obtain the (M+1)th output result, and the (M+1)th Differentiate the output result with the M-th first frame data to obtain the (M+1)-th difference result; where M is greater than or equal to 1, and M is less than or equal to N-1;

After splicing the first difference result to the Nth difference result, it is input into a one-dimensional convolution model to obtain foreground data of the adjacent frames.

In some embodiments, the determining, based on the differential information, the first data representing motion information in the data includes:

determining a channel weight based on the differential information;

The foreground data in the data is processed based on the channel weight to obtain the first data.

In some embodiments, the processing of foreground data in the data based on the channel weight to obtain the first data includes:

Multiplying the foreground data and the channel weight to obtain a product result, and adding the product result to the foreground data to obtain the first data.

In some embodiments, the processing of the data grouped by each channel in the time series dimension to obtain the second data includes:

Use the time series convolution model to process the data grouped by each channel;

The processed data of all channel groups are fused to obtain the second data.

In some embodiments, the use of the time series convolution model to process the data grouped by each channel includes:

The time-series convolution model is used to fuse data time-series information of different scales in the channel dimension.

In some embodiments, the use of the time series convolution model to fuse data time series information of different scales in the channel dimension includes:

The data of each channel grouping is divided into N sub-data, the second sub-data in the N sub-data is input into the one-dimensional convolution model in the time series convolution model, and the second sub-series data is obtained;

The Kth sub-series data is fused with the (K+1)th sub-data, and preprocessing is performed on the fused data; wherein, K is greater than or equal to 2, and K is less than or equal to N-1;

Calculate the first result of multiplying the value obtained by preprocessing and the Kth sub-series data point, and calculate the second result of multiplying the difference between 1 and the value obtained by the preprocessing and the (K+1)th sub-data point. result;

After the first result and the second result are fused, they are input to the one-dimensional convolution model to obtain the (K+1)th sub-series data.

In some embodiments, the data of the processed all channel groupings are fused to obtain the second data including:

After the first sub-data, the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data are respectively subjected to time-series convolution processing, the data after time-series convolution processing are concatenated to obtain the second sub-series data. data.

In some embodiments, before acquiring the difference information of the data of the adjacent frames of the image, the method further comprises: using a three-dimensional convolution model to acquire the data of the adjacent frames of the image;

And/or, after obtaining the second data, the method further includes: performing channel dimension reduction processing on the first data and the second data.

In some embodiments, the behavior recognition method is implemented by a temporal motion model, and the temporal motion model includes an enhanced motion transformation module and a long-term temporal modeling module;

Wherein, the enhanced motion transformation module is used to obtain the difference information of the data of adjacent frames of the image; based on the difference information, determine the first data representing the motion information in the data;

The long-time sequence modeling module is configured to perform channel grouping on the data of the adjacent frames based on the feature; and process the data of each channel grouping in the time sequence dimension to obtain second data.

In some embodiments, after the time series motion model is embedded in a three-dimensional convolution model, the three-dimensional convolution model is used to obtain data of adjacent frames of the image;

And/or, before the time series motion model is embedded into a one-dimensional convolution model, the one-dimensional convolution model is used to perform channel dimension reduction processing on the first data and the second data.

The behavior recognition device provided by the embodiment of the present disclosure includes:

an acquisition unit, configured to acquire differential information of data of adjacent frames;

a determining unit configured to determine, based on the differential information, first data representing motion information in the data;

a grouping unit configured to perform channel grouping on the data of the adjacent frames based on the feature;

a processing unit, configured to process the data grouped by each channel in the time sequence dimension to obtain second data;

An identification unit, configured to obtain a behavior identification result based on the first data and the second data.

In some embodiments, the obtaining unit is further configured to perform feature alignment on the data of the adjacent frames; and obtain the adjacent frames from the feature-aligned data based on at least one convolutional model with a staircase structure the foreground data; perform differential processing on the foreground data to obtain differential information of the foreground data of the adjacent frames.

In some embodiments, the obtaining unit is further configured to obtain features of the data of two adjacent frames; perform dimension reduction processing on the features of each frame in the channel dimension; The features of the two frames are feature aligned.

In some embodiments, the obtaining unit is further configured to respectively copy the first frame data and the second frame data in the foreground data of the adjacent frames into N shares; The first second frame data is input into a two-dimensional convolution model, and the obtained first output result is differentiated from the first first frame data to obtain a first difference result; the two-dimensional convolution model is an N-order two-dimensional convolution model product model; the first output result and the (M+1)th second frame data are added and processed, and then input the two-dimensional convolution model, and the obtained (M+1)th output result and the Mth first frame data Do the difference to get the (M+1)th difference result, M is greater than or equal to 1, and M is less than or equal to N-1; after splicing the obtained first difference result to the Nth difference result, input it to the one-dimensional convolution model, Obtain the foreground data of the adjacent frame.

In some embodiments, the determining unit is further configured to determine a channel weight based on the differential information; and process foreground data in the data based on the channel weight to obtain the first data.

In some embodiments, the determining unit is further configured to add the foreground data to the product of the foreground data and the channel weight to obtain the first data.

In some embodiments, the processing unit is further configured to use a time series convolution model to process the data grouped by each channel; and fuse the processed data of all channel groups to obtain the second data.

In some embodiments, the processing unit is further configured to use the time-series convolution model to fuse data time-series information of different scales in the channel dimension.

In some embodiments, the processing unit is further configured to divide the data grouped by each channel into N sub-data, and input the second sub-data into the one-dimensional convolution model in the time series convolution model, to obtain The second sub-series data; after merging the K-th sub-series data with the (K+1)th sub-data, perform preprocessing on the fused data, K is greater than or equal to 2, and K is less than or equal to N-1; The first result of multiplying the value of 1 with the Kth sub-series data point, and the second result of multiplying the difference between 1 and the value obtained by the preprocessing and the (K+1)th sub-data point; After the first result and the second result are fused, they are input to the one-dimensional convolution model to obtain the (K+1)th sub-series data.

In some embodiments, the processing unit is further configured to perform time-series convolution processing on the first sub-data, the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data, respectively, and then process the time-series data. The convolution-processed data are concatenated to obtain the second data.

In some embodiments, the processing unit is further configured to use a three-dimensional convolution model to obtain data of adjacent frames of the image; and/or, after obtaining the second data, use a one-dimensional convolution model to perform a Channel dimension reduction processing is performed on the first data and the second data.

The computer program product provided by the embodiments of the present disclosure includes computer-executable instructions, and after the computer-executable instructions are executed, the above-mentioned behavior identification method can be implemented.

Executable instructions are stored on the storage medium provided by the embodiments of the present disclosure, and when the executable instructions are executed by the processor, the above-mentioned behavior identification method is implemented.

In the electronic device provided by the embodiments of the present disclosure, the electronic device includes a memory and a processor, the memory stores computer-executable instructions, and the processor can implement the above-mentioned behavior when running the computer-executable instructions on the memory recognition methods.

In the behavior recognition method provided by the embodiment of the present disclosure, difference information of data of adjacent frames is obtained; based on the difference information, first data representing motion information in the data is determined; Perform channel grouping; process the data of each channel grouping in the time sequence dimension to obtain second data; obtain a behavior recognition result based on the first data and the second data. In this way, the difference information of the data of adjacent frames can be obtained, and the background noise in the image can be eliminated; the processing of each channel grouping in the time series dimension realizes a larger time series receptive field; therefore, the behavior provided by the embodiments of the present disclosure The recognition method can improve the accuracy of behavior recognition from the spatial and temporal dimensions.

In order to make the above-mentioned objects, features and advantages of the embodiments of the present disclosure more clearly understood, preferred embodiments are exemplified below, and are described in detail as follows in conjunction with the accompanying drawings.

Description of drawings

FIG. 1 is a schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure;

2 is a schematic diagram of an optional processing flow for an identification device according to an embodiment of the present disclosure to obtain differential information of data of adjacent frames;

3 is a schematic diagram of an optional processing flow of feature alignment performed on the data of the adjacent frames by an identification device according to an embodiment of the present disclosure;

4 is a schematic diagram of performing dimension reduction processing on features of different time frames according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of cascading features using a step cascading result according to an embodiment of the present disclosure;

6 is a schematic diagram of a data processing flow for adding an EMT module and a TSS module according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a characteristic pattern of the data according to the disclosed embodiment after passing through the EMT module;

FIG. 8 is a schematic structural composition diagram of a behavior recognition device provided by an embodiment of the present disclosure;

FIG. 9 is a schematic structural composition diagram of an electronic device according to an embodiment of the disclosure.

Detailed ways

Before describing the embodiments of the present disclosure in detail, a brief description of behavior recognition is given.

With the improvement of basic technologies such as cloud computing and 5G (5th Generation Mobile Communication Technology), video data has exploded in recent years. In order to fully tap the value of video data, more and more researchers have begun to engage in the field of video understanding. As a basic task of video understanding, behavior recognition has a wide range of needs in scenarios such as image recognition, human-computer interaction, and personalized recommendation. The process of video-based action recognition is to determine the action category based on a given video containing an action. The accuracy of judging behavior categories is an important indicator of behavior recognition, and motion modeling and time series modeling based on video data are important factors that affect the results of behavior recognition.

Motion modeling based on video data is most commonly used to model motion information between adjacent frames based on optical flow features. Usually, two-stream based action recognition methods extract motion features through optical flow modeling. However, the extraction and modeling process of optical flow requires a large amount of data calculation, which is difficult to apply to scenarios with high real-time requirements. Since an important role of optical flow is to highlight moving objects when describing the motion relationship between adjacent frames, an alternative solution is to use the difference of features between different frames to approximate optical flow; however, in In the process of approximating the optical flow by using the difference of features between different frames, the edge motion information of the moving object and the non-moving object can be obtained at the same time. Since the non-moving object is the background part of the image, the edge motion information of the non-moving object It belongs to noise; and the edge motion information of the stationary part of the moving object also belongs to interference noise.

There are two main schemes for time series modeling based on video data; the first scheme is to use the structure of two-dimensional (2D, 2Dimension) convolutional neural network (CNN, Convolutional Neural Networks) plus an inter-frame aggregator. The aggregator is generally Using avg/max/3D (3D, 3Dimension) convolutional/recurrent neural network (RNN, Rerrent Neural Network) operations, this scheme simply performs frame-level score fusion or frame-level high-level feature fusion, but does not consider features Aggregation of timing information at the level. The second solution is to use 3D convolution, which uses 3D convolution to aggregate time sequence relationships at the feature level. Because 3D convolution has many parameters and a large amount of measurement, 3D convolution is decoupled into 2D+1D convolution and 2D convolution. Spatial information modeling, 1D convolution is only responsible for temporal relationship modeling. However, for 3D/(2D+1D) convolution, only the timing relationship within the local window is modeled. For long-term timing relationships, vertical stacked convolution blocks are used to achieve the purpose of modeling long-range timing. This vertical structure is difficult to optimize for shallow temporal convolutions.

Based on this, an embodiment of the present disclosure proposes a behavior recognition method. Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless otherwise indicated.

Meanwhile, it should be understood that, for the convenience of description, the dimensions of various parts shown in the accompanying drawings are not drawn in an actual proportional relationship.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses in any way.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods, and apparatus should be considered part of the specification.

It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further discussion in subsequent figures.

A behavior recognition method involved in the embodiments of the present disclosure will be described in detail below. The execution subject of the behavior recognition method provided by the embodiments of the present disclosure may be a behavior recognition device, such as a terminal device or a server, or may be any other data processing device. The processing device of the capability is not limited in the embodiments of the present disclosure.

Referring to FIG. 1, a schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure includes at least the following steps:

Step S101 , acquiring difference information of data of adjacent frames.

In some embodiments, the behavior recognition device (hereinafter referred to as the recognition device) obtains differential information of data of adjacent frames. The adjacent frames may be two adjacent frames in the video data.

In some embodiments, an optional processing flow for the identification device to obtain differential information of data of adjacent frames, as shown in FIG. 2 , includes at least the following steps:

Step S1a, performing feature alignment on the data of the adjacent frames.

In some embodiments, an optional processing flow for the identification device to perform feature alignment on the data of the adjacent frames, as shown in FIG. 3 , includes at least the following steps:

Step S1a1, acquiring data features of two adjacent frames.

In some embodiments, the shape of the features of the data may be

Among them, N represents the batch size, F is the number of frames, C is the number of channels, H is the height of a single-frame image, and W is the width of a single-frame image.

Step S1a2, performing dimension reduction processing on the channel dimension of the feature of each of the two adjacent frames.

In some embodiments, the recognition device performs frame-level separation on the input feature X to obtain

In order to reduce the amount of data calculation, the recognition device performs dimensionality reduction processing on the feature of each frame in the channel dimension, and uses a 1×1 convolution to compress the feature channel, as shown in the following formulas (1) and (2), the formula (1) is the result of dimension reduction processing in the channel dimension for the feature of the frame at time t as shown in Fig. 4, formula (2) is the feature of the frame at time t+1 as shown in Fig. The result of dimensionality reduction processing.

x ^t =Conv1D(X ^t ),x ^t ∈R ^[C/l,H,W] (1)

x ^t+1 = Conv1D(X ^t+1 ), x ^t+1 ∈ R ^[C/l,H,W] (2)

Among them, l is the channel compression ratio, and the value of l can be flexibly set according to the actual application scenario, for example, it is set to a value such as 16.

Step S1a3, using the similarity matrix to perform feature alignment on the features of the two adjacent frames after dimension reduction processing.

In some embodiments, the recognition device uses the similarity matrix to wrap and align adjacent frames, as shown in the following formulas (3) and (4).

Among them, the r() function is used to transform the size and shape of the feature.

Step S1b, based on at least one convolutional model with a ladder structure, obtain foreground data of the adjacent frame from the data after feature alignment.

In some embodiments, the identification device utilizes a set of 2D convolutions in a staircase structure to extract the first sample data representing motion information. Among them, the 2D convolution of a set of ladder structures can be 2 2D convolutions, or 4 or 6 2D convolutions; the following takes the 2D convolution of a set of ladder structures as 4 2D convolutions as an example, the The feature is divided into 4 parts according to the channel, and the following formula (5), formula (6), formula (7) and formula (8) are used to perform convolution calculation on the aligned features, and obtain the described data from the feature alignment. Foreground data of adjacent frames to obtain multi-scale motion information.

m ^{s = 0} = Conv2D(r(A(x ^t+1 ))) (5)

m ^s=1 =Conv2D(m ^s=0 +r(A(x ^t+1 ))) (6)

m ^s=2 =Conv2D(m ^s=1 +r(A(x ^t+1 ))) (7)

m ^s=3 =Conv2D(m ^s=2 +r(A(x ^t+1 ))) (8)

Through formula (5) to formula (8), foreground data of adjacent frames can be extracted, that is, motion information can be extracted, and background data in adjacent frames can be filtered and deleted.

For example, the right part of Fig. 4 is the principle flow chart of the multi-scale difference module (MSFD, Multi Scale Feature Difference). As shown in Fig. 4, for the first frame data and the second frame in the foreground data of the adjacent frames The data is copied and expanded in N copies (N is 4 in Figure 4); the first second frame data in the N second frame data is input into the two-dimensional convolution model, and the first output result obtained is the same as the first The first frame data is differentiated to obtain the first difference result; the two-dimensional convolution model is an N-order two-dimensional convolution model; the first output result is added to the (M+1)th second frame data After processing, input the two-dimensional convolution model, and make a difference between the (M+1)th output result and the Mth first frame data to obtain the (M+1)th difference result, where M is greater than or equal to 1, and M is less than or equal to N-1: After the obtained first difference result to the Nth difference result are spliced and summed, input to a one-dimensional convolution model to obtain foreground data of the adjacent frames.

In the embodiment of the present disclosure, different motion change information is captured through a set of 2D convolutions, so that the subsequent motion difference information can more accurately describe and identify behaviors.

Step S1c, performing differential processing on the foreground data to obtain differential information of the foreground data of the adjacent frames.

In some embodiments, the identification device performs differential processing on the foreground data by using the following formula (9) and formula (10) to obtain differential information of the foreground data of adjacent frames.

In the embodiment of the present disclosure, by acquiring the difference information of the foreground data of adjacent frames, the background motion noise can be eliminated.

Taking Figure 4 as an example, the data at time t=i is copied into 4 copies, which are the first data, the second data, the third data and the fourth data; copy the data at time t=i+1 It is 4 copies, which are the 5th data, the 6th data, the 7th data and the 8th data. The result obtained after the 5th data is subjected to 2D convolution processing is fused with the 6th data, and the result after 2D convolution processing of the 5th data and the 6th data is merged with the 7th data. Fusion; after the 5th data is fused with the 6th data, the result after two-dimensional convolution is processed with the 7th data after 2D convolution, and then it is fused with the 8th data; each time The fused data is subjected to differential processing with the first to fourth pieces of data, respectively, and the 4th differential result is the differential information.

M _diff = m ^{s = 0} - x ^t + m ^{s = 1} - x ^t + m ^{s = 2} - x ^t + m ^{s = 3 -} x ^t (9)

M _out = Conv1D(M _diff ), M _out ∈ R ^[C,H,W] (10)

Among them, M _out represents the difference information of the foreground data of adjacent frames.

Step S102: Based on the difference information, determine first data representing motion information in the data.

In some embodiments, the channel weight is determined based on the differential information of the foreground data, that is, the differential information of the foreground data is used as the channel weight, and the foreground data is processed by using the channel weight to obtain the first data representing motion information in the data.

Wherein, using the channel weight to process the foreground data may be using the channel weight to enhance the foreground data to obtain the first data representing motion information in the data.

In other embodiments, after obtaining the difference information of the foreground data of the adjacent frames, the difference information may also be directly used as the first data representing the motion information.

In some embodiments, the channel weight is determined by the following formula (11); the foreground data is enhanced by the following formula (12), that is, the foreground data is multiplied by the channel weight to obtain a product result, and the product result is combined with The foreground data is added to obtain the first data.

W=sigmod(AvgPooling(M _out ))∈R ^[C,1,1] (11)

Enhanced(X ^t )=X ^t +X ^t ⊙W∈R ^[C,H,W] (12)

So far, in the embodiment of the present disclosure, the inter-frame similarity matrix is used to realize the feature alignment between the frames, and the interference problem caused by the background jitter is eliminated as much as possible. At the same time, considering the diversity of motion information, the embodiment of the present disclosure uses a set of stepped 2D convolutions to extract data of different scales, and then performs differential processing on the data of different scales to eliminate background noise in the image and obtain motion of different scales. Salient information, and finally use the motion saliency information to enhance the motion change area, that is, an Enhanced Motion Transformer (EMT Enhanced Motion Transformer) module is added to process the data.

Step S103: Perform channel grouping on the data of the adjacent frames based on the feature.

In some embodiments, the identification device processes the data of each channel group by using a time series convolution model, and fuses the processed data of all channel groups to obtain the second data.

In some embodiments, in order to reduce the amount of data calculation, the identification device performs channel grouping on the input features X∈R ^NF×C×H×W , and obtains X _g= i∈R ^[NF,C/4,H,W] .

Step S104: Process the data grouped by each channel in the time sequence dimension to obtain second data.

In some embodiments, the identification device uses the time series convolution model to fuse data time series information of different scales in the channel dimension; for example, the identification device can use 1D time series convolution to process the data grouped by each channel, as shown in the following formula (13) to formula (15) are shown:

r_out=reshape(X)∈R ^[NHW,C,F] (13)

r_out=Conv1D(r_out)∈R ^[NHW,C,F] (14)

out=reshape(r_out)∈R ^[NF,C,H,W] (15)

The data of each channel grouping is divided into N sub-data, the second sub-data is input into the one-dimensional convolution model in the described time series convolution model, and the second sub-series data is obtained; The Kth sub-series data and the (( K+1) After the sub-data is fused, perform preprocessing on the fused data, where K is greater than or equal to 2, and K is less than or equal to N-1; result, and the second result of multiplying the difference between 1 and the value obtained by the preprocessing and the (K+1)th sub-data point; after fusing the first result and the second result, input to The one-dimensional convolution model obtains the (K+1)th sub-series data. The preprocessing may be to perform channel fusion processing on the fused data, such as Spatial Average Pooling (SAP, Spatial Average Pooling), Fully Connected (FC, Fully Connected), and a normalized exponential function (Softmax).

After the identification device uses 1D time series convolution to process the data grouped by each channel, it fuses the processed data of all channel groups to obtain second data. For example, after the first sub-data, the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data are respectively subjected to time-series convolution processing, the data after time-series convolution processing are concatenated to obtain the the second data.

For example, if the results obtained by using 1D time series convolution to process the data of two adjacent channel groups are A and B, the processing process of the channel group data fusion is as shown in the following formulas (16) to (20). Show:

A∈R ^[NF,C,H,W] , B∈R ^[NF,C,H,W] (16)

C＝A⊕B∈R ^[NF,C,H,W] (17)

C=AvgPooling(C)∈R ^[NF,C,1,1] (18)

C _a = Softmax(FC(C))∈R ^[NF,C,1,1] (19)

out＝C _a ⊙A⊕(1-c _a )⊙B∈R ^[NF,C,H,W] (20)

Wherein, formula (20) represents the result obtained by fusing the data of two adjacent channel groups. The right part of Fig. 5 is the principle flow chart of the channel selection module (CS, Channel Selection). As shown in Fig. 5, the data corresponding to X _g =0 is the first sub-data, and the data corresponding to X _g =1 is the second sub-data. Part of the data, the second part of the data is input into the one-dimensional convolution model in the time series convolution model, to obtain the second part of the time series data, the data corresponding to X _g =2 is the third part of the data, and the third part of the data and The second sub-sequence data is fused and preprocessed, and the result obtained is input into the one-dimensional convolution model in the time-series convolution model to obtain the third sub-sequence data, and the data corresponding to X _g =3 is the fourth data, the fourth sub-data and the third sub-series data are fused and preprocessed, and the obtained result is input into the one-dimensional convolution model in the time-series convolution model to obtain the fourth sub-series data.

The identification device then uses a cascaded structure to cascade the features X _g =0, X _g =1, X _g =2, and X _g =3 as shown in FIG. 5 to obtain the second data; The sub-data, the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data are respectively subjected to time-series convolution processing, and then the time-series convolution-processed data are concatenated to obtain the second data.

The cascade processing process is shown in the following formulas (21) to (24):

O ₀ =X _{g = 0} (21)

O ₁ =TemporalConv(X _g=1 ) (22)

O ₂ =TemporalConv(CS(O ₁ ,X _g=2 )) (23)

O ₃ =TemporalConv(CS(O ₂ ,X _g=3 )) (24)

out=cat[O ₀ ,O ₁ ,O ₂ ,O ₃ ],out∈R ^[NF,C,H,W] (25)

Among them, formula (25) represents the time series information of the feature.

In the embodiment of the present disclosure, the time series information of different scales is obtained by cascaded 1D convolution, and the time series information of different scales is connected in steps to obtain the 1D convolution of the large receptive field, that is, the long time series modeling (TSS, Temporal Step- Structure) module to process the data.

Step S105: Obtain a behavior recognition result based on the first data and the second data.

In some embodiments, the first data is data from which background motion noise has been eliminated from the video data, and the second data is long-sequence data; the identification device accurately performs behavior identification on the input video data based on the first data and the second data , such as judging behavior categories, etc.

In this embodiment of the present disclosure, steps S101 to S102 and steps S103 to S104 do not have a sequence of execution. Steps S101 and S102 may be executed first, and then steps S103 and 104 may be executed, or steps S103 and S103 may be executed first. In step 104, steps S101 and S102 are executed again. It can also be understood that, in the embodiment of the present disclosure, the first data may be acquired first, and then the second data may be acquired, or the second data may be acquired first, and then the first data may be acquired.

A schematic diagram of a data processing flow for adding an EMT module and a TSS module in an embodiment of the present disclosure, as shown in FIG. 6 , on the basis of the existing behavior recognition method, an EMT module and a TSS module are added. For example, before the data processing of the EMT module, use a three-dimensional convolution model to obtain the data of the adjacent frames of the image; after the data processing of the TSS module, use a one-dimensional convolution model to analyze the first data and the second data. The data performs channel dimensionality reduction processing. In the embodiment of the present disclosure, the added EMT module and TSS module constitute a temporal motion model (TMM, Temporal and Motion Module), and the TMM can be embedded into an existing behavior recognition model such as a two-dimensional residual network (2D ResNet, 2 Dimension Residual Network) model, as shown in Figure 6, after TMM is embedded in the three-dimensional convolution model used to obtain the data of adjacent frames of the image, and the channel dimension reduction process is performed on the first data and the second data. Before the 1D convolutional model. The motion saliency information enhancement and long-term time series modeling of the data can be achieved through TMM. The characteristic pattern of the data after passing through the EMT module, as shown in Figure 7, the first row is the original frame of the input data, the second row and the third row are the characteristic patterns of the data output after passing through the EMT module; The characteristic pattern of the motion area is obvious.

Based on the above description of the behavior recognition method provided by the embodiment of the present disclosure, the behavior recognition method provided by the embodiment of the present disclosure can be applied to at least scenarios such as time-frequency recommendation, image recognition, and human-computer interaction.

In order to realize the above-mentioned behavior recognition method provided by the embodiment of the present disclosure, the embodiment of the present disclosure further provides a behavior recognition apparatus. FIG. 8 is a schematic structural composition diagram of the behavior recognition apparatus 200 provided by the embodiment of the present disclosure, and the device includes:

an acquisition unit 201, configured to acquire differential information of data of adjacent frames;

a determining unit 202, configured to determine, based on the difference information, first data representing motion information in the data;

a grouping unit 203, configured to perform channel grouping on the data of the adjacent frames based on the feature;

The processing unit 204 is configured to process the data grouped by each channel in the time series dimension to obtain second data;

The identification unit 205 is configured to obtain a behavior identification result based on the first data and the second data.

In some optional embodiments, the obtaining unit 201 is further configured to perform feature alignment on the data of the adjacent frames;

In some optional implementations, the obtaining unit 201 is further configured to obtain features of data of two adjacent frames;

Dimensionality reduction is performed on the features of each frame in the channel dimension;

In some optional embodiments, the obtaining unit 201 is further configured to copy the first frame data and the second frame data in the foreground data of the adjacent frames into N copies respectively;

The first second frame data in the N second frame data is input into the two-dimensional convolution model, and the obtained first output result is differentiated from the first first frame data to obtain the first difference result; the two-dimensional The convolution model is an N-order two-dimensional convolution model;

Input the two-dimensional convolution model after the first output result and the (M+1) second frame data are added, and the obtained (M+1) output result differs from the M first frame data, Obtain the (M+1)th difference result, M is greater than or equal to 1, and M is less than or equal to N-1;

After splicing the obtained first difference result to the Nth difference result, it is input into a one-dimensional convolution model to obtain foreground data of the adjacent frame.

In some optional embodiments, the determining unit 202 is further configured to determine the channel weight based on the differential information;

In some optional embodiments, the determining unit 202 is further configured to add the foreground data to the product of the foreground data and the channel weight to obtain the first data.

In some optional embodiments, the processing unit 204 is further configured to process the data grouped by each channel by using a time series convolution model;

The processed data of all channel groups are fused to obtain the second data.

In some optional embodiments, the processing unit 204 is further configured to use the time series convolution model to fuse data time series information of different scales in the channel dimension.

In some optional embodiments, the processing unit 204 is further configured to divide the data grouped by each channel into N sub-data, and input the second sub-data into the one-dimensional convolution in the time series convolution model model to obtain the second sub-series data;

After merging the Kth sub-series data with the (K+1)th sub-data, perform preprocessing on the fused data, K is greater than or equal to 2, and K is less than or equal to N-1;

In some optional implementation manners, the processing unit 204 is further configured to perform sequential convolution processing on the first sub-data, the second sub-sequential data, the third sub-sequential data, and the Nth sub-sequential data respectively, The data after time series convolution processing is then concatenated to obtain the second data.

In some optional implementation manners, the processing unit 204 is further configured to acquire data of adjacent frames of the image by using a three-dimensional convolution model;

And/or, after the second data is obtained, a one-dimensional convolution model is used to perform channel dimension reduction processing on the first data and the second data.

Those skilled in the art should understand that the implementation function of each unit in the behavior recognition apparatus shown in FIG. 8 can be understood with reference to the relevant description of the behavior recognition method described above. The function of each unit in the behavior recognition device shown in FIG. 8 can be realized by a program running on a processor, or can be realized by a logic circuit.

If the above-mentioned behavior recognition device in the embodiment of the present disclosure is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present disclosure, or the units that make contributions to the prior art, may be embodied in the form of software products, and the computer software products are stored in a storage medium, including a number of instructions for So that an electronic device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present disclosure. The aforementioned storage medium includes: U disk, mobile hard disk, Read Only Memory (ROM, Read Only Memory), magnetic disk or optical disk and other media that can store program codes. As such, embodiments of the present disclosure are not limited to any particular combination of hardware and software.

Correspondingly, an embodiment of the present disclosure further provides a computer program product, in which computer-executable instructions are stored, and when the computer-executable instructions are executed, the above-mentioned behavior identification method of the embodiment of the present disclosure can be implemented.

An embodiment of the present disclosure further provides a storage medium, where executable instructions are stored on the storage medium, and when the executable instructions are executed by a processor, the above-mentioned behavior identification method is implemented.

In order to realize the above-mentioned behavior recognition method provided by the embodiment of the present disclosure, the embodiment of the present disclosure further provides an electronic device. FIG. 9 is a schematic structural composition diagram of the electronic device according to the embodiment of the present disclosure. As shown in FIG. 9 , the electronic device 300 may include One or more (only one is shown in the figure) processor 301 (the processor 301 may include but is not limited to the processing of a microprocessor (MCU, Micro Controller Unit) or a programmable logic device (FPGA, Field Programmable Gate Array), etc. means), memory 302 for storing data, and transmission means 303 for communication functions. Those of ordinary skill in the art can understand that the structure shown in FIG. 9 is only a schematic diagram, which does not limit the structure of the above-mentioned electronic device. For example, the electronic device 300 may also include more or fewer components than shown in FIG. 9 , or have a different configuration than that shown in FIG. 9 .

The memory 302 can be used to store software programs and modules of application software, such as program instructions/modules corresponding to the methods in the embodiments of the present disclosure, and the processor 301 executes various functional applications by running the software programs and modules stored in the memory 302 And data processing, that is, to realize the above method. Memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 302 may further include memory located remotely from processor 301, which may be connected to electronic device 300 through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

It will be appreciated that the memory 302 may be either volatile memory or non-volatile memory, and may include both volatile and non-volatile memory. Among them, the non-volatile memory can be ROM, Programmable Read-Only Memory (PROM, Programmable Read-Only Memory), Erasable Programmable Read-Only Memory (EPROM, Erasable Programmable Read-Only Memory), Electrically Erasable Programmable Read-Only Memory Programmable read-only memory (EEPROM, Electrically Erasable Programmable Read-Only Memory), magnetic random access memory (FRAM, ferromagnetic random access memory), flash memory (Flash Memory), magnetic surface memory, optical disk, or CD-ROM -ROM, Compact Disc Read-Only Memory); magnetic surface memory can be disk memory or tape memory. Volatile memory may be Random Access Memory (RAM), which acts as an external cache. By way of example and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory Memory (DRAM, Dynamic Random Access Memory), Synchronous Dynamic Random Access Memory (SDRAM, Synchronous Dynamic Random Access Memory), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM, Double Data Rate Synchronous Dynamic Random Access Memory), Enhanced Type Synchronous Dynamic Random Access Memory (ESDRAM, Enhanced Synchronous Dynamic Random Access Memory), Synchronous Link Dynamic Random Access Memory (SLDRAM, SyncLink Dynamic Random Access Memory), Direct Memory Bus Random Access Memory (DRRAM, Direct Rambus Random Access Memory) ). The memory 302 described in the embodiments of the present disclosure is intended to include, but not be limited to, these and any other suitable types of memory.

The memory 302 in the embodiment of the present disclosure is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program, such as an application, used to operate on a communication device. A program implementing the method of the embodiment of the present disclosure may be included in an application program.

Transmission means 303 is used to receive or transmit data via a network. The aforementioned network may include a wireless network provided by a communication provider of the electronic device 300 . In one example, the transmission device 303 includes a network adapter (NIC, Network Interface Controller), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission device 303 may be a radio frequency (RF, Radio Frequency) module, which is used to communicate with the Internet in a wireless manner.

The technical solutions described in the embodiments of the present disclosure may be combined arbitrarily if there is no conflict. In the several embodiments provided in the present disclosure, it should be understood that the disclosed method and smart device may be implemented in other manners. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the constituent units shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms. of.

The unit described above as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place or distributed to multiple network units; The purpose of the solution in this embodiment can be achieved by selecting any unit or all of the units according to actual needs.

In addition, each functional unit in each embodiment of the present disclosure may all be integrated into one second processing unit, or each unit may be separately used as a unit, or two or more units may be integrated into one unit; The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.

The above are only specific embodiments of the present disclosure, but the protection scope of the present disclosure is not limited to this. should be included within the scope of protection of the present disclosure.

Industrial Applicability

The embodiment of the present disclosure obtains difference information of the data of adjacent frames; based on the difference information, determines the first data representing motion information in the data; performs channel grouping on the data of the adjacent frames based on the characteristics; The data grouped by each channel is processed to obtain second data; a behavior recognition result is obtained based on the first data and the second data. In this way, the difference information of the data of adjacent frames can be obtained, and the background noise in the image can be eliminated; the processing of each channel grouping in the time series dimension realizes a larger time series receptive field; therefore, the behavior provided by the embodiments of the present disclosure The recognition method can improve the accuracy of behavior recognition from the spatial and temporal dimensions.

Claims

A behavior recognition method, the method includes:

Obtain the difference information of the data of the adjacent frames of the image;

Based on the differential information, determining first data representing motion information in the data;

performing channel grouping on the data of the adjacent frames based on the feature;

Process the data grouped by each channel in the time series dimension to obtain the second data;

A behavior recognition result is obtained based on the first data and the second data.
The method according to claim 1, wherein the acquiring differential information of data of adjacent frames of the image comprises:

performing feature alignment on the data of the adjacent frames;

Obtain foreground data of the adjacent frame from the feature-aligned data based on at least one ladder-structured convolution model;

Perform differential processing on the foreground data to obtain differential information of the foreground data of the adjacent frames.
The method according to claim 2, wherein the performing feature alignment on the data of the adjacent frames comprises:

Obtain the characteristics of the data of two adjacent frames;

performing dimension reduction processing on the channel dimension on the feature of each of the two adjacent frames;

A similarity matrix is used to perform feature alignment on the features of two adjacent frames after dimension reduction.
The method according to claim 2 or 3, wherein, obtaining the foreground data of the adjacent frames from the feature-aligned data based on the at least one staircase-structured convolution model comprises:

The first frame data and the second frame data in the foreground data of the adjacent frames are respectively copied into N shares;

Input the first second frame data of the N second frame data into the two-dimensional convolution model to obtain the first output result, and make a difference between the first output result and the first first frame data to obtain the first output result. Difference result; the two-dimensional convolution model is an N-order two-dimensional convolution model;

After the first output result and the (M+1)th second frame data are added, input a two-dimensional convolution model to obtain the (M+1)th output result, and the (M+1)th Differentiate the output result with the M-th first frame data to obtain the (M+1)-th difference result; where M is greater than or equal to 1, and M is less than or equal to N-1;

After splicing the first difference result to the Nth difference result, it is input into a one-dimensional convolution model to obtain foreground data of the adjacent frames.
The method according to claim 4, wherein the determining, based on the differential information, the first data representing motion information in the data comprises:

determining a channel weight based on the differential information;

The foreground data in the data is processed based on the channel weight to obtain the first data.
The method according to claim 5, wherein the processing of the foreground data in the data based on the channel weight to obtain the first data comprises:

Multiplying the foreground data and the channel weight to obtain a product result, and adding the product result to the foreground data to obtain the first data.
The method according to any one of claims 1 to 6, wherein the processing of the data grouped by each channel in the time series dimension to obtain the second data comprises:

Use the time series convolution model to process the data grouped by each channel;

The processed data of all channel groups are fused to obtain the second data.
The method according to claim 7, wherein said using a time series convolution model to process the data grouped by each channel comprises:

The time-series convolution model is used to fuse data time-series information of different scales in the channel dimension.
The method according to claim 8, wherein the using the time series convolution model to fuse data time series information of different scales in a channel dimension, comprising:

The data of each channel grouping is divided into N sub-data, the second sub-data in the N sub-data is input into the one-dimensional convolution model in the time series convolution model, and the second sub-series data is obtained;

The Kth sub-series data is fused with the (K+1)th sub-data, and preprocessing is performed on the fused data; wherein, K is greater than or equal to 2, and K is less than or equal to N-1;

Calculate the first result of multiplying the value obtained by preprocessing and the Kth sub-series data point, and calculate the second result of multiplying the difference between 1 and the value obtained by the preprocessing and the (K+1)th sub-data point. result;

After the first result and the second result are fused, they are input to the one-dimensional convolution model to obtain the (K+1)th sub-series data.
The method according to claim 9, wherein the obtaining the second data by fusing the processed data of all the channel groups comprises:

After the first sub-data, the second sub-sequence data, the third sub-sequence data, and the Nth sub-series data are respectively subjected to time-series convolution processing, the data after time-series convolution processing are concatenated to obtain the second sub-series data. data.
The method according to claim 1, wherein, before acquiring the difference information of the data of the adjacent frames of the image, the method further comprises: using a three-dimensional convolution model to acquire the data of the adjacent frames of the image;

And/or, after obtaining the second data, the method further includes: performing channel dimension reduction processing on the first data and the second data by using a one-dimensional convolution model.
The method according to any one of claims 1 to 11, wherein the method is implemented by a temporal motion model, and the temporal motion model includes an enhanced motion transformation module and a long-term temporal modeling module;

Wherein, the enhanced motion transformation module is used to obtain the difference information of the data of adjacent frames of the image; based on the difference information, determine the first data representing the motion information in the data;

The long-time sequence modeling module is configured to perform channel grouping on the data of the adjacent frames based on the feature; and process the data of each channel grouping in the time sequence dimension to obtain second data.
The method according to claim 12, wherein after the time series motion model is embedded in a three-dimensional convolution model, the three-dimensional convolution model is used to obtain data of adjacent frames of the image;

And/or, before the time series motion model is embedded into a one-dimensional convolution model, the one-dimensional convolution model is used to perform channel dimension reduction processing on the first data and the second data.
A behavior recognition device, the device includes:

an acquisition unit, configured to acquire differential information of data of adjacent frames;

a determining unit configured to determine, based on the differential information, first data representing motion information in the data;

a grouping unit configured to perform channel grouping on the data of the adjacent frames based on the feature;

a processing unit, configured to process the data grouped by each channel in the time sequence dimension to obtain second data;

An identification unit, configured to obtain a behavior identification result based on the first data and the second data.
The apparatus of claim 14, wherein,

The obtaining unit is further configured to perform feature alignment on the data of the adjacent frames; obtain foreground data of the adjacent frames from the data after feature alignment based on at least one convolutional model of a ladder structure; The foreground data is subjected to differential processing to obtain differential information of the foreground data of the adjacent frames.
The apparatus of claim 15, wherein,

The obtaining unit is also configured to obtain the features of the data of two adjacent frames; the feature of each frame is subjected to dimension reduction processing in the channel dimension; the features of the two adjacent frames after the dimension reduction processing are used to perform a feature reduction process using a similarity matrix. Align.
An apparatus according to claim 15 or 16, wherein,

The acquisition unit is also configured to copy the first frame data and the second frame data in the foreground data of the adjacent frames into N parts respectively; the first part of the N parts of the second frame data The data is input into a two-dimensional convolution model, and the obtained first output result is differentiated from the first first frame of data to obtain a first differential result; the two-dimensional convolution model is an N-order two-dimensional convolution model; the first The output result and the (M+1)th second frame data are added to the two-dimensional convolution model, and the obtained (M+1)th output result is different from the Mth first frame data to obtain the ( M+1) difference result, M is greater than or equal to 1, and M is less than or equal to N-1; after splicing the obtained first difference result to the Nth difference result, input it into a one-dimensional convolution model to obtain the adjacent frame prospects data.
The apparatus of claim 17, wherein,

The determining unit is further configured to determine a channel weight based on the differential information; and process foreground data in the data based on the channel weight to obtain the first data.
The apparatus of claim 18, wherein,

The determining unit is further configured to add the foreground data to the product of the foreground data and the channel weight to obtain the first data.
An apparatus according to any one of claims 14 to 19, wherein,

The processing unit is further configured to use a time series convolution model to process the data grouped by each channel; and fuse the processed data of all channel groups to obtain the second data.
The apparatus of claim 20, wherein,

The processing unit is further configured to use the time series convolution model to fuse data time series information of different scales in the channel dimension.
The apparatus of claim 21, wherein,

The processing unit is also configured to divide the data grouped by each channel into N sub-data, input the second sub-data into the one-dimensional convolution model in the time series convolution model, and obtain the second sub-series data; After merging the Kth sub-series data with the (K+1)th sub-data, perform preprocessing on the fused data, K is greater than or equal to 2, and K is less than or equal to N-1; The first result of the K sub-series data point multiplication, and the second result of the (K+1)th sub-data point multiplication between the difference between the calculated 1 and the value obtained by the preprocessing and the (K+1)th sub-data point; After the second result is fused, it is input to the one-dimensional convolution model to obtain the (K+1)th sub-series data.
The apparatus of claim 22, wherein,

The processing unit is also configured to perform time-series convolution processing on the first sub-data, the second sub-sequence data, the third sub-series data, and the Nth sub-series data, and then perform time-series convolution processing on the data. Concatenation is performed to obtain the second data.
The apparatus of claim 14, wherein,

The processing unit is further configured to use a three-dimensional convolution model to obtain the data of the adjacent frames of the image; and/or, after obtaining the second data, use a one-dimensional convolution model to analyze the first data and the all data. Channel dimension reduction processing is performed on the second data.
A computer program product, the computer program product comprising computer-executable instructions, after the computer-executable instructions are executed, the method steps of any one of claims 1 to 13 can be implemented.
A storage medium storing executable instructions on the storage medium, when the executable instructions are executed by a processor, the method steps of any one of claims 1 to 13 are implemented.
An electronic device, the electronic device includes a memory and a processor, the memory stores computer-executable instructions, and the processor can implement any one of claims 1 to 13 when the processor executes the computer-executable instructions on the memory. The method steps described in item.