CN112597824A

CN112597824A - Behavior recognition method and device, electronic equipment and storage medium

Info

Publication number: CN112597824A
Application number: CN202011438565.XA
Authority: CN
Inventors: 陈海波; 罗志鹏; 张治广
Original assignee: Shenyan Technology Beijing Co ltd
Current assignee: Shenyan Technology Beijing Co ltd
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-04-02

Abstract

The embodiment of the application relates to the technical field of computer vision, and provides a behavior identification method, a behavior identification device, electronic equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of inputting an original behavior video into a data processing module for data preprocessing, obtaining a behavior video set to be recognized, inputting the behavior video set to be recognized into a Slowfast network model to obtain a first behavior recognition result, inputting the behavior video set to be recognized into a TSM network model to obtain a second behavior recognition result, and obtaining the recognition result of the original behavior video based on the first behavior recognition result and the second behavior recognition result. According to the method and the device, the recognition result of the Slowfast network and the recognition result of the TSM network are fused, the influence of spatial information and time domain information is considered, and the precision of the behavior recognition result is improved.

Description

Behavior recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a behavior recognition method and apparatus, an electronic device, and a storage medium.

Background

Currently, human behavior recognition is mainly applied to the fields of human-computer interaction, motion analysis, intelligent monitoring and virtual reality, and due to the complexity of human motion and the variability of external environments, the behavior recognition and detection have some challenges.

The prior art behavior recognition method includes: the identification of human body behaviors in the video is completed through two stages of library establishment and identification, wherein the library establishment stage comprises the following steps: firstly, collecting a video, calculating a histogram vector of a person image of one frame of the video, then carrying out normalization processing on the histogram vector, solving entropies of various normalized histogram vectors, combining a plurality of obtained entropies into an entropy vector, and finally defining the entropy vector to finish establishing a human body action characteristic template library; the identification phase comprises: firstly, a video is collected, entropy vectors of a section of video are calculated according to a library building method, then matching is carried out on the entropy vectors and a template library, the entropy vector which is most matched in the most template library is found, the corresponding definition of the entropy vector is the action type of the section of video, and the method is low in action identification precision.

Disclosure of Invention

The application provides a behavior recognition method, a behavior recognition device, electronic equipment and a storage medium, so as to achieve the purpose of obtaining a high-precision behavior recognition result.

The application provides a behavior recognition method, which comprises the following steps:

performing data preprocessing on the original behavior video to obtain a behavior video set to be identified;

inputting the behavior video set to be recognized into a Slowfast network model to obtain a first behavior recognition result;

inputting the behavior video set to be recognized into a TSM network model, and acquiring a second behavior recognition result;

acquiring the recognition result of the original behavior video based on the first behavior recognition result and the second behavior recognition result;

the Slowfast network model and the TSM network model are obtained by training behavior recognition results based on a to-be-recognized sample behavior video set and a to-be-recognized sample behavior video set, and the to-be-recognized sample behavior video set is obtained by performing data preprocessing on an original sample behavior video.

According to the behavior recognition method provided by the application, data preprocessing is carried out on an original behavior video to obtain a behavior video set to be recognized, and the method comprises the following steps: sequentially carrying out video length processing, video mode processing and data enhancement processing on the original behavior video;

the video length processing includes: if the length of the original behavior video is judged to be larger than a preset value, sampling the original behavior video by taking the preset value as the length; if the length of the original behavior video is judged to be smaller than a preset value, the length of the original behavior video is filled to the preset value based on video interpolation;

the video mode processing includes: respectively acquiring an RGB video and a frame difference video of an original behavior video after the video length processing;

the data enhancement processing includes: and respectively performing data enhancement on the RGB video and the frame difference video, wherein the data enhancement comprises one or more of mirror image turning, video reverse playing, video cutting and video splicing.

According to the behavior recognition method provided by the application, the behavior video set to be recognized is input into a Slowfast network model, and a first behavior recognition result is obtained, wherein the method comprises the following steps:

and inputting the behavior video set to be recognized into a Slowfast network model, acquiring a plurality of groups of Slowfast network recognition results, and taking the average value of the plurality of groups of Slowfast network recognition results as the first behavior recognition result.

According to the behavior recognition method provided by the application, the Slowfast network model comprises a Non-local module and a spatio-temporal attention module, and an ELU function is taken as an activation function by the Slowfast network model.

According to the behavior recognition method provided by the application, the behavior video set to be recognized is input into a TSM network model, and a second behavior recognition result is obtained, wherein the method comprises the following steps:

and inputting the behavior video set to be recognized into a TSM network model, acquiring a plurality of groups of TSM network recognition results, and taking the average value of the plurality of groups of TSM network recognition results as the second behavior recognition result.

According to the behavior recognition method provided by the application, the recognition result of the original behavior video is obtained based on the first behavior recognition result and the second behavior recognition result, and the method comprises the following steps:

and taking the average value of the first behavior recognition result and the second behavior recognition result as the recognition result of the original behavior video.

According to the behavior recognition method provided by the application, after training the Slowfast network model and the TSM based on a to-be-recognized sample behavior video set, the method further comprises the following steps:

performing test verification on the Slowfast network model and the TSM based on an original test behavior video, specifically comprising:

inputting an original test video into a data processing module for data preprocessing to obtain a test behavior video set;

inputting the test behavior video set into the Slowfast network model to obtain a first test result;

inputting the test behavior video set into the TSM network model to obtain a second test result;

and acquiring the test result of the original test video based on the first test result and the second test result.

The present application further provides a behavior recognition device, including:

the acquiring unit is used for carrying out data preprocessing on the original behavior video and acquiring a behavior video set to be identified;

the first identification unit is used for inputting the behavior video set to be identified into a Slowfast network model and acquiring a first behavior identification result;

the second identification unit is used for inputting the behavior video set to be identified into a TSM network model and acquiring a second behavior identification result;

a third identification unit, configured to obtain an identification result of the original behavior video based on the first behavior identification result and the second behavior identification result;

According to the behavior recognition device provided by the application, the obtaining unit is configured to perform data preprocessing on an original behavior video and obtain a behavior video set to be recognized, and the obtaining unit includes: sequentially carrying out video length processing, video mode processing and data enhancement processing on the original behavior video; the acquisition unit includes:

the video length processing unit is used for carrying out video length processing on the original behavior video, and comprises: if the length of the original behavior video is judged to be larger than a preset value, sampling the original behavior video by taking the preset value as the length; if the length of the original behavior video is judged to be smaller than a preset value, the length of the original behavior video is filled to the preset value based on video interpolation;

the video mode processing unit is used for carrying out video mode processing on the original behavior video and comprises the following steps: respectively acquiring an RGB video and a frame difference video of an original behavior video after the video length processing;

the data enhancement processing unit is used for performing data enhancement processing on the original behavior video and comprises: and respectively performing data enhancement on the RGB video and the frame difference video, wherein the data enhancement comprises one or more of mirror image turning, video reverse playing, video cutting and video splicing.

According to the behavior recognition device provided by the application, the first recognition unit is configured to:

According to the behavior recognition device provided by the application, the Slowfast network model comprises a Non-local module and a spatio-temporal attention module, and the Slowfast network model takes an ELU function as an activation function.

According to the behavior recognition device provided by the application, the second recognition unit is configured to:

According to the behavior recognition device provided by the application, the third recognition unit is configured to:

According to the behavior recognition device provided by the application, the behavior recognition device further comprises a testing unit, the testing unit is used for testing and verifying the Slowfast network model and the TSM network model based on an original testing behavior video after the Slowfast network model and the TSM network model are trained based on a to-be-recognized sample behavior video set, and the testing unit comprises:

the test acquisition unit is used for inputting the original test video into the data processing module for data preprocessing to acquire a test behavior video set;

the first testing subunit is used for inputting the testing behavior video set into the Slowfast network model to obtain a first testing result;

the second testing subunit is used for inputting the testing behavior video set into the TSM network model to obtain a second testing result;

and the third testing subunit is used for acquiring the testing result of the original testing video based on the first testing result and the second testing result.

The present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of any of the behavior recognition methods described above when executing the computer program.

The present application also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the behavior recognition method as described in any of the above.

According to the behavior recognition method, the behavior recognition device, the electronic equipment and the storage medium, data preprocessing is carried out on an original behavior video input data processing module to obtain a behavior video set to be recognized, the behavior video set to be recognized is input into a Slowfast network model to obtain a first behavior recognition result, the behavior video set to be recognized is input into a TSM network model to obtain a second behavior recognition result, and the recognition result of the original behavior video is obtained based on the first behavior recognition result and the second behavior recognition result. The method includes the steps that a slow path operated at a low frame rate is used for capturing space semantics, a fast path operated at a high frame rate is used for capturing motion, and a fine time resolution is used for capturing motion, so that a behavior recognition result can be accurately obtained, meanwhile, a TSM (time-series Messaging) network can better describe time domain information characteristics, and the obtained behavior recognition result can better represent the time domain information characteristics.

Drawings

In order to more clearly illustrate the technical solutions in the present application or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a behavior recognition method provided herein;

fig. 2 is a schematic structural diagram of a Slowfast network model provided in the present application;

fig. 3 is a schematic diagram of an example of a Slowfast network model provided in the present application;

FIG. 4 is a schematic diagram of a TSM network model provided herein;

FIG. 5 is a schematic diagram of a convolution module in a TSM network model provided in the present application;

FIG. 6 is a schematic diagram of a test flow provided herein;

FIG. 7 is a second flowchart of the behavior recognition method provided in the present application;

fig. 8 is a schematic structural diagram of a behavior recognition device provided in the present application;

FIG. 9 is a schematic structural diagram of an acquisition unit provided in the present application;

fig. 10 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

To make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In view of the above, the present application provides a behavior recognition method. Fig. 1 is a schematic flow chart of a behavior recognition method provided in the present application, and as shown in fig. 1, the method includes the following steps:

step 110, performing data preprocessing on the original behavior video to obtain a behavior video set to be identified;

step 120, inputting the behavior video set to be recognized into a Slowfast network model, and acquiring a first behavior recognition result;

step 130, inputting the behavior video set to be recognized into a TSM network model, and acquiring a second behavior recognition result;

step 140, acquiring an identification result of the original behavior video based on the first behavior identification result and the second behavior identification result;

the Slowfast network model and the TSM network model are obtained by training behavior recognition results of a behavior video set of a sample to be recognized and a behavior video set of the sample to be recognized, and the behavior video set of the sample to be recognized is obtained by performing data preprocessing on an original sample behavior video.

In this embodiment, it should be noted that the behavior recognition is to obtain the action behaviors of the relevant people in the video, such as riding a bicycle, fighting a frame, climbing a mountain, and the like, by analyzing a given section of video to be recognized. The original behavior video may be obtained from the behavior recognition data set UFC-101, or may be obtained from the kinetic-400 video data set, which is not specifically limited in this embodiment.

In the embodiment, in order to improve the accuracy of behavior recognition, on the basis of obtaining the original behavior video, data preprocessing is performed on the original behavior video to obtain a plurality of behavior videos to be recognized, and then a behavior video set to be recognized is formed, so that details of behavior actions in the original behavior video can be reflected by the obtained video set from a plurality of different angles. The data preprocessing may be to perform data enhancement on the original behavior video, or may also perform uniform video length on the original behavior video, which is not specifically limited in this embodiment.

After data preprocessing is performed on an original behavior video, the embodiment inputs a behavior video set to be identified into a Slowfast network model, and obtains a first behavior identification result; the core of the Slowfast network is to apply two parallel Convolutional Neural Networks (CNN), namely a Slow (Slow) channel and a Fast (Fast) channel, to the same video segment. For example, a video of an airplane taking off may contain a relatively static airport and a dynamic object (airplane) moving rapidly in the scene, and as in daily life, when two people see the face, the handshake is usually relatively fast while the rest of the scene is relatively static. Therefore, the SlowFast uses a Slow high-resolution CNN (Slow channel) to analyze static content in the video, and uses a Fast low-resolution CNN (Fast channel) to analyze dynamic content in the video, so that the action details in the action video can be accurately analyzed.

Meanwhile, the embodiment inputs the behavior video set to be recognized into the TSM network model, and obtains a second behavior recognition result. It should be noted that, because the 3D network has a large amount of computation, and the 2D network does not utilize timing information, the TSM network can model time with the 2D network, that is, the channel of the feature map portion of the current frame is replaced with the channel of the previous frame or the next frame, so that not only the amount of computation of the network is reduced, but also the behavior recognition can be accurately performed.

Therefore, the recognition result of the original behavior video is acquired as the final behavior recognition result based on the first behavior recognition result and the second behavior recognition result. It is understood that the final sample behavior video recognition result may be obtained by averaging the first behavior recognition result and the second behavior recognition result. Therefore, in the embodiment, the recognition results of the Slowfast network and the TSM network are fused, so that the accuracy of the behavior recognition model can be further improved, and the problem that the recognition result is unstable after a recognition error occurs in one of the Slowfast network and the TSM network is solved.

According to the behavior recognition method, data preprocessing is carried out on an original behavior video input data processing module to obtain a behavior video set to be recognized, the behavior video set to be recognized is input into a Slowfast network model to obtain a first behavior recognition result, the behavior video set to be recognized is input into a TSM network model to obtain a second behavior recognition result, and the recognition result of the original behavior video is obtained based on the first behavior recognition result and the second behavior recognition result. The method includes the steps that a slow path operated at a low frame rate is used for capturing space semantics, a fast path operated at a high frame rate is used for capturing motion, and a fine time resolution is used for capturing motion, so that a behavior recognition result can be accurately obtained, meanwhile, a TSM (time-series Messaging) network can better describe time domain information characteristics, and the obtained behavior recognition result can better represent the time domain information characteristics.

Based on the above embodiment, step 110 includes: sequentially carrying out video length processing, video mode processing and data enhancement processing on the original behavior video;

the video length processing comprises the following steps: if the length of the original behavior video is judged to be larger than a preset value, sampling the original behavior video by taking the preset value as the length; if the length of the original behavior video is judged to be smaller than the preset value, the length of the original behavior video is filled to the preset value based on video interpolation;

the video mode processing includes: respectively acquiring an RGB video and a frame difference video of an original behavior video after video length processing;

the data enhancement processing comprises the following steps: and respectively carrying out data enhancement on the RGB video and the frame difference video, wherein the data enhancement comprises one or more of mirror image turning, video reverse playing, video cutting and video splicing.

In this embodiment, because the lengths of the obtained original behavior videos are not uniform, which affects the input of the Slowfast network model and the TSM network model for training, when the length of the original behavior video is determined to be greater than a preset value, the preset value is used as the length to sample the original behavior video; and when the length of the original behavior video is judged to be smaller than the preset value, the length of the original behavior video is filled to the preset value based on video interpolation, so that the original behavior video is unified to the length of the preset value. For example, the present embodiment may sample 64 frames as input for the original behavior video length greater than 64 frames according to the model input requirement; for the original behavior video with the length less than 64 frames, video interpolation is adopted to fill in the 64 frames.

In addition, after the video length processing, in order to further improve the behavior recognition accuracy, the embodiment converts the original behavior video into two modes, and inputs the two modes into the Slowfast network model and the TSM network model respectively for training, for example, respectively obtains an RGB video corresponding to the original behavior video and a frame difference video corresponding to the original behavior video, and trains based on the RGB video and the frame difference video; the frame difference video is a video with adjacent frames for difference.

In order to enable the original behavior video to have a better expression effect in the model, the embodiment respectively performs data enhancement on the RGB video and the frame difference video, so that a large amount of data set samples can be acquired and input into the model for training, and the training effect of the model is improved. Wherein the data enhancement comprises: the whole video is subjected to mirror image turning; the whole video is played backwards; randomly cutting partial images of the whole video in each frame; and splicing the forward video and the reverse video, and then carrying out frame sampling.

Based on the above embodiment, step 120 inputs the behavior video set to be recognized into the Slowfast network model, and obtains the first behavior recognition result, including:

and inputting the behavior video set to be recognized into a Slowfast network model, acquiring a plurality of groups of Slowfast network recognition results, and taking the average value of the plurality of groups of Slowfast network recognition results as a first behavior recognition result.

In this embodiment, since the behavior video set to be recognized includes a plurality of videos, after each video is input into the Slowfast network model, a group of Slowfast network recognition results is obtained, and in this embodiment, an average value is obtained on the basis of obtaining a plurality of groups of Slowfast network recognition results, so as to obtain a first behavior recognition result. Therefore, in the embodiment, the obtained first behavior recognition result is more stable by obtaining multiple groups of Slowfast network recognition results.

Based on the above embodiment, the Slowfast network model includes a Non-local module and a spatio-temporal attention module, and the Slowfast network model uses an ELU function as an activation function.

In this embodiment, the Slowfast network model includes a Slow channel and a Fast channel, and the step of identifying the behavior video set to be identified by the Slowfast network model includes: the RGB video is input into a Fast channel to obtain an RGB video identification result, the frame difference video is input into the Fast channel to obtain a frame difference video identification result, and the average value of the RGB video identification result and the frame difference video identification result is obtained, so that the Fast channel identification result is obtained. Similarly, the RGB video is input into the Slow channel to obtain the RGB video identification result, the frame difference video is input into the Slow channel to obtain the frame difference video identification result, the average value of the RGB video identification result and the frame difference video identification result is obtained to obtain the Slow channel identification result, and finally the identification result of the double-current network module is obtained according to the Slow channel identification result and the Fast channel identification result.

As shown in FIG. 2, both the Slow channel (Slow path) and the Fast channel (Fast path) use a 3D restNet model, and run a 3D convolution operation immediately after capturing several frames. The Slow channel uses a large timing span (i.e., number of frames skipped per second), typically set to 16, indicating that about 1 second can acquire 2 frames. The Fast channel uses a small timing span τ/α, where α is typically set to 8, so that 15 frames can be acquired in 1 second. Although the time-sequence frequency of the Fast channel is higher, the Fast channel requires a calculation amount 4 times smaller than that of the Slow channel, and therefore the Fast channel is kept lightweight by using a smaller convolution width (the number of filters used), which is usually set to 1/8 (the value is denoted as β) of the Slow channel convolution width.

As shown in FIG. 3, the size of the convolution kernel is denoted as { T × S }²C, where T, S and C represent the dimensions of the temporal, spatial, and Channel, respectively. The span is denoted as { temporal stride, spatial stride ^2 }. The speed ratio (frame rate) is 8, the channel ratio is 1/8, and τ is 16. In order to make the model have better effect, the Slowfast network model in this embodiment includes:

(1) a Non-local module is added in the slowFast, so that the problem that the local module cannot acquire global information is solved, and therefore, the later layer can acquire more abundant information. Wherein the formula of Non-local is as follows:

where i and j each represent a certain spatial position of an input x, x_iRepresenting a vector (the dimension is the same as the channel number of x), f represents a function for calculating the similarity of any two points, and g represents a mapping function (mapping a point into a vector, which can be regarded as a characteristic for calculating a point).

(2) Replacing the ReLU function with an ELU function, the linear part on the right of the ELU activation function can mitigate the gradient vanishing, and the left can make the input variation or noise more robust. The output average value of the ELU is close to 0, so that the convergence speed is higher, information of each layer of the SlowFast can be more comprehensively acquired, and the accuracy of the identification result is improved; wherein, the formula of ELU is:

where α is an adjustable parameter that controls when the negative part of the ELU saturates.

(3) Add spatiotemporal attention module (Spatio-temporal attention) to the 3D convolutional layer.

In this embodiment, the Fast channel identification result is input to the Slow channel via a lateral connection; the identification data of the Fast channel are fed into the Slow channel via a lateral connection, which makes it possible for the Slow channel to obtain the processing result of the Fast channel. The shape of a single data sample is different between the two channels (Fast channel is { α T, S)²β C, and the Slow channel is { T, S }²α β C) to perform data transformation on the result of Fast channel, and then to blend into Slow channel, so as to finally obtain the Slow channel identification result. At the end of the Fast and Slow channels, SlowFast performs a dimensionality reduction, i.e., a global average pooling, and then combines the results of the two channels and feeds them into a fully-connected classification layerAnd the full-connection classification layer identifies the action occurring in the behavior video by using a logistic regression function softmax, and finally acquires the identification result of the double-flow network module.

Based on the above embodiment, step 130 inputs the behavior video set to be recognized into the TSM network model, and obtains a second behavior recognition result, including:

and inputting the behavior video set to be identified into the TSM network model, acquiring a plurality of groups of TSM network identification results, and taking the average value of the plurality of groups of TSM network identification results as a second behavior identification result.

In this embodiment, since the behavior video set to be recognized includes a plurality of videos, after each video is input into the TSM network model, a group of TSM network recognition results is obtained, and in this embodiment, an average value is obtained on the basis of obtaining a plurality of groups of TSM network recognition results, so as to obtain a second behavior recognition result. Therefore, in the embodiment, the second behavior recognition result is more stable by acquiring multiple sets of TSM network recognition results.

In this embodiment, it should be noted that the TSM network mainly constructs a time shift module, and can be inserted into any two-dimensional CNN to implement time modeling. As shown in fig. 4, in each input of model data, the video is divided into N segments, each segment samples one frame, and the N frames are input as a network model. Sampled frames are made to span the entire video, supporting long-term temporal relationship modeling. In the TSM network, a part of channels are shifted forward by one step in a temporal dimension, a part of channels are shifted backward by one step in the temporal dimension, and the shifted vacancies are filled with zeros. In this way, context interaction in the temporal dimension is introduced into the feature diagram, and the modeling capability in the time dimension is improved.

As shown in fig. 5, a (2+1) D convolution module is added to the TSM, by which 2D convolution and 1D convolution are used to approximate, but to ensure that the parameters are the same. Compared with a 3D module in a), although parameters are unchanged, R (2+1) D in b) is added with more Relu activation layers, the expression capability of the model is stronger, the calculated amount of the model is reduced, the training efficiency of the model is improved, and the model is easier to train and optimize.

Based on the above embodiment, acquiring the recognition result of the original behavior video based on the first behavior recognition result and the second behavior recognition result includes:

In this embodiment, it should be noted that, because the Slowfast path operated at a low frame rate is used to capture spatial semantics, and the fast path operated at a high frame rate and the fine time resolution are used to capture motion, a behavior recognition result can be accurately obtained, and meanwhile, the TSM network can better describe time domain information characteristics, so that the obtained behavior recognition result can better represent the time domain information characteristics.

Based on the above embodiment, as shown in fig. 6, after training the Slowfast network model and the TSM network model based on the behavior video set of the sample to be recognized, the method further includes:

the method for testing and verifying the Slowfast network model and the TSM network model based on the original test behavior video specifically comprises the following steps:

step 610, inputting an original test video into a data processing module for data preprocessing to obtain a test behavior video set;

step 620, inputting the test behavior video set into a Slowfast network model to obtain a first test result;

step 630, inputting the test behavior video set into the TSM network model to obtain a second test result;

and step 640, acquiring a test result of the original test video based on the first test result and the second test result.

In this embodiment, an original test video is obtained from a public video data set, a test behavior video set is obtained based on data preprocessing on the original test video, the test behavior video set is respectively input into a Slowfast network model and a TSM network model, multiple groups of first test results and multiple groups of second test results are obtained, an average value of the multiple groups of first test results and the multiple groups of second test results is obtained, and a test result of the original test video is obtained.

The data preprocessing comprises the steps of obtaining a test RGB video and a test frame difference video corresponding to an original test video, and then respectively carrying out data enhancement on the test RGB video and the test frame difference video.

Based on the foregoing embodiment, as shown in fig. 7, the behavior recognition method provided by this embodiment includes:

the method comprises the steps of obtaining an original behavior video to be recognized, firstly, conducting data preprocessing on the behavior video, then inputting the behavior video after data preprocessing into a Slowfast network and a TSM network respectively, obtaining recognition results of the Slowfast network and the TSM network, fusing the two recognition results (obtaining an average value), and obtaining a final recognition result.

The following describes the behavior recognition device provided in the present application, and the behavior recognition device described below and the behavior recognition method described above may be referred to in correspondence with each other.

Based on the above-described embodiment, as shown in fig. 8, the present application provides a behavior recognition apparatus, including:

the acquiring unit 810 is configured to perform data preprocessing on an original behavior video, and acquire a behavior video set to be identified;

a first identification unit 820, configured to input the behavior video set to be identified into a Slowfast network model, and obtain a first behavior identification result;

the second identification unit 830 is configured to input the behavior video set to be identified into the TSM network model, and obtain a second behavior identification result;

a third identifying unit 840 configured to obtain an identification result of the original behavior video based on the first behavior identification result and the second behavior identification result;

Based on the above embodiment, as shown in fig. 9, the obtaining unit 810 is configured to perform data preprocessing on an original behavior video, and obtain a behavior video set to be identified, where the acquiring unit includes: sequentially carrying out video length processing, video mode processing and data enhancement processing on the original behavior video; the acquisition unit 810 includes:

the video length processing unit 811 is configured to perform video length processing on the original behavior video, and includes: if the length of the original behavior video is judged to be larger than a preset value, sampling the original behavior video by taking the preset value as the length; if the length of the original behavior video is judged to be smaller than the preset value, the length of the original behavior video is filled to the preset value based on video interpolation;

a video mode processing unit 812, configured to perform video mode processing on the raw behavior video, including: respectively acquiring an RGB video and a frame difference video of an original behavior video after video length processing;

the data enhancement processing unit 813 is configured to perform data enhancement processing on the original behavior video, and includes: and respectively carrying out data enhancement on the RGB video and the frame difference video, wherein the data enhancement comprises one or more of mirror image turning, video reverse playing, video cutting and video splicing.

Based on the above embodiment, the first identifying unit 820 is configured to:

Based on the above embodiment, the Slowfast network model includes a Non-local module and a spatio-temporal attention module, and the Slowfast network model takes an ELU function as an activation function.

Based on the above embodiment, the second identifying unit 830 is configured to:

Based on the above embodiment, the third identifying unit 840 is configured to:

Based on the above embodiment, the method further includes a testing unit, configured to perform test verification on the Slowfast network model and the TSM network model based on an original test behavior video after training the Slowfast network model and the TSM network model based on a to-be-identified sample behavior video set, where the testing unit includes:

the first testing subunit is used for inputting a testing behavior video set into the Slowfast network model to obtain a first testing result;

The behavior recognition device provided in the embodiment of the present application is used for executing the behavior recognition method, and the implementation manner of the behavior recognition device is consistent with that of the behavior recognition method provided in the present application, and the same beneficial effects can be achieved, and details are not repeated here.

Fig. 10 illustrates a physical structure diagram of an electronic device, and as shown in fig. 10, the electronic device may include: a processor (processor)1010, a communication Interface (Communications Interface)1020, a memory (memory)1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may call logic instructions in memory 1030 to perform a behavior recognition method comprising: performing data preprocessing on the original behavior video to obtain a behavior video set to be identified; inputting the behavior video set to be recognized into a Slowfast network model to obtain a first behavior recognition result; inputting the behavior video set to be recognized into a TSM network model, and acquiring a second behavior recognition result; acquiring the recognition result of the original behavior video based on the first behavior recognition result and the second behavior recognition result; the Slowfast network model and the TSM network model are obtained by training behavior recognition results based on a to-be-recognized sample behavior video set and a to-be-recognized sample behavior video set, and the to-be-recognized sample behavior video set is obtained by performing data preprocessing on an original sample behavior video.

Furthermore, the logic instructions in the memory 1030 can be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The processor 1010 in the electronic device provided in the embodiment of the present application may call the logic instruction in the memory 1030 to implement the behavior recognition method, and an implementation manner of the behavior recognition method is consistent with that of the behavior recognition method provided in the present application, and the same beneficial effects may be achieved, which is not described herein again.

On the other hand, the present application further provides a computer program product, which is described below, and the computer program product described below and the behavior recognition method described above may be referred to correspondingly.

The computer program product comprises a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the behavior recognition method provided by the above methods, the method comprising: performing data preprocessing on the original behavior video to obtain a behavior video set to be identified; inputting the behavior video set to be recognized into a Slowfast network model to obtain a first behavior recognition result; inputting the behavior video set to be recognized into a TSM network model, and acquiring a second behavior recognition result; acquiring the recognition result of the original behavior video based on the first behavior recognition result and the second behavior recognition result; the Slowfast network model and the TSM network model are obtained by training behavior recognition results based on a to-be-recognized sample behavior video set and a to-be-recognized sample behavior video set, and the to-be-recognized sample behavior video set is obtained by performing data preprocessing on an original sample behavior video.

When the computer program product provided by the embodiment of the present application is executed, the behavior recognition method is implemented, and an implementation manner of the behavior recognition method is consistent with that of the behavior recognition method provided by the present application, and the same beneficial effects can be achieved, and details are not described here.

In yet another aspect, the present application further provides a non-transitory computer-readable storage medium, which is described below, and the non-transitory computer-readable storage medium described below and the behavior recognition method described above are referred to in correspondence with each other.

The present application also provides a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processor, is implemented to perform the behavior recognition methods provided above, the method comprising: performing data preprocessing on the original behavior video to obtain a behavior video set to be identified; inputting the behavior video set to be recognized into a Slowfast network model to obtain a first behavior recognition result; inputting the behavior video set to be recognized into a TSM network model, and acquiring a second behavior recognition result; acquiring the recognition result of the original behavior video based on the first behavior recognition result and the second behavior recognition result; the Slowfast network model and the TSM network model are obtained by training behavior recognition results based on a to-be-recognized sample behavior video set and a to-be-recognized sample behavior video set, and the to-be-recognized sample behavior video set is obtained by performing data preprocessing on an original sample behavior video.

When the computer program stored on the non-transitory computer readable storage medium provided in the embodiment of the present application is executed, the behavior recognition method is implemented, and an implementation manner of the behavior recognition method is consistent with that of the behavior recognition method provided in the present application, and the same beneficial effects can be achieved, and details are not repeated here.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of behavior recognition, comprising:

2. The behavior recognition method according to claim 1, wherein the data preprocessing is performed on the original behavior video to obtain a behavior video set to be recognized, and the method comprises the following steps: sequentially carrying out video length processing, video mode processing and data enhancement processing on the original behavior video;

3. The behavior recognition method according to claim 1, wherein inputting the video set of behaviors to be recognized into a Slowfast network model, and obtaining a first behavior recognition result comprises:

4. The behavior recognition method according to claim 3, wherein the Slowfast network model comprises a Non-local module and a spatio-temporal attention module, and the Slowfast network model has an ELU function as an activation function.

5. The behavior recognition method according to claim 1, wherein inputting the video set of behaviors to be recognized into a TSM network model to obtain a second behavior recognition result comprises:

6. The behavior recognition method according to claim 1, wherein obtaining the recognition result of the original behavior video based on the first behavior recognition result and the second behavior recognition result comprises:

7. The behavior recognition method according to any one of claims 1 to 6, wherein after training the Slowfast network model and the TSM network model based on a to-be-recognized sample behavior video set, the method further comprises:

8. A behavior recognition apparatus, comprising:

9. The behavior recognition device according to claim 8, wherein the obtaining unit is configured to perform data preprocessing on an original behavior video and obtain a behavior video set to be recognized, and includes: sequentially carrying out video length processing, video mode processing and data enhancement processing on the original behavior video; the acquisition unit includes:

10. The behavior recognition device according to claim 8, wherein the first recognition unit is configured to:

11. The behavior recognition device according to claim 8, wherein the second recognition unit is configured to:

12. The behavior recognition device according to claim 8, wherein the third recognition unit is configured to:

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the behavior recognition method according to any of claims 1 to 7 are implemented when the processor executes the program.

14. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the behavior recognition method according to any one of claims 1 to 7.