CN115512267A

CN115512267A - Video behavior identification method, system, device and equipment

Info

Publication number: CN115512267A
Application number: CN202211185729.1A
Authority: CN
Inventors: 钱一琛; 孙修宇
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2022-12-23

Abstract

The application provides a video behavior identification method, a system, a device and equipment. The method comprises the steps of extracting multilayer action characteristics with different spatial scales by sampling a plurality of video frames and inputting the action characteristics into a behavior recognition model, and respectively carrying out correlation analysis on the multilayer action characteristics to obtain multilayer correlation characteristics, wherein the spatial scales of the correlation characteristics of different layers are different; obtaining multi-scale fusion correlation characteristics by fusing the multi-layer correlation characteristics to obtain a dense correlation characteristic field, so that the behavior recognition model can capture more fine action characteristics, such as characteristic information of movement and rapid movement of small objects; and performing behavior classification recognition according to the fusion features obtained by fusing the multi-scale fusion correlation features and the extracted action features to obtain behavior recognition results, so that the effect and performance of video behavior recognition are improved, and the robustness of a behavior recognition model in an actual application scene is improved.

Description

Video behavior identification method, system, device and equipment

Technical Field

The present application relates to computer technologies, and in particular, to a method, a system, an apparatus, and a device for identifying video behaviors.

Background

The video behavior recognition technology is widely applied to a plurality of fields such as intelligent monitoring, man-machine interaction, video sequence understanding, medical health, intelligent education and the like.

The traditional video behavior recognition model is used for modeling action characteristics based on the posture key points of target objects in videos, and the characteristics are usually designed manually; and modeling the action characteristics through a convolutional neural network based on the deep learning video behavior recognition model. However, these behavior recognition methods do not model the movement information in space-time, and lack the modeling capability for fine motion characteristics such as small objects and fast motions, so that the recognition performance is poor in the practical application scene.

Disclosure of Invention

The application provides a video behavior identification method, a system, a device and equipment, which are used for solving the problem of poor behavior identification performance of the existing behavior identification method.

In one aspect, the present application provides a video behavior identification method, including:

acquiring video data to be identified, wherein the video data comprises a plurality of video frames;

inputting the video frames into a behavior recognition model, and extracting multi-layer action characteristics of the video frames through the behavior recognition model, wherein the action characteristics of different layers have different spatial scales;

respectively carrying out correlation analysis on the multi-layer action characteristics to obtain multi-layer correlation characteristics, fusing the multi-layer correlation characteristics to obtain multi-scale fusion correlation characteristics when the spatial scales of the correlation characteristics of different layers are different, and fusing the multi-scale fusion correlation characteristics into the extracted action characteristics to obtain fusion characteristics;

and performing behavior classification and identification according to the fusion characteristics to obtain a behavior identification result of the target in the video data.

In another aspect, the present application provides a video behavior recognition system, including:

the system comprises an end-side device, a video recognition device and a video recognition device, wherein the end-side device is used for acquiring a video to be recognized and sampling the video to obtain a plurality of video frames;

the cloud-side equipment is used for inputting the video frames into a behavior recognition model, extracting multi-layer action characteristics of the video frames through the behavior recognition model, wherein the action characteristics of different layers have different spatial scales; respectively carrying out correlation analysis on the multi-layer action characteristics to obtain multi-layer correlation characteristics, fusing the multi-layer correlation characteristics to obtain multi-scale fused correlation characteristics when the spatial scales of the correlation characteristics of different layers are different, and fusing the multi-scale fused correlation characteristics into the extracted action characteristics to obtain fused characteristics; performing behavior classification recognition according to the fusion characteristics to obtain a behavior recognition result of the target in the video data;

the cloud side device is further used for sending the behavior recognition result to the end side device;

and the end-side equipment is also used for carrying out post-processing according to the behavior recognition result and outputting a post-processing result.

In another aspect, the present application provides a video behavior recognition apparatus, including:

the video data acquisition module is used for acquiring video data to be identified, and the video data comprises a plurality of video frames;

the characteristic extraction module is used for inputting the video frames into a behavior recognition model, and extracting multi-layer action characteristics of the video frames through the behavior recognition model, wherein the action characteristics of different layers have different spatial scales;

the correlation module is used for respectively carrying out correlation analysis on the multi-layer action characteristics to obtain multi-layer correlation characteristics, the spatial scales of the correlation characteristics of different layers are different, the multi-layer correlation characteristics are fused to obtain multi-scale fusion correlation characteristics, and the multi-scale fusion correlation characteristics are fused into the extracted action characteristics to obtain fusion characteristics;

and the classification identification module is used for performing behavior classification identification according to the fusion characteristics to obtain a behavior identification result of the target in the video data.

In another aspect, the present application provides an electronic device comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the video behavior recognition method of any one of the above aspects.

In another aspect, the present application provides a computer-readable storage medium, in which computer-executable instructions are stored, and when executed by a processor, the computer-executable instructions are used to implement the video behavior recognition method according to any one of the above aspects.

In another aspect, the present application provides a computer program product comprising a computer program, which when executed by a processor, implements the video behavior recognition method according to any of the above aspects.

According to the video behavior recognition method, the system, the device and the equipment, action feature extraction is carried out by inputting the video frames into a feature extraction module of a behavior recognition model, and the feature extraction module comprises a plurality of feature extraction layers for extracting action features of different spatial scales; and respectively carrying out correlation analysis on the action characteristics extracted by the plurality of characteristic extraction layers to obtain a plurality of layers of correlation characteristics, wherein the spatial scales of the correlation characteristics of different layers are different, and the correlation characteristics of each layer capture the movement information of the target object. Furthermore, a dense correlation characteristic field can be obtained by fusing multi-layer correlation characteristics to obtain multi-scale fusion correlation characteristics, so that a behavior recognition model can capture more fine action characteristics, such as characteristic information of movement and rapid movement of a small object; furthermore, the multi-scale fusion correlation characteristics are fused into the action characteristics extracted by the characteristic extraction module, so that the action characteristics comprise more precise characteristics such as the movement information of small objects, the movement information of rapid action and the like, the action classification and identification are carried out according to the fusion characteristics to obtain the action identification result of the target in the video data, the action identification effect and performance of the video are improved, and the robustness of the action identification model in the actual application scene is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram of an exemplary network architecture to which the present application is applicable;

FIG. 2 is a schematic diagram of another example network architecture to which the present application is applicable;

fig. 3 is a flowchart of a video behavior recognition method according to an exemplary embodiment of the present application;

FIG. 4 is an exemplary diagram of an overall framework for video behavior recognition provided in an exemplary embodiment of the present application;

FIG. 5 is an exemplary diagram of an overall framework for video behavior recognition provided by another exemplary embodiment of the present application;

FIG. 6 is an exemplary diagram of an overall framework for video behavior recognition provided by yet another exemplary embodiment of the present application;

fig. 7 is a flowchart of a video behavior recognition method according to another exemplary embodiment of the present application;

FIG. 8 is a flowchart of a behavior recognition model training method provided by an exemplary embodiment of the present application;

FIG. 9 is a schematic diagram of a video behavior recognition system provided in an exemplary embodiment of the present application;

FIG. 10 is a flowchart of a method for video behavior recognition according to an exemplary embodiment of the present application;

fig. 11 is a schematic structural diagram of a video behavior recognition apparatus according to an exemplary embodiment of the present application;

fig. 12 is a schematic structural diagram of a video behavior recognition apparatus according to another exemplary embodiment of the present application;

fig. 13 is a schematic structural diagram of an electronic device according to an example embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

The terms referred to in the present application are explained first:

video behavior identification: refers to identifying one or more motion behaviors of a target object from video data.

And (3) action modeling: the moving information of objects in the video is analyzed and learned.

Visual relevance: the similarity between image textures is generally calculated in units of pixels.

Spatial scale: refers to the spatial resolution of the feature.

The traditional video behavior recognition model is used for modeling action characteristics based on the posture key points of target objects in a video, and the characteristics are usually designed manually; and modeling the action characteristics through a convolutional neural network based on a deep learning video behavior recognition model.

However, these behavior recognition methods do not model the movement information in space-time, and lack the modeling capability for fine motion features such as small objects and fast motions, so that the recognition accuracy is low and the recognition effect is poor in practical application scenarios.

A video behavior recognition method based on deep learning applies visual correlation analysis operation, correlation analysis is carried out on an image layer to learn movement information of objects in a video, but in the subsequent feature extraction process, the resolution of the features (movement information) is continuously reduced through multiple downsampling operations, so that the position information of small objects in the deep layer disappears, the spatial features and the semantic features are not matched, the action modeling capability of the small objects and the fast moving objects is still lacked, and the performance of the recognition is poor in the practical application scene.

In order to solve the technical problems, the application provides a video behavior recognition method, which is used for realizing video behavior recognition based on a trained video behavior recognition model, wherein the video behavior recognition model comprises a plurality of feature extraction layers for extracting motion features with different spatial scales, the spatial scales of different feature extraction layers are different, and the spatial scales of the extracted motion features are different. The video behavior identification method obtains multiple layers of correlation characteristics by respectively performing correlation analysis on the action characteristics extracted by the multiple characteristic extraction layers, the correlation characteristics of any layer have the same spatial scale with the action characteristics of the layer, and the spatial scales of the correlation characteristics of different layers are different. Furthermore, multi-scale fusion correlation features are obtained by fusing multiple layers of correlation features, a dense correlation feature field can be obtained, and the video behavior recognition model can capture more fine motion features such as small object movement and rapid movement through the dense correlation feature field; furthermore, the multi-scale fusion correlation characteristics are fused into the action characteristics extracted by the characteristic extraction module, and the action classification recognition is carried out according to the fusion characteristics to obtain the action recognition result of the target in the video data, so that the robustness of the model in the actual application scene can be improved, and the action recognition effect and performance of the video can be improved.

The video behavior identification method provided by the embodiment can be applied to a plurality of fields such as security, human-computer interaction, video understanding, medical health, intelligent education, intelligent transportation and the like, and has very wide application.

Illustratively, an example application scenario when applied to the security field is an intelligent monitoring scenario: based on the monitored video, a plurality of video frames are obtained through sampling, the video frames are input into a behavior recognition model, behavior actions of a target object appearing in the video are recognized through the behavior recognition model, and a behavior recognition result is obtained. Further, based on the behavior recognition result, whether a target object in the video makes or is made a dangerous behavior (such as theft, fighting, peeping, etc.) can be analyzed and determined, so that the dangerous behavior can be automatically recognized, and early warning processing can be performed.

An exemplary application scenario when applied to the field of human-computer interaction is as follows: based on an input video containing an interactive object, a plurality of video frames are obtained through sampling, the video frames are input into a behavior recognition model, and behavior actions (such as gestures, body posture actions and the like) made by the interactive object in the video are recognized through the behavior recognition model to obtain a behavior recognition result. Further, based on the behavior recognition result, the action intention of the interactive object can be determined, and corresponding feedback information can be generated for the action intention of the interactive object and can be fed back to the interactive object.

Exemplary, an exemplary application scenario when applied to the field of intelligent education is as follows: the method comprises the steps of sampling to obtain a plurality of video frames based on online or offline videos of teachers during lectures, inputting the video frames into a behavior recognition model, and recognizing behavior actions of the teachers in the videos through the behavior recognition model to obtain behavior recognition results. Further, based on the behavior recognition result, whether the teacher makes a preset behavior (such as a behavior not meeting the behavior specification) can be analyzed and determined, so that an alarm can be given online or relevant personnel can be provided for offline processing to specify the behavior of the teacher in class.

An exemplary application scenario when applied to the intelligent transportation domain is as follows: the method comprises the steps of sampling to obtain a plurality of video frames based on videos, collected by a road side unit and/or a vehicle-mounted camera, of a driver driving a vehicle, inputting the video frames into a behavior recognition model, recognizing behavior actions of the driver in the videos through the behavior recognition model, and obtaining a behavior recognition result. Further, based on the behavior recognition result, whether the driver performs preset unsafe driving behaviors such as fatigue driving, smoking, calling and the like in the driving process can be analyzed and determined, so that the driver can be warned on line or provided for relevant personnel to perform offline processing, the safe driving behaviors of the driver are standardized, and the driving safety is improved.

In addition, when the method is applied to different application scenes/fields, the rule for sampling the video data can be flexibly configured, and the behavior recognition model is trained and determined by using the data set of the specific application scene/field.

Fig. 1 is a schematic diagram of an example network architecture to which the present application is applicable. As shown in fig. 1, the network architecture includes a first electronic device responsible for performing video behavior recognition, and a second electronic device responsible for capturing and providing video data to the first electronic device. The second electronic device is also responsible for performing post-processing of the behavior recognition result.

The first electronic device may be a server cluster deployed in the cloud, or a local device with computing capability, or an Internet of Things (IoT) device. The first electronic device is stored with a behavior recognition model which is trained by a training set of a specific application scene/field. Through the operation logic preset in the first electronic device, the first electronic device can realize the behavior recognition of a plurality of video frames obtained by sampling in the video data by using the behavior recognition model, and a behavior recognition result is obtained. Based on the behavior recognition result, the first electronic device may feed back to the second electronic device.

The second electronic device may specifically be a hardware device having a network communication function, an operation function, and an information display function, and includes, but is not limited to, a terminal such as a smart phone, a tablet computer, and a desktop computer, a server, an internet of things device, and the like.

Through communicative interaction with the first electronic device, a user may submit captured video data to the first electronic device through the second electronic device. The first electronic equipment can obtain a plurality of video frames from video data in a sampling mode, the video frames are input into a feature extraction module of the behavior recognition model to carry out action feature extraction, and the feature extraction module comprises a plurality of feature extraction layers for extracting action features of different spatial scales; respectively carrying out correlation analysis on the action features extracted by the feature extraction layers to obtain multi-layer correlation features, fusing the multi-layer correlation features to obtain multi-scale fusion correlation features when the spatial scales of the correlation features of different layers are different, and fusing the multi-scale fusion correlation features into the action features extracted by the feature extraction module to obtain fusion features; and performing behavior classification and identification according to the fusion characteristics to obtain a behavior identification result of the target in the video data. The first electronic device feeds the behavior recognition result back to the second electronic device, and the second electronic device can perform post-processing on the behavior recognition result based on the post-processing rule and output a final result obtained after the post-processing.

Illustratively, taking an exemplary application scenario when applied to the field of intelligent education as an example, the second electronic device may be a device that collects and stores a video of a teacher lecture, and when performing online behavior recognition, the second electronic device transmits video data of the teacher lecture to the first electronic device. The method comprises the steps that first electronic equipment obtains video data of a teacher lecture, a plurality of video frames are obtained through sampling, the video frames are input into a behavior recognition model, behavior actions of the teacher in a video are recognized through the behavior recognition model, a behavior recognition result is obtained, and the behavior recognition result is fed back to second electronic equipment. Further, the second electronic device analyzes and determines whether the teacher makes a preset behavior (such as a behavior not meeting the behavior specification) according to the behavior recognition result, so that the teacher can perform online warning or provide related personnel for offline processing to specify the behavior of the teacher during class speaking.

Fig. 2 is a schematic diagram of another example network architecture to which the present application is applicable. As shown in fig. 2, the network architecture includes a first electronic device responsible for performing video behavior recognition, and a second electronic device responsible for capturing and providing video data to the first electronic device. The first electronic device is also responsible for performing post-processing of the behavior recognition result.

The first electronic device may be a server cluster deployed in the cloud, or a device having local computing capability, or an Internet of Things (IoT) device. The first electronic device is stored with a behavior recognition model which is trained by a training set of a specific application scene/field. Through the operation logic preset in the first electronic device, the first electronic device can realize the behavior recognition of a plurality of video frames obtained by sampling in the video data by using the behavior recognition model, and a behavior recognition result is obtained. Further, the first electronic device performs post-processing on the behavior recognition result based on the post-processing rule, and feeds back a final result obtained by the post-processing to the second electronic device.

The second electronic device may specifically be a hardware device having a network communication function, an operation function, and an information display function, and includes, but is not limited to, a terminal such as a smart phone, a tablet computer, and a desktop computer, a server, and an internet of things device. Through communicative interaction with the first electronic device, a user may submit captured video data to the first electronic device through the second electronic device. The first electronic equipment can obtain a plurality of video frames from video data by sampling, and the video frames are input into a feature extraction module of the behavior recognition model to extract action features, wherein the feature extraction module comprises a plurality of feature extraction layers for extracting action features with different spatial scales; respectively carrying out relevance analysis on the action features extracted by the plurality of feature extraction layers to obtain multi-layer relevance features, fusing the multi-layer relevance features to obtain multi-scale fusion relevance features when the spatial scales of the relevance features of different layers are different, and fusing the multi-scale fusion relevance features into the action features extracted by the feature extraction module to obtain fusion features; and performing behavior classification and identification according to the fusion characteristics to obtain a behavior identification result of the target in the video data. Further, the first electronic device performs post-processing on the behavior recognition result based on the post-processing rule, and feeds back a final result obtained by the post-processing to the second electronic device.

Illustratively, taking an example application scenario when applied to the field of human-computer interaction as an example, the second electronic device may be a terminal device used by the interactive object, and is configured to collect video data of the interactive object and send the video data to the first electronic device. The first electronic device receives video data of an interactive object, samples the video data to obtain a plurality of video frames, inputs the video frames into a behavior recognition model, recognizes behavior actions (such as gestures and body behavior actions) made by the interactive object in the video through the behavior recognition model to obtain a behavior recognition result, and further performs the following post-processing based on the behavior recognition result: and determining the action intention of the interactive object, and generating corresponding feedback information aiming at the action intention of the interactive object. Then, the first electronic device sends feedback information to the first electronic device, so that the first electronic device makes feedback to the interactive object based on the feedback information.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 3 is a flowchart of a video behavior recognition method according to an exemplary embodiment of the present application, where an execution subject of the present embodiment is the first electronic device shown in fig. 1. As shown in fig. 3, the method comprises the following specific steps:

step S301, video data to be identified is obtained, and the video data comprises a plurality of video frames.

The video data to be identified comprises a plurality of video frames, and the video frames are sampled from the video to be identified.

In an actual application scene, a video to be subjected to behavior recognition can be acquired, and the video contains a target object of behavior recognition. The method comprises the steps of sampling a video through a preset frame sampling rule to obtain a plurality of video frames, and then performing behavior recognition on the plurality of video frames obtained based on sampling.

The frame sampling rule can be implemented by adopting a rule of sampling to obtain a video frame in any existing video behavior identification method, and can be configured and adjusted according to the needs of an actual application function scene, which is not described herein again.

Optionally, in this step, the first electronic device may receive video data to be identified, which includes a plurality of video frames and is sent by the second electronic device, where the video data is obtained by performing frame sampling on an original video by the second electronic device.

Optionally, in this step, the first electronic device may further receive an original video sent by the second electronic device, the first electronic device performs frame sampling on the original video to obtain a plurality of video frames, and the plurality of sampled video frames constitute video data to be identified.

Step S302, inputting a plurality of video frames into a behavior recognition model, and extracting multi-layer action characteristics of the plurality of video frames through the behavior recognition model, wherein the action characteristics of different layers have different spatial scales.

In this embodiment, a trained end-to-end behavior recognition model is used to perform behavior recognition based on a plurality of input video frames, so as to obtain a line recognition result.

The behavior recognition model usually includes a feature extraction module for performing motion feature extraction. The feature extraction module generally comprises a plurality of feature extraction layers, wherein the plurality of feature extraction layers are stacked from top to bottom, the output of the upper layer is used as the input of the lower layer, and the output of the lowest layer is used as the action feature extracted by the feature extraction module.

The spatial scale (i.e., spatial resolution) of different feature extraction layer configurations is different, and thus the spatial scale of the extracted motion features is different.

In the step, a plurality of video frames are input into the uppermost feature extraction layer in the behavior recognition model, and the uppermost feature extraction layer performs feature extraction on each frame of video frame respectively to obtain action features corresponding to each frame of video frame. Then, the multi-frame action features are processed by a plurality of layers of feature extraction layers to obtain action features extracted by each layer, which are also called action features of each layer. In any layer, respectively extracting the characteristics of each input frame to obtain corresponding action characteristics of one frame, thereby obtaining the action characteristics of the current layer, wherein the action characteristics comprise multi-frame action characteristics, the frame number of the action characteristics is the same as that of the video frames, and the multi-frame action characteristics are in one-to-one correspondence with the video frames.

In addition, the number of layers of the feature extraction layer included in the behavior recognition model and the size of the spatial scale of different layers may be set according to an empirical value, which is not specifically limited herein.

Exemplarily, taking an example that the feature extraction module includes 3 feature extraction layers, the 3 feature extraction layers extract motion features of 3 different spatial scales, where the spatial scales may be spatial resolutions, and the 3 different spatial resolutions sequentially include, in order from an upper layer to a lower layer: 56. 28, 14. The frame numbers of the action characteristics of different layers are the same and are all the same as the frame number of the input video frame, and the length and the width of the action characteristics of different layers are the same.

Step S303, respectively carrying out relevance analysis on the multi-layer action characteristics to obtain multi-layer relevance characteristics, wherein the multi-layer relevance characteristics are different in spatial scale, the multi-layer relevance characteristics are fused to obtain multi-scale fusion relevance characteristics, and the multi-scale fusion relevance characteristics are fused into the extracted action characteristics to obtain fusion characteristics.

In this embodiment, a relevance module is added in the behavior identification model, and is configured to perform relevance analysis on the motion features extracted by the multiple feature extraction layers respectively to obtain multiple layers of relevance features, where the spatial scales of the relevance features of different layers are different, the multiple layers of relevance features are fused to obtain multiple-scale fusion relevance features, and the multiple-scale fusion relevance features are fused into the motion features extracted by the feature extraction module, so that the features finally output by the feature extraction module are fused with the multiple-scale fusion relevance features, and include finer motion features, such as motion information of small objects and motion information of fast motions.

The relevance module is added in a plurality of feature extraction layers, relevance analysis is carried out on the multi-frame action features extracted by the current layer in the time dimension, the relevance features of the current layer are obtained, and the movement information of the target object can be well captured.

The spatial scale of the correlation features of any layer is the same as that of the action features, and the spatial scale of the correlation features of different layers is different. Further, by fusing multiple layers of correlation features, multiple correlation features of different spatial scales can be fused, and finer motion features, such as motion information of small objects and motion information of fast motions, can be captured better. Furthermore, multi-scale fusion correlation features obtained by fusing multi-layer correlation features are fused into the extracted action features, so that the output features of the feature extraction module comprise more detailed action features such as the movement information of small objects, the movement information of rapid actions and the like, and the behavior classification recognition is carried out based on the fusion features, so that the robustness and the effect of the behavior recognition model can be improved.

Optionally, when performing the correlation analysis, multiple layers may be extracted from the feature extraction layer in the behavior recognition model to perform the correlation analysis, so as to obtain the correlation features of the corresponding layer. In addition, in order to improve the ability to capture finer motion features, such as movement information of a small object and movement information of a quick motion, the span of the spatial scale of the extracted multi-layer feature extraction layer may be appropriately expanded, so that the span of the spatial scale of the obtained multi-layer correlation feature may be made larger.

Optionally, when performing the correlation analysis, the correlation analysis may be performed on each feature extraction layer in the behavior recognition model to obtain the time-related feature of each layer.

Optionally, when the multi-scale fusion relevance features are merged into the action features extracted by the feature extraction module, the multi-scale fusion relevance features may be merged into the action features extracted by one or more feature extraction layers, and the multi-scale fusion relevance features merged into different layers may be the same or different.

Optionally, when the multi-scale fused relevance features are fused into the action features extracted by the feature extraction module to obtain the fusion features, the multi-scale fused relevance features and the action features extracted by the feature extraction module may be added to realize the fusion of the multi-scale fused relevance features and the action features to obtain the fusion features.

And S304, performing behavior classification and identification according to the fusion characteristics to obtain a behavior identification result of the target in the video data.

In this embodiment, in the multiple feature extraction layers, correlation analysis is performed on the multiple frames of motion features extracted by the current layer in the time dimension to obtain correlation features of the current layer, so that the movement information of the target object can be captured well. The correlation characteristics of any layer are the same as the spatial scale of the action characteristics, and the spatial scales of the correlation characteristics of different layers are different. By fusing the multilayer correlation characteristics and fusing the correlation characteristics of a plurality of different spatial scales, the movement information of the small object and the movement information of the quick action can be captured better. Furthermore, the multi-scale fusion correlation features obtained by fusing the multi-layer correlation features are fused into the extracted action features, so that the fusion features output by the feature extraction module comprise finer action features, such as the movement information of small objects and the movement information of quick actions, and the behavior classification recognition is performed based on the fusion features, so that the robustness and the effect of the behavior recognition model can be improved.

In an optional embodiment, the step S303 of fusing the multi-layer correlation features to obtain the multi-scale fused correlation feature may specifically be implemented by the following method:

taking the correlation characteristic with the minimum spatial dimension in the multilayer correlation characteristics to be fused as a target characteristic; performing down-sampling on the correlation features to be fused except the target features, wherein the down-sampled features have the same spatial scale as the target features; and fusing the down-sampled features and the target features to obtain multi-scale fused correlation features.

In this embodiment, since the spatial scales of the correlation features of different layers are different, in order to fuse the correlation features of different layers, the correlation feature of a larger spatial scale may be downsampled according to the minimum spatial scale of the multi-layer correlation features, so that the spatial scale of the downsampled feature is equal to the minimum spatial scale, and then the downsampled feature and the minimum spatial scale correlation feature are fused to obtain the multi-scale fused correlation feature.

Optionally, fusing the downsampled features and the target features, and sequentially splicing the downsampled features and the target features according to the sequence of the corresponding video frames; alternatively, this may be achieved by summing the down-sampled features with the target feature.

In another optional embodiment, the step S303 of fusing the multi-layer correlation features to obtain a multi-scale fused correlation feature may further be implemented as follows:

according to the configured preset spatial scale, the preset spatial scale is smaller than or equal to the minimum spatial scale of the multilayer correlation characteristics to be fused; and downsampling the multilayer correlation characteristics to be fused into a preset spatial scale, and fusing the downsampled characteristics to obtain the multi-scale fused correlation characteristics.

Optionally, the features after down-sampling are fused, and the features after down-sampling can be sequentially spliced according to the sequence of the corresponding video frames; alternatively, this can be achieved by summing the downsampled features.

In addition, in any of the above embodiments, the correlation feature of each layer includes a multi-frame feature corresponding to each of the multiple frames of video frames. The step of fusing the multi-layer correlation characteristics refers to the step of fusing the characteristics corresponding to the same video frame in the correlation characteristics of different layers, and the obtained multi-scale fused correlation characteristics still contain multi-frame characteristics corresponding to the video frames one to one.

In an optional embodiment, in the step S303, when the multi-scale fused correlation features are fused into the extracted action features, the multi-scale fused correlation features and the output features of the lowest layer may be fused to obtain fused features, where the fused features are output features of the feature extraction module and are used for performing subsequent behavior classification and identification.

Exemplarily, fig. 4 is an exemplary diagram of an overall framework of video behavior recognition provided by an embodiment of the present application, and as shown in fig. 4, taking an example that a feature extraction module of a behavior recognition model includes 3 layers of feature extraction layers with different spatial scales, extracting action features of 3 layers of different spatial scales (represented by graphs with different sizes), and performing correlation analysis on the action features extracted by each layer of feature extraction layer to obtain correlation features of 3 layers of different spatial scales. According to the spatial scale of the correlation feature of the lowest layer, the correlation features of the upper 2 layers are downsampled, the downsampled features are enabled to be the same as the spatial scale of the correlation feature of the lowest layer, the sampled features are fused with the correlation feature of the lowest layer, the multi-scale fused correlation feature is obtained, the multi-scale fused correlation feature is a dense correlation feature field, and finer action features such as feature information of movement and rapid movement of small objects can be captured through the dense correlation feature field. Further, the multi-scale fused correlation features are fused with the output features of the lowest layer to obtain fused features. The behavior classification recognition module is input with the fusion characteristics to perform behavior classification recognition, and outputs a behavior recognition result, so that the robustness of the model in an actual application scene can be improved, and the effect and performance of video behavior recognition can be improved.

It should be noted that, in fig. 4, taking the number of frames of the input video frame as 5 as an example, each layer of feature extraction layer extracts 5 frame action features respectively, and performs correlation analysis with other action features for the 5 frame action features respectively to obtain 5 frame correlation features corresponding to the 5 frame action features respectively, where the 5 frame correlation features respectively correspond to the 5 frame video frames one to one.

In fig. 4, the relationship between the features (motion features and correlation features) in different spatial scales is represented by only graphs of different sizes, but the size ratio of the graph representing the features does not represent the size ratio of the spatial scales, and the size of the graph does not represent a specific numerical value of the spatial scales.

Based on this embodiment, in other optional embodiments, as an alternative to fusing the multi-scale fused correlation features with the output features of the lowest layer, the multi-scale fused correlation features may be fused with the motion features output by the lowest layer of the multiple layers that have been subjected to the correlation analysis processing, so that the multi-scale fused correlation features may also be fused with the motion features extracted by the feature extraction module.

In this embodiment, by adding a correlation module on the basis of the network of the behavior recognition model, the correlation module can perform correlation analysis based on the action features extracted by the multiple layers of feature extraction layers to obtain correlation features, and fuse the multiple layers of correlation features, so that the correlation features of multiple different spatial scales are fused and fused with the action features output by the feature advancing module, so that the deep-layer action features can also contain better spatial information and time information, and the behavior recognition model can capture more precise action features, such as movement information and fast movement information of small objects, thereby improving the robustness of the behavior recognition model in an actual application scene.

In another optional embodiment, in step S303, in the layer subjected to the correlation analysis processing, the correlation characteristic of the current layer may be fused with the action characteristic to obtain the output characteristic of the current layer, and the output characteristic of the current layer is used as the input of the next layer. And moreover, multi-level correlation features are fused to obtain multi-scale fusion correlation features, the multi-scale fusion correlation features are fused with the output features of the lowest layer to obtain fusion features, and the fusion features are output features of the feature extraction module and used for subsequent behavior classification and identification.

Optionally, in each layer subjected to the correlation analysis processing, the correlation characteristic of the current layer and the action characteristic are fused to obtain an output characteristic of the current layer, and the output characteristic of the current layer is used as an input of a next layer.

Alternatively, one or more layers may be selected from the layers subjected to the correlation analysis processing, the correlation characteristics of each layer may be fused with the motion characteristics in the selected layer, and the output characteristics of each layer may be used as the input of the next layer.

For example, if the lowest layer in the feature extraction module is subjected to the correlation analysis processing, the correlation features of each layer may be fused with the motion features to obtain the output features of each layer, and the output features of each layer may be used as the input of the next layer.

Exemplarily, fig. 5 is an exemplary diagram of an overall framework of video behavior recognition provided by the embodiment of the present application, and as shown in fig. 5, by taking an example that a feature extraction module of a behavior recognition model includes 3 layers of feature extraction layers with different spatial scales as an example, action features of 3 layers of different spatial scales (represented by graphs with different sizes) are extracted, and correlation analysis is performed on the action features extracted by each layer of feature extraction layer, so as to obtain correlation features of 3 layers of different spatial scales. In the upper 2 layers, the correlation features of the current layer are fused with the action features extracted from the current layer, and the fused features are used as the input of the next layer, so that the correlation features can be fused in the action features extracted from each layer, and the fused features of the upper 2 layers contain better movement information. And according to the spatial scale of the correlation feature of the lowest layer, the correlation feature of the upper 2 layers is downsampled, so that the downsampled feature is the same as the spatial scale of the correlation feature of the lowest layer, the sampled feature is fused with the correlation feature of the lowest layer to obtain the multi-scale fused correlation feature, which is a dense correlation feature field, and finer action features, such as feature information of movement and rapid movement of small objects, can be captured through the dense correlation feature field. And further, fusing the multi-scale fused correlation features with the output features of the lowest layer to obtain fused features. The behavior classification recognition module is input with the fusion characteristics to perform behavior classification recognition, and outputs a behavior recognition result, so that the robustness of the model in an actual application scene can be improved, and the effect and performance of video behavior recognition can be improved.

It should be noted that, in fig. 5, for example, the number of frames of the input video frame is 5, each layer of feature extraction layer respectively extracts 5 frame action features, and performs correlation analysis on the 5 frame action features and other action features respectively to obtain 5 frame correlation features corresponding to the 5 frame action features respectively, where the 5 frame correlation features respectively correspond to the 5 frame video frames one to one.

In fig. 5, the relationship between the features (motion features and correlation features) in different spatial scales is represented by only graphs of different sizes, but the size ratio of the graph representing the features does not represent the size ratio of the spatial scales, and the size of the graph does not represent a specific numerical value of the spatial scales.

In this embodiment, by adding a correlation module to the network based on the behavior recognition model, the correlation module can perform correlation analysis based on the action features extracted by the multiple feature extraction layers to obtain correlation features, and in one or more layers subjected to the correlation analysis, the correlation features of the current layer are fused with the action features to obtain output features of the current layer, and the output features of the current layer are used as input of a next layer, so that the action features of each layer include rich movement information. Moreover, by fusing multiple layers of correlation features, multiple correlation features with different spatial scales are fused and fused with the action features output by the feature advancing module, so that deep action features can also contain better spatial information and time information, a behavior recognition model can capture more precise action features, such as the movement of small objects and the movement of express delivery, and the robustness of the behavior recognition model in an actual application scene is improved.

In another optional embodiment, in the layers subjected to the correlation analysis processing, the correlation features of the current layer may be fused with the correlation features of at least one upper layer to obtain multi-scale fused correlation features of the current layer, and the multi-scale fused correlation features of the current layer may be fused with the action features to obtain output features of the current layer, and the correlation features of the action features may be obtained by adding the correlation analysis processing to each feature extraction layer, so that the movement information of the target in different video frames may be captured; moreover, the relevant features of different spatial scales of the upper layer are fused on each layer, pyramid-shaped multi-scale fusion relevant features can be constructed and fused into the extracted action features to obtain a dense relevant feature field, and the dense relevant feature field is fused into the extracted action features for action recognition, so that deep features also contain better spatial information and time information, the action recognition model can capture more precise action features, such as feature information of small object movement and express movement, and the robustness of the action recognition model in an actual application scene is improved.

And if the correlation characteristics of the upper layer do not exist in a certain layer subjected to the correlation analysis processing, fusing the correlation characteristics of the current layer and the action characteristics of the current layer to obtain the output characteristics of the current layer.

For example, for the uppermost feature extraction layer in the feature extraction module, since there is no feature extraction layer of the upper layer, the correlation feature of the current layer is fused with the action feature of the current layer to obtain the output feature of the current layer.

In this case, the correlation feature of the upper layer does not exist, and the correlation feature of the layer is fused with the operation feature of the layer to obtain the output feature of the layer.

Optionally, in each layer subjected to the correlation analysis processing, the correlation feature of the current layer and the correlation feature of at least one upper layer are fused to obtain a multi-scale fused correlation feature of the current layer, and the multi-scale fused correlation feature of the current layer and the action feature are fused to obtain an output feature of the current layer, and the output feature of the current layer is used as an input of a next layer.

Optionally, one or more layers may be selected from the layers subjected to the correlation analysis processing, the correlation feature of the current layer is fused with the correlation feature of at least one upper layer in the selected layer to obtain a multi-scale fused correlation feature of the current layer, the multi-scale fused correlation feature of the current layer is fused with the action feature to obtain an output feature of the current layer, and the output feature of the current layer is used as an input of a next layer.

In a preferred embodiment, in each layer subjected to correlation analysis processing, the correlation features of the current layer are fused with the correlation features of all upper layers to obtain multi-scale fused correlation features of the current layer, and the multi-scale fused correlation features of the current layer are fused with the action features to obtain output features of the current layer, so that finer action features, such as movement information and rapid movement information of small objects, can be captured, pyramid-shaped multi-scale fused correlation features can be constructed and fused into the extracted action features to obtain a dense correlation feature field, deep features can also include better spatial information and time information, and the behavior recognition model can capture finer action features, such as feature information of movement of small objects and movement of express delivery, thereby improving robustness of the behavior recognition model in an actual application scene.

Exemplarily, fig. 6 is an exemplary diagram of an overall framework of video behavior recognition provided by an embodiment of the present application, and as shown in fig. 6, taking an example that a feature extraction module of a behavior recognition model includes 3 layers of feature extraction layers with different spatial scales, extracting action features of 3 layers of different spatial scales (represented by graphs with different sizes), and performing correlation analysis on the action features extracted by each layer of feature extraction layer to obtain correlation features of 3 layers of different spatial scales. In each layer subjected to the correlation analysis, since there is no upper layer feature extraction layer in the 1 st layer, the correlation feature of the current layer is fused with the motion feature to obtain the output feature of the current layer, which is used as the input of the 2 nd layer. At layer 2, the correlation features of layer 1 are fused with the correlation features of the current layer (layer 2) to obtain multi-scale fused correlation features of the current layer (layer 2), and the multi-scale fused correlation features of the current layer (layer 2) are fused with the action features to obtain output features of the current layer (layer 2) and serve as input of layer 3. And at the 3 rd layer, fusing the correlation characteristics of the 1 st layer, the correlation characteristics of the 2 nd layer and the correlation characteristics of the current layer (the 3 rd layer) to obtain the multi-scale fused correlation characteristics of the current layer (the 3 rd layer), and fusing the multi-scale fused correlation characteristics of the current layer (the 3 rd layer) and the action characteristics to obtain the output characteristics of the current layer (the 3 rd layer) as the fused characteristics output by the characteristic extraction module.

It should be noted that, in fig. 6, taking the number of input video frames as 5 as an example, each layer of feature extraction layer extracts 5 frame action features, and performs correlation analysis on the 5 frame action features and other action features to obtain 5 frame correlation features corresponding to the 5 frame action features, where the 5 frame correlation features correspond to the 5 frame video frames one to one.

In fig. 6, the relationship between the features (motion features and correlation features) in different spatial scales is represented by only graphs of different sizes, but the size ratio of the graph representing the features does not represent the size ratio of the spatial scales, and the size of the graph does not represent a specific numerical value of the spatial scales.

In the embodiment, correlation analysis processing is added in each feature extraction layer to obtain correlation features of action features, so that the movement information of a target object in video data can be captured; moreover, the relevant features of different spatial scales of the upper layer are fused in each layer, pyramid-shaped multi-scale fusion relevant features can be constructed and fused into the extracted action features to obtain a dense relevant feature field, so that the deep features also contain better spatial information and time information, the action recognition model can capture more precise action features, such as feature information of movement of small objects and movement of express delivery, and the robustness of the action recognition model in an actual application scene is improved.

On the basis of any of the above embodiments, the correlation analysis is performed on the action features of any layer to obtain the correlation features of the current layer, which may specifically be implemented in the following manner:

the motion features of any layer include a frame motion feature corresponding to each video frame. And extracting each layer to obtain multi-frame action characteristics, wherein the frame number of the action characteristics is equal to that of the input video frame. When the correlation analysis is carried out, the action characteristics of each frame are respectively used as a target frame, and the correlation between the target frame and at least one adjacent frame is analyzed.

Specifically, for any one target frame in the action characteristics of any one layer, at least one adjacent frame of the target frame is determined in the action characteristics of the current layer; performing correlation calculation on the target frame and each adjacent frame to obtain correlation characteristics between the target frame and each adjacent frame; and fusing the correlation characteristics between the target frame and at least one adjacent frame to obtain the correlation characteristics corresponding to the target frame of the current layer.

The motion information of the target object on the time dimension can be captured by respectively carrying out correlation analysis on the motion characteristics of each frame and the motion characteristics of at least one other adjacent frame; correlation analysis is carried out on the action characteristics extracted by the multiple layers of characteristic extraction layers, correlation characteristics of different spatial scales can be extracted, so that the movement information of the small object can be captured on the enlarged spatial scale and is fused into the extracted action characteristics, the action recognition model can capture more precise action characteristics, such as characteristic information of movement of the small object and movement of express delivery, and the robustness of the action recognition model in an actual application scene is improved.

Optionally, when determining an adjacent frame subjected to correlation analysis with the target frame in the action features of any layer, taking an action feature of a frame after the target frame as the adjacent frame according to the time sequence information of the corresponding video frame, and performing correlation analysis on the target frame and the adjacent frame to obtain the correlation feature of the target frame.

Optionally, when at least one adjacent frame of the target frame is determined in the motion features of any layer, a preset number of adjacent video frames of the video frame corresponding to the target frame may be determined in the motion features of the current layer according to the preset number, where the preset number is greater than 1; and taking the action characteristics corresponding to the adjacent video frames as the adjacent frames of the target frame.

By carrying out correlation analysis on the target frame and a plurality of adjacent frames of a preset number, long-term correlation characteristics can be obtained, and mobile information in a long time period can be captured, so that the quick mobile information of the library can be captured, the behavior of the target object can be better depicted, more precise motion characteristics can be captured by the behavior recognition model, such as the characteristic information of the movement of small objects and the movement of express delivery, and the robustness of the behavior recognition model in an actual application scene is improved.

Further, performing correlation calculation on the target frame and each adjacent frame to obtain correlation characteristics between the target frame and each adjacent frame, which may specifically be implemented by the following method:

determining a plurality of first feature blocks in a target frame according to the configured feature block sizes, and determining second feature blocks matched with the first feature blocks in adjacent frames; and calculating the similarity of each first feature block and the matched second feature block, and determining the correlation feature between the target frame and each adjacent frame according to the similarity of each first feature block and the matched second feature block.

The size of the feature block may be set according to an actual application scenario and an empirical value, and is not specifically limited herein.

Specifically, when determining a plurality of first feature blocks in the target frame, an area centered on each pixel point may be obtained as a feature block according to the size of the feature block for each pixel point, and each pixel point corresponds to and determines one feature block.

When a second feature block matched with the first feature block is determined in the adjacent frame, a matching pixel point corresponding to a pixel point in the center of the first feature block in the adjacent frame is determined, and an area which takes the matching pixel point as the center in the adjacent frame is determined according to the size of the feature block and is used as the second feature block matched with the first feature block.

Additionally, in other embodiments, the first feature block and the matching second feature block may be different sizes. When a second feature block matching the first feature block is determined, a region of a specified size centered on the matching pixel point in the adjacent frame is determined as a second feature block matching the first feature block based on the configured specified size (different from the size of the above-described feature block).

Alternatively, when calculating the similarity between each first feature block and the matching second feature block, the inner product of each first feature block and the matching second feature block may be calculated as the similarity.

Optionally, when calculating the similarity between each first feature block and the matching second feature block, the cosine similarity between each first feature block and the matching second feature block is calculated.

Further, according to the similarity of each first feature block and the matched second feature block, the correlation feature between the target frame and each adjacent frame can be determined. The value of each pixel point in the correlation characteristics between the target frame and each adjacent frame is the similarity of a first characteristic block corresponding to the pixel point and a matched second characteristic block; or, normalizing the similarity of each first feature block and the matched second feature block, wherein the value of each pixel point in the correlation features between the target frame and each adjacent frame is the value of the pixel point after the similarity normalization processing of the corresponding first feature block and the matched second feature block.

Further, when the correlation characteristics between the target frame and a plurality of adjacent frames are fused, the correlation characteristics between the target frame and the plurality of adjacent frames can be spliced to obtain the correlation characteristics corresponding to the target frame; or the correlation characteristics between the target frame and a plurality of adjacent frames can be summed to obtain the correlation characteristics corresponding to the target frame.

On the basis of any of the above embodiments, before performing correlation analysis on the action features of any layer to obtain the correlation features of the current layer, the number of channels of the action features can be reduced by performing a first convolution operation on the action features extracted from the current layer, so that the calculation amount of the correlation analysis can be reduced, and the efficiency of video behavior identification can be improved.

On the basis of any of the above embodiments, before the correlation features and the action features are fused, the correlation features may be first subjected to a second convolution operation to convert the correlation features into action information, and then the action information and the action features are fused, so that a better fusion effect can be achieved.

Fig. 7 is a flowchart of a video behavior recognition method according to another exemplary embodiment of the present application, where an execution subject of the present embodiment is the first electronic device shown in fig. 2. As shown in fig. 7, the method comprises the following specific steps:

step S701, a video to be identified is obtained, and the video is sampled to obtain a plurality of video frames.

The step is similar to the implementation manner of step S301, and for details, refer to relevant description of step S301, and is not described herein again.

Step S702, inputting a plurality of video frames into an end-to-end behavior recognition model, and performing behavior recognition through the behavior recognition model to obtain a behavior recognition result, wherein the behavior recognition model is used for extracting multi-layer action features of the plurality of video frames, and the action features of different layers have different spatial scales; respectively carrying out correlation analysis on the multi-layer action characteristics to obtain multi-layer correlation characteristics, fusing the multi-layer correlation characteristics to obtain multi-scale fused correlation characteristics when the spatial scales of the correlation characteristics of different layers are different, and fusing the multi-scale fused correlation characteristics into the extracted action characteristics to obtain fused characteristics; and performing behavior classification and identification according to the fusion characteristics to obtain a behavior identification result.

The step is similar to the implementation manner of the steps S302 to S304, and reference is specifically made to relevant contents in the foregoing embodiments, which are not described herein again.

And step S703, performing post-processing according to the behavior recognition result, and outputting a post-processing result.

In practical application, different post-processing rules can be configured according to different practical application scenes, and after the behavior recognition result is obtained, post-processing is performed according to the configured post-processing rules to obtain a final output result.

The output mode of the post-processing result can be configured according to the actual application scene, and the mode of outputting the final processing result in different application scenes can be different.

In this embodiment, based on the network architecture shown in fig. 2, the first electronic device is deployed with a trained behavior recognition model, and provides a service based on video behavior recognition to the outside based on the behavior recognition model, and a final result of the service is determined by performing post-processing on a behavior recognition result.

The video behavior identification method provided by the embodiment can be applied to numerous fields such as intelligent monitoring, man-machine interaction, video sequence understanding, medical health and intelligent education, and has very wide application.

The behavior recognition model used in any of the above embodiments is an end-to-end model trained based on a training set of a specific application scenario/domain. Fig. 8 is a flowchart of a behavior recognition model training method provided in an exemplary embodiment of the present application, and as shown in fig. 8, a behavior recognition model used in any of the embodiments may be obtained by training in the following manner:

step S801, a training set of the current application scene is obtained, wherein the training set comprises audio sample data and behavior categories corresponding to the audio sample data, and the audio sample data comprises a plurality of video sample frames.

Step S802, inputting a plurality of video sample frames into a behavior recognition model, and extracting multi-layer action characteristics of the plurality of video frames through the behavior recognition model, wherein the action characteristics of different layers have different spatial scales.

Step S803, respectively carrying out relevance analysis on the multi-layer action characteristics to obtain multi-layer relevance characteristics, fusing the multi-layer relevance characteristics to obtain multi-scale fusion relevance characteristics, and fusing the multi-scale fusion relevance characteristics into the action characteristics extracted by the characteristic extraction module to obtain fusion characteristics, wherein the multi-layer relevance characteristics are different in spatial scale.

And step S804, performing behavior classification and identification according to the fusion characteristics to obtain a behavior identification result of the target in the video sample data.

The specific implementation manner of the steps S802 to S804 is similar to the implementation manner of the steps S302 to S304, and refer to the relevant description in the above embodiments specifically, which is not described herein again.

And S805, updating parameters of the behavior recognition model according to the behavior recognition result of the target in the video sample data and the corresponding behavior category.

In this embodiment, the specific implementation manner for updating the model parameters based on the behavior recognition result and the labeled behavior category is similar to the implementation manner for updating the model parameters when the video behavior recognition model is trained based on the machine learning manner in the prior art, and is not described here again.

The trained behavior recognition model may be deployed based on the network structure shown in fig. 1 or fig. 2.

By the method, the behavior recognition model with the ability of capturing more detailed motion characteristics, such as the movement information of small objects and rapid movement information, can be trained, and the robustness and effect of the behavior recognition model in an actual application scene can be improved.

Fig. 9 is a schematic diagram of a video behavior recognition system according to an exemplary embodiment of the present application. As shown in fig. 9, the video behavior recognition system 900 provided in this embodiment includes: an end-side device 901, and a cloud-side device 902 communicatively connected to the end-side device 901.

The end-side device 901 acquires video data to be recognized, the video data containing a plurality of video frames. The process is similar to the implementation manner of step S301, and the details of the process are related to step S301, and are not described herein again.

The cloud-side equipment 902 inputs a plurality of video frames into the behavior recognition model, and multi-layer action features of the video frames are extracted through the behavior recognition model, wherein the action features of different layers have different spatial scales; respectively carrying out correlation analysis on the multi-layer action characteristics to obtain multi-layer correlation characteristics, fusing the multi-layer correlation characteristics to obtain multi-scale fused correlation characteristics when the spatial scales of the correlation characteristics of different layers are different, and fusing the multi-scale fused correlation characteristics into the extracted action characteristics to obtain fused characteristics; according to the fusion characteristics, performing behavior classification identification to obtain a behavior identification result of the target in the video data; and issues a behavior recognition result to the end-side device 901. The specific implementation manner of this process is similar to the implementation manner of the above steps S302-S304, and refer to the relevant description in the above embodiments specifically, which is not described herein again.

The end-side device 901 performs post-processing according to the behavior recognition result, and outputs the post-processing result.

In this embodiment, the end-side device 901 may be an edge cloud device in which various network platforms are deployed at the edge of a network, and is responsible for collecting various types of data generated by terminal devices within the coverage range of the end-side device. The end-side device 901 may be a server-side device such as a conventional server, a cloud server, or a server array. The terminal device includes, but is not limited to, a desktop computer, a notebook computer, or a smartphone, and various types of data generated by the terminal device include, but are not limited to, collected videos, such as a user-side video collected in human-computer interaction, a teacher lecture video collected in intelligent education, and the like. Network platforms include, but are not limited to, e-commerce platforms, short video platforms, news information platforms, educational training platforms, and the like.

Illustratively, fig. 10 is a flowchart of a method for video behavior recognition according to an exemplary embodiment of the present application. In this embodiment, the end-side device may be a road-side or vehicle-mounted shooting device for capturing a video of a driver during driving of the vehicle. As shown in fig. 10, the method comprises the following specific steps:

s1, the end-side equipment collects video data of a driver in the driving process of the vehicle.

Optionally, the collected video data of the driver may be sampled to obtain a plurality of video frames, and the plurality of video frames are used as the video data to be identified to perform the processing of the subsequent steps.

And S2, the end-side equipment sends the video data of the driver to the cloud-side equipment.

S3, the cloud side equipment inputs a plurality of video frames into a behavior recognition model, and multi-layer action characteristics of the video frames are extracted through the behavior recognition model, wherein the action characteristics of different layers have different spatial scales; respectively carrying out correlation analysis on the multi-layer action characteristics to obtain multi-layer correlation characteristics, fusing the multi-layer correlation characteristics to obtain multi-scale fused correlation characteristics when the spatial scales of the correlation characteristics of different layers are different, and fusing the multi-scale fused correlation characteristics into the extracted action characteristics to obtain fused characteristics; and performing behavior classification and identification according to the fusion characteristics to obtain a behavior identification result of the target in the video data.

The specific implementation manner of this step is similar to the implementation manner of steps S302-S304, which specifically refers to the relevant description in the foregoing embodiments, and this embodiment is not described herein again.

And S4, the cloud side equipment sends a behavior recognition result to the end side equipment.

And S5, the end-side equipment outputs warning information to the driver through an output device when the driver is determined to have the preset unsafe driving behavior according to the behavior recognition result.

In this embodiment, a flow of implementing video behavior recognition by linking end-side equipment and cloud-side equipment is described, taking a driver behavior detection scene applied in the field of intelligent transportation as an example.

Fig. 11 is a schematic structural diagram of a video behavior recognition apparatus according to an example embodiment of the present application. The apparatus provided in this embodiment is applied to perform the video behavior recognition method shown in fig. 3. As shown in fig. 11, the video behavior recognition apparatus 110 includes: a video data acquisition module 111, a feature extraction module 112, a correlation module 113 and a classification identification module 114.

The video data obtaining module 111 is configured to obtain video data to be identified, where the video data includes a plurality of video frames.

The feature extraction module 112 is configured to input the multiple video frames into the behavior recognition model, and extract multi-layer motion features of the multiple video frames through the behavior recognition model, where the motion features of different layers have different spatial scales.

The correlation module 113 is configured to perform correlation analysis on the multiple layers of action features to obtain multiple layers of correlation features, where the spatial scales of the correlation features of different layers are different, fuse the multiple layers of correlation features to obtain multiple-scale fusion correlation features, and fuse the multiple-scale fusion correlation features into the extracted action features to obtain fusion features.

The classification recognition module 114 is configured to perform behavior classification recognition according to the fusion features, so as to obtain a behavior recognition result of the target in the video data.

In an optional embodiment, the behavior recognition model comprises a feature extraction module, the feature extraction module comprises a plurality of feature extraction layers which are stacked in sequence from top to bottom, the output of the upper layer is used as the input of the lower layer, and a plurality of video frames are input into the feature extraction layer of the uppermost layer. When the multi-layer action features are subjected to relevance analysis respectively to obtain the multi-layer relevance features, the multi-layer relevance features are fused to obtain the multi-scale fusion relevance features when the spatial scales of the multi-layer relevance features are different, and the multi-scale fusion relevance features are fused into the extracted action features to obtain the fusion features, the relevance module 113 is further configured to:

in a multi-layer feature extraction layer, performing correlation analysis on the action features extracted from the current layer to obtain correlation features of the current layer; fusing the correlation characteristic of the current layer with the correlation characteristic of at least one upper layer to obtain the multi-scale fused correlation characteristic of the current layer; fusing the multi-scale fused correlation characteristic and the action characteristic of the current layer to obtain the output characteristic of the current layer; and the output characteristic of the lowest characteristic extraction layer is the fusion characteristic output by the characteristic extraction module.

in a multi-layer feature extraction layer, performing correlation analysis on the action features extracted from the current layer to obtain correlation features of the current layer, and fusing the correlation features of the current layer with the action features to obtain output features of the current layer; and fusing the multi-level correlation characteristics to obtain multi-scale fusion correlation characteristics, and fusing the multi-scale fusion correlation characteristics with the output characteristics of the lowest layer to obtain fusion characteristics.

In an optional embodiment, when performing a correlation analysis on the action features of any layer to obtain a correlation feature of a current layer, the correlation module 113 is further configured to:

the action characteristics of any layer comprise a frame action characteristic corresponding to each video frame, each frame action characteristic is respectively used as a target frame, and at least one adjacent frame of the target frame is determined in the action characteristics of the current layer; performing correlation calculation on the target frame and each adjacent frame to obtain correlation characteristics between the target frame and each adjacent frame; and fusing the correlation characteristics between the target frame and at least one adjacent frame to obtain the correlation characteristics corresponding to the target frame of the current layer.

In an alternative embodiment, in implementing the determining at least one neighboring frame of the target frame in the action feature of the current layer, the correlation module 113 is further configured to:

determining a preset number of adjacent video frames of the video frames corresponding to the target frame in the action characteristics of the current layer according to the configured preset number, wherein the preset number is more than 1; and taking the action characteristics corresponding to the adjacent video frames as the adjacent frames of the target frame.

In an optional embodiment, when performing a correlation calculation on the target frame and each of the adjacent frames to obtain a correlation characteristic between the target frame and each of the adjacent frames, the correlation module 113 is further configured to:

In an alternative embodiment, in implementing the calculating the similarity between each first feature block and the matching second feature block, the correlation module 113 is further configured to:

calculating the inner product of each first feature block and the matched second feature block;

alternatively, the first and second electrodes may be,

and calculating the cosine similarity of each first characteristic block and the matched second characteristic block.

In an optional embodiment, when the multi-scale fused correlation feature is obtained by fusing the multi-layer correlation features, the correlation module 113 is further configured to:

taking the correlation characteristic with the minimum spatial scale in the multilayer correlation characteristics to be fused as a target characteristic; performing down-sampling on the correlation features to be fused except the target features, wherein the down-sampled features have the same spatial scale as the target features; and fusing the downsampled features and the target features to obtain the multi-scale fused correlation features.

In an alternative embodiment, in implementing the obtaining of the video data to be identified, the correlation module 113 is further configured to:

the method comprises the steps of obtaining a video to be identified, sampling the video to be identified to obtain a plurality of video frames, and forming video data to be identified by the plurality of video frames.

The apparatus provided in this embodiment may be specifically configured to execute the method provided in any embodiment related to the video behavior identification method shown in fig. 1, and specific functions and technical effects that can be achieved are not described herein again.

Fig. 12 is a schematic structural diagram of a video behavior recognition apparatus according to another exemplary embodiment of the present application. The apparatus provided by the present embodiment is applied to perform the video behavior recognition method shown in fig. 7. As shown in fig. 12, the video behavior recognition device 120 includes: a video frame acquisition module 1201, a behavior recognition module 1202 and a post-processing module 1203.

The video frame acquiring module 1201 is configured to acquire a video to be identified, and sample the video to obtain a plurality of video frames.

The behavior recognition module 1202 is configured to input a plurality of video frames into an end-to-end behavior recognition model, and perform behavior recognition through the behavior recognition model to obtain a behavior recognition result, where the behavior recognition model includes a feature extraction module and a correlation module, and the feature extraction module includes a plurality of layers of feature extraction layers with different spatial scales and is configured to extract a plurality of layers of action features according to the input plurality of video frames; the correlation module is used for respectively carrying out correlation analysis on the multi-layer action characteristics to obtain multi-layer correlation characteristics, the spatial scales of the correlation characteristics of different layers are different, the multi-layer correlation characteristics are fused to obtain multi-scale fusion correlation characteristics, and the multi-scale fusion correlation characteristics are fused into the action characteristics extracted by the characteristic extraction module to obtain fusion characteristics; the behavior recognition model further comprises a classification recognition module, and the classification recognition module is used for performing behavior classification recognition according to the fusion characteristics to obtain a behavior recognition result of the target in the video data.

The post-processing module 1203 is configured to perform post-processing according to the behavior recognition result, and output a post-processing result.

The apparatus provided in this embodiment may be specifically configured to execute the video behavior recognition method based on the foregoing fig. 7, and specific functions and technical effects that can be achieved are not described herein again.

Fig. 13 is a schematic structural diagram of an electronic device according to an example embodiment of the present application. As shown in fig. 13, the electronic device 130 includes: a processor 1301, and a memory 1302 communicatively coupled to the processor 1301, the memory 1302 storing computer-executable instructions.

The processor executes the computer execution instructions stored in the memory to implement the solutions provided in any of the above method embodiments, and the specific functions and technical effects that can be implemented are not described herein again.

The embodiments of the present application further provide a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the computer-executable instructions are executed by a processor, the computer-executable instructions are used to implement the solutions provided in any of the above method embodiments, and specific functions and technical effects that can be implemented are not described herein again.

An embodiment of the present application further provides a computer program product, where the computer program product includes: the computer program is stored in the readable storage medium, at least one processor of the electronic device can read the computer program from the readable storage medium, and the at least one processor executes the computer program, so that the electronic device executes the scheme provided by any one of the above method embodiments, and specific functions and technical effects that can be achieved are not described herein again.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of order or in parallel as they appear in the present document, and only for distinguishing between the various operations, and the sequence number itself does not represent any execution order. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor do they limit the types of "first" and "second". The meaning of "a plurality" is two or more unless specifically limited otherwise.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A video behavior recognition method, comprising:

respectively carrying out correlation analysis on the multi-layer action characteristics to obtain multi-layer correlation characteristics, fusing the multi-layer correlation characteristics to obtain multi-scale fused correlation characteristics when the spatial scales of the correlation characteristics of different layers are different, and fusing the multi-scale fused correlation characteristics into the extracted action characteristics to obtain fused characteristics;

2. The method according to claim 1, wherein the behavior recognition model comprises a feature extraction module, the feature extraction module comprises a plurality of feature extraction layers stacked in sequence from top to bottom, an output of an upper layer is used as an input of a lower layer, the plurality of video frames are input into an uppermost feature extraction layer,

the step of respectively performing relevance analysis on the multi-layer action features to obtain multi-layer relevance features, wherein the multi-layer relevance features are fused to obtain multi-scale fusion relevance features when the spatial scales of the multi-layer relevance features are different, and the multi-scale fusion relevance features are fused into the extracted action features to obtain fusion features, comprises the steps of:

in a multi-layer feature extraction layer, performing correlation analysis on the action features extracted from the current layer to obtain correlation features of the current layer;

fusing the correlation characteristic of the current layer with the correlation characteristic of at least one upper layer to obtain the multi-scale fused correlation characteristic of the current layer;

fusing the multi-scale fused correlation characteristic and the action characteristic of the current layer to obtain the output characteristic of the current layer;

and the output characteristic of the lowest characteristic extraction layer is the fusion characteristic output by the characteristic extraction module.

3. The method according to claim 1, wherein the behavior recognition model includes a feature extraction module, the feature extraction module includes a plurality of feature extraction layers stacked in sequence from top to bottom, an output of an upper layer is used as an input of a lower layer, the plurality of video frames are input into an uppermost feature extraction layer,

in a multi-layer feature extraction layer, performing relevance analysis on the action features extracted from the current layer to obtain the relevance features of the current layer, and fusing the relevance features of the current layer with the action features to obtain the output features of the current layer;

and fusing the multi-layer correlation characteristics to obtain multi-scale fusion correlation characteristics, and fusing the multi-scale fusion correlation characteristics and the output characteristics of the lowest layer to obtain fusion characteristics.

4. The method according to any one of claims 1-3, wherein performing a correlation analysis on the action features of any layer to obtain the correlation features of the current layer comprises:

the action characteristics of any layer comprise a frame action characteristic corresponding to each video frame, each frame action characteristic is respectively used as a target frame, and at least one adjacent frame of the target frame is determined in the action characteristics of the current layer;

performing correlation calculation on the target frame and each adjacent frame to obtain correlation characteristics between the target frame and each adjacent frame;

and fusing the correlation characteristics between the target frame and the at least one adjacent frame to obtain the correlation characteristics corresponding to the target frame of the current layer.

5. The method of claim 4, wherein determining at least one neighboring frame of the target frame among the action features of the current layer comprises:

determining a preset number of adjacent video frames of the video frames corresponding to the target frame in the action characteristics of the current layer according to the configured preset number, wherein the preset number is greater than 1;

and taking the action characteristics corresponding to the adjacent video frames as the adjacent frames of the target frame.

6. The method according to claim 4, wherein said performing a correlation calculation on the target frame and each of the neighboring frames to obtain a correlation characteristic between the target frame and each of the neighboring frames comprises:

determining a plurality of first feature blocks in the target frame according to the configured feature block sizes, and determining second feature blocks matched with the first feature blocks in the adjacent frames;

and calculating the similarity of each first feature block and the matched second feature block, and determining the correlation feature between the target frame and each adjacent frame according to the similarity of each first feature block and the matched second feature block.

7. The method of claim 1, wherein the fusing the multi-layered relevance features results in a multi-scale fused relevance feature, comprising:

taking the correlation characteristic with the minimum spatial scale in the multilayer correlation characteristics to be fused as a target characteristic;

performing down-sampling on the correlation features to be fused except the target features, wherein the down-sampled features and the target features have the same spatial scale;

and fusing the downsampled features with the target features to obtain multi-scale fused correlation features.

8. The method according to claim 1, wherein the obtaining the video data to be identified comprises:

the method comprises the steps of obtaining a video to be identified, and sampling from the video to be identified to obtain a plurality of video frames, wherein the plurality of video frames form video data to be identified.

9. A video behavior recognition system, comprising:

the terminal side equipment is used for acquiring video data to be identified, and the video data comprises a plurality of video frames;

the cloud-side equipment is used for inputting the video frames into a behavior recognition model, extracting multi-layer action characteristics of the video frames through the behavior recognition model, wherein the action characteristics of different layers have different spatial scales; respectively carrying out correlation analysis on the multi-layer action characteristics to obtain multi-layer correlation characteristics, fusing the multi-layer correlation characteristics to obtain multi-scale fusion correlation characteristics when the spatial scales of the correlation characteristics of different layers are different, and fusing the multi-scale fusion correlation characteristics into the extracted action characteristics to obtain fusion characteristics; performing behavior classification recognition according to the fusion characteristics to obtain a behavior recognition result of the target in the video data;

10. The system of claim 9, wherein the peer-to-peer device obtains the video to be identified, comprising:

collecting video data of a driver in the driving process of a vehicle;

the end-side device performs post-processing according to the behavior recognition result and outputs a post-processing result, including:

and according to the behavior recognition result, when the driver is determined to have the preset unsafe driving behavior, outputting warning information to the driver through an output device.

11. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer execution instructions;

the processor executes computer-executable instructions stored by the memory to implement the method of any of claims 1-8.

12. A computer-readable storage medium having computer-executable instructions stored therein, which when executed by a processor, are configured to implement the method of any one of claims 1-8.

13. A computer program product, characterized in that it comprises a computer program which, when being executed by a processor, carries out the method of any one of claims 1 to 8.