CN116883774A

CN116883774A - Training method of video behavior recognition model, video behavior recognition method and device

Info

Publication number: CN116883774A
Application number: CN202310681043.XA
Authority: CN
Inventors: 董帅; 李文生; 熊坤坤; 邹昆; 冯子钜; 叶润源
Original assignee: Zhongshan Xidao Technology Co ltd; University of Electronic Science and Technology of China Zhongshan Institute
Current assignee: Zhongshan Xidao Technology Co ltd; University of Electronic Science and Technology of China Zhongshan Institute
Priority date: 2023-06-08
Filing date: 2023-06-08
Publication date: 2023-10-13

Abstract

The application provides a training method of a video behavior recognition model, a video behavior recognition method and a device, wherein a specific implementation mode of the method comprises the following steps: inputting a sample image sequence corresponding to the sample video stream into an initial behavior recognition model; the sample video stream comprises a source domain sample video stream; the initial behavior recognition model comprises a classifier, a feature extractor and a decoder; extracting integral image features of the sample image sequence by using the feature extractor; segmenting, with the decoder, a person image feature from the overall image feature; and training the initial behavior recognition model based on the classification result output by the classifier and the character image characteristics output by the decoder. The method can improve the interference condition of the image background and improve the recognition performance of the model.

Description

Training method of video behavior recognition model, video behavior recognition method and device

Technical Field

The application relates to the field of information processing, in particular to a training method of a video behavior recognition model, a video behavior recognition method and a video behavior recognition device.

Background

The video behavior recognition model is a model for recognizing the behavior of people in the video. Typically, the model may be trained by acquired sample video data. That is, the sample video data may be input into the initial behavior recognition model, and a classification result, which may reflect the behavior of the person, may be output using a classifier of the initial behavior recognition model. The model can then be used to identify character behavior in the actual scene as it converges.

In the related art, when the video behavior recognition model extracts image features, the video behavior recognition model is easily interfered by an image background, so that poor performance of the model is caused.

Disclosure of Invention

The embodiment of the application aims to provide a training method of a video behavior recognition model, a video behavior recognition method and a video behavior recognition device, which are used for improving the interference condition of an image background and improving the recognition performance of the model.

In a first aspect, an embodiment of the present application provides a training method for a video behavior recognition model, where the video behavior recognition model is constructed based on a MiCT-Net network framework, and the method includes: inputting a sample image sequence corresponding to the sample video stream into an initial behavior recognition model; the sample video stream comprises a source domain sample video stream; the initial behavior recognition model comprises a classifier, a feature extractor and a decoder; extracting integral image features of the sample image sequence by using the feature extractor; segmenting, with the decoder, a person image feature from the overall image feature; and training the initial behavior recognition model based on the classification result output by the classifier and the character image characteristics output by the decoder.

Thus, the initial behavior recognition model can be regarded as a multitasking framework, i.e. the MiCT-Net network framework can be regarded as consisting of feature extractors and classifiers, which can perform the main task of recognizing human behavior; the feature extractor may then be used as an encoder to use the feature extractor and decoder as an auxiliary task framework that may perform auxiliary tasks for extracting features of the character image. Therefore, the decoder can prompt the feature extractor to extract only the character image features and ignore the image background, so that the interference condition of the image background is improved, and the recognition performance of the model is effectively improved.

In addition, the decoder, as part of the auxiliary task framework, may only exist in the process of training the initial behavior recognition model, and then may be split from the target behavior recognition model, so that the processing speed of the target behavior recognition model in the actual application scene is not attenuated. Therefore, the converged target behavior recognition model gives consideration to the processing speed and recognition performance, and can meet the requirements of actual application scenes.

Optionally, the sample image sequence is labeled with a segmentation label and a classification label, and the training the initial behavior recognition model based on the classification result output by the classifier and the character image feature output by the decoder includes: calculating a classification loss between the classification result and the classification label; calculating segmentation loss between the character image features and the segmentation labels; and back-propagating the classification loss and the segmentation loss to update model parameters of the initial behavior recognition model. Thus, by calculating the segmentation loss and the classification loss respectively, the model parameters of the initial behavior recognition model can be calculated more clearly, so that the initial behavior recognition model gradually converges towards the optimization target.

Optionally, the segmentation tag comprises a binary mask tag or a level set icon tag; the binary Mask tag is obtained by dividing a source domain sample image sequence corresponding to the source domain sample video stream through an image division model Mask R-CNN; the level set icon describes the human body action outline in the source domain sample image sequence through gray values. Therefore, the segmentation labels can be obtained by segmenting the source domain sample image sequence through the image segmentation model Mask R-CNN, so that the labeling efficiency and the segmentation quality are considered, and rich supervision information is provided through the level set label, so that the character behaviors in the source domain sample image sequence are directly supervised.

Optionally, the decoder includes a 3D deconvolution layer, and the segmenting the character image feature from the overall image feature with the decoder includes: performing up-sampling processing on the layer of the downsampled feature map aiming at the last layer of the downsampled feature map extracted by the feature extractor; aiming at each layer of up-sampling feature images in the up-sampling process, carrying out feature fusion processing on the layer of up-sampling feature images and the corresponding down-sampling feature images according to a preset feature fusion function; wherein the character image features are features after feature fusion processing. In this way, feature fusion processing can be performed by combining the up-sampling feature map and the down-sampling feature map, so as to reduce semantic information lost in the up-sampling processing process of the sample image sequence, and further obtain character image features with relatively perfect semantic information.

Optionally, the training the initial behavior recognition model based on the classification result output by the classifier and the character image feature output by the decoder includes: and training the initial behavior recognition model based on the classification result output by the classifier for the last layer of downsampled feature images and the character image features output by the decoder for the last layer of downsampled feature images. Thus, the input of the classifier and the decoder is the last layer of downsampling feature map extracted by the feature extractor, and the downsampling feature map has more perfect semantic information, so that the classification result and the segmentation result are more accurate, and the recognition performance of the target behavior recognition model is enhanced to a certain extent.

Optionally, the sample video stream further comprises a target domain sample video stream, the initial behavior recognition model further comprises a plurality of domain discriminators, and the method further comprises: aligning the features of the sample image sequences on different scales by utilizing the multiple domain discriminators to obtain domain discrimination results; and training the initial behavior recognition model based on the classification result, the character image feature, and the domain discrimination result. In this way, the initial behavior recognition model is trained through the output of the decoder, the field discriminator and the classifier, the characteristics of the target field sample video data and the characteristics of the source field sample video data can be aligned, the interference caused by the image background can be reduced, and then the recognition performance of the target behavior recognition model is effectively improved.

Optionally, the training the initial behavior recognition model based on the classification result, the character image feature, and the domain discrimination result includes: calculating a source domain predicted value corresponding to the source domain sample image sequence by using a domain discriminator; the source domain sample image sequence corresponds to the source domain sample video stream; calculating a target domain predicted value corresponding to the target domain sample image sequence by using a domain discriminator; the target domain sample image sequence corresponds to the target domain sample video stream; calculating the domain losses of a plurality of domain discriminators based on the source domain predicted value and the target domain predicted value; the classification loss, the segmentation loss, and the domain loss are back-propagated to update model parameters of the initial behavior recognition model. Thus, by calculating the domain loss, the segmentation loss and the classification loss respectively, the model parameters of the initial behavior recognition model can be calculated more clearly, so that the initial behavior recognition model gradually converges towards the optimization target.

In a second aspect, an embodiment of the present application provides a video behavior recognition method, where the method includes: extracting a to-be-processed image sequence corresponding to the to-be-processed video stream data; inputting the image sequence to be processed into a target behavior recognition model so as to output a behavior recognition result by using the target behavior recognition model; the target behavior recognition model is trained based on the method according to the first aspect. Thus, the behavior of the person in the video can be accurately identified.

In a third aspect, an embodiment of the present application provides a training apparatus for a video behavior recognition model, where the video behavior recognition model is constructed based on a MiCT-Net network framework, and the apparatus includes: the input module is used for inputting a sample image sequence corresponding to the sample video stream into the initial behavior recognition model; the sample video stream comprises a source domain sample video stream; the initial behavior recognition model comprises a classifier, a feature extractor and a decoder; a feature extraction module for extracting integral image features of the sample image sequence by using the feature extractor; a segmentation module for segmenting the character image features from the overall image features using the decoder; and the training module is used for training the initial behavior recognition model based on the classification result output by the classifier and the character image characteristics output by the decoder.

In a fourth aspect, an embodiment of the present application provides a video behavior recognition apparatus, including: the extraction module is used for extracting an image sequence to be processed corresponding to the video stream data to be processed; the identification module is used for inputting the image sequence to be processed into a target behavior identification model so as to output a behavior identification result by utilizing the target behavior identification model; the target behavior recognition model is trained based on the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides an electronic device comprising a processor and a memory storing computer readable instructions which, when executed by the processor, perform the steps of the method as provided in the first or second aspects above.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method as provided in the first or second aspects above.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the embodiments of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a training method of a video behavior recognition model according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a decoder according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a network framework of a video behavior recognition model according to an embodiment of the present application;

FIG. 4 is a flowchart of a video behavior recognition method according to an embodiment of the present application;

FIG. 5 is a block diagram of a training device for a video behavior recognition model according to an embodiment of the present application;

fig. 6 is a block diagram of a video behavior recognition device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device for executing a training method of a video behavior recognition model or a video behavior recognition method according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

It should be noted that embodiments of the present application or technical features of embodiments may be combined without conflict.

In the related art, when the video behavior recognition model extracts image features, the video behavior recognition model is easily interfered by an image background, so that the problem of poor performance of the model is caused; in order to solve the problem, the application provides a training method of a video behavior recognition model; further, the image foreground is extracted through the decoder, so that the initial behavior recognition model is focused on a human body image area in the foreground, interference of an image background is reduced, and recognition performance of the model is improved.

In some application scenarios, the training method of the video behavior recognition model can be executed in a server or a cloud platform with a Ubuntu (u Ban Tu) system so as to meet the calculation power and memory requirements required by training. Illustratively, the present application is hereinafter applied to server contexts.

The above related art solutions have drawbacks, which are results obtained by the inventor after practice and careful study, and therefore, the discovery process of the above problems and the solutions proposed by the embodiments of the present application hereinafter for the above problems should be all contributions of the inventor to the present application in the process of the present application.

Referring to fig. 1, a flowchart of a training method of a video behavior recognition model according to an embodiment of the present application is shown. The video behavior recognition model is constructed based on a MiCT-Net network framework. The MiCT-Net network framework adopts a 3D/2D mixed convolution mode, and image characteristics in the video stream can be extracted based on the mixed convolution mode. In addition, the MiCT-Net network framework has small parameter quantity and simpler operation operator, and is convenient to be deployed in different terminals (such as development boards of different models).

As shown in fig. 1, the training method of the video behavior recognition model includes the following steps 101 to 104.

Step 101, inputting a sample image sequence corresponding to a sample video stream into an initial behavior recognition model; the sample video stream comprises a source domain sample video stream; the initial behavior recognition model comprises a classifier, a feature extractor and a decoder;

The source domain sample video stream is an existing sample video stream which can be learned. Which may be, for example, a video stream provided by the UCF-101 dataset or the HMDB51 dataset.

In some application scenarios, the server may extract frames of each source-domain sample video stream into multiple images, and then may truncate the image frames with a fixed sampling interval to obtain a corresponding sample image sequence. Then, the fixed frame number of images intercepted at a certain sampling time point can be randomly selected, and a source domain sample image sequence corresponding to the source domain sample video stream is obtained. Here, image sequences of different sampling intervals or different frames may also be employed, and the present application is not limited.

In these application scenarios, the sample image sequence may correspond to a plurality of sample image frames, which may then be subjected to enhancement processing. That is, the server may perform enhancement processing such as center clipping, random flipping, and enhanced brightness on each sample image frame to increase the authenticity of the sample data, which in turn increases the generalization ability of the initial behavior model.

The server may then input the sample image sequence into an initial behavior recognition model. In some application scenarios, the initial behavior recognition model may include a classifier, a feature extractor, and a decoder.

The classifier is used for judging the behavior of the person; which may include a fully connected layer and a global average pooling layer.

The feature extractor is used to extract global image features of the sample image sequence, which may include, for example, convolutional neural networks (Convolutional Neural Networks, CNN), recurrent neural networks (Recurrent Neural Network, RNN), etc.

The decoder may comprise a 3D deconvolution layer. In some application scenarios, the decoder architecture may be determined with reference to a 3D U-Net image segmentation network, for example, as shown in fig. 2. Wherein, the three-dimensional convolutional neural network parameters (conv 3d parameters) can comprise the channel number, the width and the height of the convolutional kernel; for example, for the calculated Conv3d 4 x 3 in FIG. 2, the value 4 may be considered the number of channels of the convolution kernel, the first value 3 may be considered the width of the convolution kernel, and the last value 3 may be considered the height of the convolution kernel. The dimensions of the convolutional layers may include 512, 256, 128, 64, 32, etc.

Step 102, extracting integral image features of the sample image sequence by using the feature extractor;

after the server inputs the sample image sequence into the initial behavior recognition model, the overall image features of the sample image sequence may be extracted using a feature extractor included therein. The global image features, i.e. the foreground image features comprising the sample image sequence, and the person image features. Here, each sample image sequence may be considered as the same sample image frame on different channels, and each sample image frame may be considered as one channel.

Step 103, segmenting the character image features from the whole image features by using the decoder;

in some application scenarios, the feature extractor may be considered an encoder, which in turn may segment the person image features from the overall image features via a decoder.

And step 104, training the initial behavior recognition model based on the classification result output by the classifier and the character image characteristics output by the decoder.

After inputting the sample image sequence into the initial behavior recognition model, its classifier may output a classification result for the source domain sample video stream. And then combining the classification result and the character image characteristics to train the initial behavior recognition model to converge to the target behavior recognition model.

In this embodiment, the initial behavior recognition model may be regarded as a multi-tasking framework, i.e., a MiCT-Net network framework may be regarded as consisting of feature extractors and classifiers, which may perform the main task of recognizing human behavior; the feature extractor may then be used as an encoder to use the feature extractor and decoder as an auxiliary task framework that may perform auxiliary tasks for extracting features of the character image. Therefore, the decoder can prompt the feature extractor to extract only the character image features and ignore the image background, so that the interference condition of the image background is improved, and the recognition performance of the model is effectively improved.

In some alternative implementations, the sample image sequence is labeled with a segmentation tag and a classification tag. Here, since the source domain sample video stream is known, it may have a corresponding class label for tagging the person behavior. The classification labels may be obtained by manual labeling, for example.

And, the source domain sample video stream may have a corresponding split tag for marking the cut out character features. The above-mentioned division label can be obtained by, for example, manual labeling, threshold division, edge detection, or the like. The threshold segmentation method can segment the sample image sequence into an object part and a background part according to the pixel value, and then a corresponding character image can be obtained. The edge detection mode can be used for generating a mask image by detecting edges between different areas in the sample image sequence, and then a corresponding character image can be obtained.

In some alternative implementations, the split tag includes a binary mask tag or a level set tag;

the binary Mask tag is obtained by dividing a source domain sample image sequence corresponding to the source domain sample video stream through an image division model Mask R-CNN;

the algorithm of the image segmentation model Mask R-CNN (Region-Based Convolutional Networks, R-CNN) can be regarded as a combination algorithm of a target detection algorithm Faster-RCNN (Towards Real-Time Object Detection with Region Proposal Networks, faster R-CNN) and a semantic segmentation algorithm FCN (Fully Convolutional Networks for Semantic Segmentation, FCN), so that the result of semantic segmentation can be obtained while target detection is completed.

In some application scenarios, if the number of sample video streams is large, the labeling efficiency is low due to manual labeling, and the labeling efficiency can be improved by using threshold segmentation or edge detection, but the segmentation quality is poor.

In the implementation mode, the segmentation labels can be obtained by segmenting the source domain sample image sequence through the image segmentation model Mask R-CNN, so that the labeling efficiency and the segmentation quality are both considered.

The level set icon describes the human body action outline in the source domain sample image sequence through gray values.

The Level Set icon may be obtained by Level Set Model (LSM) annotation.

In this implementation manner, the level set graph tag may represent a human motion profile by using a gray value, and then may provide relatively rich supervision information to directly supervise person behaviors in a source domain sample image sequence.

In some application scenarios, the training the initial behavior recognition model based on the classification result output by the classifier and the character image feature output by the decoder in the step 104 includes the following sub-steps:

a substep 1041 of calculating a classification loss between the classification result and the classification label;

in some application scenarios, the classification result output by the classifier and the classification label may be subjected to loss calculation to obtain classification loss. In these application scenarios, the classification loss may be calculated, for example, by a cross entropy loss function.

A substep 1042 of calculating a segmentation loss between the character image feature and the segmentation label;

in some application scenarios, the loss calculation may be performed on the character image features output by the decoder and the segmentation tags to obtain segmentation loss. In these application scenarios, the segmentation Loss may be calculated, for example, by a Dice Loss function, or a Loss function including L1 regularization, L2 regularization.

Sub-step 1043, back-propagating the classification loss and the segmentation loss to update model parameters of the initial behavior recognition model.

After the above classification loss and segmentation loss are obtained, the sum of the two can be determined as the loss of the initial behavior recognition model. For example, the initial behavior recognition may be characterized by the following calculationLoss of model: l (L) _total ＝L _cls +L _seg The method comprises the steps of carrying out a first treatment on the surface of the Wherein L is _total Representing a loss of the initial behavior recognition model; l (L) _cls Representing a classification loss; l (L) _seg Representing the segmentation loss.

Then, the optimization objective of the initial behavior recognition model can be considered as: classification loss takes the minimum value and segmentation loss takes the minimum value. That is, when the classification loss is minimum and the segmentation loss is minimum, the loss of the initial behavior recognition model is minimum.

In some application scenarios, in order to achieve the optimization objective of the initial behavior recognition model, the segmentation loss and the classification loss may be counter-propagated, and the network parameters are updated by using a gradient descent algorithm until the initial behavior recognition model converges to the target behavior recognition model.

In the implementation manner, the model parameters of the initial behavior recognition model can be calculated more clearly by calculating the segmentation loss and the classification loss respectively, so that the initial behavior recognition model gradually converges towards the optimization target.

It should be noted that the number of sample video streams may be plural, each sample video stream may correspond to a sample image sequence, and then the initial behavior recognition model may be iteratively trained through the plurality of sample image sequences until the initial behavior recognition model converges to the target behavior recognition model.

In some alternative implementations, the decoder includes a 3D deconvolution layer, and the segmenting the person image feature from the integral image feature using the decoder in step 103 may include the following sub-steps:

a substep 1031, for the last layer of the downsampled feature map extracted by the feature extractor, performing upsampling processing on the layer of the downsampled feature map;

the feature extractor may extract feature maps of the sample image sequence at different scales, and when it is used as an encoder corresponding to a decoder composed of 3D deconvolution layers, the extracted features at each scale may be integrated in a layer of downsampled feature maps. Wherein, the semantic information included in the last layer of downsampling feature map is most abundant. The extracted last layer of downsampled feature map may then be input into a decoder to segment more accurate character image features.

After the decoder receives the last layer of downsampling feature map, upsampling processing can be performed on the layer of downsampling feature map to realize a deconvolution process, so as to obtain a multi-layer upsampling feature map.

Sub-step 1032, for each layer of up-sampling feature map in the up-sampling process, performing feature fusion processing on the layer of up-sampling feature map and the corresponding down-sampling feature map according to a preset feature fusion function; wherein the character image features are features after feature fusion processing.

When the decoder performs up-sampling processing on the last layer of down-sampling feature images, feature fusion processing can be performed on each layer of up-sampling feature images. Specifically, for each layer of up-sampling feature map, the decoder may acquire a down-sampling feature map corresponding to the up-sampling feature map, so as to perform feature fusion processing according to a preset feature fusion function. For example, after the feature extractor sequentially extracts the downsampling feature patterns of four scales of 64×28×28, 128×14×14, 256×7×7, 512×4×4, for the last downsampling feature pattern (i.e., the downsampling feature pattern of 512×4×4), the feature extractor may perform upsampling processing on the last downsampling feature pattern, so as to obtain four downsampling feature patterns of 512×4×4, 256×7×7, 128×14×14, and 64×28×28, which are sequentially arranged; if the up-sampling feature map with the scale of 256×7×7 is obtained, feature fusion processing can be performed on the down-sampling feature map with the same scale of 256×7×7. The predetermined feature fusion function may be, for example, a matrix stitching function torch.cat, or a combination of a matrix point multiplication function torch.mul and a matrix addition function torch.add.

And the decoder performs feature fusion processing on the up-sampling feature map to output the character image features.

In the implementation manner, feature fusion processing can be performed by combining the up-sampling feature map and the down-sampling feature map, so that semantic information lost in the up-sampling processing process of a sample image sequence is reduced, and then character image features with relatively perfect semantic information are obtained.

In some optional implementations, training the initial behavior recognition model based on the classification result output by the classifier and the character image feature output by the decoder in step 104 includes: and training the initial behavior recognition model based on the classification result output by the classifier for the last layer of downsampled feature images and the character image features output by the decoder for the last layer of downsampled feature images.

In some application scenarios, the last layer of downsampled feature map may be used as input to the classifier and decoder, respectively. Thus, the classifier may output a corresponding classification result for the last layer of downsampled feature map, and the decoder may output a corresponding segmentation result (i.e., a person image feature) for the last layer of downsampled feature map.

In the implementation mode, the input of the classifier and the decoder is the last layer of downsampling feature map extracted by the feature extractor, and the downsampling feature map has more perfect semantic information, so that the classification result and the segmentation result are more accurate, and the recognition performance of the target behavior recognition model is enhanced to a certain extent.

In some alternative implementations, as shown in fig. 3, the sample video stream further includes a target domain sample video stream, and the initial behavior recognition model further includes a plurality of domain discriminators.

The target domain sample video stream is a sample video stream to be learned or migrated; the number of the partial sample video streams can be smaller, so that the target behavior recognition model used by the video streams meeting the target domain universe can be trained by using fewer target domain video streams.

In some application scenarios, the target domain sample video stream may include, for example, a Smoke UCF101 dataset and a Smoke HMDB51 dataset. Wherein the Smoke UCF101 dataset can be synthesized based on the UCF-101 dataset, which is used for simulating actions under the condition of dense fog; the Smoke HMDB51 dataset may be synthesized based on the HMDB51 dataset, which is used to simulate actions in a heavy fog situation.

The domain discriminator is used for discriminating whether the sample image sequence belongs to the source domain or the target domain. The number of the domain discriminators may be the same as or half the number of the scale feature maps of the sample image sequence, for example, and the present invention is not limited thereto.

Further, each domain arbiter may consist of one gradient inversion layer (Gradient Reversal Layer, GRL) and three 1*1 convolution layers. And, each convolution layer can be externally connected with an activation function layer (Rectified Linear Unit, reLU), and the domain discrimination result can be determined through a Sigmoid function.

In some application scenarios, the server may input a sample image sequence corresponding to each sample video stream into the initial behavior recognition model. In these application scenarios, for example, the source domain sample image sequence and the target domain sample image sequence may be mixed, and any sample image sequence may be randomly loaded, so as to improve the authenticity of the sample image sequence, and then improve the generalization capability of the initial behavior model.

In some application scenarios, when the server inputs the sample image sequence into the initial behavior recognition model, for example, the source domain sample image sequence and the target domain sample image sequence may be packaged so that the source domain sample image sequence and the target domain sample image sequence can be input into the initial behavior recognition model together, so as to avoid data imbalance caused by that the source domain sample image sequence and the target domain sample image sequence are not input into the initial behavior recognition model together. In these application scenarios, the number of packed sample image sequences may be determined, for example, based on the memory capabilities of the graphics card.

Thus, the training method of the video behavior recognition model further comprises the following steps:

step 105, aligning the features of the sample image sequence on different scales by using the multiple domain discriminators to obtain a domain discrimination result;

in some application scenarios, the sample image sequence may be processed by the domain arbiter after the initial behavior recognition model is input. In particular, the domain arbiter may align features of the sample image sequence on different scales.

Further, for one of the scale feature maps, the scale feature map may be used as an input of one of the domain discriminators, and then the domain discriminators may output a domain discrimination result of whether the scale feature map is a source domain or a target domain.

And step 106, training the initial behavior recognition model based on the classification result, the character image characteristics and the domain discrimination result.

In some application scenarios, the source domain sample video stream has a classification tag, while the target domain sample video stream has no classification tag, so the target domain sample video stream does not need to determine its classification result. Then, only the source domain sample image sequence corresponding to the source domain sample video stream may be input into the classifier to obtain a corresponding classification result.

Then, the initial behavior recognition model can be trained to converge to the target behavior recognition model by combining the classification result, the character image features and the domain discrimination results output by the plurality of domain discriminators.

In the related art, since the characteristic distribution of the target domain sample video data and the source domain sample video data is not uniform, the performance of the model is reduced.

In this embodiment, when the initial behavior recognition model is trained, the features between the source domain sample video data and the target domain sample video data are aligned by using the plurality of domain discriminators, and the domain results are predicted, and the discrimination loss can be reversed due to the GRL existing in the domain discriminators, so that a countermeasure training mode between the feature extractor and the domain discriminators is formed, and the recognition performance of the target behavior recognition model is improved to a certain extent.

In addition, the field discriminator only exists in the training process of the initial behavior recognition model, so that the training process does not influence the processing speed of the target behavior recognition model in the actual application scene, and therefore the converged target behavior recognition model has the processing speed and recognition performance, and the requirements of the actual application scene can be met.

In the implementation mode, the initial behavior recognition model is trained through the output of the decoder, the field discriminator and the classifier, the characteristics of the target field sample video data and the characteristics of the source field sample video data can be aligned, the interference caused by the image background can be reduced, and then the recognition performance of the target behavior recognition model is effectively improved.

It should be noted that in these application scenarios, the input of the classifier and the decoder may be the last layer of downsampled feature map extracted by the feature extractor, so that more accurate classification loss and segmentation loss may be obtained, which improves the recognition performance of the target behavior recognition model to a certain extent.

In some application scenarios, before the characteristics of the sample image sequences on different scales are aligned by using the multiple domain discriminators in the step 105 to obtain a domain discrimination result, the training method of the video behavior recognition model may further include: extracting image features of the sample image sequence by using the feature extractor to obtain a shallow feature map and a deep feature map; the feature extractor extracts the shallow feature map and the deep feature map based on a MiCT-Net network framework;

in some application scenarios, a feature extractor may extract image features of a sample image sequence. In these application scenarios, the feature extractor may extract shallow feature maps as well as deep feature maps through the MiCT-Net network framework. The MiCT-Net network framework adopts a 3D/2D mixed convolution mode, and shallow layer characteristics or deep layer characteristics can be distinguished based on the 3D/2D mixed convolution mode.

The shallow features may include, for example, local features such as color features and brightness features of an image; the deep features may include global features such as facial contours, limb contours, etc. of the human body.

In these application scenarios, the MiCT-Net network framework may be composed, for example, by the network architecture shown in FIG. 3. If the MiCT-Net network framework utilizes four modules to extract four layers of feature images, the first three layers of feature images can be determined as shallow feature images, and the fourth layer can be determined as deep feature images.

Thus, the aligning features of the sample image sequence on different scales by using the plurality of domain discriminators as described in the step 105 to obtain a domain discrimination result includes: and respectively aligning shallow features and deep features corresponding to the sample image sequence by using the field discriminators.

For the shallow layer feature map and the deep layer feature map extracted by the feature extractor, a plurality of field discriminators can be utilized to perform feature alignment operation respectively. For example, the shallow feature map extracted for the first module in fig. 3 may identify whether it belongs to the source domain or the target domain using the domain arbiter D1. For the deep feature map extracted by the last module, the domain discriminator D4 may be used to identify whether it belongs to the source domain or the target domain. It should be noted that the number of the domain discriminators may be 2, 3 or 4, which is not limited herein.

In the application scenes, the field discriminators can be used for respectively aligning shallow features and deep features, so that the feature extractor can be trained to comprehensively extract image information, and the feature extraction performance of the target behavior recognition model is improved.

In some application scenarios, the multiple domain discriminators are in one-to-one correspondence with the multi-layer feature map extracted by the feature extractor. That is, the number of domain discriminators is the same as the number of layers of the feature map extracted by the feature extractor. For example, in fig. 3, when the feature extractor extracts 4 layers of feature graphs, there may be 4 domain discriminators (i.e., D1, D2, D3, D4 shown in fig. 3), and each domain discriminator processes a corresponding feature graph.

Thus, the aligning the shallow features and the deep features corresponding to the sample image sequence by using the domain discriminators includes:

step 1, regarding any shallow domain identifier corresponding to the shallow feature map, taking the shallow feature map corresponding to the shallow domain identifier as input, and enabling the shallow domain identifier to output a domain identification result for the shallow feature map.

After the feature extractor extracts the shallow feature images, each shallow feature image can be input into the corresponding domain discriminator, so that the domain discriminator can output the domain discrimination result corresponding to the shallow feature images.

And 2, aiming at any deep field discriminator corresponding to the deep feature map, taking the deep feature map corresponding to the deep field discriminator as input, and enabling the deep field discriminator to output a field discrimination result aiming at the deep features.

After the feature extractor extracts the deep feature images, each deep feature image can be input into the corresponding domain discriminator, so that the domain discriminator can output the domain discrimination result corresponding to the deep feature image.

In these application scenarios, when the feature extractor extracts features based on the MiCT-Net network framework, multiple shallow feature maps and one or more deep feature maps may be extracted, so for each feature map, a domain discriminator may be set to align the features of each feature map, so as to obtain more comprehensive domain information.

In some application scenarios, the loss functions of the shallow domain identifier and the deep domain identifier are different; wherein the loss function of the shallow domain arbiter comprises a weighted mean square loss function; the loss function of the deep field discriminator includes a focal point loss function.

In some application scenarios, shallow features may be aligned using a strong alignment and deep features may be aligned using a weak alignment. The reason for this is that: first, the categories of global features corresponding to the source domain sample video stream and the target domain sample video stream may be completely different. For example, the global features of the source domain sample video stream characterize the play behavior of the person and the global features of the target domain sample video stream characterize the bouncing behavior of the person. If a strong alignment mode is adopted, the cradling behavior of the person may be forcefully identified as the jumping behavior, which may result in a decrease in the identification performance of the model. Secondly, if the features included in the feature map are too significant (for example, the feature map represents that the person is playing basketball in the daytime and dancing in the night, and the brightness feature can be considered as the too significant feature), the domain discriminator will consider the feature as a simple feature, so that the domain discriminator can easily predict the domain discrimination result, and then the countermeasure capability with the feature extractor is weaker, which may also result in the degradation of the recognition performance of the model. Therefore, the deep field discriminator and the shallow field discriminator can adopt different loss functions to perform an adaptive alignment mode on the feature map.

In these applications, the shallow domain arbiter may use a weighted mean square Loss function (Weighted Mean Square Error Loss, WMSE Loss) and the deep domain arbiter may use a Focal Loss function (Focal Loss).

The reason for using the weighted mean square loss function is that: first, the domain arbiter is used to determine whether the input data belongs to the source domain or the target domain, that is, the domain arbiter performs a classification task essentially, and the mean square error loss is to fit a function by measuring the difference between the predicted value and the true value, so that enough importance can be drawn in the training process when the predicted value and the true value are inconsistent. Secondly, the weighting mode is utilized because semantic information contained in feature maps of different scales is not completely consistent, and different weight values are beneficial to the stability of model training.

The reason for using the focus loss function is that: the focus loss function has the effect of enabling the attention degree of the model to the difficult sample to be different, and is more fit to the current weak alignment mode.

In the application scenes, shallow features and deep features can be aligned in a targeted manner through the field discriminators with different loss functions, and the recognition performance of the model can be improved to a certain extent.

In some application scenarios, the shallow domain identifier and the deep domain identifier have the same loss function; the loss function comprises a weighted mean square loss function.

In some application scenarios, there may be no situation that the categories of global features corresponding to the source domain sample video stream and the target domain sample video stream are completely different or the features are significant, so the loss functions of the multiple domain discriminators may be the same. Preferably, to avoid the domain arbiter from reducing the attention to simple features, the features may all be aligned using a strongly aligned weighted mean square loss function to align the extracted individual features.

It should be noted that, the weighted mean square loss function and the focal point loss function are all commonly used in the art, and one skilled in the art can select any one according to the actually required adaptability on the basis of knowing that the alignment features need to be aligned in a strong alignment or a weak alignment in the present application, which is not limited herein.

In some optional implementations, training the initial behavior recognition model based on the classification result, the character image feature, and the domain discrimination result described in step 106 above includes:

Step 1061, calculating a source domain predicted value corresponding to the source domain sample image sequence by using a domain discriminator; the source domain sample image sequence corresponds to the source domain sample video stream;

after the source domain sample image sequence is input into the domain discriminator, the domain discriminator can calculate a source domain predicted value corresponding to the domain sample image sequence. That is, the domain arbiter may calculate the source domain prediction value corresponding to the source domain sample image sequence using the Sigmoid function.

Step 1062, calculating a target domain predicted value corresponding to the target domain sample image sequence by using a domain discriminator; the target domain sample image sequence corresponds to the target domain sample video stream.

After the target domain sample image sequence is input into the domain discriminator, the domain discriminator can calculate a target domain predicted value corresponding to the target domain sample image sequence. That is, the domain arbiter may also calculate the target domain prediction value corresponding to the target domain sample image sequence using the Sigmoid function.

Step 1063, calculating a domain loss of a plurality of the domain discriminators based on the source domain prediction value and the target domain prediction value;

after the source domain predicted value and the target domain predicted value are predicted based on the domain discriminators, domain losses corresponding to the domain discriminators can be calculated.

In some application scenarios, the accumulated sum of the domain losses corresponding to the respective domain discriminators may be determined as the domain loss corresponding to the plurality of domain discriminators. In these application scenarios, if there are shallow features and deep features, the loss of multiple domain discriminators can be characterized by the following calculation formula:wherein, the MiCT-Net network architecture currently extracts a 4-layer feature map. Then L is _MUDA Representing the loss of multiple domain discriminators, +.>Representing the loss of the shallow domain arbiter (i.e., the domain arbiter corresponding to the first three layers); lambda (lambda) ₄ L _Dl Representing the loss of the deep domain arbiter (i.e., the domain arbiter corresponding to the fourth layer); lambda represents the hyper-parameters of the domain discriminators for adjusting the loss weights of the individual domain discriminators to control the degree of feature alignment.

Step 1064, back-propagating the classification loss, the segmentation loss, and the domain loss to update model parameters of the initial behavior recognition model.

In some application scenarios, since the feature extractor extracts features based on the MiCT-Net network framework, the initial behavior recognition model can be partitioned into a MiCT-Net portion and a domain arbiter portion. The model loss of the initial behavior recognition model then contains the loss of the MiCT-Net portion, the decoder portion, and the domain arbiter portion.

Then, after the above-mentioned domain loss, division loss, and classification loss are obtained, the sum of the three can be determined as the loss of the initial behavior recognition model. For example, the loss of the initial behavior recognition model may be characterized by the following calculation: l (L) _total ＝L _cls +L _MUDA +L _seg The method comprises the steps of carrying out a first treatment on the surface of the Wherein L is _total Representing a loss of the initial behavior recognition model; l (L) _cls Representing a classification loss; l (L) _MUDA Representing domain losses for a plurality of domain discriminators; l (L) _seg Representation scoreCutting loss. It should be noted that, since the gradient inversion layer exists in the domain discriminator, the loss of the initial behavior recognition model is smaller when the domain loss is larger.

Then, the optimization objective of the initial behavior recognition model can be considered as: classification loss takes minimum value, domain loss takes maximum value, and segmentation loss takes minimum value. That is, when the classification loss and the segmentation loss are minimum and the domain loss is maximum, the loss of the initial behavior recognition model is minimum.

In some application scenarios, in order to achieve the optimization objective of the initial behavior recognition model, the domain loss, the segmentation loss, and the classification loss may be counter-propagated, and the network parameters may be updated using a gradient descent algorithm until the initial behavior recognition model converges to the objective behavior recognition model.

In the implementation manner, the model parameters of the initial behavior recognition model can be calculated more clearly by calculating the domain loss, the segmentation loss and the classification loss respectively, so that the initial behavior recognition model gradually converges towards the optimization target.

In some application scenarios, after the initial behavior recognition model converges to obtain the target behavior recognition model, a test may be performed using the target domain dataset. In these application scenarios, the recognition performance of the target behavior recognition model can be determined by calculating the classification accuracy. Further, in these application scenarios, for example, part of the data in the target domain data set may be used for training and the rest of the data may be used for testing.

Referring to fig. 4, a flowchart of a video behavior recognition method according to an embodiment of the present application is shown, and the method includes the following steps 401 to 402:

step 401, extracting a to-be-processed image sequence corresponding to the to-be-processed video stream data;

after receiving the video stream data to be processed, the server may extract a sequence of images to be processed corresponding to the video stream data to be processed. The process of extracting the image sequence to be processed may be the same as or similar to the process of extracting the sample image sequence in step 101, which is not described herein.

Step 402, inputting the image sequence to be processed into a target behavior recognition model to output a behavior recognition result by using the target behavior recognition model; the target behavior recognition model is trained based on the method described in the embodiment shown in fig. 1.

After the server extracts the image sequence to be processed, the image sequence to be processed can be input into a target behavior recognition model, and then the target behavior recognition model can output a classification result by using a classifier so as to recognize the character behavior corresponding to the video stream data to be processed.

In this embodiment, the behavior of the person in the video stream to be processed may be identified using the target behavior identification model. Wherein the initial behavior recognition model is considered as a multitasking framework as it is trained. That is, the MiCT-Net network framework is regarded as being composed of a feature extractor and a classifier, which can execute a main task of recognizing the behavior of a person; the feature extractor may then be used as an encoder to use the feature extractor and decoder as an auxiliary task framework that may perform auxiliary tasks for extracting features of the character image. Therefore, the decoder can prompt the feature extractor to extract only the character image features and ignore the image background, so that the interference condition of the image background is improved, and the recognition performance of the model is effectively improved.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiment, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Referring to fig. 5, a block diagram of a training apparatus for a video behavior recognition model according to an embodiment of the present application is shown, where the training apparatus for a video behavior recognition model may be a module, a program segment, or a code on an electronic device. It should be understood that the apparatus corresponds to the embodiment of the method of fig. 1 described above, and is capable of performing the steps involved in the embodiment of the method of fig. 1, and specific functions of the apparatus may be referred to in the foregoing description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy.

Optionally, the training device of the video behavior recognition model includes an input module 501, a feature extraction module 502, a segmentation module 503, and a training module 504. The video behavior recognition model is constructed based on a MiCT-Net network framework; the input module 501 is configured to input a sample image sequence corresponding to a sample video stream into an initial behavior recognition model; the sample video stream comprises a source domain sample video stream; the initial behavior recognition model comprises a classifier, a feature extractor and a decoder; a feature extraction module 502 for extracting integral image features of the sample image sequence using the feature extractor, and a segmentation module 503 for segmenting human image features from the integral image features using the decoder; a training module 504, configured to train the initial behavior recognition model based on the classification result output by the classifier and the character image feature output by the decoder.

Optionally, the sample image sequence is labeled with a segmentation label and a classification label, and the training module 504 is further configured to: calculating a classification loss between the classification result and the classification label; calculating segmentation loss between the character image features and the segmentation labels; and back-propagating the classification loss and the segmentation loss to update model parameters of the initial behavior recognition model.

Optionally, the segmentation tag comprises a binary mask tag or a level set icon tag; the binary Mask tag is obtained by dividing a source domain sample image sequence corresponding to the source domain sample video stream through an image division model Mask R-CNN; the level set icon describes the human body action outline in the source domain sample image sequence through gray values.

Optionally, the decoder comprises a 3D deconvolution layer, and the segmentation module 503 is further configured to: performing up-sampling processing on the layer of the downsampled feature map aiming at the last layer of the downsampled feature map extracted by the feature extractor; aiming at each layer of up-sampling feature images in the up-sampling process, carrying out feature fusion processing on the layer of up-sampling feature images and the corresponding down-sampling feature images according to a preset feature fusion function; wherein the character image features are features after feature fusion processing.

Optionally, the training module 504 is further configured to: and training the initial behavior recognition model based on the classification result output by the classifier for the last layer of downsampled feature images and the character image features output by the decoder for the last layer of downsampled feature images.

Optionally, the sample video stream further includes a target domain sample video stream, the initial behavior recognition model further includes a plurality of domain discriminators, and the apparatus further includes a feature alignment module and a discrimination training module, where the feature alignment module is configured to: aligning the features of the sample image sequences on different scales by utilizing the multiple domain discriminators to obtain domain discrimination results; the discriminant training module is used for: and training the initial behavior recognition model based on the classification result, the character image characteristics and the domain discrimination result.

Optionally, the discriminant training module is further configured to: calculating a source domain predicted value corresponding to the source domain sample image sequence by using a domain discriminator; the source domain sample image sequence corresponds to the source domain sample video stream; calculating a target domain predicted value corresponding to the target domain sample image sequence by using a domain discriminator; the target domain sample image sequence corresponds to the target domain sample video stream; calculating the domain losses of a plurality of domain discriminators based on the source domain predicted value and the target domain predicted value; the classification loss, the segmentation loss, and the domain loss are back-propagated to update model parameters of the initial behavior recognition model.

Referring to fig. 6, a block diagram of a video behavior recognition apparatus, which may be a module, a program segment, or a code on an electronic device, is shown in an embodiment of the present application. It should be understood that the apparatus corresponds to the embodiment of the method of fig. 4 described above, and is capable of performing the steps involved in the embodiment of the method of fig. 4, and specific functions of the apparatus may be referred to in the foregoing description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy.

Optionally, the training device of the video behavior recognition model includes an extraction module and a recognition module. The extraction module is used for extracting an image sequence to be processed corresponding to the video stream data to be processed; the identification module is used for inputting the image sequence to be processed into a target behavior identification model so as to output a behavior identification result by utilizing the target behavior identification model; the target behavior recognition model is trained based on the method of the embodiment shown in fig. 4.

It should be noted that, for convenience and brevity, a person skilled in the art will clearly understand that, for the specific working procedure of the apparatus described above, reference may be made to the corresponding procedure in the foregoing method embodiment, and the description will not be repeated here.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device for executing a training method of a video behavior recognition model or a video behavior recognition method according to an embodiment of the present application, where the electronic device may include: at least one processor 701, such as a CPU, at least one communication interface 702, at least one memory 703 and at least one communication bus 704. Wherein the communication bus 704 is used to enable direct connection communication of these components. The communication interface 702 of the device in the embodiment of the present application is used for performing signaling or data communication with other node devices. The memory 703 may be a high-speed RAM memory or a nonvolatile memory (non-volatile memory), such as at least one disk memory. The memory 703 may optionally also be at least one storage device located remotely from the aforementioned processor. The memory 703 has stored therein computer readable instructions which, when executed by the processor 701, can cause the electronic device to perform the method processes described above in fig. 1 or fig. 4.

It will be appreciated that the configuration shown in fig. 7 is merely illustrative, and that the electronic device may also include more or fewer components than those shown in fig. 7, or have a different configuration than that shown in fig. 7. The components shown in fig. 7 may be implemented in hardware, software, or a combination thereof.

Embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, is capable of performing a method procedure performed by an electronic device in an embodiment of the method as shown in fig. 1 or fig. 4.

Embodiments of the present application provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the execution of the methods provided by the method embodiments described above, e.g. the method may comprise: inputting a sample image sequence corresponding to the sample video stream into an initial behavior recognition model; the sample video stream comprises a source domain sample video stream; the initial behavior recognition model comprises a classifier, a feature extractor and a decoder; extracting integral image features of the sample image sequence by using the feature extractor; segmenting, with the decoder, a person image feature from the overall image feature; and training the initial behavior recognition model based on the classification result output by the classifier and the character image characteristics output by the decoder.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. The training method of the video behavior recognition model is characterized in that the video behavior recognition model is constructed based on a MiCT-Net network framework, and comprises the following steps:

inputting a sample image sequence corresponding to the sample video stream into an initial behavior recognition model; the sample video stream comprises a source domain sample video stream; the initial behavior recognition model comprises a classifier, a feature extractor and a decoder;

extracting integral image features of the sample image sequence by using the feature extractor;

segmenting, with the decoder, a person image feature from the overall image feature;

And training the initial behavior recognition model based on the classification result output by the classifier and the character image characteristics output by the decoder.

2. The method of claim 1, wherein the sample image sequence is labeled with a segmentation label and a classification label, and

the training the initial behavior recognition model based on the classification result output by the classifier and the character image feature output by the decoder comprises the following steps:

calculating a classification loss between the classification result and the classification label;

calculating segmentation loss between the character image features and the segmentation labels;

and back-propagating the classification loss and the segmentation loss to update model parameters of the initial behavior recognition model.

3. The method of claim 2, wherein the split tag comprises a binary mask tag or a level set tag;

4. A method according to any of claims 1-3, wherein the decoder comprises a 3D deconvolution layer, and

the segmenting, with the decoder, the character image feature from the overall image feature, comprising:

performing up-sampling processing on the layer of the downsampled feature map aiming at the last layer of the downsampled feature map extracted by the feature extractor;

aiming at each layer of up-sampling feature images in the up-sampling process, carrying out feature fusion processing on the layer of up-sampling feature images and the corresponding down-sampling feature images according to a preset feature fusion function;

wherein the character image features are features after feature fusion processing.

5. The method of claim 4, wherein the training the initial behavior recognition model based on the classification result output by the classifier and the character image features output by the decoder comprises:

and training the initial behavior recognition model based on the classification result output by the classifier for the last layer of downsampled feature images and the character image features output by the decoder for the last layer of downsampled feature images.

6. A method according to claim 2 or 3, wherein the sample video stream further comprises a target domain sample video stream, the initial behavior recognition model further comprises a plurality of domain discriminators, and the method further comprises:

Aligning the features of the sample image sequences on different scales by utilizing the multiple domain discriminators to obtain domain discrimination results; and

and training the initial behavior recognition model based on the classification result, the character image characteristics and the domain discrimination result.

7. The method of claim 6, wherein the training the initial behavior recognition model based on the classification result, the character image feature, and the domain discrimination result comprises:

calculating a source domain predicted value corresponding to the source domain sample image sequence by using a domain discriminator; the source domain sample image sequence corresponds to the source domain sample video stream;

calculating a target domain predicted value corresponding to the target domain sample image sequence by using a domain discriminator; the target domain sample image sequence corresponds to the target domain sample video stream;

calculating the domain losses of a plurality of domain discriminators based on the source domain predicted value and the target domain predicted value;

the classification loss, the segmentation loss, and the domain loss are back-propagated to update model parameters of the initial behavior recognition model.

8. A method for identifying video behavior, comprising:

Extracting a to-be-processed image sequence corresponding to the to-be-processed video stream data;

inputting the image sequence to be processed into a target behavior recognition model so as to output a behavior recognition result by using the target behavior recognition model; the target behavior recognition model is trained based on the method of any one of claims 1-7.

9. A training device for a video behavior recognition model, wherein the video behavior recognition model is constructed based on a MiCT-Net network framework, the device comprising:

the input module is used for inputting a sample image sequence corresponding to the sample video stream into the initial behavior recognition model; the sample video stream comprises a source domain sample video stream; the initial behavior recognition model comprises a classifier, a feature extractor and a decoder;

a feature extraction module for extracting integral image features of the sample image sequence by using the feature extractor;

a segmentation module for segmenting the character image features from the overall image features using the decoder;

and the training module is used for training the initial behavior recognition model based on the classification result output by the classifier and the character image characteristics output by the decoder.

10. A video behavior recognition apparatus, comprising:

the extraction module is used for extracting an image sequence to be processed corresponding to the video stream data to be processed;

the identification module is used for inputting the image sequence to be processed into a target behavior identification model so as to output a behavior identification result by utilizing the target behavior identification model; the target behavior recognition model is trained based on the method of any one of claims 1-7.

11. An electronic device comprising a processor and a memory storing computer readable instructions that, when executed by the processor, perform the method of any of claims 1-7 or 8.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the method according to any of claims 1-7 or 8.