CN115705706A

CN115705706A - Video processing method, video processing device, computer equipment and storage medium

Info

Publication number: CN115705706A
Application number: CN202110928921.4A
Authority: CN
Inventors: 马锦华; 高远; 陈培鹏
Original assignee: Tencent Technology Shenzhen Co Ltd; Sun Yat Sen University
Current assignee: Tencent Technology Shenzhen Co Ltd; Sun Yat Sen University
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2023-02-17

Abstract

The application relates to a video processing method, a video processing device, computer equipment and a storage medium. The method comprises the following steps: respectively extracting the depth characteristics of the video samples under the source domain and the target domain through the identification model to be trained; performing multi-time scale feature extraction on the depth features through a domain adaptation trainer to respectively obtain multi-time scale video features under a source domain and a target domain; grouping and aligning the video features under the source domain and the target domain according to the time nodes and the time scale weights corresponding to the video features; the time scale weight is positively correlated with the information quantity expressed by the corresponding video characteristics; and adjusting model parameters of the recognition model and continuing the confrontation training according to the confrontation loss between the video characteristics of the source domain and the target domain in the same group and the class loss between the prediction class of the video sample in the source domain and the corresponding sample label until the training stopping condition is met, and finishing the training. By adopting the method, the accuracy of video identification can be effectively improved.

Description

Video processing method, video processing device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a video processing method and apparatus, a computer device, and a storage medium.

Background

With the rapid development of image processing technology and artificial intelligence technology, video identification technology has appeared, for example, detecting and identifying behaviors of objects in video content, so as to realize automatic identification of video categories.

In the related art, a network model is usually trained by using a large amount of labeled sample data, and each static frame in a video in a labeling field is classified and identified by the trained network model. However, this method is only suitable for classification and identification of videos in the labeling field, and for videos in other fields, accurate identification cannot be performed, which results in low accuracy of video identification in other fields.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a video processing method, an apparatus, a computer device and a storage medium capable of effectively improving the accuracy of video identification.

A method of video processing, the method comprising:

respectively extracting the depth characteristics of the video samples under the source domain and the target domain through the identification model to be trained; the video sample under the source domain carries a sample label;

performing multi-time scale feature extraction on the depth features through a domain adaptation trainer to respectively obtain multi-time scale video features under a source domain and a target domain;

according to the time nodes and time scale weights corresponding to the video features, video features under a source domain and a target domain are grouped and aligned; the time scale weight is positively correlated with the information quantity expressed by the corresponding video feature;

determining the confrontation loss according to the difference between the video characteristics under the source domain and the target domain in the same group;

determining a category loss based on a difference between a prediction category of a video sample under a source domain and a corresponding sample label; the prediction category is obtained by classifying based on the video characteristics of the video sample in the source domain;

and adjusting the model parameters of the recognition model according to the countermeasure loss and the category loss, and continuing the countermeasure training until the training stopping condition is met.

A video processing device, the device comprising:

the feature extraction module is used for respectively extracting the depth features of the video samples in the source domain and the target domain through the recognition model to be trained; the video sample under the source domain carries a sample label;

the domain adaptation training module is used for extracting the multi-time scale features of the depth features through a domain adaptation trainer to respectively obtain the multi-time scale video features under a source domain and a target domain; according to the time nodes and time scale weights corresponding to the video features, video features under a source domain and a target domain are grouped and aligned; the time scale weight is positively correlated with the information quantity expressed by the corresponding video feature;

the loss determining module is used for determining the antagonistic loss according to the difference between the video characteristics under the source domain and the target domain in the same group; determining a category loss based on a difference between a prediction category of a video sample under a source domain and a corresponding sample label; the prediction category is obtained by classifying based on the video characteristics of the video samples in the source domain;

and the parameter adjusting module is used for adjusting the model parameters of the recognition model according to the countermeasure loss and the category loss and continuing the countermeasure training until the training stopping condition is met.

In one embodiment, the domain adaptation training module is further configured to perform multi-time scale convolution processing on the depth features through a domain adaptation trainer, respectively, to obtain convolution results corresponding to the depth features; and respectively obtaining the video features of multiple time scales under the source domain and the target domain according to the time node weight corresponding to the depth feature and the corresponding convolution result.

In one embodiment, the domain adaptation trainer is a multi-time scale convolution process with convolution layers; the domain adaptation trainer also comprises a time node intention layer; the domain adaptation training module is further used for respectively distributing corresponding time node weights to the time nodes corresponding to the depth features according to the information quantity expressed by the depth features under each time node through the time node attention layer; the time node weight is positively correlated with the information quantity expressed by the depth features under the corresponding time node.

In one embodiment, the domain adaptation training module is further configured to determine, through a time scale attention layer of the domain adaptation trainer, information entropies corresponding to video features of respective time scales in a source domain and a target domain; the information entropy represents the information quantity expressed by the corresponding video characteristics; and respectively distributing corresponding time scale weights to the video features of each time scale according to the information entropy.

In one embodiment, the domain adaptation training module is further configured to determine, by the domain adaptation trainer, video features to be aligned in a source domain and a target domain according to a time node and a time scale weight corresponding to each of the video features; dividing the video features to be aligned into one group to obtain multiple groups of aligned video features; the video features within each group include video features at the source domain and the target domain at the same time scale.

In one embodiment, the domain adaptation training module is further configured to determine temporal node weights of video features in the source domain and the target domain at respective temporal nodes; and determining video features matched with time node weights and time scale weights in different domains from the video features in the source domain and the target domain, and taking the video features as video features to be aligned in the source domain and the target domain.

In one embodiment, the feature extraction module is further configured to extract initial features of the video samples in the source domain and the target domain respectively through an initial feature extractor in the recognition model to be trained; and respectively carrying out special extraction on the initial features of the video samples in the source domain and the target domain through a target feature extractor in the identification model to obtain the depth features of the video samples in the source domain and the target domain.

In one embodiment, the video processing apparatus further includes a classification module, configured to perform classification based on video features of the video sample in the source domain through a classifier of the recognition model, so as to obtain a prediction category of the video sample in the source domain; and the parameter adjusting module is further used for adjusting model parameters of a target feature extractor and a classifier in the recognition model according to the countermeasure loss and the category loss and continuing countermeasure training so that the difference between the video features under the source domain and the target domain in the same group is reduced in the iterative training process of the recognition model until the training stopping condition is met, and then ending the training.

In one embodiment, the loss determination module is further configured to determine a cross-entropy loss based on a difference between a prediction category of the video sample in the target domain and a prediction category of the video sample in the source domain; and the parameter adjusting module is further used for adjusting model parameters of the recognition model according to the confrontation loss, the category loss and the cross entropy loss and continuing the confrontation training so as to reduce the difference between the video characteristics of the source domain and the target domain in the same group in the iterative training process of the recognition model until the training stop condition is met and then ending the training.

In an embodiment, after the training is finished until the training stop condition is met, the video processing apparatus further includes a video recognition module, configured to perform depth feature extraction on a video to be processed in a target domain through a trained recognition model, so as to obtain target video features aligned in a source domain and the target domain; and classifying and identifying the videos to be processed based on the target video characteristics through the identification model.

In one embodiment, the video recognition module is further configured to perform preliminary feature extraction on the video to be processed through a trained recognition model to obtain an initial feature of the video to be processed; and performing depth feature extraction on the initial features based on the time nodes and the time scale weights corresponding to the initial features to obtain target video features aligned in a source domain and a target domain.

In one embodiment, the target video features are target video behavior features; the video identification module is also used for carrying out behavior identification on the object in the video to be processed based on the target video behavior characteristics through the identification model.

A computer device comprising a memory storing a computer program and a processor implementing the steps in the video processing method of the embodiments of the application when executing the computer program.

A computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps in the video processing method of the embodiments of the present application.

A computer program product or computer program comprising computer instructions stored in a computer readable storage medium; the processor of the computer device reads the computer instructions from the computer-readable storage medium, and when the processor executes the computer instructions, the steps in the video processing method of the embodiments of the present application are implemented.

According to the video processing method, the video processing device, the computer equipment and the storage medium, the depth characteristics of the video samples in the source domain and the target domain are respectively extracted through the identification model to be trained; and performing multi-time scale feature extraction on the depth features through a domain adaptation trainer to respectively obtain multi-time scale video features under a source domain and a target domain. And according to the time nodes and the time scale weights corresponding to the video features, grouping and aligning the video features under the source domain and the target domain. Because the time scale weight is positively correlated with the information quantity expressed by the corresponding video features, the extracted features can be grouped based on the time nodes and the time scale weight, and each group of features is separately subjected to domain confrontation training, so that the distribution of the video features under the source domain and the target domain can be aligned more accurately. And then adjusting model parameters of the recognition model and continuing the confrontation training according to the confrontation loss between the video features of the source domain and the target domain in the same group and the category loss between the prediction category of the video sample in the source domain and the sample label carried by the video sample in the source domain, so that the distribution of the video features in the source domain and the target domain can be aligned more accurately, the recognition model obtained by training has the capability of extracting the video features aligned in the source domain and the target domain, and the accuracy of classifying the video in the target domain can be effectively improved.

Drawings

FIG. 1 is a diagram of an exemplary video processing application;

FIG. 2 is a flow diagram of a video processing method in one embodiment;

FIG. 3 is a block diagram of a domain adaptation trainer in one embodiment;

FIG. 4 is a block diagram of the convolutional layer in the domain adaptation trainer in one embodiment;

FIG. 5 is a flow chart illustrating a video processing method according to another embodiment;

FIG. 6 is an overall architecture diagram of a recognition model during training in one embodiment;

FIG. 7 is an overall architecture diagram of a recognition model during training in a particular embodiment;

FIG. 8 is a flowchart illustrating the steps of a video identification process in one embodiment;

FIG. 9 is a block diagram showing the structure of a video processing apparatus according to one embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

The video processing method can be applied to computer equipment. The computer device may be a terminal or a server. It can be understood that the video processing method provided by the present application may be applied to a terminal, may also be applied to a server, may also be applied to a system including a terminal and a server, and is implemented through interaction between the terminal and the server.

The video processing method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may first obtain video samples under the source domain and the target domain for training the recognition model from the server 104. Then the terminal 102 respectively extracts the depth characteristics of the video samples in the source domain and the target domain through the identification model to be trained; performing multi-time scale feature extraction on the depth features through a domain adaptation trainer to respectively obtain multi-time scale video features under a source domain and a target domain; according to the time nodes and time scale weights corresponding to the video features, video features under a source domain and a target domain are grouped and aligned; determining the confrontation loss according to the difference between the video characteristics under the source domain and the target domain in the same group; determining a category loss based on a difference between a prediction category of a video sample under a source domain and a corresponding sample label; and adjusting the model parameters of the recognition model according to the confrontation loss and the category loss, and continuing the confrontation training until the training stopping condition is met.

The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal 102 may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Cloud Computing (Cloud Computing) refers to a mode of delivery and use of IT (Internet Technology ) infrastructure, and refers to obtaining required resources through a network in an on-demand, easily-extensible manner; the generalized cloud computing refers to a delivery and use mode of a service, and refers to obtaining a required service in an on-demand and easily-extensible manner through a network. Such services may be IT and software, internet related, or other services. The cloud Computing is a product of development and fusion of traditional computers and Network Technologies, such as Grid Computing (Grid Computing), distributed Computing (distributed Computing), parallel Computing (Parallel Computing), utility Computing (Utility Computing), network Storage (Network Storage Technologies), virtualization (Virtualization), load balancing (Load Balance), and the like. With the development of diversification of internet, real-time data stream and connecting equipment and the promotion of demands of search service, social network, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Different from the prior parallel distributed computing, the generation of cloud computing can promote the revolutionary change of the whole internet mode and the enterprise management mode in concept.

It can be understood that the video processing method in the embodiments of the present application adopts a machine learning technique in an artificial intelligence technique, and can train an identification model capable of accurately classifying and identifying videos. Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. It can be understood that the recognition model trained in some embodiments of the present application is trained by using a machine learning technique, and the recognition model trained based on the machine learning technique can be trained to have high accuracy in classifying and recognizing videos in a target domain.

In one embodiment, as shown in fig. 2, a video processing method is provided, which is described by taking the method as an example applied to a computer device, where the computer device may be a terminal or a server in fig. 1, and includes the following steps:

step S202, extracting the depth characteristics of the video samples in the source domain and the target domain respectively through the recognition model to be trained.

It is understood that video refers to consecutive image frames, i.e. video includes a plurality of frames of images with time sequence, and specifically may be a series of consecutive still images generated by using photography to capture dynamic images.

The identification model to be trained is an initial neural network model needing to be trained, and the identification model is trained through video samples in a source domain and a target domain, so that the trained identification model has the capability of accurately classifying videos in the target domain.

Where a domain refers to a particular range or attribute. Both the source domain and the target domain refer to the domain of the video. The domain of the video, i.e., the attribute range of the video. The different-domain data refers to data in which objects of the same category have relatively large differences in appearance. Videos in different domains often have differences in appearance. For example, the video domain may include various scene domains and various style domains. The various scene domains include images generated under various environmental scenes. For example, a video captured in a daytime scene and a video captured in a night scene, a video captured in an indoor scene and an image captured in an outdoor scene are videos in different scene fields.

It is understood that a source domain, also referred to as a source domain, a video under the source domain is a video belonging to one of the domains. The video in the target domain is the video in another domain. In transfer learning, the domain with knowledge is the source domain, and the domain to be learned with new knowledge is the target domain. The goal of the transfer learning is to extract useful knowledge information from one or more source domain tasks and use it on new target domain tasks, particularly applicable to cross-domain video recognition tasks. When the data distribution of the source domain and the target is different, but the tasks corresponding to the two data are the same, the special transfer learning is adapted to the domain.

The video samples comprise video samples under a source domain and video samples under a target domain. Wherein the video sample under the source domain carries the sample label. The sample label is a label marked for the category of the video under the source domain, and is used for performing difference comparison with the prediction category output by the identification model so as to adjust the model parameters of the identification model. The sample label may be generated by manual labeling.

The computer equipment firstly obtains video samples under a source domain and a target domain, and then extracts the depth characteristics of the video samples under the source domain and the target domain respectively through the recognition models to be trained. Specifically, the depth feature of the video sample is an information expression, i.e., a feature representation, for the depth of the video sample. The depth features of the video sample may include frame-level features and may also include slice-level features. The frame-level features refer to features corresponding to each frame in the video sample. The feature at the segment level refers to a feature corresponding to one of the segments in the video sample.

In some embodiments, in each round of training, the computer device inputs a video sample in a source domain and a video sample in a target domain into the recognition model to be trained, and extracts the depth features of the video samples in the source domain and the target domain through the recognition model to be trained, so as to perform confrontation training.

And S204, performing multi-time scale feature extraction on the depth features through a domain adaptation trainer to respectively obtain multi-time scale video features under a source domain and a target domain.

The domain adaptation trainer is a neural network used for extracting the characteristics of the video samples in the source domain and the target domain, aligning the characteristics of the video samples in the source domain with the characteristics of the video samples in the target domain, and performing countermeasure training on the recognition model.

It is understood that the time scale refers to a measure of the time range. The time scale of the video may refer to the length of a video segment in a video sample. Specifically, the time scale may be determined according to the duration of the video segment, or according to the number of frames included in the video segment. For example, the time scale may be a video segment duration, such as may be 3 seconds, 4 seconds, 5 seconds, etc.; the time scale may also be the number of video segment frames, such as may be 3 frames, 4 frames, 5 frames, etc.

The computer equipment acquires video samples under a source domain and a target domain, and performs depth feature extraction on the video samples under the source domain and the target domain respectively to obtain the depth features of the video samples under the source domain and the target domain. The computer equipment further inputs the depth characteristics of the video samples under the source domain and the target domain into a domain adaptation trainer, and further performs multi-time scale characteristic extraction on the depth characteristics of the video samples under the source domain and the target domain through the domain adaptation trainer to respectively obtain the multi-time scale video characteristics under the source domain and the target domain.

It can be understood that the video features of multiple time scales are video features corresponding to video frames of various time scales.

And S206, grouping and aligning the video features under the source domain and the target domain according to the time nodes and the time scale weights corresponding to the video features.

It is understood that a time node of a video refers to a time point or a time period in the video, and specifically may refer to a time sequence position of one or more frames in the video in the whole video. The time node corresponding to the video feature is the time sequence position of the corresponding video frame in the corresponding video sample.

The time scale weight is positively correlated with the information quantity expressed by the corresponding video feature, that is, the more the information quantity expressed by the video feature is, the higher the time scale weight of the video feature is. It is understood that the time scale weights of video features refer to the weights of the video features themselves at various time scales, not the weights of the time scales themselves.

The video data is characterized in that a time dimension is introduced, and the content of the video can be accurately analyzed by combining information of previous and next frames. Therefore, for the video identification task, the video features need to be reasonably modeled as far as possible by combining the time sequence information in the process of performing domain adaptation training, and the modeled features are efficiently aligned.

After extracting the video features of multiple time scales under the source domain and the target domain respectively through the domain adaptation trainer, the computer device further performs grouping alignment on the video features under the source domain and the video features under the target domain through the domain adaptation trainer.

Specifically, the computer device may first perform preliminary alignment on the video sample in the source domain and the video sample in the target domain according to corresponding time nodes through the domain adaptation trainer, and then perform grouping alignment on the video features in the source domain and the video features in the target domain matching the time scale weights according to time scale weights corresponding to the video features in various time scales in the source domain and the target domain, so as to obtain multiple groups of aligned video features.

It is understood that each set of aligned video features respectively includes a video feature in a source domain and a video feature in a target domain, for example, each set of aligned video features respectively includes a video feature in a source domain and a video feature in a target domain. The time nodes of the video features in the source domain and the target domain in each set of aligned video features may be corresponding or non-corresponding.

And step S208, determining the confrontation loss according to the difference between the video characteristics under the source domain and the target domain in the same group.

The loss countermeasure refers to a difference in distribution between video features in a source domain and video features in a target domain within the same group.

And after the computer equipment carries out grouping and alignment on the video characteristics under the source domain and the target domain to obtain a plurality of groups of aligned video characteristics, the computer equipment respectively calculates the difference between the video characteristics under the source domain and the target domain in the same group to obtain corresponding confrontation loss.

Step S210, determining a category loss based on a difference between a prediction category of the video sample in the source domain and a corresponding sample label.

It can be understood that the prediction category refers to a category corresponding to an object in a video obtained by classifying the video. The prediction type of the video sample in the source domain is obtained by classifying based on the video characteristics of the video sample in the source domain.

Specifically, after extracting video features of multiple time scales in a source domain and a target domain through a domain adaptation trainer, the computer device further classifies video samples in the source domain based on the video features in the source domain through an identification model to be trained to obtain prediction categories of the video samples in the source domain.

Then, the computer device compares the prediction category of the video sample in the source domain with the corresponding sample label carried by the video sample in the source domain, and calculates the category loss between the prediction category of the video sample in the source domain and the corresponding sample label.

And step S212, according to the confrontation loss and the category loss, adjusting the model parameters of the recognition model and continuing the confrontation training until the training stop condition is met and finishing the training.

It can be understood that, in the process of training the recognition model, multiple rounds of iterative training of the recognition model are required. Wherein the current round is the current round of model training. And each round of iterative training of each round is to gradually converge the recognition model by adjusting the model parameters of the recognition model of the round so as to obtain the final recognition model.

The training stopping condition is that a condition for ending model training is satisfied, for example, the training stopping condition may be that a preset iteration number is reached, or that a classification performance index of the recognition model after the parameter is adjusted reaches a preset index.

Specifically, after the computer device obtains the confrontation loss between the video features in the source domain and the target domain in the same group and the class loss between the prediction class of the video sample in the source domain and the corresponding sample label, the model parameters of the recognition model are adjusted towards the direction of reducing the loss difference according to the confrontation loss and the class loss.

When the current round does not meet the training stopping condition, the computer equipment returns to the step of respectively extracting the depth characteristics of the video samples in the source domain and the target domain through the identification model to be trained so as to enter the next round. And the computer equipment continues to perform multi-time scale feature extraction and grouping processing on the depth features of the video samples in the other set of the source domain and the target domain so as to perform iterative confrontation training. And when the training stopping condition is met, stopping the iterative training, thereby obtaining the trained recognition model.

The trained recognition model is a machine learning model with the capability of extracting the video features which are irrelevant to the field and aligned under the source domain and the target domain from the video under the target domain and accurately classifying the video under the target domain, and can directly perform deep feature extraction and classification processing on the video under the target domain so as to accurately recognize the class of the object in the video.

In one embodiment, the difference between the prediction classes of the video samples in the source domain and the corresponding sample labels can be measured by a loss function, for example, a function such as cross entropy or mean square error can be selected as the loss function. In the process of iteratively training the recognition model, a back propagation algorithm can be adopted, parameters are updated towards the gradient descending direction, the weight and the bias are adjusted to enable the overall error to be minimum, and the parameters of the recognition extraction model are gradually adjusted to iteratively train the recognition model. For example, the training may be terminated when the value of the loss function is smaller than a preset value, so as to obtain an identification model capable of accurately and effectively classifying videos in the target domain.

In a traditional video domain adaptation processing mode, modeling is usually performed only on the feature weight of a single frame, that is, only the time sequence feature of the single frame can be identified and corresponding importance weight is calculated, and then features of all time nodes are aligned uniformly, or each frame of video is aligned one by one according to the sequence of the time nodes. Due to the complexity of video, the information expressed by different time nodes and different time scales and the importance of domain adaptation tasks may differ for video features. Although the traditional method improves the video domain adaptation task to a certain extent, the mining of the time sequence information in the video task is not sufficient, so that the accuracy of video identification is still low.

In the scheme of the embodiment, the importance degrees of different time nodes can be calculated under different time scales, that is, the importance degrees of the video characteristics under the corresponding time nodes are really calculated according to the video time sequence. In addition, the scheme can also distribute larger weight to important time scales in the video characteristics, so that the characteristics of the content of the video are accurately analyzed by introducing the time dimensions and combining the time node information and the time scale information, and the distribution of the video characteristics under a source domain and a target domain can be aligned more accurately.

In the video processing method, after the depth features of the video samples in the source domain and the target domain are respectively extracted through the recognition model to be trained, the multi-time scale feature extraction is carried out on the depth features through the domain adaptation trainer, and the multi-time scale video features in the source domain and the target domain are respectively obtained. And then, according to the time nodes and the time scale weights corresponding to the video features, grouping and aligning the video features under the source domain and the target domain. Because the time scale weight is positively correlated with the information quantity expressed by the corresponding video features, the extracted video features can be grouped according to the time nodes and the time scale weight, and the features in each group are separately subjected to domain confrontation training, so that the distribution of the video features under the source domain and the target domain can be aligned more accurately. According to the countermeasure loss between the video features of the source domain and the target domain in the same group and the category loss between the prediction category of the video sample of the source domain and the corresponding sample label, the model parameters of the recognition model are adjusted and the countermeasure training is continued, so that the video feature distribution of the source domain and the target domain can be aligned more accurately, the recognition model obtained through training has the capability of extracting the video features aligned under the source domain and the target domain, and the accuracy of classifying the video under the target domain can be effectively improved.

In one embodiment, the step of performing multi-time scale feature extraction on the depth features through a domain adaptation trainer to obtain multi-time scale video features under a source domain and a target domain respectively includes: respectively carrying out multi-time scale convolution processing on the depth features through a domain adaptation trainer to obtain convolution results corresponding to the depth features; and respectively obtaining the video characteristics of multiple time scales under the source domain and the target domain according to the time node weight corresponding to the depth characteristics and the corresponding convolution result.

It can be understood that, in the process of extracting features of a video, specifically, feature extraction may be performed on image frames in the video. The depth features of the video samples in the source domain and the target domain are respectively extracted through the recognition model to be trained, specifically, the depth features can be video features of depth at a frame level, that is, the depth features corresponding to each frame image in the video samples.

In image processing, a convolution operation refers to performing a series of operations on each pixel in an image using a convolution kernel, i.e., a convolution template. That is, the process of sliding on the image by using the convolution kernel, multiplying the gray value of the pixel on the image point by the corresponding numerical value on the convolution kernel, then adding all the multiplied values as the gray value of the pixel on the image corresponding to the middle pixel of the convolution kernel, and finally finishing sliding all the images.

It is understood that the temporal node weight refers to the weight of the video feature under the corresponding temporal node, that is, the weight of the video feature itself under the corresponding temporal node, and not the weight of the temporal node itself.

After extracting the depth features of the video samples under the source domain and the target domain through the identification model to be trained, the computer equipment further inputs the depth features of the video samples under the source domain and the target domain into the domain adaptive trainer, so that the domain adaptive trainer respectively carries out multi-time scale convolution processing on the depth features under the source domain and the target domain to obtain convolution results corresponding to the depth features.

Specifically, the domain adaptation trainer can firstly perform multi-scale modeling on the input depth features respectively, and extract the video features of multiple time scales corresponding to the depth features in the source domain and the video features of multiple time scales corresponding to the depth features in the target domain by utilizing multilayer cavity convolution respectively, so as to obtain corresponding convolution results.

The computer device also determines time node weights of the depth features under the source domain and the target domain under corresponding time nodes respectively through the domain adaptation trainer. And then the computer equipment respectively obtains the video characteristics of multiple time scales under the source domain and the target domain according to the time node weight corresponding to the depth characteristic and the corresponding convolution result.

In this embodiment, the domain adaptation trainer is used to perform multi-time scale convolution processing on the depth features in the source domain and the target domain, and according to the time node weights corresponding to the depth features and the corresponding convolution results, the multi-time scale video features in the source domain and the target domain can be accurately and effectively extracted according to the time sequence information of the video sample.

In one embodiment, the domain adaptation trainer is a multi-time scale convolution process through convolution layers; the domain adaptation trainer also comprises a time node attention layer; the video processing method further comprises: and respectively distributing corresponding time node weights to the time nodes corresponding to the depth features through the time node attention layer according to the information quantity expressed by the depth features under each time node.

It is understood that the network structure of the domain adaptation trainer includes a domain attention layer, and the domain attention layer further includes a convolutional layer and a time node attention layer. The convolution layer is used for performing multi-time scale convolution processing on the depth features under the source domain and the target domain respectively. The time node attention layer is used for allocating corresponding time node weight to the time node corresponding to the depth feature.

For the domain adaptation task of the video, the importance of the video characteristics of different time nodes in the video is different, and the accuracy of the domain adaptation can be effectively improved by aligning the time nodes with larger information amount.

Specifically, the depth features of the video samples in the source domain and the target domain extracted by the recognition model to be trained may be video features with a lower time scale. And then, the computer equipment inputs the depth characteristics of the video samples under the source domain and the target domain into a domain adaptation trainer, and performs multi-time scale convolution processing on the depth characteristics under the source domain and the target domain respectively according to the time sequence information corresponding to the video samples under the source domain and the target domain through convolution layers in the domain adaptation trainer to obtain corresponding convolution results.

Meanwhile, the computer equipment also allocates corresponding time node weights to the time nodes corresponding to the depth features respectively according to the information quantity expressed by the depth features under each time node through a time node attention layer in the domain adaptation trainer. Specifically, the time node weight is positively correlated with the information amount expressed by the depth feature under the corresponding time node, that is, the more the information amount expressed by the depth feature under each time node is, the higher the time node weight is allocated to the depth feature under the corresponding time node.

And then, the computer equipment fuses the time node weight corresponding to the depth feature and the corresponding convolution result to respectively obtain the video features of multiple time scales under the source domain and the target domain. The video features of multiple time scales under the source domain and the target domain extracted by the domain adaptation trainer are video features with higher time scales.

FIG. 3 is a block diagram that illustrates the structure of a domain adaptation trainer in one embodiment. Referring to fig. 3, the domain adaptation trainer includes a domain attention tier and a time scale attention tier as well as a time-ordered domain training module. The domain attention layer may be a domain attention hole residual layer, and further includes a time node attention layer and a multi-layer convolution layer. The convolution layer can be a multilayer cavity convolution layer and is used for performing multi-time scale feature extraction on the input depth features. And the time node attention layer is used for distributing corresponding time node weights to time nodes corresponding to the depth features under the source domain and the target domain, respectively performing multi-time scale convolution processing on the depth features through the multilayer void convolution layer, and then respectively obtaining the multi-time scale video features under the source domain and the target domain according to the time node weights corresponding to the depth features and corresponding convolution results. The time scale attention layer is used for determining time scale weights corresponding to the video features of multiple time scales. And the time sequence domain training module is used for grouping the video features under the source domain and the target domain according to the time nodes and the time scale weight so as to perform domain confrontation training on the features in each group independently.

FIG. 4 is a block diagram of a convolutional layer in a domain adaptation trainer in one embodiment. The convolutional layers in the domain adaptation trainer comprise a time node attention layer and a plurality of layers of cavity convolutional layers. The time node attention layer may further specifically include a domain discriminator and a domain attention pooling layer. The domain discriminator is used for judging whether the input depth features come from video samples in a source domain or video samples in a target domain. And outputting the cross entropy of the result through the domain discriminator to be used as a measure of the information quantity expressed by the depth feature.

It can be understood that if the larger the entropy of the output result of the domain discriminator, the more difficult the domain discriminator can distinguish the source domain of the depth feature of the input video sample, the depth feature representing the input contains more domain information, and the amount of information expressed by the input depth feature is also more.

And a domain attention pooling layer in the time node attention layer allocates corresponding time node weight to the time node corresponding to the depth feature through a domain attention mechanism for measuring entropy, so that the depth feature under the time node with high information content is given greater weight.

The multi-layer hole convolutional layer of the convolutional layer in the domain trainer may be specifically a convolutional network, for example, a Conv convolutional network. For example, the Conv convolutional network may specifically include a hole convolutional layer with a hole rate of 2 and a convolutional kernel size of 2, a ReLu activation function, and a bottleneck convolutional layer of 1 × 1.

For example, the video feature of the convolutional layer in the input domain adaptation trainer is f _s-1,t 、f _s-1,t+1 、f _s-1,t+2 、f _s-1 、 _t+3 、f _s-1,t+4 Determining the time node weight of the video feature under each time node through a time node attention layer, performing convolution operation on the input video feature through a multilayer hollow convolution layer, and adding the convolution result and the time node weight of each video feature to obtain the multi-time-scale video feature f _s,t 。

In this embodiment, the domain adaptation trainer allocates corresponding time node weights to the time nodes corresponding to the depth features according to the information amount expressed by the depth features under each time node, and then fuses the time node weights corresponding to the depth features and the corresponding convolution results to obtain the video features of multiple time scales under the source domain and the target domain, so that the video features of multiple time scales rich in more time sequence information and higher in time scale can be extracted more accurately.

In one embodiment, before aligning the video feature groups in the source domain and the target domain according to the time nodes and the time scale weights corresponding to the video features, the video processing method further includes: determining information entropy corresponding to video features of each time scale under a source domain and a target domain through a time scale attention layer of a domain adaptation trainer; and respectively distributing corresponding time scale weights to the video features of each time scale according to the information entropy.

The information entropy refers to the probability of occurrence of information. The entropy of the video features may refer to a value measure of the amount of information expressed by the video features. The information entropy represents the information quantity expressed by the corresponding video feature, that is, the larger the value of the information entropy is, the more the information quantity expressed by the video feature is represented. The information entropy may specifically be a cross entropy, and the cross entropy may be used to measure difference information between two probability distributions.

The domain adaptation trainer comprises a time scale attention layer. The computer equipment carries out multi-time scale feature extraction on the depth features through a domain adaptation trainer, and further respectively calculates information entropies corresponding to the video features of all time scales under the source domain and the target domain through a time scale attention layer after respectively obtaining the video features of all time scales under the source domain and the target domain.

Specifically, the computer device may calculate a cross entropy between a probability that the current video feature belongs to the source domain and a probability that the current video feature belongs to the target domain through the domain adaptation trainer, where a larger cross entropy indicates a larger amount of information expressed by the video feature, and vice versa. The computer device further assigns corresponding time scale weights to the video features according to the cross entropy, wherein the larger the cross entropy is, the higher the time scale weights are assigned to the video features, and the lower the time scale weights are assigned to the video features otherwise.

In this embodiment, different importance weights are assigned to video features of different time scales by determining the cross entropy of the video features and using the cross entropy as a measurement standard of the information amount expressed by the video features, so that the value of the video features of each time scale can be accurately measured, and matched importance weights can be accurately assigned to the video features of each time scale.

In one embodiment, the step of grouping and aligning the video features under the source domain and the target domain according to the time nodes and the time scale weights corresponding to the video features comprises: determining video features to be aligned under a source domain and a target domain according to time nodes and time scale weights corresponding to the video features through a domain adaptation trainer; and dividing the video features to be aligned into one group to obtain a plurality of groups of aligned video features.

It is to be understood that the grouping alignment means that the video features in the source domain and the video features in the target domain are respectively grouped and aligned in the source domain and the target domain, so that the video features in each group after alignment include the video features in the source domain and the video features in the target domain at the corresponding time scale.

When the computer equipment aligns the video features under the source domain and the target domain in a grouping mode through the domain adaptation trainer, the domain adaptation trainer aligns the depth features of the video samples under the source domain and the target domain on the whole according to the time node of the video sample under the source domain and the time node of the video sample under the target domain.

The domain adaptation trainer further extracts the multi-time scale features of the depth features, and respectively determines time scale weights corresponding to the multi-time scale video features of the source domain and the target domain after the multi-time scale video features of the source domain and the target domain are respectively obtained.

And then, the domain adaptation trainer respectively determines the video features to be aligned in the source domain and the target domain according to the time nodes and the time scale weights corresponding to the video features in the source domain and the target domain. Specifically, the computer device may determine, according to the same time scale, video features corresponding to video frames or video segments of the same time scale, where time nodes and time scale weights under the source domain and the target domain are matched, as video features to be aligned under the source domain and the target domain. The time node matching may include at least one of a weight of the time node, a sequence of the time node, a length of the time node, and the like.

In this embodiment, the video features in the source domain and the target domain are grouped and aligned according to the time nodes and the time scale weights, so that individual domain confrontation training can be performed on the video features in the source domain and the target domain in each group more accurately.

In one embodiment, the step of determining the video features to be aligned in the source domain and the target domain according to the time nodes and the time scale weights corresponding to the video features by using a domain adaptation trainer comprises: determining time node weights of video features under a source domain and a target domain under corresponding time nodes; and determining video features matched with time node weights and time scale weights in different domains from the video features in the source domain and the target domain to serve as video features to be aligned in the source domain and the target domain.

The video features in different domains refer to video features in a source domain and video features in a target domain.

When the computer equipment aligns the video features under the source domain and the target domain in a grouping way through the domain adaptive trainer, the computer equipment further allocates corresponding time node weights to the time nodes corresponding to the depth features respectively according to the information quantity expressed by the depth features under all the time nodes. And the domain adaptation trainer performs multi-time scale feature extraction on the depth features, and respectively determines time scale weights corresponding to the video features of each time scale in the source domain and the target domain after respectively obtaining the video features of the multiple time scales in the source domain and the target domain.

And the domain adaptation trainer further determines the video features corresponding to the video frames or video clips with the same time scale, which are matched with the time node weights and the time scale weights under the source domain and the target domain, as the video features to be aligned under the source domain and the target domain according to the same time scale.

For example, the frame sequence of the video sample in the source domain is { s1, s2, s3, s4, s 5.., s20}, and the frame sequence of the video sample in the target domain is { t1, t2, t3, t4, t 5.., t20}. The time nodes of the video samples in the source domain then include 1-20, and the time nodes of the video samples in the target domain also include 1-20. If the time node weight and the time scale weight of the video features corresponding to the video clips with the frame sequences of { s5-s10} in the video sample under the source domain are matched with the time node weight and the time scale weight of the video features corresponding to the video clips with the frame sequences of { s15-s20} in the video sample under the target domain, the video features corresponding to the video clips with the frame sequences of { s5-s10} and the video features corresponding to the video clips with the frame sequences of { s15-s20} are determined to be a group of video features to be aligned, and the video features corresponding to the two video clips are divided into a group.

In the embodiment, the video features under the source domain and the target domain are grouped and aligned according to the time node weight and the time scale weight, so that the video features with the matched importance degrees under the source domain and the target domain can be grouped and aligned more accurately, and the grouped matching degree and the anti-training precision can be effectively improved.

In one embodiment, the step of extracting the depth features of the video samples in the source domain and the target domain respectively through the recognition model to be trained includes: respectively extracting initial features of video samples in a source domain and a target domain through an initial feature extractor in a recognition model to be trained; and respectively carrying out special extraction on the initial features of the video samples in the source domain and the target domain through a target feature extractor in the identification model to obtain the depth features of the video samples in the source domain and the target domain.

It is understood that the recognition model to be trained includes an initial feature extractor and a target feature extractor. The initial feature extractor is used for performing initial feature extraction on the video samples in the source domain and the target domain. And the target feature extractor is used for extracting depth features according to the video samples in the source domain and the target domain.

After the computer equipment acquires the video samples under the source domain and the target domain, the video samples under the source domain and the target domain are input into the recognition model to be trained. Firstly, performing preliminary feature extraction on video samples in a source domain and a target domain respectively through an initial feature extractor in a recognition model. Specifically, the initial feature extractor may perform shallow learning to perform key point detection on each image frame in the video sample to extract key point information, and then obtain initial features corresponding to each image frame according to the key point information to extract initial features of the video sample in the source domain and the target domain, respectively.

And then inputting the initial characteristics of the video samples in the source domain and the target domain into a target characteristic extractor in the recognition model, and performing depth characteristic extraction on the initial characteristics of the video samples in the source domain and the target domain through the target characteristic extractor respectively. Specifically, the depth feature expression of the initial features can be obtained after the initial features are subjected to operations such as convolution, pooling and back propagation through the target feature extractor, and the original video samples can be identified and classified based on the depth feature expression.

It will be appreciated that the initial feature extractor in the recognition model may be a trained feature extractor and the target feature extractor in the recognition model may be a feature extractor to be trained. In the process of performing countermeasure training on video samples in a source domain and a target domain, model parameters of a target feature extractor are continuously adjusted, so that the target feature extractor has the capability of extracting video features which are irrelevant to the field and are aligned in the source domain and the target domain from a video to be processed in the target domain.

In an embodiment, as shown in fig. 5, another video processing method is provided, which specifically includes the following steps:

step S502, the initial characteristics of the video samples in the source domain and the target domain are respectively extracted through the initial characteristic extractor in the recognition model to be trained.

Step S504, respectively extracting the initial features of the video samples in the source domain and the target domain through the target feature extractor in the identification model to obtain the depth features of the video samples in the source domain and the target domain.

And S506, performing multi-time scale feature extraction on the depth features through the domain adaptation trainer to respectively obtain multi-time scale video features under the source domain and the target domain.

And step S508, aligning the video feature groups under the source domain and the target domain according to the time nodes and the time scale weights corresponding to the video features.

Step S510, determining the countermeasure loss according to the difference between the video features in the source domain and the target domain in the same group.

And S512, classifying the video sample based on the video characteristics of the video sample in the source domain through a classifier of the identification model to obtain the prediction category of the video sample in the source domain.

Step S514, based on the difference between the prediction category of the video sample in the source domain and the corresponding sample label, determines the category loss.

And step S516, according to the countermeasure loss and the category loss, adjusting model parameters of a target feature extractor and a classifier in the recognition model and continuing countermeasure training so that the difference between the video features of the source domain and the target domain in the same group is reduced in the iterative training process of the recognition model, and the training is finished until the training stopping condition is met.

It is understood that the network structure of the recognition model to be trained includes an initial feature extractor, a target feature extractor, and a classifier. The classifier is used for classifying the video samples according to the video characteristics of the video samples.

The computer equipment respectively extracts the depth features under the source domain and the target domain through a domain adaptation trainer to obtain the video features under the source domain and the target domain in multiple time scales, aligns the video features under the source domain and the target domain in groups, and calculates the confrontation loss between the video features under the source domain and the target domain in the same group. The computer equipment further classifies the video samples under the source domain according to the video characteristics of the video samples under the source domain through a classifier in the recognition model to be trained to obtain the prediction category of the video samples under the source domain.

In one embodiment, the computer device classifies the video samples in the source domain according to the depth features corresponding to the video samples in the source domain extracted by the target feature extractor through a classifier of the recognition model, so as to obtain the prediction category of the video samples in the source domain.

In another embodiment, the computer device may further classify, by the classifier of the recognition model, the video sample in the source domain according to the multi-time scale corresponding to the video sample in the source domain extracted by the domain adaptation trainer, so as to obtain the prediction category of the video sample in the source domain.

The computer device then determines a class loss based on the difference between the predicted class of the video sample at the source domain and the corresponding sample label. And further adjusting the model parameters of the target feature extractor in the recognition model according to the confrontation loss and the class loss, and adjusting the model parameters of the classifier according to the class loss, and continuing the confrontation training in the direction that the difference between the video features under the source domain and the target domain in the same group is reduced and the difference between the predicted class of the video sample under the source domain and the corresponding sample label is reduced. In the iterative training process of the recognition model, the difference between the video characteristics of the source domain and the target domain in the same group is reduced, and the training is finished until the training stopping condition is met, so that the trained recognition model can classify the videos in the target domain more accurately.

In one embodiment, the video processing method further includes: a cross entropy loss is determined based on a difference between a prediction category of the video sample in the target domain and a prediction category of the video sample in the source domain.

According to the countermeasure loss and the category loss, adjusting model parameters of the recognition model and continuing the countermeasure training until the training stopping condition is met, and the method comprises the following steps: and adjusting the model parameters of the recognition model according to the confrontation loss, the category loss and the cross entropy loss, and continuing the confrontation training so as to reduce the difference between the video characteristics of the source domain and the target domain in the same group in the iterative training process of the recognition model until the training stopping condition is met.

The cross entropy loss can be used for measuring the similarity of the class probability distribution corresponding to the video samples in the source domain and the target domain, and can be further used for capturing the difference between the classification prediction effect on the video samples in the source domain and the classification prediction effect on the video samples in the target domain.

After the computer equipment respectively extracts the depth features corresponding to the video samples in the source domain and the target domain through the target feature extractor of the recognition model, the multi-time scale feature extraction is respectively carried out on the depth features in the source domain and the target domain through the domain adaptation trainer, and the multi-time scale video features in the source domain and the target domain are obtained.

And the computer equipment further classifies the video samples under the source domain according to the video characteristics of the video samples under the source domain through the classifier in the recognition model to be trained to obtain the prediction categories of the video samples under the source domain. And classifying the video samples under the target domain according to the video characteristics of the video samples under the target domain to obtain the prediction category of the video samples under the target domain. And the computer equipment calculates the difference between the prediction category of the video sample under the target domain and the prediction category of the video sample under the source domain to obtain the corresponding cross entropy loss.

Then, the computer device adjusts model parameters of a target feature extractor in the recognition model according to the confrontation loss, the category loss and the cross entropy loss, adjusts model parameters of a classifier in the recognition model according to the category loss and the cross entropy loss, and continues the confrontation training in the direction that the difference between the video features under the source domain and the target domain in the same group is reduced and the difference between the predicted category of the video sample under the source domain and the corresponding sample label is reduced. And reducing the difference between the video characteristics of the source domain and the target domain in the same group in the iterative training process of the recognition model until the training stop condition is met, and finishing the training.

In a specific embodiment, as shown in fig. 6, an overall architecture diagram of the recognition model during training in one embodiment is shown. Referring to fig. 6, the overall architecture of the recognition model to be trained includes a fixed pre-trained feature extractor, a trainable feature extractor, a domain adaptation module, and a behavior recognition module.

The fixed pre-trained feature extractor, that is, the initial feature extractor, may be a fixed trained backbone network in the recognition model, and specifically may be a feature extraction network constructed based on networks such as ResNet101 (deep residual convolution network), resNet50, inclusion-V3, and AlexNet. In addition to using a 2D network, feature extraction may also be performed using a 3D network, such as an I3D network, as a pre-trained feature extraction network.

A trainable feature extractor, i.e., an object feature extractor, is a trainable feature extraction network fixed in the recognition model for extracting depth features of the video sample in the source domain and the object domain. The target feature extractor may be formed by a simple full connection layer, or may be a feature extraction network constructed based on a network such as DenseNet, CNN (Convolutional Neural Networks), and the like.

The behavior recognition module, namely the classifier, is a classification network to be trained fixed in the recognition model, and the classification network can be composed of a pooling layer and a plurality of layers of sensors. For example, the method may specifically adopt a simple Pooling layer and a multilayer perceptron to constitute, and may also adopt other Network structures such as a Temporal relationship Network and a multilayer perceptron to combine as a classifier, and construct a classifier based on an ImageNet (image recognition Network) Network, and the like.

The domain adaptive module, namely the domain adaptive trainer, is a non-fixed domain adaptive trainer in the recognition model to be trained, and is used for transferring the knowledge learned from the video in the source domain, namely aligning the distribution of the video features in the source domain and the target domain extracted by the feature extractor.

Specifically, the computer device may first perform preliminary single-frame feature extraction on the video samples in the source domain and the target domain through a pre-trained feature extractor, and then perform depth feature extraction on the input single-frame initial features through a trainable feature extractor, so as to obtain corresponding depth features. And then, inputting the depth characteristics corresponding to the video samples in the source domain and the target domain into a behavior identification module and a domain adaptation module respectively. And classifying the behaviors of the objects in the video according to the video characteristics corresponding to the video samples in the source domain and the target domain through a behavior identification module. And simultaneously, transferring the knowledge learned from the video samples in the source domain through a domain adaptation module, namely aligning the distribution of the video features of multiple time scales in the source domain and the target domain. Therefore, the recognition model trained on the source domain can accurately perform behavior recognition on the video under the target domain.

FIG. 7 is a diagram illustrating the overall architecture of the recognition model during training in one embodiment. Referring to the figure, the overall architecture of the recognition model to be trained includes a fixed pre-trained feature extractor, a trainable feature extractor F, a domain adaptation module and a classification module C. A specific network structure of the domain adaptation module is shown in fig. 7. Referring to fig. 7, the domain adaptation module includes a domain discriminator, a gradient blocking layer, a domain attention hole residual layer, and a gradient inversion layer.

The training data includes a source domain data set and a target domain data set, which may be any data set. Wherein, the sample labels of the source domain data set and the target domain data set need to be consistent. The source domain data set includes video samples under a source domain, and the target domain data set includes video samples under a target domain.

For example, when the time scale level is s, the video feature at time node t is denoted as f _s,t . Taking the feature extracted by the feature extractor as a standard and regarding it as the time scale 1, the feature extracted by the feature extractor can be expressed as f _1,t . And will f _1,t As input to the domain adaptation module.

The domain discriminator in the domain self-adaptive module is used for judging whether the input features come from source domain samples or target domain samples. Characterizing a video f _s,t The corresponding domain discriminator is denoted D _s,t . If the larger the entropy of the output result of the domain discriminator, the more difficult the domain discriminator can distinguish the source domain of the input video sample, it indicates that the input video sample contains more domain information. Then via the domain attention hole residual layer H _s,t And performing multi-time scale feature extraction on the input depth features to respectively obtain multi-time scale video features under a source domain and a target domain. Domain attention hole residual layer H _s,t Generating video features f at multiple timescales _s,t May be:

f _s,t ＝H _s,t (f _s-1,t ,f _s-1,t+d ,f _s-1,t+2d ) (1)；

where s represents the time scale, t represents the time node, f _s,t To representVideo features at time node t at time scale level s, E denotes cross entropy, H _s,t Representing the domain attention hole residual layer, D representing the hole rate in hole convolution, D _s,ti Is the time scale s, the domain discriminator at the time node ti, D _s,ti (f _s,t ) The probability of the feature belonging to the source domain is the output result of the feature under the corresponding domain discriminator. The activation function ReLU is contained in the convolution layer Conv (fs, t1, fs, t2, fs, t 3). By directly adding the output of the domain attention pooling as a residual to the result of the convolution, the model is made to focus on the time node where the amount of information is large, and the model training is stabilized. Therefore, when the video features of different scales are extracted, the void rate is gradually increased, so that the high-level features can be rapidly extracted, and the network parameter quantity is reduced.

The part of the domain adaptive module, which consists of the domain discriminator and the gradient blocking layer, is the time scale attention model Ns. The gradient blocking layer is used for blocking gradients in a back propagation stage of directional neural network training, so that the time scale attention model Ns can be independently trained from other parts of the network, and the influence on the domain adaptation trainer caused by training the time scale attention model Ns is avoided. The time scale attention model Ns calculates time scale weights corresponding to the video features at each time scale.

The specific expression may be: w' _s ＝E(N _s (pooling([f _s,1 ,f _s,2 ,...,f _s,ts ]))) (3)；

Wherein, w' _s Representing the original importance of the time scale s, where posing represents the pooling operation of all time node features at the time scale s, and E represents the cross-entropy computation.

The relative importance of the time scale s can then be defined as:

wherein i represents a time scale, w' _i Is the original significance measure of the time scale i and is used for calculating the relative significance degree of the time scale i. w is a _s Watch with clockShowing the relative importance of the video features at each time scale at time scale s.

In order to avoid negative effects caused by alignment of video features under different time nodes, the video features under different time nodes are subjected to grouping alignment through a domain adaptation trainer so as to carry out individual alignment.

Aligning the video features under the source domain and the target domain, wherein the expression may be:

wherein F is a trainable backbone feature extractor, D represents all domain discriminators, N represents a time scale attention model, H represents a domain attention hole residual error layer, and L _adv Representing the penalty on confrontation between video features under the source domain and the target domain.

And grouping the preliminarily aligned video features according to time nodes and time scales through a domain adaptation trainer, and performing independent domain confrontation training in each group of features.

Wherein the penalty function L is resisted _adv May be:

wherein D is _src Representing a source domain data set, D _tar Representing a target domain data set. x represents the video data in the data set, and y represents the sample label corresponding to the digital data, which may be specifically a behavior recognition classification label, for example.

In order to train the recognition model in an end-to-end manner, a gradient inversion layer is added after the target feature extractor F to invert the gradient during direction propagation, so that the effect of resisting training in expression (5) is achieved. While the domain adaptation countermeasure training, the target feature extractor F needs to be trained together with the classifier so that the video sample under the source domain can be correctly classified.

In particular, the classification loss function L _cls Expression ofThe formula can be:

the final training goal may be:

wherein L is _cls Is a classification loss function, B represents a pre-trained initial feature extractor, C represents a classifier, L _ce Is a cross entropy loss function for determining a cross entropy loss between the prediction category of the video sample in the target domain and the prediction category of the video sample in the source domain.

In the training process, according to the confrontation loss, the category loss and the cross entropy loss, the model parameters are adjusted towards the direction of reducing the difference between the video characteristics of the source domain and the target domain in the same group and the direction of reducing the difference between the prediction category of the video sample under the source domain and the corresponding sample label, and iterative confrontation training is continued, so that the video characteristic distribution under the source domain and the target domain can be aligned more accurately, the recognition model has the capability of extracting the video characteristics aligned under the source domain and the target domain, and the accuracy of video classification can be effectively improved.

In one embodiment, after the training is finished until the training stop condition is satisfied, the video processing method further includes a video processing step, where the video processing step specifically includes the following: performing depth feature extraction on a video to be processed under a target domain through a trained recognition model to obtain target video features aligned under a source domain and the target domain; and classifying, identifying and processing the video to be processed based on the characteristics of the target video through the identification model.

It can be understood that the trained recognition model has the capability of extracting the domain-independent video features aligned under the source domain and the target domain from the video under the target domain, and accurately classifying the video under the target domain. The video to be processed is the video to be classified.

After the trained recognition model is obtained, the computer equipment can utilize the recognition model to perform classification recognition on the video in the target domain. Specifically, after the computer device obtains the video to be processed in the target domain, the video to be processed is input into the recognition model, and the recognition model firstly carries out depth feature extraction on the video to be processed.

The trained recognition model extracts the video features which are irrelevant to the field and aligned under the source domain and the target domain from the video under the target domain, so that the target video features which are irrelevant to the field and aligned under the source domain and the target domain can be directly extracted from the video to be processed by extracting the depth features of the video to be processed through the recognition model.

And then the recognition model can classify, recognize and process the video to be processed according to the characteristics of the target video, so that the category of the object in the video to be processed can be accurately recognized.

In the embodiment, the trained recognition model is used for recognizing the video in the target domain, so that the target video features which are irrelevant to the field and aligned in the source domain and the target domain can be accurately extracted from the video to be processed, the video to be processed in the target domain can be accurately classified and recognized based on the target video features, and the accuracy of the classification and recognition of the video in the target domain can be effectively improved.

In one embodiment, as shown in fig. 8, after the training is finished until the training stop condition is satisfied, the video processing method further includes a video recognition processing step, where the video recognition processing step specifically includes the following steps:

and S802, performing primary feature extraction on the video to be processed through the trained recognition model to obtain initial features of the video to be processed.

Step S804, based on the time node and the time scale weight corresponding to the initial feature, depth feature extraction is carried out on the initial feature, and target video features aligned under a source domain and a target domain are obtained.

And step S806, classifying, identifying and processing the video to be processed based on the target video characteristics through the identification model.

It is understood that the trained recognition model includes an initial feature extractor, a target feature extractor, and a classifier. The target feature extractor is trained to extract the video features which are irrelevant to the field and aligned under the source domain and the target domain from the video under the target domain, and the classifier is trained to accurately classify the video under the source domain.

Specifically, the computer equipment inputs a video to be processed into an identification model, and the identification model firstly carries out primary feature extraction on the video to be processed through an initial feature extractor to obtain initial features of the video to be processed. The initial features are then further subjected to depth feature extraction by a target feature extractor in the recognition model. Because the target feature extractor in the trained recognition model extracts the video features which are irrelevant to the field and aligned under the source domain and the target domain from the video under the target domain, the target video features which are irrelevant to the field and aligned under the source domain and the target domain can be directly extracted from the video to be processed by performing depth feature extraction on the video to be processed through the recognition model.

And then, the recognition model can classify, recognize and process the video to be processed according to the target video characteristics, and the classifier is trained to have the capability of accurately classifying the video in the source domain, so that the class of the object in the video to be processed can be accurately recognized according to the target video characteristics aligned under the source domain and the target domain.

In one embodiment, the step of performing classification recognition processing on the video to be processed based on the target video features through a recognition model includes: and performing behavior recognition on the object in the video to be processed based on the behavior characteristics of the target video through the recognition model.

The target video characteristics in this embodiment are specifically target video behavior characteristics. The classification identification task for the video data may specifically be a behavior identification task. The trained recognition model has the capability of extracting the video features which are irrelevant to the field and aligned under the source domain and the target domain from the video under the target domain, and accurately classifying and recognizing the behaviors of the objects in the video under the target domain.

Specifically, after the computer device obtains the video to be processed in the target domain, the depth feature extraction is firstly carried out on the video to be processed through the recognition model, the target video features which are irrelevant to the field and are aligned in the source domain and the target domain are extracted from the video to be processed, and then the behaviors of the object in the video to be processed are classified and recognized according to the target video features, so that the behavior category corresponding to the video to be processed can be accurately recognized.

In a specific test embodiment, the video processing method provided in the embodiment of the present application is compared with a conventional correlation method. Specifically, the present solution is respectively associated with JAN (point Adaptation Networks, video Domain adaptive Networks based on still pictures), DANN (Domain adaptive Neural Networks), and TA ³ N (Temporal adaptive Adaptation Network, video domain adaptive Network using timing information) mode, respectively adopting a polling Network and a multi-layer sensor as classifiers, carrying out video classification processing on a plurality of data sets, and comparing corresponding test results. Specifically, two adaptive tasks are included on the UCF-HMDB data set, two tasks are included on the UCF-Olympic data set, and one task is included on the Kinetics-Gameplay data set.

The test results are shown in the table one below.

Watch 1

U → H in the UCF-HMDB dataset indicates that the UCF dataset is the source domain, the HMDB dataset is the target domain, and H → U is the opposite. U → O in the UCF-Olympic dataset indicates that the UCF dataset is the source domain and the Olympic dataset is the target domain, and O → U vice versa. The Kinetics-Gameplay dataset does not distinguish between the source domain and the target domain.

It can be seen from the test results obtained by the various processing methods shown in the table one above that, the recognition model obtained by the training of the scheme can more accurately classify and recognize videos in the target domain compared with the traditional video domain adaptive network model.

In one application scenario, the video processing method described above can be applied to video behavior recognition. Specifically, the computer equipment respectively extracts the depth characteristics of the video samples in the source domain and the target domain through the identification model to be trained; performing multi-time scale feature extraction on the depth features through a domain adaptation trainer to respectively obtain multi-time scale video features under a source domain and a target domain; according to the time nodes and time scale weights corresponding to the video features, video features under a source domain and a target domain are grouped and aligned; and according to the confrontation loss between the video characteristics of the source domain and the target domain in the same group, regulating the model parameters of the recognition model according to the class loss between the prediction class of the video sample under the source domain and the corresponding sample label, and continuing the confrontation training until meeting the training stop condition to finish the training to obtain the trained recognition model.

And then, the computer equipment carries out real-time video behavior recognition processing on the objects in the video under the target domain by using the trained recognition model.

More specifically, the video to be processed in the target domain may be a surveillance video. For example, the video in the source domain may be a video of an indoor scene, the video in the target domain may be a video of an outdoor scene, and vice versa. After the computer equipment acquires the collected video under the target domain, depth feature extraction is carried out through the monitoring video under the trained recognition model target domain, and target video features aligned under the source domain and the target domain are obtained. And then classifying and identifying the behaviors of the objects in the monitoring video according to the characteristics of the target video through the identification model to obtain a behavior identification result corresponding to the monitoring video. The computer equipment can further monitor the video behavior of the monitoring video according to the behavior recognition result.

In one application scenario, the video processing method described above can be applied to a video search scenario of a target category. Specifically, the computer device performs countermeasure training on the recognition model to be trained through the video samples in the source domain and the target domain to obtain the trained recognition model.

Then, the computer device can obtain the candidate video in the target domain meeting the conditions according to the category keywords corresponding to the target category, and then perform real-time video behavior recognition processing on the object in the candidate video in the target domain through the trained recognition model. Specifically, depth feature extraction is performed through candidate videos under a trained recognition model target domain, and target video features aligned under a source domain and a target domain are obtained. And then classifying and identifying the objects in the candidate videos according to the characteristics of the target videos through the identification model to obtain corresponding identification results. And then extracting videos of which the identification results are matched with the target categories from the candidate videos to serve as video search results.

It should be understood that although the steps in the flowcharts of fig. 2, 5 and 8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 5, and 8 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the other steps or stages.

In one embodiment, as shown in fig. 9, there is provided a video processing apparatus 900, which may be a part of a computer device using software modules or hardware modules, or a combination of the two, the apparatus specifically includes: a feature extraction module 902, a domain adaptation training module 904, a loss determination module 906, and a parameter adjustment module 908, wherein:

the feature extraction module 902 is configured to extract depth features of the video samples in the source domain and the target domain respectively through the recognition model to be trained; the video samples under the source domain carry the sample labels.

A domain adaptation training module 904, configured to perform multi-time scale feature extraction on the depth features through a domain adaptation trainer, so as to obtain multi-time scale video features in a source domain and a target domain respectively; according to the time nodes and time scale weights corresponding to the video features, video features under a source domain and a target domain are grouped and aligned; the time scale weights are positively correlated with the amount of information expressed by the corresponding video features.

A loss determination module 906 for determining a countermeasure loss according to a difference between video features under a source domain and a target domain within the same group; determining a category loss based on a difference between a prediction category of a video sample under a source domain and a corresponding sample label; the prediction category is obtained by classifying based on the video characteristics of the video sample in the source domain.

And a parameter adjusting module 908 for adjusting the model parameters of the recognition model according to the confrontation loss and the category loss and continuing the confrontation training until the training stop condition is met.

In an embodiment, the domain adaptation training module 904 is further configured to perform multi-time scale convolution processing on the depth features through the domain adaptation trainer, respectively, to obtain convolution results corresponding to the depth features; and respectively obtaining the video features of multiple time scales under the source domain and the target domain according to the time node weight corresponding to the depth feature and the corresponding convolution result.

In one embodiment, the domain adaptation trainer is a multi-time scale convolution process through convolution layers; the domain adaptation trainer also comprises a time node attention layer; the domain adaptation training module 904 is further configured to, through the time node attention layer, allocate corresponding time node weights to time nodes corresponding to the depth features, respectively, according to information amounts expressed by the depth features under the respective time nodes; the time node weight is positively correlated with the information quantity expressed by the depth characteristics under the corresponding time node.

In one embodiment, the domain adaptation training module 904 is further configured to determine information entropies corresponding to the video features of the source domain and the target domain at each time scale through a time scale attention layer of the domain adaptation trainer; the information entropy represents the information quantity expressed by the corresponding video characteristics; and respectively distributing corresponding time scale weights to the video features of each time scale according to the information entropy.

In one embodiment, the domain adaptation training module 904 is further configured to determine, through the domain adaptation trainer, video features to be aligned in the source domain and the target domain according to time nodes and time scale weights corresponding to the video features; dividing the video features to be aligned into one group to obtain a plurality of groups of aligned video features; the video features within each group include video features under the source domain and the target domain at the same time scale.

In one embodiment, the domain adaptation training module 904 is further configured to determine temporal node weights of the video features under the source domain and the target domain at respective temporal nodes; and determining video features matched with time node weights and time scale weights in different domains from the video features in the source domain and the target domain to serve as video features to be aligned in the source domain and the target domain.

In one embodiment, the feature extraction module 902 is further configured to extract initial features of the video samples in the source domain and the target domain respectively through an initial feature extractor in the recognition model to be trained; and respectively carrying out special extraction on the initial features of the video samples in the source domain and the target domain through a target feature extractor in the identification model to obtain the depth features of the video samples in the source domain and the target domain.

In an embodiment, the video processing apparatus further includes a classification module, configured to perform classification based on video features of the video sample in the source domain by using a classifier of the recognition model, so as to obtain a prediction category of the video sample in the source domain; the parameter adjusting module 908 is further configured to adjust model parameters of the target feature extractor and the classifier in the recognition model according to the countermeasure loss and the category loss, and continue the countermeasure training, so that the recognition model reduces a difference between video features in the source domain and the target domain in the same group during the iterative training until a training stop condition is met, and then ends the training.

In one embodiment, the loss determination module 906 is further configured to determine a cross-entropy loss based on a difference between a prediction category of the video sample in the target domain and a prediction category of the video sample in the source domain; the parameter adjusting module 908 is further configured to adjust model parameters of the recognition model according to the confrontation loss, the category loss, and the cross entropy loss, and continue the confrontation training, so that the recognition model reduces a difference between video features in the source domain and the target domain in the same group during the iterative training process until the training stop condition is met, and then the training is ended.

In an embodiment, after the training is finished until the training stop condition is met, the video processing apparatus further includes a video recognition module, configured to perform depth feature extraction on a video to be processed in a target domain through a trained recognition model, so as to obtain target video features aligned in a source domain and the target domain; and classifying, identifying and processing the video to be processed based on the target video characteristics through the identification model.

In one embodiment, the video identification module is further configured to perform preliminary feature extraction on the video to be processed through the trained identification model to obtain an initial feature of the video to be processed; and performing depth feature extraction on the initial features based on the time nodes and the time scale weights corresponding to the initial features to obtain target video features aligned under the source domain and the target domain.

In one embodiment, the target video features are target video behavior features; the video identification module is also used for identifying the behavior of the object in the video to be processed based on the behavior characteristics of the target video through the identification model.

For specific limitations of the video processing apparatus, reference may be made to the above limitations of the video processing method, which is not described herein again. The various modules in the video processing apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal or a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a video processing method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), for example.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims

1. A method of video processing, the method comprising:

determining a category loss based on a difference between a prediction category of a video sample under a source domain and a corresponding sample label; the prediction category is obtained by classifying based on the video characteristics of the video samples in the source domain;

2. The method of claim 1, wherein performing multi-time scale feature extraction on the depth features through a domain adaptation trainer to obtain multi-time scale video features in a source domain and a target domain respectively comprises:

performing multi-time scale convolution processing on the depth features through a domain adaptation trainer to obtain convolution results corresponding to the depth features;

and respectively obtaining the video features of multiple time scales under the source domain and the target domain according to the time node weight corresponding to the depth feature and the corresponding convolution result.

3. The method of claim 2, wherein the domain adaptation trainer is a multi-time scale convolution process with convolution layers; the domain adaptation trainer further comprises a time node attention layer;

the method further comprises the following steps:

respectively distributing corresponding time node weights to the time nodes corresponding to the depth characteristics according to the information quantity expressed by the depth characteristics under each time node through the time node attention layer; the time node weight is positively correlated with the information quantity expressed by the depth features under the corresponding time node.

4. The method of claim 1, wherein before aligning the video feature groups under the source domain and the target domain according to the time nodes and the time scale weights corresponding to the video features, the method further comprises:

determining information entropies corresponding to the video features of each time scale in the source domain and the target domain through the time scale attention layer of the domain adaptive trainer; the information entropy represents the information quantity expressed by the corresponding video features;

and respectively distributing corresponding time scale weights to the video features of each time scale according to the information entropy.

5. The method of claim 1, wherein aligning the video feature groups in the source domain and the target domain according to the time node and the time scale weight corresponding to the video feature comprises:

determining video features to be aligned under a source domain and a target domain according to the time node and the time scale weight corresponding to each video feature through the domain adaptive trainer;

dividing the video features to be aligned into one group to obtain multiple groups of aligned video features; the video features within each group include video features at the source domain and the target domain at the same time scale.

6. The method of claim 5, wherein the determining, by the domain adaptation trainer, the video features to be aligned in the source domain and the target domain according to the time node and the time scale weight corresponding to each of the video features comprises:

determining time node weights of video features under a source domain and a target domain under corresponding time nodes;

and determining video features matched with time node weights and time scale weights in different domains from the video features in the source domain and the target domain, and taking the video features as video features to be aligned in the source domain and the target domain.

7. The method of claim 1, wherein extracting the depth features of the video sample in the source domain and the target domain respectively through the recognition model to be trained comprises:

respectively extracting initial features of video samples in a source domain and a target domain through an initial feature extractor in a recognition model to be trained;

and respectively carrying out special extraction on the initial features of the video samples in the source domain and the target domain through a target feature extractor in the identification model to obtain the depth features of the video samples in the source domain and the target domain.

8. The method of claim 7, further comprising:

classifying based on the video characteristics of the video sample under the source domain through a classifier of the identification model to obtain a prediction category of the video sample under the source domain;

and according to the countermeasure loss and the category loss, adjusting the model parameters of the recognition model and continuing the countermeasure training until the training stopping condition is met, wherein the method comprises the following steps:

and adjusting model parameters of a target feature extractor and a classifier in the recognition model according to the countermeasure loss and the category loss, and continuing countermeasure training so that the difference between the video features of the source domain and the target domain in the same group is reduced in the iterative training process of the recognition model until the training stopping condition is met, and ending the training.

9. The method of claim 1, further comprising:

determining cross entropy loss based on a difference between a prediction category of the video sample in the target domain and a prediction category of the video sample in the source domain;

adjusting the model parameters of the recognition model and continuing the confrontation training according to the confrontation loss and the category loss until the training stopping condition is met, wherein the method comprises the following steps:

and adjusting model parameters of the recognition model and continuing to perform the antagonistic training according to the antagonistic loss, the category loss and the cross entropy loss so as to reduce the difference between the video characteristics of the source domain and the target domain in the same group in the iterative training process of the recognition model until the training stop condition is met, and ending the training.

10. The method according to any one of claims 1 to 9, wherein after the training is finished until the training stop condition is satisfied, the method further comprises:

performing depth feature extraction on the video to be processed under the target domain through the trained recognition model to obtain aligned target video features under the source domain and the target domain;

and carrying out classification and identification processing on the video to be processed based on the target video characteristics through the identification model.

11. The method according to claim 10, wherein performing depth feature extraction on the to-be-processed video in the target domain through the trained recognition model to obtain target video features aligned in the source domain and the target domain, includes:

performing primary feature extraction on the video to be processed through the trained recognition model to obtain initial features of the video to be processed;

and performing depth feature extraction on the initial features based on the time nodes and the time scale weights corresponding to the initial features to obtain target video features aligned in a source domain and a target domain.

12. The method of claim 10, wherein the target video features are target video behavior features;

through the identification model, the classification and identification processing is carried out on the video to be processed based on the target video characteristics, and the classification and identification processing comprises the following steps:

and performing behavior recognition on the object in the video to be processed based on the target video behavior characteristics through the recognition model.

13. A video processing apparatus, characterized in that the apparatus comprises:

the feature extraction module is used for respectively extracting the depth features of the video samples in the source domain and the target domain through the identification model to be trained; the video sample under the source domain carries a sample label;

the loss determining module is used for determining the confrontation loss according to the difference between the video characteristics under the source domain and the target domain in the same group; determining a category loss based on a difference between a prediction category of a video sample under a source domain and a corresponding sample label; the prediction category is obtained by classifying based on the video characteristics of the video samples in the source domain;

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 12.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 12.