CN115205739A

CN115205739A - Low-illumination video behavior identification method and system based on semi-supervised learning

Info

Publication number: CN115205739A
Application number: CN202210788338.2A
Authority: CN
Inventors: 金枝; 罗经周; 王晨曦; 伍鸿俊; 齐浩然; 黎以宁
Original assignee: Sun Yat Sen University; Sun Yat Sen University Shenzhen Campus
Current assignee: Sun Yat Sen University; Sun Yat Sen University Shenzhen Campus
Priority date: 2022-07-06
Filing date: 2022-07-06
Publication date: 2022-10-18
Anticipated expiration: 2042-07-06
Also published as: CN115205739B

Abstract

The invention discloses a low-illumination video behavior identification method and system based on semi-supervised learning, wherein the method comprises the following steps: acquiring a normal illumination source video and a low illumination source video; respectively carrying out pixel level domain adaptation, weak amplification and frame mixing processing to obtain a first video sequence and a second video sequence; taking the first video sequence and the second video sequence as model input, and using a real label to constrain a model prediction result of the first video sequence; carrying out feature level domain adaptation on features extracted from the normal illumination source video and features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after weak augmentation and frame mixing processing to obtain a target model; and inputting the video sequence to be recognized into the target model to generate a prediction label result corresponding to the video sequence to be recognized. The method has high identification accuracy, can effectively improve the robust generalization capability between the normal illumination video and the low illumination video, and can be widely applied to the technical field of artificial intelligence.

Description

Low-illumination video behavior identification method and system based on semi-supervised learning

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a low-illumination video behavior identification method and system based on semi-supervised learning.

Background

In the related art of video behavior recognition, the following three methods are generally adopted: 1. a video behavior recognition method based on complete supervision; 2. a domain adaptation method; 3. a consistency training method; however, these methods all have corresponding drawbacks.

Video behavior recognition based on a complete supervision method is developed more mature, but for low-illumination scenes, the problem that video behavior labels are difficult to acquire exists, and in the behavior recognition data sets disclosed at present, the vast majority are video data sets in normal illumination environments. Therefore, if a fully supervised method is adopted to perform low-illumination video behavior recognition, the method is extremely easy to be limited by a data set, so that the application scene of the low-illumination video behavior recognition is limited.

Domain adaptation aims at transferring knowledge learned from a labeled source domain to an unlabeled target domain. With the gradual development of generation of countermeasure networks (GANs), the related art proposes a Conditional Domain Adaptive Network (CDAN) that calculates a Domain offset loss conditioned on a prediction of a certain feature. When the characteristics of the normal illumination domain and the characteristics of the low illumination domain are adapted, the CDAN domain adaptation technology is not capable of effectively performing domain adaptation when facing a scene with large difference between the normal illumination domain and the low illumination domain because the normal illumination video mostly contains less noise, and the low illumination video often contains a large amount of noise besides low contrast.

The consistency training method is mainly used for regularizing model prediction, so that the model prediction is not influenced by small noise applied to an input sample or a hidden feature layer. Related art techniques have employed consistency constraints to mitigate reliance on pseudo-tag quality in recent years. They are more versatile than under similar domain scenarios and because of the strong dependence on the initial collection quality of the pseudo-labels, the effect of consistency training between different domains is extremely limited.

Disclosure of Invention

In view of this, the embodiment of the present invention provides a low-illumination video behavior identification method and system based on semi-supervised learning, which have high accuracy, and can effectively improve the robust generalization capability between a normal-illumination video and a low-illumination video.

One aspect of the embodiments of the present invention provides a low-light video behavior identification method based on semi-supervised learning, including:

acquiring a normal illumination source video and a low illumination source video;

performing pixel level domain adaptation on the normal illumination source video to obtain a first video sequence, and performing weak amplification and frame mixing processing on the low illumination source video to obtain a second video sequence;

using the first video sequence and the second video sequence as model input, and using a real label to constrain a model prediction result of the first video sequence, so as to perform supervised training under a normal illumination video; carrying out feature level domain adaptation on features extracted from the normal illumination source video and features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing to obtain a target model;

and inputting the video sequence to be identified into the target model, and generating a prediction label result corresponding to the video sequence to be identified.

Optionally, the acquiring the normal-illuminant video and the low-illuminant video includes:

acquiring 64 continuous video sequences from the random sampling of the normal illumination source video;

randomly sampling a low-illumination source video to obtain a 64-frame continuous video sequence;

when the source video is not enough than 64 frames, a cyclic frame taking strategy is adopted to sample the source video to obtain a corresponding video sequence.

Optionally, the performing pixel level domain adaptation on the normal illumination source video to obtain a first video sequence includes:

performing degradation processing on the normal illumination source video from two aspects of contrast and noise respectively by adopting pixel-level self-adaptation;

wherein the expression of the degradation process is:

N _deg ＝D _eg (N _in )+δ

wherein N is _in Is a normal lighting frame; d _eg (. Is) N _in A dimming operation of (1); δ is random gaussian noise; n is a radical of _deg Is a degraded processed video frame.

Optionally, the performing weak amplification and frame mixing processing on the low-illumination source video to obtain a second video sequence includes:

continuously sampling a plurality of frames of the low-illumination source video to obtain a third video sequence;

carrying out weak amplification processing and strong amplification processing on the third video sequence to obtain a first processing result, wherein the first processing result comprises the video sequence subjected to the weak amplification processing and the video sequence subjected to the strong amplification processing;

setting the ratio of weak enhancement frames in a mixed frame, and randomly acquiring the initial position of the weak enhancement frame according to the ratio of the weak enhancement frames;

synthesizing a mixed frame according to the initial position of the weak enhancement frame;

wherein the weak augmentation process comprises at least one of: multi-scale cutting and random horizontal turning;

the strong augmentation treatment comprises at least one of: random inversion, random rotation, random gaussian noise, random blur, random distortion, random brightness, and random contrast.

Optionally, the performing feature level domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video includes:

and performing domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video by using the CDAN, wherein the features extracted from the normal illumination source video and the low illumination source video and the corresponding prediction results are used as input of the CDAN domain adaptation.

Optionally, the performing consistency training on the prediction result of the video sequence after the weak amplification and frame mixing processing includes:

predicting the weakly augmented video sequence, and taking a prediction result with the output confidence coefficient higher than 0.8 as a pseudo label of the sequence source video;

and using the pseudo label as a real label to restrict the prediction output of the video sequence after the model is mixed with the frame.

In another aspect, an embodiment of the present invention further provides a low-light video behavior recognition system based on semi-supervised learning, including:

the system comprises a first module, a second module and a third module, wherein the first module is used for acquiring a normal illumination source video and a low illumination source video;

the second module is used for carrying out pixel level domain adaptation on the normal illumination source video to obtain a first video sequence, and carrying out weak amplification and frame mixing processing on the low illumination source video to obtain a second video sequence;

a third module for inputting the first video sequence and the second video sequence as models, and constraining a model prediction result of the first video sequence by using a real label, so as to perform supervised training under a normal illumination video; carrying out feature level domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing to obtain a target model;

and the fourth module is used for inputting the video sequence to be identified into the target model and generating a prediction label result corresponding to the video sequence to be identified.

Another aspect of the embodiments of the present invention further provides an electronic device, which includes a processor and a memory;

the memory is used for storing programs;

the processor executes the program to implement the method as described above.

Yet another aspect of the embodiments of the present invention provides a computer-readable storage medium, which stores a program, which is executed by a processor to implement the method as described above.

Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and the computer instructions executed by the processor cause the computer device to perform the foregoing method.

The embodiment of the invention obtains a normal illumination source video and a low illumination source video; performing pixel level domain adaptation on the normal illumination source video to obtain a first video sequence, and performing weak amplification and frame mixing processing on the low illumination source video to obtain a second video sequence; using the first video sequence and the second video sequence as model input, and using a real label to constrain a model prediction result of the first video sequence, so as to perform supervised training under a normal illumination video; carrying out feature level domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing to obtain a target model; and inputting the video sequence to be identified into the target model, and generating a prediction label result corresponding to the video sequence to be identified. The method has high identification accuracy, and can effectively improve the robust generalization capability between the normal illumination video and the low illumination video.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating the overall steps provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of the overall structure of V-DixMatch;

fig. 3 is a schematic diagram of a classifier structure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

In view of the problems in the prior art, an aspect of the embodiments of the present invention provides a low-light video behavior identification method based on semi-supervised learning, including:

using the first video sequence and the second video sequence as model input, and using a real label to constrain a model prediction result of the first video sequence, so as to perform supervised training under a normal illumination video; carrying out feature level domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing to obtain a target model;

acquiring a 64-frame continuous video sequence from the random sampling of the normal illumination source video;

when the source video is not 64 frames, a cyclic frame taking strategy is adopted to sample the source video to obtain a corresponding video sequence.

wherein the expression of the degradation process is:

N _deg ＝D _eg (N _in )+δ

continuously sampling a plurality of frames for the low illumination source video to obtain a third video sequence;

performing weak amplification processing and strong amplification processing on the third video sequence to obtain a first processing result, wherein the first processing result comprises the video sequence subjected to the weak amplification processing and the video sequence subjected to the strong amplification processing;

the strong augmentation treatment comprises at least one of: random inversion, random rotation, random gaussian noise, random blurring, random distortion, random brightness, and random contrast.

a third module, configured to input the first video sequence and the second video sequence as models, and constrain a model prediction result of the first video sequence by using a real label, so as to perform supervised training under a normal lighting video; carrying out feature level domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing to obtain a target model;

and the fourth module is used for inputting the video sequence to be recognized into the target model and generating a prediction label result corresponding to the video sequence to be recognized.

Another aspect of the embodiments of the present invention further provides an electronic device, including a processor and a memory;

the memory is used for storing programs;

the processor executes the program to implement the method as described above.

The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and the computer instructions executed by the processor cause the computer device to perform the foregoing method.

The following detailed description of the embodiments of the present invention is made with reference to the accompanying drawings:

aiming at the problems in the prior art, the invention provides a low-illumination video behavior recognition method and system V-DixMatch based on semi-supervised learning, and the accuracy of the behavior recognition of a model system in a low-illumination video is improved by using a normal light video with a label and a low-illumination video without the label as data sets. In the experimental process, it is found that the model training is easy to be unstable due to large domain difference by using the domain adaptation technology. And the existing consistency training technology is used only, and the domain difference is large, so that the initial pseudo label collection quality is poor, and a model with high accuracy cannot be obtained through training. Therefore, the invention provides the V-DixMatch, and the method mainly comprises three parts of joint pixel level domain adaptation, characteristic level domain adaptation and consistency training based on a frame mixing technology, so that the robust generalization capability of the model system between a normal illumination video and a low illumination video is effectively improved.

As shown in fig. 1, the method of the present invention integrally includes three training phases, specifically:

stage one:

step 1, acquiring 64 continuous video sequences from random sampling of normal illumination source videos (if the source videos are not 64 frames, a cyclic frame-taking strategy is adopted).

And 2, applying pixel level domain adaptation to the normal illumination video sequence.

And 3, inputting the video sequence subjected to degradation processing as a model, and constraining the output of the model by using a real label so as to perform supervision training under a normal illumination video.

And a second stage:

step 1, respectively and randomly sampling 64 continuous video sequences from a normal illumination source video and a low illumination source video (if the source video is not enough for 64 frames, a cyclic frame-taking strategy is adopted).

And 2, applying pixel level domain adaptation to the normal illumination video sequence. Data perturbation is performed using weak amplification on low-light video sequences.

And 3, inputting the video sequence subjected to degradation processing and the video sequence subjected to weak amplification as models, and constraining the prediction result of the video sequence subjected to degradation processing by using a real label. And meanwhile, carrying out feature level domain adaptation on the normal illumination video features and the low illumination video features.

And a third stage:

step 1: and respectively randomly sampling 64 continuous video sequences from the normal illumination source video and the low illumination source video (if the source video is not enough for 64 frames, a cyclic frame taking strategy is adopted).

And 2, applying pixel level domain adaptation to the normal illumination video sequence. And respectively applying weak amplification and frame mixing technologies to the low-illumination video sequence for data disturbance.

And 3, taking the video sequence after degradation processing, the video sequence after weak amplification and the video sequence after frame mixing technology processing as model input, and using a real label to constrain the prediction result of the video sequence after degradation processing. And meanwhile, carrying out feature level domain adaptation on the normal illumination video features and the low illumination video features. And calculating consistency loss for the video sequence prediction result after the weak amplification and the video sequence prediction result after the frame mixing processing to carry out consistency training.

And 4, after the model training is finished, only one model is finally obtained due to the sharing of model parameters. And (3) acquiring 64 continuous video sequences from the middle sampling of the low-illumination source video to be subjected to inference test (if the source video is not enough for 64 frames, a cyclic frame-taking strategy is adopted), and directly taking the video sequences as the input of a model, thereby predicting the behavior recognition prediction label result of the low-illumination video to be subjected to inference test.

As shown in fig. 2, the complete method of the present embodiment is composed of three parts, which are pixel level domain adaptation, feature level domain adaptation and consistency training between cross-domain videos. First, the present embodiment implements pixel level domain adaptation by performing degradation processing on a normal-light video to narrow the domain gap between the normal-light video and the low-light video. Then, the feature level domains of the low-light video and the normal-light video are adapted. To take full advantage of the tagged normal-light video data, the present embodiment uses tagged normal-light video to provide basic constraints to the model to learn knowledge about recognized actions. Finally, the present embodiment uses an original frame mixing technique (BFA) to increase the input perturbation of the model during the consistency training process, thereby further improving the stability of the model.

As shown in fig. 3, the present embodiment selects R (2+1) D BERT as the classifier of the present embodiment, where FC is a full-connected layer, and predicts the final result according to the extracted feature vector. The present embodiment will describe details of each part of the present embodiment V-DixMatch in the following sections.

1. Pixel level domain adaptation

The embodiment first adopts pixel-level adaptation to perform degradation processing on a normally-illuminated video from two aspects, namely contrast and noise. Inspired by LEDNet, the degradation process can be defined as:

N _deg ＝D _eg (N _in )+δ

wherein N is _in Is a normal illumination frame, D _eg (. Is) N _in Is a random gaussian noise, N _deg Is a degraded processed video frame. Wherein the dimming operation is consistent with the operation used in LEDNet.

2. Feature level domain adaptation

As shown in fig. 1, in the present embodiment, a CDAN is used for domain adaptation, where features and prediction results extracted from low-light and normal-light video frames are used as input of the CDAN domain adaptation, so that the features extracted by the model in the normal-light video and the low-light video are as close as possible step by step, thereby improving the robustness of the model in behavior recognition of the low-light video.

3. Frame-blending based consistency training

3.1 frame hybrid data augmentation

In order to maximize the utilization of the collected pseudo label samples during the consistency training process, the present embodiment proposes a frame-mix data augmentation technique (BFA). The following are specific details of BFA:

1) Continuously sampling N frames of a video sample, and calling the sequence of the segment as F;

F＝(f ₁ ，f ₂ ，...，f _N )

2) Weak amplification (multi-scale clipping and random horizontal flipping) and strong amplification (random flipping, random rotation, random gaussian noise, random blur, random distortion, random brightness and random contrast) are applied in F:

F _weak ＝(f _w1 ，f _w2 ，...，f _wN )

F _strong ＝(f _s1 ，f _s2 ，...，f _sN )

3) The ratio α of the weakly amplified frames in the mixed frame is set to 0.3 in the operation of the present embodiment. Then, the present embodiment randomly acquires the start position P of the weak enhancement frame as follows:

λ～Uniform(α，1-α)

P～Uniform(1，(1-λ)N+1)

4) Synthesizing a hybrid frame F _BFA Wherein the 1 st to P-1 th frames and the (P + λ N) th to Nth frames are strongly augmented frames, the P th to (P + λ N-1) th frames are weakly augmented frames, and the hybrid frame F can be represented by the following set _BFA ：

F _BFA ＝(f _s1 ，...，f _sP-1 ，f _wP ，...，f _w(P+λN-1) ，f _s(P+λN) ，...，f _sN )

3.2 loss of consistency

The model predicts the weakly augmented video sequence, takes the prediction result with the output confidence higher than 0.8 as a pseudo label of the sequence source video, and takes the pseudo label as a real label to restrict the prediction output of the model to the frame-mixed video sequence (i.e. using the Focal loss to perform classification loss calculation).

3.3 three-stage training strategy

The model of the embodiment is firstly pre-trained by 50 epochs in a stage one by using a normal illumination video in combination with pixel level domain adaptation; then, training in a second stage is carried out on the basis of the first stage by combining with the feature level domain, wherein the second stage trains for 30 epochs; and finally, performing stage three training on the basis of stage two by combining with the consistency training based on the BFA technology, wherein the stage three trains 30 epochs together.

In conclusion, the invention constructs a low-illumination video behavior recognition method and system V-DixMatch based on semi-supervised learning, and normal-illumination videos with labels and low-illumination videos without labels are used as data sets to improve the accuracy of behavior recognition of a model system in the low-illumination videos. The method comprises three main parts, namely pixel-level adaptation, feature-level adaptation and consistency training based on frame mixing, wherein in the consistency training process, a novel frame mixing data augmentation technology is provided, and further the feature extraction capability and robustness of the model in a low-illumination scene are improved. Through experimental comparison, the accuracy of the embodiment is about 4% higher than that of the current advanced method (the accuracy of top1 is about 80%), and reaches about 84%; compared with the full supervision directly using low-light video (top 1 accuracy is about 86%), the error rate is only about 2%. This is sufficient to demonstrate the effectiveness of the method of this embodiment.

The method can be applied to high-level visual tasks in low-illumination scenes, and high-level tasks such as target detection, face recognition and the like can be applied to the semi-supervised learning method V-DixMatch provided by the embodiment besides behavior recognition. The frame mixing data augmentation technology of the present invention can be applied to any video-related task, and can also be generalized to frame mixing technologies between different types of videos. The three-stage training strategy can be popularized and applied to consistency learning tasks among other different domains.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise indicated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer given the nature, function, and interrelationships of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is to be determined from the appended claims along with their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A low-light video behavior identification method based on semi-supervised learning is characterized by comprising the following steps:

2. The method for identifying the behavior of the low-light video based on the semi-supervised learning as recited in claim 1, wherein the acquiring the normal-light source video and the low-light source video comprises:

3. The method for low-light video behavior recognition based on semi-supervised learning as claimed in claim 1, wherein the performing pixel-level domain adaptation on the normal-light source video to obtain a first video sequence comprises:

wherein the expression of the degradation process is:

N _deg ＝D _eg (N _in )+δ

wherein, N _in Is a normal lighting frame; d _eg (. Is) N _in Dimming operation of (c); δ is random gaussian noise; n is a radical of _deg Is a degraded processed video frame.

4. The method for identifying the behavior of the low-light video based on the semi-supervised learning as recited in claim 1, wherein the performing the weak amplification and the frame mixing on the low-light source video to obtain a second video sequence comprises:

5. The method for low-light video behavior recognition based on semi-supervised learning according to claim 1, wherein the performing feature level domain adaptation on the features extracted from the normal-light source video and the features extracted from the low-light source video comprises:

6. The method for identifying low-light video behaviors based on semi-supervised learning according to claim 1, wherein the performing consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing comprises:

predicting the weakly amplified video sequence, and taking a prediction result with the output confidence higher than 0.8 as a pseudo label of the sequence source video;

7. A low-light video behavior recognition system based on semi-supervised learning, comprising:

8. An electronic device comprising a processor and a memory;

the memory is used for storing programs;

the processor executing the program realizes the method of any one of claims 1 to 6.

9. A computer-readable storage medium, characterized in that the storage medium stores a program which is executed by a processor to implement the method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the method according to any of claims 1 to 6 when executed by a processor.