CN115205739A - Low-illumination video behavior identification method and system based on semi-supervised learning - Google Patents

Low-illumination video behavior identification method and system based on semi-supervised learning Download PDF

Info

Publication number
CN115205739A
CN115205739A CN202210788338.2A CN202210788338A CN115205739A CN 115205739 A CN115205739 A CN 115205739A CN 202210788338 A CN202210788338 A CN 202210788338A CN 115205739 A CN115205739 A CN 115205739A
Authority
CN
China
Prior art keywords
video
video sequence
low
illumination source
source video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210788338.2A
Other languages
Chinese (zh)
Other versions
CN115205739B (en
Inventor
金枝
罗经周
王晨曦
伍鸿俊
齐浩然
黎以宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Sun Yat Sen University Shenzhen Campus
Original Assignee
Sun Yat Sen University
Sun Yat Sen University Shenzhen Campus
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University, Sun Yat Sen University Shenzhen Campus filed Critical Sun Yat Sen University
Priority to CN202210788338.2A priority Critical patent/CN115205739B/en
Publication of CN115205739A publication Critical patent/CN115205739A/en
Application granted granted Critical
Publication of CN115205739B publication Critical patent/CN115205739B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/34Smoothing or thinning of the pattern; Morphological operations; Skeletonisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a low-illumination video behavior identification method and system based on semi-supervised learning, wherein the method comprises the following steps: acquiring a normal illumination source video and a low illumination source video; respectively carrying out pixel level domain adaptation, weak amplification and frame mixing processing to obtain a first video sequence and a second video sequence; taking the first video sequence and the second video sequence as model input, and using a real label to constrain a model prediction result of the first video sequence; carrying out feature level domain adaptation on features extracted from the normal illumination source video and features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after weak augmentation and frame mixing processing to obtain a target model; and inputting the video sequence to be recognized into the target model to generate a prediction label result corresponding to the video sequence to be recognized. The method has high identification accuracy, can effectively improve the robust generalization capability between the normal illumination video and the low illumination video, and can be widely applied to the technical field of artificial intelligence.

Description

Low-illumination video behavior identification method and system based on semi-supervised learning
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a low-illumination video behavior identification method and system based on semi-supervised learning.
Background
In the related art of video behavior recognition, the following three methods are generally adopted: 1. a video behavior recognition method based on complete supervision; 2. a domain adaptation method; 3. a consistency training method; however, these methods all have corresponding drawbacks.
Video behavior recognition based on a complete supervision method is developed more mature, but for low-illumination scenes, the problem that video behavior labels are difficult to acquire exists, and in the behavior recognition data sets disclosed at present, the vast majority are video data sets in normal illumination environments. Therefore, if a fully supervised method is adopted to perform low-illumination video behavior recognition, the method is extremely easy to be limited by a data set, so that the application scene of the low-illumination video behavior recognition is limited.
Domain adaptation aims at transferring knowledge learned from a labeled source domain to an unlabeled target domain. With the gradual development of generation of countermeasure networks (GANs), the related art proposes a Conditional Domain Adaptive Network (CDAN) that calculates a Domain offset loss conditioned on a prediction of a certain feature. When the characteristics of the normal illumination domain and the characteristics of the low illumination domain are adapted, the CDAN domain adaptation technology is not capable of effectively performing domain adaptation when facing a scene with large difference between the normal illumination domain and the low illumination domain because the normal illumination video mostly contains less noise, and the low illumination video often contains a large amount of noise besides low contrast.
The consistency training method is mainly used for regularizing model prediction, so that the model prediction is not influenced by small noise applied to an input sample or a hidden feature layer. Related art techniques have employed consistency constraints to mitigate reliance on pseudo-tag quality in recent years. They are more versatile than under similar domain scenarios and because of the strong dependence on the initial collection quality of the pseudo-labels, the effect of consistency training between different domains is extremely limited.
Disclosure of Invention
In view of this, the embodiment of the present invention provides a low-illumination video behavior identification method and system based on semi-supervised learning, which have high accuracy, and can effectively improve the robust generalization capability between a normal-illumination video and a low-illumination video.
One aspect of the embodiments of the present invention provides a low-light video behavior identification method based on semi-supervised learning, including:
acquiring a normal illumination source video and a low illumination source video;
performing pixel level domain adaptation on the normal illumination source video to obtain a first video sequence, and performing weak amplification and frame mixing processing on the low illumination source video to obtain a second video sequence;
using the first video sequence and the second video sequence as model input, and using a real label to constrain a model prediction result of the first video sequence, so as to perform supervised training under a normal illumination video; carrying out feature level domain adaptation on features extracted from the normal illumination source video and features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing to obtain a target model;
and inputting the video sequence to be identified into the target model, and generating a prediction label result corresponding to the video sequence to be identified.
Optionally, the acquiring the normal-illuminant video and the low-illuminant video includes:
acquiring 64 continuous video sequences from the random sampling of the normal illumination source video;
randomly sampling a low-illumination source video to obtain a 64-frame continuous video sequence;
when the source video is not enough than 64 frames, a cyclic frame taking strategy is adopted to sample the source video to obtain a corresponding video sequence.
Optionally, the performing pixel level domain adaptation on the normal illumination source video to obtain a first video sequence includes:
performing degradation processing on the normal illumination source video from two aspects of contrast and noise respectively by adopting pixel-level self-adaptation;
wherein the expression of the degradation process is:
N deg =D eg (N in )+δ
wherein N is in Is a normal lighting frame; d eg (. Is) N in A dimming operation of (1); δ is random gaussian noise; n is a radical of deg Is a degraded processed video frame.
Optionally, the performing weak amplification and frame mixing processing on the low-illumination source video to obtain a second video sequence includes:
continuously sampling a plurality of frames of the low-illumination source video to obtain a third video sequence;
carrying out weak amplification processing and strong amplification processing on the third video sequence to obtain a first processing result, wherein the first processing result comprises the video sequence subjected to the weak amplification processing and the video sequence subjected to the strong amplification processing;
setting the ratio of weak enhancement frames in a mixed frame, and randomly acquiring the initial position of the weak enhancement frame according to the ratio of the weak enhancement frames;
synthesizing a mixed frame according to the initial position of the weak enhancement frame;
wherein the weak augmentation process comprises at least one of: multi-scale cutting and random horizontal turning;
the strong augmentation treatment comprises at least one of: random inversion, random rotation, random gaussian noise, random blur, random distortion, random brightness, and random contrast.
Optionally, the performing feature level domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video includes:
and performing domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video by using the CDAN, wherein the features extracted from the normal illumination source video and the low illumination source video and the corresponding prediction results are used as input of the CDAN domain adaptation.
Optionally, the performing consistency training on the prediction result of the video sequence after the weak amplification and frame mixing processing includes:
predicting the weakly augmented video sequence, and taking a prediction result with the output confidence coefficient higher than 0.8 as a pseudo label of the sequence source video;
and using the pseudo label as a real label to restrict the prediction output of the video sequence after the model is mixed with the frame.
In another aspect, an embodiment of the present invention further provides a low-light video behavior recognition system based on semi-supervised learning, including:
the system comprises a first module, a second module and a third module, wherein the first module is used for acquiring a normal illumination source video and a low illumination source video;
the second module is used for carrying out pixel level domain adaptation on the normal illumination source video to obtain a first video sequence, and carrying out weak amplification and frame mixing processing on the low illumination source video to obtain a second video sequence;
a third module for inputting the first video sequence and the second video sequence as models, and constraining a model prediction result of the first video sequence by using a real label, so as to perform supervised training under a normal illumination video; carrying out feature level domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing to obtain a target model;
and the fourth module is used for inputting the video sequence to be identified into the target model and generating a prediction label result corresponding to the video sequence to be identified.
Another aspect of the embodiments of the present invention further provides an electronic device, which includes a processor and a memory;
the memory is used for storing programs;
the processor executes the program to implement the method as described above.
Yet another aspect of the embodiments of the present invention provides a computer-readable storage medium, which stores a program, which is executed by a processor to implement the method as described above.
Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and the computer instructions executed by the processor cause the computer device to perform the foregoing method.
The embodiment of the invention obtains a normal illumination source video and a low illumination source video; performing pixel level domain adaptation on the normal illumination source video to obtain a first video sequence, and performing weak amplification and frame mixing processing on the low illumination source video to obtain a second video sequence; using the first video sequence and the second video sequence as model input, and using a real label to constrain a model prediction result of the first video sequence, so as to perform supervised training under a normal illumination video; carrying out feature level domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing to obtain a target model; and inputting the video sequence to be identified into the target model, and generating a prediction label result corresponding to the video sequence to be identified. The method has high identification accuracy, and can effectively improve the robust generalization capability between the normal illumination video and the low illumination video.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart illustrating the overall steps provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of the overall structure of V-DixMatch;
fig. 3 is a schematic diagram of a classifier structure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.
In view of the problems in the prior art, an aspect of the embodiments of the present invention provides a low-light video behavior identification method based on semi-supervised learning, including:
acquiring a normal illumination source video and a low illumination source video;
performing pixel level domain adaptation on the normal illumination source video to obtain a first video sequence, and performing weak amplification and frame mixing processing on the low illumination source video to obtain a second video sequence;
using the first video sequence and the second video sequence as model input, and using a real label to constrain a model prediction result of the first video sequence, so as to perform supervised training under a normal illumination video; carrying out feature level domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing to obtain a target model;
and inputting the video sequence to be identified into the target model, and generating a prediction label result corresponding to the video sequence to be identified.
Optionally, the acquiring the normal-illuminant video and the low-illuminant video includes:
acquiring a 64-frame continuous video sequence from the random sampling of the normal illumination source video;
randomly sampling a low-illumination source video to obtain a 64-frame continuous video sequence;
when the source video is not 64 frames, a cyclic frame taking strategy is adopted to sample the source video to obtain a corresponding video sequence.
Optionally, the performing pixel level domain adaptation on the normal illumination source video to obtain a first video sequence includes:
performing degradation processing on the normal illumination source video from two aspects of contrast and noise respectively by adopting pixel-level self-adaptation;
wherein the expression of the degradation process is:
N deg =D eg (N in )+δ
wherein N is in Is a normal lighting frame; d eg (. Is) N in A dimming operation of (1); δ is random gaussian noise; n is a radical of deg Is a degraded processed video frame.
Optionally, the performing weak amplification and frame mixing processing on the low-illumination source video to obtain a second video sequence includes:
continuously sampling a plurality of frames for the low illumination source video to obtain a third video sequence;
performing weak amplification processing and strong amplification processing on the third video sequence to obtain a first processing result, wherein the first processing result comprises the video sequence subjected to the weak amplification processing and the video sequence subjected to the strong amplification processing;
setting the ratio of weak enhancement frames in a mixed frame, and randomly acquiring the initial position of the weak enhancement frame according to the ratio of the weak enhancement frames;
synthesizing a mixed frame according to the initial position of the weak enhancement frame;
wherein the weak augmentation process comprises at least one of: multi-scale cutting and random horizontal turning;
the strong augmentation treatment comprises at least one of: random inversion, random rotation, random gaussian noise, random blurring, random distortion, random brightness, and random contrast.
Optionally, the performing feature level domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video includes:
and performing domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video by using the CDAN, wherein the features extracted from the normal illumination source video and the low illumination source video and the corresponding prediction results are used as input of the CDAN domain adaptation.
Optionally, the performing consistency training on the prediction result of the video sequence after the weak amplification and frame mixing processing includes:
predicting the weakly augmented video sequence, and taking a prediction result with the output confidence coefficient higher than 0.8 as a pseudo label of the sequence source video;
and using the pseudo label as a real label to restrict the prediction output of the video sequence after the model is mixed with the frame.
In another aspect, an embodiment of the present invention further provides a low-light video behavior recognition system based on semi-supervised learning, including:
the system comprises a first module, a second module and a third module, wherein the first module is used for acquiring a normal illumination source video and a low illumination source video;
the second module is used for carrying out pixel level domain adaptation on the normal illumination source video to obtain a first video sequence, and carrying out weak amplification and frame mixing processing on the low illumination source video to obtain a second video sequence;
a third module, configured to input the first video sequence and the second video sequence as models, and constrain a model prediction result of the first video sequence by using a real label, so as to perform supervised training under a normal lighting video; carrying out feature level domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing to obtain a target model;
and the fourth module is used for inputting the video sequence to be recognized into the target model and generating a prediction label result corresponding to the video sequence to be recognized.
Another aspect of the embodiments of the present invention further provides an electronic device, including a processor and a memory;
the memory is used for storing programs;
the processor executes the program to implement the method as described above.
Yet another aspect of the embodiments of the present invention provides a computer-readable storage medium, which stores a program, which is executed by a processor to implement the method as described above.
The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and the computer instructions executed by the processor cause the computer device to perform the foregoing method.
The following detailed description of the embodiments of the present invention is made with reference to the accompanying drawings:
aiming at the problems in the prior art, the invention provides a low-illumination video behavior recognition method and system V-DixMatch based on semi-supervised learning, and the accuracy of the behavior recognition of a model system in a low-illumination video is improved by using a normal light video with a label and a low-illumination video without the label as data sets. In the experimental process, it is found that the model training is easy to be unstable due to large domain difference by using the domain adaptation technology. And the existing consistency training technology is used only, and the domain difference is large, so that the initial pseudo label collection quality is poor, and a model with high accuracy cannot be obtained through training. Therefore, the invention provides the V-DixMatch, and the method mainly comprises three parts of joint pixel level domain adaptation, characteristic level domain adaptation and consistency training based on a frame mixing technology, so that the robust generalization capability of the model system between a normal illumination video and a low illumination video is effectively improved.
As shown in fig. 1, the method of the present invention integrally includes three training phases, specifically:
stage one:
step 1, acquiring 64 continuous video sequences from random sampling of normal illumination source videos (if the source videos are not 64 frames, a cyclic frame-taking strategy is adopted).
And 2, applying pixel level domain adaptation to the normal illumination video sequence.
And 3, inputting the video sequence subjected to degradation processing as a model, and constraining the output of the model by using a real label so as to perform supervision training under a normal illumination video.
And a second stage:
step 1, respectively and randomly sampling 64 continuous video sequences from a normal illumination source video and a low illumination source video (if the source video is not enough for 64 frames, a cyclic frame-taking strategy is adopted).
And 2, applying pixel level domain adaptation to the normal illumination video sequence. Data perturbation is performed using weak amplification on low-light video sequences.
And 3, inputting the video sequence subjected to degradation processing and the video sequence subjected to weak amplification as models, and constraining the prediction result of the video sequence subjected to degradation processing by using a real label. And meanwhile, carrying out feature level domain adaptation on the normal illumination video features and the low illumination video features.
And a third stage:
step 1: and respectively randomly sampling 64 continuous video sequences from the normal illumination source video and the low illumination source video (if the source video is not enough for 64 frames, a cyclic frame taking strategy is adopted).
And 2, applying pixel level domain adaptation to the normal illumination video sequence. And respectively applying weak amplification and frame mixing technologies to the low-illumination video sequence for data disturbance.
And 3, taking the video sequence after degradation processing, the video sequence after weak amplification and the video sequence after frame mixing technology processing as model input, and using a real label to constrain the prediction result of the video sequence after degradation processing. And meanwhile, carrying out feature level domain adaptation on the normal illumination video features and the low illumination video features. And calculating consistency loss for the video sequence prediction result after the weak amplification and the video sequence prediction result after the frame mixing processing to carry out consistency training.
And 4, after the model training is finished, only one model is finally obtained due to the sharing of model parameters. And (3) acquiring 64 continuous video sequences from the middle sampling of the low-illumination source video to be subjected to inference test (if the source video is not enough for 64 frames, a cyclic frame-taking strategy is adopted), and directly taking the video sequences as the input of a model, thereby predicting the behavior recognition prediction label result of the low-illumination video to be subjected to inference test.
As shown in fig. 2, the complete method of the present embodiment is composed of three parts, which are pixel level domain adaptation, feature level domain adaptation and consistency training between cross-domain videos. First, the present embodiment implements pixel level domain adaptation by performing degradation processing on a normal-light video to narrow the domain gap between the normal-light video and the low-light video. Then, the feature level domains of the low-light video and the normal-light video are adapted. To take full advantage of the tagged normal-light video data, the present embodiment uses tagged normal-light video to provide basic constraints to the model to learn knowledge about recognized actions. Finally, the present embodiment uses an original frame mixing technique (BFA) to increase the input perturbation of the model during the consistency training process, thereby further improving the stability of the model.
As shown in fig. 3, the present embodiment selects R (2+1) D BERT as the classifier of the present embodiment, where FC is a full-connected layer, and predicts the final result according to the extracted feature vector. The present embodiment will describe details of each part of the present embodiment V-DixMatch in the following sections.
1. Pixel level domain adaptation
The embodiment first adopts pixel-level adaptation to perform degradation processing on a normally-illuminated video from two aspects, namely contrast and noise. Inspired by LEDNet, the degradation process can be defined as:
N deg =D eg (N in )+δ
wherein N is in Is a normal illumination frame, D eg (. Is) N in Is a random gaussian noise, N deg Is a degraded processed video frame. Wherein the dimming operation is consistent with the operation used in LEDNet.
2. Feature level domain adaptation
As shown in fig. 1, in the present embodiment, a CDAN is used for domain adaptation, where features and prediction results extracted from low-light and normal-light video frames are used as input of the CDAN domain adaptation, so that the features extracted by the model in the normal-light video and the low-light video are as close as possible step by step, thereby improving the robustness of the model in behavior recognition of the low-light video.
3. Frame-blending based consistency training
3.1 frame hybrid data augmentation
In order to maximize the utilization of the collected pseudo label samples during the consistency training process, the present embodiment proposes a frame-mix data augmentation technique (BFA). The following are specific details of BFA:
1) Continuously sampling N frames of a video sample, and calling the sequence of the segment as F;
F=(f 1 ,f 2 ,...,f N )
2) Weak amplification (multi-scale clipping and random horizontal flipping) and strong amplification (random flipping, random rotation, random gaussian noise, random blur, random distortion, random brightness and random contrast) are applied in F:
F weak =(f w1 ,f w2 ,...,f wN )
F strong =(f s1 ,f s2 ,...,f sN )
3) The ratio α of the weakly amplified frames in the mixed frame is set to 0.3 in the operation of the present embodiment. Then, the present embodiment randomly acquires the start position P of the weak enhancement frame as follows:
λ~Uniform(α,1-α)
P~Uniform(1,(1-λ)N+1)
4) Synthesizing a hybrid frame F BFA Wherein the 1 st to P-1 th frames and the (P + λ N) th to Nth frames are strongly augmented frames, the P th to (P + λ N-1) th frames are weakly augmented frames, and the hybrid frame F can be represented by the following set BFA
F BFA =(f s1 ,...,f sP-1 ,f wP ,...,f w(P+λN-1) ,f s(P+λN) ,...,f sN )
3.2 loss of consistency
The model predicts the weakly augmented video sequence, takes the prediction result with the output confidence higher than 0.8 as a pseudo label of the sequence source video, and takes the pseudo label as a real label to restrict the prediction output of the model to the frame-mixed video sequence (i.e. using the Focal loss to perform classification loss calculation).
3.3 three-stage training strategy
The model of the embodiment is firstly pre-trained by 50 epochs in a stage one by using a normal illumination video in combination with pixel level domain adaptation; then, training in a second stage is carried out on the basis of the first stage by combining with the feature level domain, wherein the second stage trains for 30 epochs; and finally, performing stage three training on the basis of stage two by combining with the consistency training based on the BFA technology, wherein the stage three trains 30 epochs together.
In conclusion, the invention constructs a low-illumination video behavior recognition method and system V-DixMatch based on semi-supervised learning, and normal-illumination videos with labels and low-illumination videos without labels are used as data sets to improve the accuracy of behavior recognition of a model system in the low-illumination videos. The method comprises three main parts, namely pixel-level adaptation, feature-level adaptation and consistency training based on frame mixing, wherein in the consistency training process, a novel frame mixing data augmentation technology is provided, and further the feature extraction capability and robustness of the model in a low-illumination scene are improved. Through experimental comparison, the accuracy of the embodiment is about 4% higher than that of the current advanced method (the accuracy of top1 is about 80%), and reaches about 84%; compared with the full supervision directly using low-light video (top 1 accuracy is about 86%), the error rate is only about 2%. This is sufficient to demonstrate the effectiveness of the method of this embodiment.
The method can be applied to high-level visual tasks in low-illumination scenes, and high-level tasks such as target detection, face recognition and the like can be applied to the semi-supervised learning method V-DixMatch provided by the embodiment besides behavior recognition. The frame mixing data augmentation technology of the present invention can be applied to any video-related task, and can also be generalized to frame mixing technologies between different types of videos. The three-stage training strategy can be popularized and applied to consistency learning tasks among other different domains.
In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.
Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise indicated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer given the nature, function, and interrelationships of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is to be determined from the appended claims along with their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A low-light video behavior identification method based on semi-supervised learning is characterized by comprising the following steps:
acquiring a normal illumination source video and a low illumination source video;
performing pixel level domain adaptation on the normal illumination source video to obtain a first video sequence, and performing weak amplification and frame mixing processing on the low illumination source video to obtain a second video sequence;
using the first video sequence and the second video sequence as model input, and using a real label to constrain a model prediction result of the first video sequence, so as to perform supervised training under a normal illumination video; carrying out feature level domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing to obtain a target model;
and inputting the video sequence to be identified into the target model, and generating a prediction label result corresponding to the video sequence to be identified.
2. The method for identifying the behavior of the low-light video based on the semi-supervised learning as recited in claim 1, wherein the acquiring the normal-light source video and the low-light source video comprises:
acquiring a 64-frame continuous video sequence from the random sampling of the normal illumination source video;
randomly sampling a low-illumination source video to obtain a 64-frame continuous video sequence;
when the source video is not enough than 64 frames, a cyclic frame taking strategy is adopted to sample the source video to obtain a corresponding video sequence.
3. The method for low-light video behavior recognition based on semi-supervised learning as claimed in claim 1, wherein the performing pixel-level domain adaptation on the normal-light source video to obtain a first video sequence comprises:
performing degradation processing on the normal illumination source video from two aspects of contrast and noise respectively by adopting pixel-level self-adaptation;
wherein the expression of the degradation process is:
N deg =D eg (N in )+δ
wherein, N in Is a normal lighting frame; d eg (. Is) N in Dimming operation of (c); δ is random gaussian noise; n is a radical of deg Is a degraded processed video frame.
4. The method for identifying the behavior of the low-light video based on the semi-supervised learning as recited in claim 1, wherein the performing the weak amplification and the frame mixing on the low-light source video to obtain a second video sequence comprises:
continuously sampling a plurality of frames of the low-illumination source video to obtain a third video sequence;
performing weak amplification processing and strong amplification processing on the third video sequence to obtain a first processing result, wherein the first processing result comprises the video sequence subjected to the weak amplification processing and the video sequence subjected to the strong amplification processing;
setting the ratio of weak enhancement frames in a mixed frame, and randomly acquiring the initial position of the weak enhancement frame according to the ratio of the weak enhancement frames;
synthesizing a mixed frame according to the initial position of the weak enhancement frame;
wherein the weak augmentation process comprises at least one of: multi-scale cutting and random horizontal turning;
the strong augmentation treatment comprises at least one of: random inversion, random rotation, random gaussian noise, random blur, random distortion, random brightness, and random contrast.
5. The method for low-light video behavior recognition based on semi-supervised learning according to claim 1, wherein the performing feature level domain adaptation on the features extracted from the normal-light source video and the features extracted from the low-light source video comprises:
and performing domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video by using the CDAN, wherein the features extracted from the normal illumination source video and the low illumination source video and the corresponding prediction results are used as input of the CDAN domain adaptation.
6. The method for identifying low-light video behaviors based on semi-supervised learning according to claim 1, wherein the performing consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing comprises:
predicting the weakly amplified video sequence, and taking a prediction result with the output confidence higher than 0.8 as a pseudo label of the sequence source video;
and using the pseudo label as a real label to restrict the prediction output of the video sequence after the model is mixed with the frame.
7. A low-light video behavior recognition system based on semi-supervised learning, comprising:
the system comprises a first module, a second module and a third module, wherein the first module is used for acquiring a normal illumination source video and a low illumination source video;
the second module is used for carrying out pixel level domain adaptation on the normal illumination source video to obtain a first video sequence, and carrying out weak amplification and frame mixing processing on the low illumination source video to obtain a second video sequence;
a third module, configured to input the first video sequence and the second video sequence as models, and constrain a model prediction result of the first video sequence by using a real label, so as to perform supervised training under a normal lighting video; carrying out feature level domain adaptation on the features extracted from the normal illumination source video and the features extracted from the low illumination source video; carrying out consistency training on the prediction result of the video sequence after the weak augmentation and frame mixing processing to obtain a target model;
and the fourth module is used for inputting the video sequence to be recognized into the target model and generating a prediction label result corresponding to the video sequence to be recognized.
8. An electronic device comprising a processor and a memory;
the memory is used for storing programs;
the processor executing the program realizes the method of any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that the storage medium stores a program which is executed by a processor to implement the method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the method according to any of claims 1 to 6 when executed by a processor.
CN202210788338.2A 2022-07-06 2022-07-06 Low-light video behavior recognition method and system based on semi-supervised learning Active CN115205739B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210788338.2A CN115205739B (en) 2022-07-06 2022-07-06 Low-light video behavior recognition method and system based on semi-supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210788338.2A CN115205739B (en) 2022-07-06 2022-07-06 Low-light video behavior recognition method and system based on semi-supervised learning

Publications (2)

Publication Number Publication Date
CN115205739A true CN115205739A (en) 2022-10-18
CN115205739B CN115205739B (en) 2023-11-28

Family

ID=83578369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210788338.2A Active CN115205739B (en) 2022-07-06 2022-07-06 Low-light video behavior recognition method and system based on semi-supervised learning

Country Status (1)

Country Link
CN (1) CN115205739B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200082221A1 (en) * 2018-09-06 2020-03-12 Nec Laboratories America, Inc. Domain adaptation for instance detection and segmentation
CN112183456A (en) * 2020-10-19 2021-01-05 北京深睿博联科技有限责任公司 Multi-scene moving object detection method and device based on sample generation and domain adaptation
CN112307995A (en) * 2020-11-05 2021-02-02 电子科技大学 Semi-supervised pedestrian re-identification method based on feature decoupling learning
CN112966723A (en) * 2021-02-08 2021-06-15 北京百度网讯科技有限公司 Video data augmentation method, video data augmentation device, electronic device and readable storage medium
CN113158815A (en) * 2021-03-27 2021-07-23 复旦大学 Unsupervised pedestrian re-identification method, system and computer readable medium
CN113469289A (en) * 2021-09-01 2021-10-01 成都考拉悠然科技有限公司 Video self-supervision characterization learning method and device, computer equipment and medium
CN114332568A (en) * 2022-03-16 2022-04-12 中国科学技术大学 Training method, system, equipment and storage medium of domain adaptive image classification network
CN114693615A (en) * 2022-03-17 2022-07-01 常州工学院 Deep learning concrete bridge crack real-time detection method based on domain adaptation

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200082221A1 (en) * 2018-09-06 2020-03-12 Nec Laboratories America, Inc. Domain adaptation for instance detection and segmentation
CN112183456A (en) * 2020-10-19 2021-01-05 北京深睿博联科技有限责任公司 Multi-scene moving object detection method and device based on sample generation and domain adaptation
CN112307995A (en) * 2020-11-05 2021-02-02 电子科技大学 Semi-supervised pedestrian re-identification method based on feature decoupling learning
CN112966723A (en) * 2021-02-08 2021-06-15 北京百度网讯科技有限公司 Video data augmentation method, video data augmentation device, electronic device and readable storage medium
CN113158815A (en) * 2021-03-27 2021-07-23 复旦大学 Unsupervised pedestrian re-identification method, system and computer readable medium
CN113469289A (en) * 2021-09-01 2021-10-01 成都考拉悠然科技有限公司 Video self-supervision characterization learning method and device, computer equipment and medium
CN114332568A (en) * 2022-03-16 2022-04-12 中国科学技术大学 Training method, system, equipment and storage medium of domain adaptive image classification network
CN114693615A (en) * 2022-03-17 2022-07-01 常州工学院 Deep learning concrete bridge crack real-time detection method based on domain adaptation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JAEMIN NA ET AL: "FixBi: Bridging Domain Spaces for Unsupervised Domain Adaptation", 《IEEE》, pages 1094 - 1103 *
M. ESAT KALFAOGLU ET AL: "Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition", 《ARXIV:2008.01232V3》, pages 1 - 19 *
李知菲和端木春江: "基于半监督学习的人体异常行为识别", 《浙江师范大学学报(自然科学版)》, pages 258 - 262 *

Also Published As

Publication number Publication date
CN115205739B (en) 2023-11-28

Similar Documents

Publication Publication Date Title
Ren et al. Deep video dehazing with semantic segmentation
Li et al. End-to-end united video dehazing and detection
Sixt et al. Rendergan: Generating realistic labeled data
Yasarla et al. Uncertainty guided multi-scale residual learning-using a cycle spinning cnn for single image de-raining
Gomez-Ojeda et al. Learning-based image enhancement for visual odometry in challenging HDR environments
CN110969589A (en) Dynamic scene fuzzy image blind restoration method based on multi-stream attention countermeasure network
Pan et al. Joint stereo video deblurring, scene flow estimation and moving object segmentation
Guo et al. Joint raindrop and haze removal from a single image
Xing et al. Domain adaptive video segmentation via temporal pseudo supervision
Choo et al. Learning background subtraction by video synthesis and multi-scale recurrent networks
Cao et al. A two-stage density-aware single image deraining method
Jalata et al. Eqadap: Equipollent domain adaptation approach to image deblurring
CN111625661A (en) Audio and video segment classification method and device
Shin et al. Unsupervised domain adaptation for video semantic segmentation
Wang et al. PFDN: Pyramid feature decoupling network for single image deraining
Dai et al. A gated cross-domain collaborative network for underwater object detection
Wang et al. Undaf: A general unsupervised domain adaptation framework for disparity or optical flow estimation
Lin et al. Real-time foreground object segmentation networks using long and short skip connections
CN115205739B (en) Low-light video behavior recognition method and system based on semi-supervised learning
Wang et al. A multi-scale attentive recurrent network for image dehazing
Kajo et al. Tensor based completion meets adversarial learning: a win–win solution for change detection on unseen videos
Tang et al. Removal of visual disruption caused by rain using cycle-consistent generative adversarial networks
Xing et al. Improved shallow-uwnet for underwater image enhancement
Kusuma et al. Modular ST-MRF environment for moving target detection and tracking under adverse local conditions
Song et al. Pixel-wise object tracking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant