CN118052985A

CN118052985A - Low-light video target segmentation method based on event signal driving

Info

Publication number: CN118052985A
Application number: CN202410215980.0A
Authority: CN
Inventors: 孙晓艳; 李和倍; 张越一
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2024-02-27
Filing date: 2024-02-27
Publication date: 2024-05-17

Abstract

The invention discloses a low-light video target segmentation method based on event signal driving, which comprises the following steps: 1. preparing video data in a low light scene, a target mask and a corresponding event sequence; 2. constructing a low-light video target segmentation model; 3. offline training is carried out on the constructed low-light video target segmentation neural network model; 4. and predicting the mask under the low light scene by using the trained model so as to realize the target of low light video target segmentation. According to the method and the device, the effect of video target segmentation under the low light scene can be improved by utilizing an event data driving mode, so that an accurate target mask can be generated.

Description

Low-light video target segmentation method based on event signal driving

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a low-light video target segmentation method based on event signal driving.

Background

Video object segmentation technology occupies a central research position in the field of computer vision, and its main task is to accurately identify and track one or more target objects in a video sequence. The application range of the technology is very wide, and the technology covers a plurality of fields from the environment perception of an automatic driving system, the video monitoring system of urban safety, video editing software for providing innovative interaction modes and the like. With the rapid development and application of the deep learning technology, the video object segmentation technology has realized a qualitative leap, and particularly when high-definition video input is processed, the existing method not only can segment the object with higher precision, but also can track the motion trail of the object more stably.

However, despite significant achievements under standard lighting conditions, application of video object segmentation techniques under low light conditions still faces significant challenges. In such an environment, the video picture is often subject to serious quality degradation, such as significant increase of noise, massive loss of scene details, and serious distortion of color information, which directly affect the accuracy of the segmentation algorithm and the stability of the tracking algorithm. More importantly, most of the current video object segmentation techniques are highly dependent on clear and high-quality video input as a premise, and the condition is often difficult to meet in practical application scenes such as night monitoring or low-illumination automatic driving. The dependence on high-quality video input greatly limits the application potential and practical effect of video object segmentation technology in low-light environments.

Disclosure of Invention

The invention aims to solve the defects of the prior art, provides a low-light video target segmentation method based on event signal driving, and aims to improve the robustness of a video target segmentation technology under low light and the segmentation capability of a moving object by utilizing the high dynamic characteristic of event data and the high-speed motion characteristic of a captured object and improve the video target segmentation effect under a low light scene.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

The invention discloses a low-light video target segmentation method based on event signal driving, which is characterized by comprising the following steps:

Step 1, acquiring a video image set I under a low light scene, a target mask set Y and a corresponding event sequence E:

step 2, constructing a low-light video target segmentation neural network, which comprises the following steps: a multi-modal encoder and an event guided memory matching module:

Step 2.1, the multi-mode encoder is used for extracting the characteristics of the I and the E to obtain mixed characteristics;

step 2.2, the event-guided memory matching module is used for processing the mixed features to obtain a predicted target mask;

Step 3, constructing a total loss function based on the predicted target mask and the target mask set Y;

And 4, training the low-light video target segmentation neural network by using a gradient descent method, calculating the total loss function L to update network parameters, and stopping training when the training iteration number reaches the set number or the total loss function L converges, so as to obtain the optimal low-light video target segmentation neural network, and processing the low-light video image to obtain a corresponding prediction mask.

The low-light video target segmentation method based on event signal driving is also characterized in that the step 1 is performed according to the following steps:

Step 1.1.1, acquiring a video image set I= { I ₁,I₂,...,I_t,...,I_T } of a low light scene and a video image set N= { N ₁,N₂,...,N_t,...,N_T } of a corresponding normal light scene, wherein I _t represents a low light image at the t moment, and N _t represents a normal light image at the t moment; t represents the number of frames of the image;

Step 1.1.2, labeling a target mask of a video image set of a normal light scene through a labeling tool to obtain a target mask set Y= { Y ₁,y₂,...,y_t,...,y_T } of the video image sets of a low light scene and the normal light scene, wherein Y _t represents a target mask of a low light image I _t and a normal light image N _t at a t-th moment;

Step 1.1.3 obtains an event sequence of the video image set I of the low light scene, denoted as e= { E _0,1,E_1,2,...,E_t-1,t,...,E_T-1,T }, where E _t-1,t represents low light events corresponding to the low light image I _t-1 at time t-1 to the low light image I _t at time t.

The multi-mode encoder in step 2.1 comprises: the system comprises an image encoder, an event encoder and a self-adaptive cross-mode fusion module;

step 2.1.1, the image encoder is composed of an m-layer residual error module and an n-layer downsampling module;

Inputting the low-light image I _t at the t moment into the image encoder to perform feature extraction to obtain a multi-scale image feature F _t ^Img of I _t;

Step 2.1.2, the event encoder is composed of an m-layer residual error module and an n-layer downsampling module;

inputting the low light event E _t-1,t from the t-1 time to the t time into the event encoder for feature extraction to obtain a multi-scale event feature F _t ^Evt of E _t-1,t;

Step 2.1.3, the self-adaptive cross-modal fusion module splices the multi-scale image characteristic F _t ^Img and the multi-scale event characteristic F _t ^Evt in the channel dimension, and then carries out convolution and average pooling treatment to obtain a multi-scale mixed characteristic F _t ^Cat at the t-th moment;

F _t ^Cat respectively performs dot multiplication with the multi-scale image feature F _t ^Img and the multi-scale event feature F _t ^Evt to obtain a multi-scale image feature after t time screening And event feature/>

The screened multi-scale event featuresAfter the channel attention operation and the space attention operation are sequentially carried out, the multiscale event attention characteristic/>, at the t-th moment, is obtainedMulti-scale event attention feature/>And filtered multiscale image features/>Summing to obtain the multiscale image feature/>, of the t moment fusion event information

Multi-scale image features for fused event informationAnd post-screening multiscale event features/>And (3) carrying out convolution summation to obtain a mixed characteristic F _t at the t-th moment.

The event-guided memory matching module in the step 2.2 comprises: a memory storage module, an event guidance module, an attention matching module, a mask decoder;

Step 2.2.1, after the memory storage module performs linear transformation on the hybrid characteristic F _t at the t-th moment, obtaining a key K _t and a value V _t at the t-th moment; the mask defining the t-th moment is noted as When t=1, initialize/>

Step 2.2.2 the event guidance module pairs in the channel dimensionAfter being combined with F _t ^evt, the filter signal SE _t,SE_t at the t moment is obtained by multi-scale information extraction through different convolution kernel sizes and pooling, and F _t ^evt and F _t ^evt are respectively carried outPerforming point multiplication and summation, and finally outputting a guide signal G _t after strengthening at the t moment;

Step 2.2.3 the attention matching module obtains the filtered key K _t' at time t using equation (1):

K_t′＝K_t·G_t (1)

the attention matrix a _t+1 at time t+1st is obtained by using the formula (2):

In the formula (2), Q _t+1 represents a query value of F _t+1 after linear transformation, d _k represents channel dimensions of Q _t+1 and K _t', and Softmax represents an activation function; tr represents matrix transposition;

obtaining a matching result R _t+1 at the t+1st time by using the formula (3):

R_t+1＝A_t+1(G_t+V_t) (3)

Step 2.2.4, the mask decoder is composed of a convolution layer and an up-sampling layer;

After the matching result R _t+1 and the mixed feature F _t+1 are channel-combined, the matching result R _t+1 and the mixed feature F _t+1 are input into the Mask decoder for processing, and the target Mask _t+1 predicted at the t+1st time is output.

The step 3is carried out according to the following steps:

Step 3.1 constructing a t-th moment cross entropy loss function using equation (4)

Step 3.2 construction of Soft Jack loss function at time t Using equation (5)

Step 3.3 constructing the total loss function L at time t using equation (6):

In the formula (7), α and β are two weighting coefficients.

The electronic device of the present invention includes a memory and a processor, wherein the memory is configured to store a program for supporting the processor to execute the low-light video object segmentation method, and the processor is configured to execute the program stored in the memory.

The invention relates to a computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being run by a processor, performs the steps of the low-light video object segmentation method.

Compared with the prior art, the invention has the beneficial effects that:

1. The present invention proposes an innovative video object segmentation network design that is based on event signal driven by combining event signal data with conventional video segmentation techniques, innovatively applying the unique advantages of event signals to low-light video object segmentation tasks. The method has remarkable progress in improving the robustness of video target segmentation in a low-illumination environment, and particularly has obvious advantages compared with the existing mainstream video target segmentation technology in improving the segmentation precision and stability of fast moving objects.

2. The invention develops a self-adaptive cross-mode fusion module. The module adopts an advanced multi-scale fusion strategy, so that the information fusion efficiency of the image frames and the event data is enhanced, and the illumination robustness of the event data under the low illumination condition is effectively utilized, so that the performance of video object segmentation under various illumination conditions is remarkably improved.

3. The method creatively fuses the event signal and the target mask characteristic to generate the signal for guiding the segmentation network, and effectively improves the matching capability of the segmentation mask when the network processes the low-illumination video sequence. The innovation solves the problem of reduced segmentation performance caused by low matching accuracy under the condition of low illumination, and further enhances the application capability of the system in complex environments.

4. The invention adopts a supervision training mode to train, and embeds event information in depth into the video target segmentation network, thereby improving the quality of the output mask.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a block diagram of an adaptive cross-modality fusion module of the present invention;

FIG. 3 is a diagram illustrating an event guided memory matching module according to the present invention.

Detailed Description

In this embodiment, a series of challenges brought by a low-illumination environment are faced, and a low-light video target segmentation method driven based on an event signal is provided, which constructs a self-adaptive cross-mode fusion module and an event guiding target matching module by using high dynamic characteristics of event data and high-speed motion of a captured object. The method can adapt to video quality degradation under low illumination conditions and reduce dependence on high-quality video input, so that target segmentation and tracking performance under the conditions can be effectively improved, the application range of a video target segmentation technology can be expanded, the practicability and reliability of the video target segmentation technology under a complex environment can be improved, and wider practical application requirements can be met. As shown in fig. 1, the method specifically comprises the following steps:

step 1, obtaining video data in a low light scene, a target mask and a corresponding event sequence:

Step 1.1.1, acquiring a video image set I= { I ₁,I₂,...,I_t,...,I_T } of a low light scene and a video image set N= { N ₁,N₂,...,N_t,...,N_T } of a corresponding normal light scene, wherein I _t represents a low light image at the t moment, and N _t represents a normal light image at the t moment; t represents the number of frames of the image; in this example, the number of image frames t=5 at the time of neural network training.

Step 1.1.2, labeling a target mask of a video image set of a normal light scene by a labeling tool to obtain a target mask set y= { Y ₁,y₂,...,y_t,...,y_T } of the video image sets of a low light scene and the normal light scene, wherein Y _t represents the target masks of a low light image I _t and a normal light image N _t at the t-th moment.

Step 2, constructing a low-light video target segmentation neural network, as shown in fig. 1, including: a multi-mode encoder, an event guided memory matching module:

Step 2.1 the multi-mode encoder comprises: the system comprises an image encoder, an event encoder and a self-adaptive cross-mode fusion module;

step 2.1.1, the image encoder is composed of an m-layer residual error module and an n-layer downsampling module; in this example, m=4, n=3.

The low-light image I _t at the t-th moment is input into an image encoder for feature extraction to obtain the multi-scale image feature F _t ^Img of the I _t. Step 2.1.2, the event encoder is composed of an m-layer residual error module and an n-layer downsampling module; in this example, m=4, n=3

The low light event E _t-1,t from the t-1 time to the t time is input into an event encoder for feature extraction, and the multi-scale event feature F _t ^Evt of E _t-1,t is obtained.

Step 2.1.3, as shown in fig. 2, the adaptive cross-mode fusion module splices the multi-scale image feature F _t ^Img and the multi-scale event feature F _t ^Evt in the channel dimension, and then carries out convolution and average pooling treatment to obtain a multi-scale mixed feature F _t ^Cat at the t-th moment;

Multi-scale event features after screeningAfter the channel attention operation and the space attention operation are sequentially carried out, the multiscale event attention characteristic/>, at the t-th moment, is obtainedMulti-scale event attention feature/>And filtered multiscale image features/>Summing to obtain the multiscale image feature/>, of the t moment fusion event information

Multi-scale image features for fused event informationAnd post-screening multiscale event features/>After convolution summation is carried out, a mixed characteristic F _t at the t moment is obtained;

step 2.2, an event guided memory matching module, comprising: a memory storage module, an event guidance module, an attention matching module, a mask decoder;

Step 2.2.1, the memory storage module performs linear transformation on the mixed characteristic F _t at the t moment to obtain a key K _t and a value V _t at the t moment; the mask defining the t-th moment is noted as When t=1, initialize/>

Step 2.2.2 As shown in FIG. 3, the event guidance module pairs in the channel dimensionAfter being combined with F _t ^evt, the filter signals SE _t,SE_t at the t moment are respectively combined with F _t ^evt and/>, and multi-scale information extraction is carried out through different convolution kernel sizes and poolingAnd carrying out dot multiplication and summation, and finally outputting the guide signal G _t after strengthening at the t-th moment.

K_t′＝K_t·G_t (1)

the attention matrix a _t+1 at time t+1st is obtained by using the formula (2):

In formula (2), Q _t+1 represents the query vector of F _t+1 after linear transformation, d _k represents the channel dimensions of vectors Q _t+1 and K _t', softmax represents the activation function; tr represents the matrix transpose.

Obtaining a matching result R _t+1 at the t+1st time by using the formula (3):

R_t+1＝A_t+1(G_t+V_t) (3)

step 2.2.4 the mask decoder is composed of a convolutional layer and an upsampling layer;

After channel combination is carried out on the matching result R _t+1 and the mixed feature F _t+1, the matching result R _t+1 and the mixed feature F _t+1 are input into a Mask decoder for processing, and a target Mask _t+1 predicted at the t+1st moment is output;

step 3, training of a low-light video target segmentation neural network:

Step 3.2 construction of Soft Jack loss function at time t Using equation (5)

Step 3.3 constructing the total loss function L at time t using equation (6):

in the formula (7), α and β are two weighting coefficients;

in the formula (7), α and β are two weighting coefficients; in this example, α and β are both 0.5.

And 4, training the low-light video target segmentation neural network by using a gradient descent method, calculating a total loss function L to update network parameters, and stopping training when the training iteration number reaches the set number or the total loss function L converges, so as to obtain an optimal low-light video target segmentation neural network, and processing the low-light video image to obtain a corresponding prediction mask.

In this embodiment, an electronic device includes a memory for storing a program supporting the processor to execute the above method, and a processor configured to execute the program stored in the memory.

In this embodiment, a computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the method described above.

Claims

1. The low-light video target segmentation method based on event signal driving is characterized by comprising the following steps of:

2. The method for splitting a low-light video object based on event signal driving according to claim 1, wherein the step 1 is performed as follows:

Step 1.1.2, labeling a target mask of a video image set of a normal light scene by a labeling tool to obtain a target mask set Y= { Y ₁,y₂,...,y_t,...,y_T } of the video image sets of the low light scene and the normal light scene, wherein Y _t represents target masks of a low light image I _t and a normal light image N _t at a t-th moment;

step 1.1.3, obtaining an event sequence of a video image set I of a low light scene, which is denoted as E= { E _0,1,E_1,2,...,E_t-1,t,...,E_T-1,T }, wherein E _t-1,t represents low light events corresponding to low light images I _t-1 at time t-1 to low light images I _t at time t.

3. The method for event signal driven low-light video object segmentation according to claim 2, wherein the multi-mode encoder in step 2.1 comprises: the system comprises an image encoder, an event encoder and a self-adaptive cross-mode fusion module;

4. The method for event signal driven low-light video object segmentation according to claim 3, wherein the event-guided memory matching module in step 2.2 comprises: a memory storage module, an event guidance module, an attention matching module, a mask decoder;

Step 2.2.1, after the memory storage module performs linear transformation on the mixed characteristic F _t at the t moment, obtaining a key K _t and a value V _t at the t moment; the mask defining the t-th moment is noted as When t=1, initialize/>

Step 2.2.2, the event guidance module pairs in the channel dimensionAfter being combined with F _t ^evt, the filter signals SE _t,SE_t at the t moment are respectively combined with F _t ^evt and/>, and multi-scale information extraction is carried out through different convolution kernel sizes and poolingPerforming point multiplication and summation, and finally outputting a guide signal G _t after strengthening at the t moment;

Step 2.2.3, the attention matching module obtains a filtered key K _t' at time t using equation (1):

K_t′＝K_t·G_t (1)

the attention matrix a _t+1 at time t+1st is obtained by using the formula (2):

obtaining a matching result R _t+1 at the t+1st time by using the formula (3):

R_t+1＝A_t+1(G_t+V_t) (3)

5. The method for splitting a low-light video object based on event signal driving according to claim 4, wherein said step 3 is performed as follows:

step 3.1, constructing a t moment cross entropy loss function by utilizing the step (4)

Step 3.2, constructing a Soft Jack loss function at the t-th time by using the formula (5)

Step 3.3, constructing a total loss function L at the t-th moment by using the formula (6):

In the formula (7), α and β are two weighting coefficients.

6. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that supports the processor to perform the low-light video object segmentation method of any one of claims 1-5, the processor being configured to execute the program stored in the memory.

7. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor performs the steps of the low-light video object segmentation method according to any of claims 1-5.