CN113538457B

CN113538457B - Video semantic segmentation method utilizing multi-frequency dynamic hole convolution

Info

Publication number: CN113538457B
Application number: CN202110718738.1A
Authority: CN
Inventors: 李平; 陈俊杰; 王然; 徐向华
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2022-06-24
Anticipated expiration: 2041-06-28
Also published as: CN113538457A

Abstract

The invention discloses a video semantic segmentation method utilizing multi-frequency dynamic hole convolution. Firstly, enhancing a sampling frame image of video data, and extracting a shallow visual characteristic diagram through an encoder; then constructing a characteristic frequency separation module to obtain a multi-frequency characteristic image corresponding to the video frame, inputting the multi-frequency characteristic image into a dynamic void convolution module to obtain a corresponding multi-frequency high-level semantic characteristic image, and obtaining a segmentation mask of the video frame through an up-sampling convolution encoder; and (5) iteratively training the model by using a random gradient descent algorithm until convergence, and inputting the new video into the model to obtain a semantic segmentation result. The method of the invention separates the characteristic graph of the video frame according to different frequencies to depict different visual region changes, can reduce low-frequency visual space redundant information and reduce the computational complexity, adaptively enlarges the receptive field of the multi-frequency characteristic graph through dynamic void convolution, and improves the discrimination capability of different semantic classes of the video, thereby obtaining a better video semantic segmentation result.

Description

Video semantic segmentation method utilizing multi-frequency dynamic hole convolution

Technical Field

The invention belongs to the technical field of computer vision, particularly relates to the field of semantic segmentation in video processing, and relates to a video semantic segmentation method by utilizing multi-frequency dynamic hole convolution.

Background

With the increasing proliferation of vehicles of all types, driving safety is a significant concern to governments and the public. Generally, the driver of the large vehicle is easy to have a visual blind area, and great hidden danger is brought to the driving safety. In recent years, the automatic driving technique has attracted much interest in the industry, and more research efforts have been put into this field. Efficient visual understanding can provide guarantee for safety of automatic driving, and video semantic segmentation is one of core technologies of the automatic driving. The video semantic segmentation aims to carry out pixel-level class marking on video frames with time sequence correlation to obtain a pixel-by-pixel class mask matrix with the same size as that of an original video frame, and can be widely applied to the fields of machine vision, video monitoring, unmanned aerial vehicle reconnaissance, automatic driving and the like. For example, in an automatic driving environment, objects such as roads, pedestrians, or other vehicles in a vehicle visual scene are segmented at a pixel level, and object region information more accurate than a boundary frame can be obtained, so that more accurate visual perception content is provided for an automatic driving system, and obstacles such as pedestrians and vehicles are avoided, and driving safety is ensured. Currently, the main challenges in the field of video semantic segmentation include high computational complexity of the model, long time consuming processing of high resolution video frames, and difficulty in deploying the model in a real-time environment.

The traditional semantic segmentation method mainly comprises the following categories of threshold, edge, super-pixel clustering and the like. The threshold segmentation method compares the gray value of each pixel point of the image with a threshold, and the pixels with the gray values larger than the threshold are judged to be foreground, and the other pixels are background but only applicable to gray images; the edge segmentation method firstly carries out edge detection on an image, pixels in the same edge represent the same object, and the defect is that the segmentation precision is limited by an edge detection algorithm; the super-pixel clustering method aggregates approximate super-pixel blocks to depict the same object, and has the disadvantages that the formation of super-pixels is limited by the colors of the pixels and the textures of pixel regions, and different parts of the same object are easily divided into a plurality of super-pixels, resulting in segmentation errors. In recent years, a deep neural network is popular due to its strong feature extraction capability, and a typical method utilizes a convolutional neural network as an encoder to extract abstract semantic information of a video frame, and obtains a semantic segmentation mask through a layer-by-layer upsampling operation of a decoder. However, the convolutional layer can only extract local semantic information of the frame image, and it is difficult to characterize the global scene. Therefore, the spatial pyramid pooling technology is used for semantic segmentation, and is characterized in that: and performing multiple parallel pooling operations on the feature map obtained from the encoder to obtain compressed feature maps with different sizes so as to capture global scene features of multiple size receptive fields, performing up-sampling to restore the feature maps with the same size as the initial feature map and splicing the feature maps to obtain a total feature map, and finally obtaining a semantic segmentation mask through a decoder so as to obtain a video semantic segmentation result.

The existing semantic segmentation method still has many defects: 1) the spatial pyramid pooling technology considers local and global space-time structure information simultaneously so that the segmentation result is more reliable, but the defects of poor fault tolerance, poor generalization capability, high calculation complexity and the like are caused by using the maximum average pooling operation on the high-resolution feature map; 2) although the long-term semantic dependence relationship among feature graphs is strengthened by utilizing an attention mechanism, the model is too large and occupies much memory, and the real-time deployment of the model is not facilitated; 3) a Transformer encoder is widely used for natural language processing as a feature extractor, takes a one-dimensional embedded feature representation sequence of a two-dimensional image as input, and utilizes a self-attention mechanism and a multi-layer perceptron to stack and capture long-term dependence relation between video frames, but the model lacks weight sharing, so that the number of parameters is huge, and the self-attention computational complexity is high, so that the real-time performance is difficult to guarantee. Meanwhile, the precision and the real-time performance of most segmentation methods cannot be effectively balanced, so that the requirements of actual segmentation tasks cannot be effectively met. Therefore, aiming at the problems of high computational complexity, poor generalization capability and the like of the segmentation model, a method capable of ensuring the real-time performance of the segmentation model and achieving higher semantic segmentation precision is urgently needed.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a video semantic segmentation method by using multi-frequency dynamic hole convolution, which separates multiple frequencies of a feature map through Fourier transform, and the multi-frequency feature map can depict different gray value changes of different visual areas so as to reduce low-frequency visual space redundant information and reduce calculation complexity; meanwhile, the dynamic cavity convolution is designed to adaptively enlarge the receptive field of the multi-frequency characteristic diagram, and the discrimination capability of the model on different semantic classes of the video is improved from the global and local angles, so that the video semantic segmentation precision is improved.

The method firstly acquires a video data set, and then performs the following operations:

sampling a video to obtain a video frame, performing enhancement operation, and inputting the video frame to an encoder, namely a deep convolutional neural network to obtain a corresponding shallow visual characteristic diagram;

step (2) constructing a characteristic frequency separation module, inputting a shallow visual characteristic diagram, and outputting a multi-frequency characteristic diagram;

step (3) constructing a dynamic void convolution module, inputting the dynamic void convolution module into a multi-frequency feature map, and outputting the multi-frequency high-level semantic feature map;

inputting the multi-frequency high-level semantic feature map into a decoder, namely an up-sampling convolution module, and obtaining a segmentation mask of a video frame;

and (5) iteratively training a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic cavity convolution module and a decoder until convergence, and then inputting a new video into the model to obtain a corresponding semantic segmentation result.

Further, the step (1) is specifically:

(1-1) uniformly sampling a single video to obtain video frames with the sampling rate of 10-15 frames/second, and performing enhancement operation on the video frames to obtain a video frame sequence I with the number of N, and recording the video frame sequence I as a video frame sequence I

Wherein I_iWhich represents the ith video frame of the video,

representing a real number field, 3 representing the number of RGB channels, H representing the height of a video frame, and W representing the width of the video frame;

(1-2) sequentially extracting a shallow visual feature map from the video frame sequence I by utilizing a convolutional neural network ResNet pre-trained on a large image library ImageNet

C_fNumber of channels, H, representing a feature map_fIndicating the height of the feature map, W_fRepresenting a feature map width; ResNet has a plurality of modules consisting of convolutional layers, f_iAnd (4) obtaining a characteristic diagram of the ith video frame passing through the first three modules consisting of a plurality of convolutional layers by RestNet.

Further, the step (2) is specifically:

(2-1) constructing a characteristic frequency separation module, and carrying out three times of high-low frequency characteristic separation operation on the shallow visual characteristic graph by utilizing the characteristic that the image has separable frequency to obtain a multi-frequency characteristic graph; the high-frequency characteristic is used for describing a contour region of the characteristic diagram, the low-frequency characteristic is used for describing a plane region of the characteristic diagram, and the medium-frequency characteristic is used for describing a content region of the characteristic diagram;

(2-2) the specific operation of high and low frequency feature separation is as follows:

firstly, for the superficial visual characteristic diagram f_iPerforming fast Fourier transform, and converting the space domain signal into a frequency domain signal to obtain f_iSpectral diagram of

Will be provided with

The middle and low frequency signal part is translated to the middle to obtain a translation spectrogram

Determining

A central position vector (P, Q); wherein,

vector formed by abscissa values of central point of channel

Vector formed by ordinate values

The subscript r represents

The channel index of (2);

then will be

Each element of (1) and a low frequency transfer function H_l(u_r,a,,v_r,b) Multiplying to obtain low-frequency shift spectrogram

Transfer function of Gaussian low-pass filter

l represents low-frequency signal, a represents coordinate value of horizontal axis of pixel point, b represents coordinate value of vertical axis of pixel point, and {0 ≦ a ≦ H_f,0≤b≤W_f}, exp (. cndot.) denotes an exponential function, D₀Is the set standard deviation; wherein,

represent

Distance coordinate point (P) of middle-r channel pixel points (a, b)_r,Q_r) Of Euclidean distance u_r,aIs that

Distance P between the r-th channel spectrum position (a,0)_rOf Euclidean distance, v_r,bIs that

Middle-r channel spectral position (0, b) distance Q_rThe Euclidean distance of (c);

in the same way, will

Each element of (1) and a high frequency transfer function H_h(u_r,a,,v_r,b) Multiplication operation is carried out to obtain a high-frequency shift spectrogram

Where h denotes a high-frequency signal, and,

respectively convert the frequency spectrum

And

the middle low-frequency signal is translated back to the original position from the middle to obtain a low-frequency spectrogram

And high frequency spectrogram

Finally will be

And

respectively carrying out fast Fourier inversion transformation to convert the frequency domain signals into space domain signals to obtain weak low-frequency characteristic diagrams

And weak high frequency characteristic diagram

(2-3) for weak high frequency characteristics according to (2-2)

Carrying out secondary high-low frequency characteristic separation operation to obtain a strong high-frequency characteristic diagram

Characteristic diagram of medium and high frequency

hh represents that the characteristic diagram is subjected to high-frequency signal filtering twice, hi represents that the characteristic diagram is subjected to high-frequency signal filtering once and then low-frequency signal filtering once;

according to (2-2), for weaknessLow frequency signature

Carrying out secondary high-low frequency characteristic separation operation to obtain a strong-low frequency characteristic diagram

Middle and low frequency characteristic diagram

ll represents that the characteristic diagram is subjected to low-frequency signal filtering twice, lh represents that the characteristic diagram is subjected to low-frequency signal filtering once and then high-frequency signal filtering once;

(2-4) mapping the medium-high frequency characteristics

Middle and low frequency characteristic diagram

Performing one-time splicing, performing convolution operation with the size of 1 × 1 to obtain a compressed characteristic diagram, performing down-sampling operation with the maximum step length of 2 to obtain an intermediate frequency characteristic diagram

Where m represents the intermediate frequency signal and where,

channel dimensions of the intermediate frequency characteristic diagram;

(2-5) mapping the strong low-frequency characteristics

Obtaining a compressed characteristic diagram through convolution operation with the size of 1 multiplied by 1, and obtaining a low-frequency characteristic diagram through down sampling through maximum pooling operation with the step length of 4

Mapping strong high frequency characteristics

Obtaining a compressed high-frequency characteristic diagram through a convolution operation with the size of 1 multiplied by 1

Wherein,

and

representing the channel dimensions of the high frequency and low frequency profiles, respectively.

Still further, the step (3) is specifically:

(3-1) constructing a dynamic cavity convolution module consisting of a weight calculator and K parallel cavity convolution kernels, and respectively inputting the multi-frequency feature maps into the dynamic cavity convolution module to obtain multi-frequency high-level semantic feature maps, wherein the multi-frequency high-level semantic feature maps comprise a low-frequency high-level semantic feature map, a medium-frequency high-level semantic feature map and a high-frequency high-level semantic feature map;

(3-2) the specific operation of the dynamic hole convolution is as follows: low frequency feature map

Input to the weight calculator to obtain K weights

w_tRepresents the weight of the t-th hole convolution, w is more than or equal to 0_t＜1，

The weight calculator consists of a global average pooling operation, a full connection layer, a Relu function, a full connection layer and a Softmax function; k parallel hole convolution kernels

K_tA convolution of 3 × 3 holes indicating a tth hole rate of 2; k_tRespectively corresponding to the weight w_tThe dot product operation is carried out, and the operation,then adding K parallel cavity convolutions to obtain an integrated cavity convolution kernel

Parameters for utilizing a plurality of parallel hole convolutions to capture different receptive fields; low frequency signature

And then convolution kernel with the synthetic hole

Performing convolution operation to obtain a low-frequency high-level semantic feature map

Indicating the number of channels

Twice of; (3-3) serially superposing the dynamic cavity convolution modules, wherein the output of the first dynamic cavity convolution module is used as the input of the second dynamic cavity convolution module; according to (3-2), intermediate frequency characteristic diagram

Obtaining a medium-frequency high-level semantic feature map through two serial dynamic void convolution modules

Indicating the number of channels

Four times that of; similarly, high frequency characteristic diagram

Obtaining a high-frequency high-level semantic feature map through four serial dynamic void convolution modules

Indicating the number of channels

Eight times of that of the prior art.

Still further, the step (4) is specifically:

(4-1) constructing a decoder consisting of three transposed convolution layers, wherein the transposed convolution is the reverse process of convolution, and a large-size characteristic diagram is obtained by performing convolution operation on the transposed convolution and the input small-size characteristic diagram;

(4-2) mapping the low-frequency high-level semantic features

Intermediate frequency high-level semantic feature map

And high-frequency high-level semantic feature maps

Splicing in channel dimension to obtain integrated high-level semantic feature map

(4-3) integrating the semantic feature map t_iInput decoder obtaining segmentation mask

And C represents the total number of semantic categories, and the category corresponding to each pixel in the video frame is the category with the highest probability in all the categories.

Still further, the step (5) is specifically:

(5-1) establishing a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic void convolution module and a decoder;

(5-2) sequentially inputting the video frame sequence into a semantic segmentation model to obtain a segmentation mask

1, …, N, adjusting model parameters by a gradient back propagation method according to cross entropy loss, and iteratively optimizing the model until convergence;

(5-3) inputting each frame of the new video into the trained model, and sequentially outputting corresponding segmentation results according to (5-2)

Wherein the first dimension represents a semantic category.

The method utilizes a characteristic frequency separation mechanism and a dynamic void convolution module to carry out semantic segmentation on the video, and has the following characteristics: 1) different from the existing method which uniformly processes the high-resolution feature map, the feature frequency separation module designed by the invention separates the feature map into features with different frequencies, wherein the high-frequency features represent regions with large variation amplitude, the low-frequency features represent regions with small variation amplitude, and the medium-frequency features represent regions with moderate variation amplitude, and the features with different frequencies are distinguished and processed, so that the network learns more targeted semantic features; 2) by constructing a dynamic void convolution module, different weights are dynamically distributed to a plurality of parallel void convolutions according to the characteristics of input characteristics under the condition of not increasing the depth and the width of a network, so that the void convolutions can be effectively fused together, and more effective semantic characteristics can be extracted; 3) most of the existing methods improve the segmentation precision by overlapping correction modules and increasing the network depth, but neglect the problems of redundancy of the model, low segmentation speed and the like.

The method is suitable for video semantic segmentation with strict real-time requirement, and has the advantages that: 1) the characteristic frequency separation module can be used for effectively separating and distinguishing the characteristics of different frequencies in the characteristic diagram, so that the processing efficiency can be improved; 2) by constructing a dynamic void convolution module, a plurality of void convolutions can be fused under the condition that the network complexity is not increased remarkably, more effective semantic information in a characteristic diagram is captured, and more accurate segmentation results are obtained; 3) for the characteristics of different frequencies, the characteristics of different frequencies can be processed in a targeted manner through the dynamic cavity convolution modules of different depths, the calculated amount of the model can be greatly reduced, and the semantic segmentation speed of the model on the video is improved. The invention can be applied to practical tasks such as intelligent monitoring, unmanned aerial vehicle reconnaissance, machine vision, automatic driving and the like.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1, a video semantic segmentation method using multi-frequency dynamic hole convolution first samples a given video and inputs the video into an encoder composed of a convolutional neural network to obtain a shallow visual feature map of a video frame; then, a characteristic frequency separation module consisting of Fourier transform, a Gaussian filter and inverse Fourier transform is used for separating a multi-frequency characteristic graph from the shallow visual characteristic graph; then, carrying out different-depth processing on the multi-frequency high-level semantic feature map by utilizing dynamic cavity convolution consisting of a weight calculator and a plurality of parallel cavity convolution kernels according to the multi-frequency feature map to obtain a multi-frequency high-level semantic feature map; and finally, splicing the multi-frequency high-level semantic feature maps, inputting the multi-frequency high-level semantic feature maps into a decoder, and performing up-sampling to obtain a semantic segmentation result. The method has the advantages that the idea of separable image frequency is popularized to the shallow visual characteristic diagram, visual areas with different frequencies of the characteristic diagram can be distinguished, the characteristic diagrams with different frequencies are processed by convolution of dynamic cavities with different depths, the sensing field of the characteristic diagrams is enlarged, the calculation complexity of a model is reduced, and high semantic segmentation precision can be obtained in real time.

The method comprises the steps of firstly acquiring a video data set, and then performing the following operations:

sampling a video to obtain a video frame, performing enhancement operation, and inputting the video frame to an encoder, namely a deep convolutional neural network to obtain a corresponding shallow visual characteristic diagram; the method comprises the following steps:

(1-1) uniformly sampling a single video to obtain video frames with a sampling rate of 10 frames/second, and performing enhancement operation on the video frames to obtain a video frame sequence I with the number of N, which is recorded as a video frame sequence I

In which I_iWhich represents the (i) th video frame,

C_fNumber of channels (1024 in this embodiment) representing feature map, H_fIndicating the height of the feature map, W_fRepresenting a feature map width; ResNet has a plurality of modules consisting of convolutional layers, f_iAnd (4) obtaining a characteristic diagram of the ith video frame passing through the first three modules consisting of a plurality of convolutional layers by RestNet.

Step (2) constructing a characteristic frequency separation module, inputting a shallow visual characteristic diagram, and outputting a multi-frequency characteristic diagram; the method comprises the following steps:

firstly, a superficial visual feature map f is mapped_iPerforming fast Fourier transform, and converting the space domain signal into a frequency domain signal to obtain f_iSpectral diagram of

Will be provided with

Determining

A central position vector (P, Q); wherein,

vector formed by abscissa values of central point of channel

Vector formed by ordinate values

The subscript r represents

The channel index of (2);

then will be

Each element of (1) and low frequency transfer function H_l(u_r,a,,v_r,b) Multiplying to obtain low-frequency shift spectrogram

Transfer function of Gaussian low-pass filter

l represents low-frequency signal, a represents coordinate value of horizontal axis of pixel point, b represents coordinate value of vertical axis of pixel point, and {0 ≦ a ≦ H_f,0≤b≤W_f}, exp (. cndot.) denotes an exponential function, D₀Is the set standard deviation (10 in this example); wherein,

represent

in the same way, will

Each element of (1) and high frequency transfer function H_h(u_r,a,,v_r,b) Multiplication operation is carried out to obtain a high-frequency shift spectrogram

Where h represents a high-frequency signal and,

respectively convert the frequency spectrum

And

And high frequency spectrogram

Finally will be

And

And weak high frequency characteristic diagram

(2-3) for weak high frequency characteristics according to (2-2)

Characteristic diagram of medium and high frequency

according to (2-2), for weak low frequency characteristic diagram

Middle and low frequency characteristic diagram

(2-4) mapping the medium-high frequency characteristics

Characteristic diagram of middle and low frequency

Where m represents the intermediate frequency signal and where,

channel dimensions of the intermediate frequency characteristic diagram;

(2-5) mapping the strong low-frequency characteristics

Mapping strong high frequency characteristics

Wherein,

and

representing the channel dimensions of the high frequency and low frequency signatures, respectively.

Step (3) constructing a dynamic void convolution module, inputting the dynamic void convolution module into a multi-frequency feature map, and outputting the multi-frequency high-level semantic feature map; the method comprises the following steps:

(3-2) the specific operation of the dynamic hole convolution is as follows: mapping low frequency features

Input to the weight calculator to obtain K weights

K_tA convolution of 3 × 3 holes indicating a tth hole rate of 2; k_tRespectively corresponding to the weight w_tPerforming dot product operation, and adding K parallel cavity convolutions to obtain an integrated cavity convolution kernel

And then convolution kernel with the synthetic hole

Carrying out convolution operation to obtain a low-frequency high-level semantic feature map

Indicating the number of channels

Indicating the number of channels

Four times that of; similarly, high frequency characteristic diagram

Indicating the number of channels

Eight times of that of the prior art.

Inputting the multi-frequency high-level semantic feature map into a decoder, namely an up-sampling convolution module, and obtaining a segmentation mask of a video frame; the method comprises the following steps:

(4-1) constructing a decoder consisting of three transposed convolution layers, wherein the transposed convolution is the reverse process of convolution, and a large-size characteristic diagram is obtained by performing convolution operation on the transposed convolution layer and an input small-size characteristic diagram;

(4-2) mapping the low-frequency high-level semantic feature map

Intermediate frequency high-level semantic feature map

And high frequency high layerSemantic feature maps

Iteratively training a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic void convolution module and a decoder until convergence, and then inputting a new video into the model to obtain a corresponding semantic segmentation result; the method comprises the following steps:

Wherein the first dimension represents a semantic category.

The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.

Claims

1. The video semantic segmentation method by using the multi-frequency dynamic hole convolution is characterized by firstly acquiring a video data set and then performing the following operations:

Wherein I_iWhich represents the ith video frame of the video,

(1-2) sequentially extracting shallow visual feature maps from the video frame sequence I by using a convolutional neural network ResNet pre-trained on a large image library ImageNet

C_fNumber of channels, H, representing a feature map_fIndicating the height of the feature map, W_fRepresenting a feature map width; ResNet has a plurality of modules consisting of convolutional layers, f_iObtaining a characteristic diagram of the ith video frame through three modules consisting of a plurality of convolutional layers before RestNet;

firstly, a superficial visual feature map f is mapped_iPerforming fast Fourier transform, and converting the space domain signal into a frequency domain signal to obtain f_iSpectrum chart of

Will be provided with

The middle and low frequency signal is partially translated to the middle to obtain a translated spectrogram

Determining

A central position vector (P, Q); wherein,

vector formed by abscissa values of central point of channel

Vector formed by ordinate values

The subscript r represents

The channel index of (2);

then will be

Transfer function of Gaussian low-pass filter

represent

Distance P between middle r channel spectrum position (a,0)_rOf Euclidean distance, v_r,bIs that

in the same way, will

Where h denotes a high-frequency signal, and,

respectively convert the frequency spectrum

And

And high frequency spectrogram

Finally will be

And

respectively carrying out fast Fourier inversion transformation to convert the frequency domain signals into space domain signals to obtain a weak low-frequency characteristic diagram

And weak high frequency characteristic diagram

(2-3) for weak high frequency characteristics according to (2-2)

Characteristic diagram of medium and high frequency

according to (2-2), for weak low frequency characteristic diagram

Middle and low frequency characteristic diagram

(2-4) mapping the medium-high frequency characteristics

Middle and low frequency characteristic diagram

Where m represents the intermediate frequency signal and where,

channel dimensions of the intermediate frequency characteristic diagram;

(2-5) mapping the strong low-frequency characteristics

Through one large passPerforming convolution operation with a small value of 1 × 1 to obtain compressed feature map, performing down-sampling operation with a maximum step size of 4 to obtain low-frequency feature map

Mapping strong high frequency characteristics

Wherein,

and

respectively representing the channel dimensions of the high-frequency characteristic diagram and the low-frequency characteristic diagram;

step (3) constructing a dynamic void convolution module, inputting the dynamic void convolution module into a multi-frequency characteristic graph, and outputting a multi-frequency high-level semantic characteristic graph; the method comprises the following steps:

Input to the weight calculator to obtain K weights

K_tA convolution of 3 × 3 holes representing the tth hole rate of 2; k_tRespectively corresponding to the weight w_tPerforming dot product operation, and adding K parallel cavity convolutions to obtain an integrated cavity convolution kernel

Low frequency signature

And then convolution kernel with the synthetic hole

Indicating the number of channels

Twice of;

(3-3) serially superposing the dynamic cavity convolution modules, wherein the output of the first dynamic cavity convolution module is used as the input of the second dynamic cavity convolution module; according to (3-2), intermediate frequency characteristic diagram

Indicating the number of channels

Four times that of; high frequency signature

Indicating the number of channels

Eight times of;

2. The method for video semantic segmentation using multi-frequency dynamic hole convolution according to claim 1, wherein the step (4) is specifically:

(4-2) mapping the low-frequency high-level semantic feature map

Intermediate frequency high-level semantic feature map

And high-frequency high-level semantic feature maps

(4-3) integrating the semantic feature map t_iInput decoder obtains a segmentation mask

3. The method for video semantic segmentation using multi-frequency dynamic hole convolution according to claim 2, wherein the step (5) is specifically:

Adjusting model parameters by a gradient back propagation method according to cross entropy loss, and iteratively optimizing the model until convergence;

(5-3) inputting each frame of the new video into the trained model according to (5-2)) Sequentially outputting corresponding segmentation results

Wherein the first dimension represents a semantic category.