CN113538457B - Video semantic segmentation method utilizing multi-frequency dynamic hole convolution - Google Patents
Video semantic segmentation method utilizing multi-frequency dynamic hole convolution Download PDFInfo
- Publication number
- CN113538457B CN113538457B CN202110718738.1A CN202110718738A CN113538457B CN 113538457 B CN113538457 B CN 113538457B CN 202110718738 A CN202110718738 A CN 202110718738A CN 113538457 B CN113538457 B CN 113538457B
- Authority
- CN
- China
- Prior art keywords
- frequency
- convolution
- characteristic diagram
- low
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 65
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000010586 diagram Methods 0.000 claims abstract description 96
- 238000000926 separation method Methods 0.000 claims abstract description 29
- 230000000007 visual effect Effects 0.000 claims abstract description 27
- 239000011800 void material Substances 0.000 claims abstract description 24
- 238000005070 sampling Methods 0.000 claims abstract description 22
- 238000012549 training Methods 0.000 claims abstract description 4
- 238000001914 filtration Methods 0.000 claims description 18
- 238000013507 mapping Methods 0.000 claims description 14
- 238000011176 pooling Methods 0.000 claims description 9
- 238000012546 transfer Methods 0.000 claims description 9
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 238000001228 spectrum Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 4
- 230000003595 spectral effect Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 abstract 1
- 230000006870 function Effects 0.000 description 12
- 230000007547 defect Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 101100269850 Caenorhabditis elegans mask-1 gene Proteins 0.000 description 2
- 238000003708 edge detection Methods 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000004438 eyesight Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000035755 proliferation Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2218/00—Aspects of pattern recognition specially adapted for signal processing
- G06F2218/12—Classification; Matching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video semantic segmentation method utilizing multi-frequency dynamic hole convolution. Firstly, enhancing a sampling frame image of video data, and extracting a shallow visual characteristic diagram through an encoder; then constructing a characteristic frequency separation module to obtain a multi-frequency characteristic image corresponding to the video frame, inputting the multi-frequency characteristic image into a dynamic void convolution module to obtain a corresponding multi-frequency high-level semantic characteristic image, and obtaining a segmentation mask of the video frame through an up-sampling convolution encoder; and (5) iteratively training the model by using a random gradient descent algorithm until convergence, and inputting the new video into the model to obtain a semantic segmentation result. The method of the invention separates the characteristic graph of the video frame according to different frequencies to depict different visual region changes, can reduce low-frequency visual space redundant information and reduce the computational complexity, adaptively enlarges the receptive field of the multi-frequency characteristic graph through dynamic void convolution, and improves the discrimination capability of different semantic classes of the video, thereby obtaining a better video semantic segmentation result.
Description
Technical Field
The invention belongs to the technical field of computer vision, particularly relates to the field of semantic segmentation in video processing, and relates to a video semantic segmentation method by utilizing multi-frequency dynamic hole convolution.
Background
With the increasing proliferation of vehicles of all types, driving safety is a significant concern to governments and the public. Generally, the driver of the large vehicle is easy to have a visual blind area, and great hidden danger is brought to the driving safety. In recent years, the automatic driving technique has attracted much interest in the industry, and more research efforts have been put into this field. Efficient visual understanding can provide guarantee for safety of automatic driving, and video semantic segmentation is one of core technologies of the automatic driving. The video semantic segmentation aims to carry out pixel-level class marking on video frames with time sequence correlation to obtain a pixel-by-pixel class mask matrix with the same size as that of an original video frame, and can be widely applied to the fields of machine vision, video monitoring, unmanned aerial vehicle reconnaissance, automatic driving and the like. For example, in an automatic driving environment, objects such as roads, pedestrians, or other vehicles in a vehicle visual scene are segmented at a pixel level, and object region information more accurate than a boundary frame can be obtained, so that more accurate visual perception content is provided for an automatic driving system, and obstacles such as pedestrians and vehicles are avoided, and driving safety is ensured. Currently, the main challenges in the field of video semantic segmentation include high computational complexity of the model, long time consuming processing of high resolution video frames, and difficulty in deploying the model in a real-time environment.
The traditional semantic segmentation method mainly comprises the following categories of threshold, edge, super-pixel clustering and the like. The threshold segmentation method compares the gray value of each pixel point of the image with a threshold, and the pixels with the gray values larger than the threshold are judged to be foreground, and the other pixels are background but only applicable to gray images; the edge segmentation method firstly carries out edge detection on an image, pixels in the same edge represent the same object, and the defect is that the segmentation precision is limited by an edge detection algorithm; the super-pixel clustering method aggregates approximate super-pixel blocks to depict the same object, and has the disadvantages that the formation of super-pixels is limited by the colors of the pixels and the textures of pixel regions, and different parts of the same object are easily divided into a plurality of super-pixels, resulting in segmentation errors. In recent years, a deep neural network is popular due to its strong feature extraction capability, and a typical method utilizes a convolutional neural network as an encoder to extract abstract semantic information of a video frame, and obtains a semantic segmentation mask through a layer-by-layer upsampling operation of a decoder. However, the convolutional layer can only extract local semantic information of the frame image, and it is difficult to characterize the global scene. Therefore, the spatial pyramid pooling technology is used for semantic segmentation, and is characterized in that: and performing multiple parallel pooling operations on the feature map obtained from the encoder to obtain compressed feature maps with different sizes so as to capture global scene features of multiple size receptive fields, performing up-sampling to restore the feature maps with the same size as the initial feature map and splicing the feature maps to obtain a total feature map, and finally obtaining a semantic segmentation mask through a decoder so as to obtain a video semantic segmentation result.
The existing semantic segmentation method still has many defects: 1) the spatial pyramid pooling technology considers local and global space-time structure information simultaneously so that the segmentation result is more reliable, but the defects of poor fault tolerance, poor generalization capability, high calculation complexity and the like are caused by using the maximum average pooling operation on the high-resolution feature map; 2) although the long-term semantic dependence relationship among feature graphs is strengthened by utilizing an attention mechanism, the model is too large and occupies much memory, and the real-time deployment of the model is not facilitated; 3) a Transformer encoder is widely used for natural language processing as a feature extractor, takes a one-dimensional embedded feature representation sequence of a two-dimensional image as input, and utilizes a self-attention mechanism and a multi-layer perceptron to stack and capture long-term dependence relation between video frames, but the model lacks weight sharing, so that the number of parameters is huge, and the self-attention computational complexity is high, so that the real-time performance is difficult to guarantee. Meanwhile, the precision and the real-time performance of most segmentation methods cannot be effectively balanced, so that the requirements of actual segmentation tasks cannot be effectively met. Therefore, aiming at the problems of high computational complexity, poor generalization capability and the like of the segmentation model, a method capable of ensuring the real-time performance of the segmentation model and achieving higher semantic segmentation precision is urgently needed.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a video semantic segmentation method by using multi-frequency dynamic hole convolution, which separates multiple frequencies of a feature map through Fourier transform, and the multi-frequency feature map can depict different gray value changes of different visual areas so as to reduce low-frequency visual space redundant information and reduce calculation complexity; meanwhile, the dynamic cavity convolution is designed to adaptively enlarge the receptive field of the multi-frequency characteristic diagram, and the discrimination capability of the model on different semantic classes of the video is improved from the global and local angles, so that the video semantic segmentation precision is improved.
The method firstly acquires a video data set, and then performs the following operations:
sampling a video to obtain a video frame, performing enhancement operation, and inputting the video frame to an encoder, namely a deep convolutional neural network to obtain a corresponding shallow visual characteristic diagram;
step (2) constructing a characteristic frequency separation module, inputting a shallow visual characteristic diagram, and outputting a multi-frequency characteristic diagram;
step (3) constructing a dynamic void convolution module, inputting the dynamic void convolution module into a multi-frequency feature map, and outputting the multi-frequency high-level semantic feature map;
inputting the multi-frequency high-level semantic feature map into a decoder, namely an up-sampling convolution module, and obtaining a segmentation mask of a video frame;
and (5) iteratively training a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic cavity convolution module and a decoder until convergence, and then inputting a new video into the model to obtain a corresponding semantic segmentation result.
Further, the step (1) is specifically:
(1-1) uniformly sampling a single video to obtain video frames with the sampling rate of 10-15 frames/second, and performing enhancement operation on the video frames to obtain a video frame sequence I with the number of N, and recording the video frame sequence I as a video frame sequence IWherein IiWhich represents the ith video frame of the video,representing a real number field, 3 representing the number of RGB channels, H representing the height of a video frame, and W representing the width of the video frame;
(1-2) sequentially extracting a shallow visual feature map from the video frame sequence I by utilizing a convolutional neural network ResNet pre-trained on a large image library ImageNetCfNumber of channels, H, representing a feature mapfIndicating the height of the feature map, WfRepresenting a feature map width; ResNet has a plurality of modules consisting of convolutional layers, fiAnd (4) obtaining a characteristic diagram of the ith video frame passing through the first three modules consisting of a plurality of convolutional layers by RestNet.
Further, the step (2) is specifically:
(2-1) constructing a characteristic frequency separation module, and carrying out three times of high-low frequency characteristic separation operation on the shallow visual characteristic graph by utilizing the characteristic that the image has separable frequency to obtain a multi-frequency characteristic graph; the high-frequency characteristic is used for describing a contour region of the characteristic diagram, the low-frequency characteristic is used for describing a plane region of the characteristic diagram, and the medium-frequency characteristic is used for describing a content region of the characteristic diagram;
(2-2) the specific operation of high and low frequency feature separation is as follows:
firstly, for the superficial visual characteristic diagram fiPerforming fast Fourier transform, and converting the space domain signal into a frequency domain signal to obtain fiSpectral diagram ofWill be provided withThe middle and low frequency signal part is translated to the middle to obtain a translation spectrogramDeterminingA central position vector (P, Q); wherein,vector formed by abscissa values of central point of channelVector formed by ordinate valuesThe subscript r representsThe channel index of (2);
then will beEach element of (1) and a low frequency transfer function Hl(ur,a,,vr,b) Multiplying to obtain low-frequency shift spectrogramTransfer function of Gaussian low-pass filterl represents low-frequency signal, a represents coordinate value of horizontal axis of pixel point, b represents coordinate value of vertical axis of pixel point, and {0 ≦ a ≦ Hf,0≤b≤Wf}, exp (. cndot.) denotes an exponential function, D0Is the set standard deviation; wherein,representDistance coordinate point (P) of middle-r channel pixel points (a, b)r,Qr) Of Euclidean distance ur,aIs thatDistance P between the r-th channel spectrum position (a,0)rOf Euclidean distance, vr,bIs thatMiddle-r channel spectral position (0, b) distance QrThe Euclidean distance of (c);
in the same way, willEach element of (1) and a high frequency transfer function Hh(ur,a,,vr,b) Multiplication operation is carried out to obtain a high-frequency shift spectrogramWhere h denotes a high-frequency signal, and,
respectively convert the frequency spectrumAndthe middle low-frequency signal is translated back to the original position from the middle to obtain a low-frequency spectrogramAnd high frequency spectrogram
Finally will beAndrespectively carrying out fast Fourier inversion transformation to convert the frequency domain signals into space domain signals to obtain weak low-frequency characteristic diagramsAnd weak high frequency characteristic diagram
(2-3) for weak high frequency characteristics according to (2-2)Carrying out secondary high-low frequency characteristic separation operation to obtain a strong high-frequency characteristic diagramCharacteristic diagram of medium and high frequencyhh represents that the characteristic diagram is subjected to high-frequency signal filtering twice, hi represents that the characteristic diagram is subjected to high-frequency signal filtering once and then low-frequency signal filtering once;
according to (2-2), for weaknessLow frequency signatureCarrying out secondary high-low frequency characteristic separation operation to obtain a strong-low frequency characteristic diagramMiddle and low frequency characteristic diagramll represents that the characteristic diagram is subjected to low-frequency signal filtering twice, lh represents that the characteristic diagram is subjected to low-frequency signal filtering once and then high-frequency signal filtering once;
(2-4) mapping the medium-high frequency characteristicsMiddle and low frequency characteristic diagramPerforming one-time splicing, performing convolution operation with the size of 1 × 1 to obtain a compressed characteristic diagram, performing down-sampling operation with the maximum step length of 2 to obtain an intermediate frequency characteristic diagramWhere m represents the intermediate frequency signal and where,channel dimensions of the intermediate frequency characteristic diagram;
(2-5) mapping the strong low-frequency characteristicsObtaining a compressed characteristic diagram through convolution operation with the size of 1 multiplied by 1, and obtaining a low-frequency characteristic diagram through down sampling through maximum pooling operation with the step length of 4Mapping strong high frequency characteristicsObtaining a compressed high-frequency characteristic diagram through a convolution operation with the size of 1 multiplied by 1Wherein,andrepresenting the channel dimensions of the high frequency and low frequency profiles, respectively.
Still further, the step (3) is specifically:
(3-1) constructing a dynamic cavity convolution module consisting of a weight calculator and K parallel cavity convolution kernels, and respectively inputting the multi-frequency feature maps into the dynamic cavity convolution module to obtain multi-frequency high-level semantic feature maps, wherein the multi-frequency high-level semantic feature maps comprise a low-frequency high-level semantic feature map, a medium-frequency high-level semantic feature map and a high-frequency high-level semantic feature map;
(3-2) the specific operation of the dynamic hole convolution is as follows: low frequency feature mapInput to the weight calculator to obtain K weightswtRepresents the weight of the t-th hole convolution, w is more than or equal to 0t<1,The weight calculator consists of a global average pooling operation, a full connection layer, a Relu function, a full connection layer and a Softmax function; k parallel hole convolution kernelsKtA convolution of 3 × 3 holes indicating a tth hole rate of 2; ktRespectively corresponding to the weight wtThe dot product operation is carried out, and the operation,then adding K parallel cavity convolutions to obtain an integrated cavity convolution kernelParameters for utilizing a plurality of parallel hole convolutions to capture different receptive fields; low frequency signatureAnd then convolution kernel with the synthetic holePerforming convolution operation to obtain a low-frequency high-level semantic feature mapIndicating the number of channelsTwice of; (3-3) serially superposing the dynamic cavity convolution modules, wherein the output of the first dynamic cavity convolution module is used as the input of the second dynamic cavity convolution module; according to (3-2), intermediate frequency characteristic diagramObtaining a medium-frequency high-level semantic feature map through two serial dynamic void convolution modulesIndicating the number of channelsFour times that of; similarly, high frequency characteristic diagramObtaining a high-frequency high-level semantic feature map through four serial dynamic void convolution modulesIndicating the number of channelsEight times of that of the prior art.
Still further, the step (4) is specifically:
(4-1) constructing a decoder consisting of three transposed convolution layers, wherein the transposed convolution is the reverse process of convolution, and a large-size characteristic diagram is obtained by performing convolution operation on the transposed convolution and the input small-size characteristic diagram;
(4-2) mapping the low-frequency high-level semantic featuresIntermediate frequency high-level semantic feature mapAnd high-frequency high-level semantic feature mapsSplicing in channel dimension to obtain integrated high-level semantic feature map
(4-3) integrating the semantic feature map tiInput decoder obtaining segmentation maskAnd C represents the total number of semantic categories, and the category corresponding to each pixel in the video frame is the category with the highest probability in all the categories.
Still further, the step (5) is specifically:
(5-1) establishing a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic void convolution module and a decoder;
(5-2) sequentially inputting the video frame sequence into a semantic segmentation model to obtain a segmentation mask1, …, N, adjusting model parameters by a gradient back propagation method according to cross entropy loss, and iteratively optimizing the model until convergence;
(5-3) inputting each frame of the new video into the trained model, and sequentially outputting corresponding segmentation results according to (5-2)Wherein the first dimension represents a semantic category.
The method utilizes a characteristic frequency separation mechanism and a dynamic void convolution module to carry out semantic segmentation on the video, and has the following characteristics: 1) different from the existing method which uniformly processes the high-resolution feature map, the feature frequency separation module designed by the invention separates the feature map into features with different frequencies, wherein the high-frequency features represent regions with large variation amplitude, the low-frequency features represent regions with small variation amplitude, and the medium-frequency features represent regions with moderate variation amplitude, and the features with different frequencies are distinguished and processed, so that the network learns more targeted semantic features; 2) by constructing a dynamic void convolution module, different weights are dynamically distributed to a plurality of parallel void convolutions according to the characteristics of input characteristics under the condition of not increasing the depth and the width of a network, so that the void convolutions can be effectively fused together, and more effective semantic characteristics can be extracted; 3) most of the existing methods improve the segmentation precision by overlapping correction modules and increasing the network depth, but neglect the problems of redundancy of the model, low segmentation speed and the like.
The method is suitable for video semantic segmentation with strict real-time requirement, and has the advantages that: 1) the characteristic frequency separation module can be used for effectively separating and distinguishing the characteristics of different frequencies in the characteristic diagram, so that the processing efficiency can be improved; 2) by constructing a dynamic void convolution module, a plurality of void convolutions can be fused under the condition that the network complexity is not increased remarkably, more effective semantic information in a characteristic diagram is captured, and more accurate segmentation results are obtained; 3) for the characteristics of different frequencies, the characteristics of different frequencies can be processed in a targeted manner through the dynamic cavity convolution modules of different depths, the calculated amount of the model can be greatly reduced, and the semantic segmentation speed of the model on the video is improved. The invention can be applied to practical tasks such as intelligent monitoring, unmanned aerial vehicle reconnaissance, machine vision, automatic driving and the like.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1, a video semantic segmentation method using multi-frequency dynamic hole convolution first samples a given video and inputs the video into an encoder composed of a convolutional neural network to obtain a shallow visual feature map of a video frame; then, a characteristic frequency separation module consisting of Fourier transform, a Gaussian filter and inverse Fourier transform is used for separating a multi-frequency characteristic graph from the shallow visual characteristic graph; then, carrying out different-depth processing on the multi-frequency high-level semantic feature map by utilizing dynamic cavity convolution consisting of a weight calculator and a plurality of parallel cavity convolution kernels according to the multi-frequency feature map to obtain a multi-frequency high-level semantic feature map; and finally, splicing the multi-frequency high-level semantic feature maps, inputting the multi-frequency high-level semantic feature maps into a decoder, and performing up-sampling to obtain a semantic segmentation result. The method has the advantages that the idea of separable image frequency is popularized to the shallow visual characteristic diagram, visual areas with different frequencies of the characteristic diagram can be distinguished, the characteristic diagrams with different frequencies are processed by convolution of dynamic cavities with different depths, the sensing field of the characteristic diagrams is enlarged, the calculation complexity of a model is reduced, and high semantic segmentation precision can be obtained in real time.
The method comprises the steps of firstly acquiring a video data set, and then performing the following operations:
sampling a video to obtain a video frame, performing enhancement operation, and inputting the video frame to an encoder, namely a deep convolutional neural network to obtain a corresponding shallow visual characteristic diagram; the method comprises the following steps:
(1-1) uniformly sampling a single video to obtain video frames with a sampling rate of 10 frames/second, and performing enhancement operation on the video frames to obtain a video frame sequence I with the number of N, which is recorded as a video frame sequence IIn which IiWhich represents the (i) th video frame,representing a real number field, 3 representing the number of RGB channels, H representing the height of a video frame, and W representing the width of the video frame;
(1-2) sequentially extracting a shallow visual feature map from the video frame sequence I by utilizing a convolutional neural network ResNet pre-trained on a large image library ImageNetCfNumber of channels (1024 in this embodiment) representing feature map, HfIndicating the height of the feature map, WfRepresenting a feature map width; ResNet has a plurality of modules consisting of convolutional layers, fiAnd (4) obtaining a characteristic diagram of the ith video frame passing through the first three modules consisting of a plurality of convolutional layers by RestNet.
Step (2) constructing a characteristic frequency separation module, inputting a shallow visual characteristic diagram, and outputting a multi-frequency characteristic diagram; the method comprises the following steps:
(2-1) constructing a characteristic frequency separation module, and carrying out three times of high-low frequency characteristic separation operation on the shallow visual characteristic graph by utilizing the characteristic that the image has separable frequency to obtain a multi-frequency characteristic graph; the high-frequency characteristic is used for describing a contour region of the characteristic diagram, the low-frequency characteristic is used for describing a plane region of the characteristic diagram, and the medium-frequency characteristic is used for describing a content region of the characteristic diagram;
(2-2) the specific operation of high and low frequency feature separation is as follows:
firstly, a superficial visual feature map f is mappediPerforming fast Fourier transform, and converting the space domain signal into a frequency domain signal to obtain fiSpectral diagram ofWill be provided withThe middle and low frequency signal part is translated to the middle to obtain a translation spectrogramDeterminingA central position vector (P, Q); wherein,vector formed by abscissa values of central point of channelVector formed by ordinate valuesThe subscript r representsThe channel index of (2);
then will beEach element of (1) and low frequency transfer function Hl(ur,a,,vr,b) Multiplying to obtain low-frequency shift spectrogramTransfer function of Gaussian low-pass filterl represents low-frequency signal, a represents coordinate value of horizontal axis of pixel point, b represents coordinate value of vertical axis of pixel point, and {0 ≦ a ≦ Hf,0≤b≤Wf}, exp (. cndot.) denotes an exponential function, D0Is the set standard deviation (10 in this example); wherein,representDistance coordinate point (P) of middle-r channel pixel points (a, b)r,Qr) Of Euclidean distance ur,aIs thatDistance P between the r-th channel spectrum position (a,0)rOf Euclidean distance, vr,bIs thatMiddle-r channel spectral position (0, b) distance QrThe Euclidean distance of (c);
in the same way, willEach element of (1) and high frequency transfer function Hh(ur,a,,vr,b) Multiplication operation is carried out to obtain a high-frequency shift spectrogramWhere h represents a high-frequency signal and,
respectively convert the frequency spectrumAndthe middle low-frequency signal is translated back to the original position from the middle to obtain a low-frequency spectrogramAnd high frequency spectrogram
Finally will beAndrespectively carrying out fast Fourier inversion transformation to convert the frequency domain signals into space domain signals to obtain weak low-frequency characteristic diagramsAnd weak high frequency characteristic diagram
(2-3) for weak high frequency characteristics according to (2-2)Carrying out secondary high-low frequency characteristic separation operation to obtain a strong high-frequency characteristic diagramCharacteristic diagram of medium and high frequencyhh represents that the characteristic diagram is subjected to high-frequency signal filtering twice, hi represents that the characteristic diagram is subjected to high-frequency signal filtering once and then low-frequency signal filtering once;
according to (2-2), for weak low frequency characteristic diagramCarrying out secondary high-low frequency characteristic separation operation to obtain a strong-low frequency characteristic diagramMiddle and low frequency characteristic diagramll represents that the characteristic diagram is subjected to low-frequency signal filtering twice, lh represents that the characteristic diagram is subjected to low-frequency signal filtering once and then high-frequency signal filtering once;
(2-4) mapping the medium-high frequency characteristicsCharacteristic diagram of middle and low frequencyPerforming one-time splicing, performing convolution operation with the size of 1 × 1 to obtain a compressed characteristic diagram, performing down-sampling operation with the maximum step length of 2 to obtain an intermediate frequency characteristic diagramWhere m represents the intermediate frequency signal and where,channel dimensions of the intermediate frequency characteristic diagram;
(2-5) mapping the strong low-frequency characteristicsObtaining a compressed characteristic diagram through convolution operation with the size of 1 multiplied by 1, and obtaining a low-frequency characteristic diagram through down sampling through maximum pooling operation with the step length of 4Mapping strong high frequency characteristicsObtaining a compressed high-frequency characteristic diagram through a convolution operation with the size of 1 multiplied by 1Wherein,andrepresenting the channel dimensions of the high frequency and low frequency signatures, respectively.
Step (3) constructing a dynamic void convolution module, inputting the dynamic void convolution module into a multi-frequency feature map, and outputting the multi-frequency high-level semantic feature map; the method comprises the following steps:
(3-1) constructing a dynamic cavity convolution module consisting of a weight calculator and K parallel cavity convolution kernels, and respectively inputting the multi-frequency feature maps into the dynamic cavity convolution module to obtain multi-frequency high-level semantic feature maps, wherein the multi-frequency high-level semantic feature maps comprise a low-frequency high-level semantic feature map, a medium-frequency high-level semantic feature map and a high-frequency high-level semantic feature map;
(3-2) the specific operation of the dynamic hole convolution is as follows: mapping low frequency featuresInput to the weight calculator to obtain K weightswtRepresents the weight of the t-th hole convolution, w is more than or equal to 0t<1,The weight calculator consists of a global average pooling operation, a full connection layer, a Relu function, a full connection layer and a Softmax function; k parallel hole convolution kernelsKtA convolution of 3 × 3 holes indicating a tth hole rate of 2; ktRespectively corresponding to the weight wtPerforming dot product operation, and adding K parallel cavity convolutions to obtain an integrated cavity convolution kernelParameters for utilizing a plurality of parallel hole convolutions to capture different receptive fields; low frequency signatureAnd then convolution kernel with the synthetic holeCarrying out convolution operation to obtain a low-frequency high-level semantic feature mapIndicating the number of channelsTwice of; (3-3) serially superposing the dynamic cavity convolution modules, wherein the output of the first dynamic cavity convolution module is used as the input of the second dynamic cavity convolution module; according to (3-2), intermediate frequency characteristic diagramObtaining a medium-frequency high-level semantic feature map through two serial dynamic void convolution modulesIndicating the number of channelsFour times that of; similarly, high frequency characteristic diagramObtaining a high-frequency high-level semantic feature map through four serial dynamic void convolution modulesIndicating the number of channelsEight times of that of the prior art.
Inputting the multi-frequency high-level semantic feature map into a decoder, namely an up-sampling convolution module, and obtaining a segmentation mask of a video frame; the method comprises the following steps:
(4-1) constructing a decoder consisting of three transposed convolution layers, wherein the transposed convolution is the reverse process of convolution, and a large-size characteristic diagram is obtained by performing convolution operation on the transposed convolution layer and an input small-size characteristic diagram;
(4-2) mapping the low-frequency high-level semantic feature mapIntermediate frequency high-level semantic feature mapAnd high frequency high layerSemantic feature mapsSplicing in channel dimension to obtain integrated high-level semantic feature map
(4-3) integrating the semantic feature map tiInput decoder obtaining segmentation maskAnd C represents the total number of semantic categories, and the category corresponding to each pixel in the video frame is the category with the highest probability in all the categories.
Iteratively training a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic void convolution module and a decoder until convergence, and then inputting a new video into the model to obtain a corresponding semantic segmentation result; the method comprises the following steps:
(5-1) establishing a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic void convolution module and a decoder;
(5-2) sequentially inputting the video frame sequence into a semantic segmentation model to obtain a segmentation mask1, …, N, adjusting model parameters by a gradient back propagation method according to cross entropy loss, and iteratively optimizing the model until convergence;
(5-3) inputting each frame of the new video into the trained model, and sequentially outputting corresponding segmentation results according to (5-2)Wherein the first dimension represents a semantic category.
The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.
Claims (3)
1. The video semantic segmentation method by using the multi-frequency dynamic hole convolution is characterized by firstly acquiring a video data set and then performing the following operations:
sampling a video to obtain a video frame, performing enhancement operation, and inputting the video frame to an encoder, namely a deep convolutional neural network to obtain a corresponding shallow visual characteristic diagram; the method comprises the following steps:
(1-1) uniformly sampling a single video to obtain video frames with the sampling rate of 10-15 frames/second, and performing enhancement operation on the video frames to obtain a video frame sequence I with the number of N, and recording the video frame sequence I as a video frame sequence IWherein IiWhich represents the ith video frame of the video,representing a real number field, 3 representing the number of RGB channels, H representing the height of a video frame, and W representing the width of the video frame;
(1-2) sequentially extracting shallow visual feature maps from the video frame sequence I by using a convolutional neural network ResNet pre-trained on a large image library ImageNetCfNumber of channels, H, representing a feature mapfIndicating the height of the feature map, WfRepresenting a feature map width; ResNet has a plurality of modules consisting of convolutional layers, fiObtaining a characteristic diagram of the ith video frame through three modules consisting of a plurality of convolutional layers before RestNet;
step (2) constructing a characteristic frequency separation module, inputting a shallow visual characteristic diagram, and outputting a multi-frequency characteristic diagram; the method comprises the following steps:
(2-1) constructing a characteristic frequency separation module, and carrying out three times of high-low frequency characteristic separation operation on the shallow visual characteristic graph by utilizing the characteristic that the image has separable frequency to obtain a multi-frequency characteristic graph; the high-frequency characteristic is used for describing a contour region of the characteristic diagram, the low-frequency characteristic is used for describing a plane region of the characteristic diagram, and the medium-frequency characteristic is used for describing a content region of the characteristic diagram;
(2-2) the specific operation of high and low frequency feature separation is as follows:
firstly, a superficial visual feature map f is mappediPerforming fast Fourier transform, and converting the space domain signal into a frequency domain signal to obtain fiSpectrum chart ofWill be provided withThe middle and low frequency signal is partially translated to the middle to obtain a translated spectrogramDeterminingA central position vector (P, Q); wherein,vector formed by abscissa values of central point of channelVector formed by ordinate valuesThe subscript r representsThe channel index of (2);
then will beEach element of (1) and low frequency transfer function Hl(ur,a,,vr,b) Multiplying to obtain low-frequency shift spectrogramTransfer function of Gaussian low-pass filterl represents low-frequency signal, a represents coordinate value of horizontal axis of pixel point, b represents coordinate value of vertical axis of pixel point, and {0 ≦ a ≦ Hf,0≤b≤Wf}, exp (. cndot.) denotes an exponential function, D0Is the set standard deviation; wherein,representDistance coordinate point (P) of middle-r channel pixel points (a, b)r,Qr) Of Euclidean distance ur,aIs thatDistance P between middle r channel spectrum position (a,0)rOf Euclidean distance, vr,bIs thatMiddle-r channel spectral position (0, b) distance QrThe Euclidean distance of (c);
in the same way, willEach element of (1) and a high frequency transfer function Hh(ur,a,,vr,b) Multiplication operation is carried out to obtain a high-frequency shift spectrogramWhere h denotes a high-frequency signal, and,
respectively convert the frequency spectrumAndthe middle low-frequency signal is translated back to the original position from the middle to obtain a low-frequency spectrogramAnd high frequency spectrogram
Finally will beAndrespectively carrying out fast Fourier inversion transformation to convert the frequency domain signals into space domain signals to obtain a weak low-frequency characteristic diagramAnd weak high frequency characteristic diagram
(2-3) for weak high frequency characteristics according to (2-2)Carrying out secondary high-low frequency characteristic separation operation to obtain a strong high-frequency characteristic diagramCharacteristic diagram of medium and high frequencyhh represents that the characteristic diagram is subjected to high-frequency signal filtering twice, hi represents that the characteristic diagram is subjected to high-frequency signal filtering once and then low-frequency signal filtering once;
according to (2-2), for weak low frequency characteristic diagramCarrying out secondary high-low frequency characteristic separation operation to obtain a strong-low frequency characteristic diagramMiddle and low frequency characteristic diagramll represents that the characteristic diagram is subjected to low-frequency signal filtering twice, lh represents that the characteristic diagram is subjected to low-frequency signal filtering once and then high-frequency signal filtering once;
(2-4) mapping the medium-high frequency characteristicsMiddle and low frequency characteristic diagramPerforming one-time splicing, performing convolution operation with the size of 1 × 1 to obtain a compressed characteristic diagram, performing down-sampling operation with the maximum step length of 2 to obtain an intermediate frequency characteristic diagramWhere m represents the intermediate frequency signal and where,channel dimensions of the intermediate frequency characteristic diagram;
(2-5) mapping the strong low-frequency characteristicsThrough one large passPerforming convolution operation with a small value of 1 × 1 to obtain compressed feature map, performing down-sampling operation with a maximum step size of 4 to obtain low-frequency feature mapMapping strong high frequency characteristicsObtaining a compressed high-frequency characteristic diagram through a convolution operation with the size of 1 multiplied by 1Wherein, andrespectively representing the channel dimensions of the high-frequency characteristic diagram and the low-frequency characteristic diagram;
step (3) constructing a dynamic void convolution module, inputting the dynamic void convolution module into a multi-frequency characteristic graph, and outputting a multi-frequency high-level semantic characteristic graph; the method comprises the following steps:
(3-1) constructing a dynamic cavity convolution module consisting of a weight calculator and K parallel cavity convolution kernels, and respectively inputting the multi-frequency feature maps into the dynamic cavity convolution module to obtain multi-frequency high-level semantic feature maps, wherein the multi-frequency high-level semantic feature maps comprise a low-frequency high-level semantic feature map, a medium-frequency high-level semantic feature map and a high-frequency high-level semantic feature map;
(3-2) the specific operation of the dynamic hole convolution is as follows: mapping low frequency featuresInput to the weight calculator to obtain K weightswtRepresents the weight of the t-th hole convolution, w is more than or equal to 0t<1,The weight calculator consists of a global average pooling operation, a full connection layer, a Relu function, a full connection layer and a Softmax function; k parallel hole convolution kernelsKtA convolution of 3 × 3 holes representing the tth hole rate of 2; ktRespectively corresponding to the weight wtPerforming dot product operation, and adding K parallel cavity convolutions to obtain an integrated cavity convolution kernelLow frequency signatureAnd then convolution kernel with the synthetic holePerforming convolution operation to obtain a low-frequency high-level semantic feature map Indicating the number of channelsTwice of;
(3-3) serially superposing the dynamic cavity convolution modules, wherein the output of the first dynamic cavity convolution module is used as the input of the second dynamic cavity convolution module; according to (3-2), intermediate frequency characteristic diagramObtaining a medium-frequency high-level semantic feature map through two serial dynamic void convolution modules Indicating the number of channelsFour times that of; high frequency signatureObtaining a high-frequency high-level semantic feature map through four serial dynamic void convolution modules Indicating the number of channelsEight times of;
inputting the multi-frequency high-level semantic feature map into a decoder, namely an up-sampling convolution module, and obtaining a segmentation mask of a video frame;
and (5) iteratively training a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic cavity convolution module and a decoder until convergence, and then inputting a new video into the model to obtain a corresponding semantic segmentation result.
2. The method for video semantic segmentation using multi-frequency dynamic hole convolution according to claim 1, wherein the step (4) is specifically:
(4-1) constructing a decoder consisting of three transposed convolution layers, wherein the transposed convolution is the reverse process of convolution, and a large-size characteristic diagram is obtained by performing convolution operation on the transposed convolution layer and an input small-size characteristic diagram;
(4-2) mapping the low-frequency high-level semantic feature mapIntermediate frequency high-level semantic feature mapAnd high-frequency high-level semantic feature mapsSplicing in channel dimension to obtain integrated high-level semantic feature map
3. The method for video semantic segmentation using multi-frequency dynamic hole convolution according to claim 2, wherein the step (5) is specifically:
(5-1) establishing a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic void convolution module and a decoder;
(5-2) sequentially inputting the video frame sequence into a semantic segmentation model to obtain a segmentation maskAdjusting model parameters by a gradient back propagation method according to cross entropy loss, and iteratively optimizing the model until convergence;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110718738.1A CN113538457B (en) | 2021-06-28 | 2021-06-28 | Video semantic segmentation method utilizing multi-frequency dynamic hole convolution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110718738.1A CN113538457B (en) | 2021-06-28 | 2021-06-28 | Video semantic segmentation method utilizing multi-frequency dynamic hole convolution |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113538457A CN113538457A (en) | 2021-10-22 |
CN113538457B true CN113538457B (en) | 2022-06-24 |
Family
ID=78125962
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110718738.1A Active CN113538457B (en) | 2021-06-28 | 2021-06-28 | Video semantic segmentation method utilizing multi-frequency dynamic hole convolution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113538457B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114494297B (en) * | 2022-01-28 | 2022-12-06 | 杭州电子科技大学 | Adaptive video target segmentation method for processing multiple priori knowledge |
CN114240945B (en) * | 2022-02-28 | 2022-05-10 | 科大天工智能装备技术(天津)有限公司 | Bridge steel cable fracture detection method and system based on target segmentation |
CN114821432B (en) * | 2022-05-05 | 2022-12-02 | 杭州电子科技大学 | Video target segmentation anti-attack method based on discrete cosine transform |
CN116824139B (en) * | 2023-06-14 | 2024-03-22 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Endoscope polyp segmentation method based on boundary supervision and time sequence association |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108985269A (en) * | 2018-08-16 | 2018-12-11 | 东南大学 | Converged network driving environment sensor model based on convolution sum cavity convolutional coding structure |
CN110276354A (en) * | 2019-05-27 | 2019-09-24 | 东南大学 | A kind of training of high-resolution Streetscape picture semantic segmentation and real time method for segmenting |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10147193B2 (en) * | 2017-03-10 | 2018-12-04 | TuSimple | System and method for semantic segmentation using hybrid dilated convolution (HDC) |
CN111210435B (en) * | 2019-12-24 | 2022-10-18 | 重庆邮电大学 | Image semantic segmentation method based on local and global feature enhancement module |
CN111860386B (en) * | 2020-07-27 | 2022-04-08 | 山东大学 | Video semantic segmentation method based on ConvLSTM convolutional neural network |
-
2021
- 2021-06-28 CN CN202110718738.1A patent/CN113538457B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108985269A (en) * | 2018-08-16 | 2018-12-11 | 东南大学 | Converged network driving environment sensor model based on convolution sum cavity convolutional coding structure |
CN110276354A (en) * | 2019-05-27 | 2019-09-24 | 东南大学 | A kind of training of high-resolution Streetscape picture semantic segmentation and real time method for segmenting |
Also Published As
Publication number | Publication date |
---|---|
CN113538457A (en) | 2021-10-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113538457B (en) | Video semantic segmentation method utilizing multi-frequency dynamic hole convolution | |
CN111242037B (en) | Lane line detection method based on structural information | |
CN112507997B (en) | Face super-resolution system based on multi-scale convolution and receptive field feature fusion | |
CN109190752B (en) | Image semantic segmentation method based on global features and local features of deep learning | |
CN111915592B (en) | Remote sensing image cloud detection method based on deep learning | |
CN109035149B (en) | License plate image motion blur removing method based on deep learning | |
CN113052210A (en) | Fast low-illumination target detection method based on convolutional neural network | |
CN113642634A (en) | Shadow detection method based on mixed attention | |
CN110059728B (en) | RGB-D image visual saliency detection method based on attention model | |
CN109034184B (en) | Grading ring detection and identification method based on deep learning | |
CN112396607A (en) | Streetscape image semantic segmentation method for deformable convolution fusion enhancement | |
CN110399840B (en) | Rapid lawn semantic segmentation and boundary detection method | |
CN113240697B (en) | Lettuce multispectral image foreground segmentation method | |
CN109508639B (en) | Road scene semantic segmentation method based on multi-scale porous convolutional neural network | |
CN112115871B (en) | High-low frequency interweaving edge characteristic enhancement method suitable for pedestrian target detection | |
CN115346071A (en) | Image classification method and system for high-confidence local feature and global feature learning | |
CN112149526A (en) | Lane line detection method and system based on long-distance information fusion | |
CN113392728B (en) | Target detection method based on SSA sharpening attention mechanism | |
CN111539434B (en) | Infrared weak and small target detection method based on similarity | |
CN117746130A (en) | Weak supervision deep learning classification method based on remote sensing image punctiform semantic tags | |
Yuan et al. | Graph neural network based multi-feature fusion for building change detection | |
CN115035377A (en) | Significance detection network system based on double-stream coding and interactive decoding | |
CN113780305A (en) | Saliency target detection method based on interaction of two clues | |
CN113610857B (en) | Apple grading method and system based on residual error network | |
CN113553919B (en) | Target frequency characteristic expression method, network and image classification method based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |