CN113538457B - Video semantic segmentation method utilizing multi-frequency dynamic hole convolution - Google Patents

Video semantic segmentation method utilizing multi-frequency dynamic hole convolution Download PDF

Info

Publication number
CN113538457B
CN113538457B CN202110718738.1A CN202110718738A CN113538457B CN 113538457 B CN113538457 B CN 113538457B CN 202110718738 A CN202110718738 A CN 202110718738A CN 113538457 B CN113538457 B CN 113538457B
Authority
CN
China
Prior art keywords
frequency
convolution
characteristic diagram
low
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110718738.1A
Other languages
Chinese (zh)
Other versions
CN113538457A (en
Inventor
李平
陈俊杰
王然
徐向华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110718738.1A priority Critical patent/CN113538457B/en
Publication of CN113538457A publication Critical patent/CN113538457A/en
Application granted granted Critical
Publication of CN113538457B publication Critical patent/CN113538457B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/12Classification; Matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video semantic segmentation method utilizing multi-frequency dynamic hole convolution. Firstly, enhancing a sampling frame image of video data, and extracting a shallow visual characteristic diagram through an encoder; then constructing a characteristic frequency separation module to obtain a multi-frequency characteristic image corresponding to the video frame, inputting the multi-frequency characteristic image into a dynamic void convolution module to obtain a corresponding multi-frequency high-level semantic characteristic image, and obtaining a segmentation mask of the video frame through an up-sampling convolution encoder; and (5) iteratively training the model by using a random gradient descent algorithm until convergence, and inputting the new video into the model to obtain a semantic segmentation result. The method of the invention separates the characteristic graph of the video frame according to different frequencies to depict different visual region changes, can reduce low-frequency visual space redundant information and reduce the computational complexity, adaptively enlarges the receptive field of the multi-frequency characteristic graph through dynamic void convolution, and improves the discrimination capability of different semantic classes of the video, thereby obtaining a better video semantic segmentation result.

Description

Video semantic segmentation method utilizing multi-frequency dynamic hole convolution
Technical Field
The invention belongs to the technical field of computer vision, particularly relates to the field of semantic segmentation in video processing, and relates to a video semantic segmentation method by utilizing multi-frequency dynamic hole convolution.
Background
With the increasing proliferation of vehicles of all types, driving safety is a significant concern to governments and the public. Generally, the driver of the large vehicle is easy to have a visual blind area, and great hidden danger is brought to the driving safety. In recent years, the automatic driving technique has attracted much interest in the industry, and more research efforts have been put into this field. Efficient visual understanding can provide guarantee for safety of automatic driving, and video semantic segmentation is one of core technologies of the automatic driving. The video semantic segmentation aims to carry out pixel-level class marking on video frames with time sequence correlation to obtain a pixel-by-pixel class mask matrix with the same size as that of an original video frame, and can be widely applied to the fields of machine vision, video monitoring, unmanned aerial vehicle reconnaissance, automatic driving and the like. For example, in an automatic driving environment, objects such as roads, pedestrians, or other vehicles in a vehicle visual scene are segmented at a pixel level, and object region information more accurate than a boundary frame can be obtained, so that more accurate visual perception content is provided for an automatic driving system, and obstacles such as pedestrians and vehicles are avoided, and driving safety is ensured. Currently, the main challenges in the field of video semantic segmentation include high computational complexity of the model, long time consuming processing of high resolution video frames, and difficulty in deploying the model in a real-time environment.
The traditional semantic segmentation method mainly comprises the following categories of threshold, edge, super-pixel clustering and the like. The threshold segmentation method compares the gray value of each pixel point of the image with a threshold, and the pixels with the gray values larger than the threshold are judged to be foreground, and the other pixels are background but only applicable to gray images; the edge segmentation method firstly carries out edge detection on an image, pixels in the same edge represent the same object, and the defect is that the segmentation precision is limited by an edge detection algorithm; the super-pixel clustering method aggregates approximate super-pixel blocks to depict the same object, and has the disadvantages that the formation of super-pixels is limited by the colors of the pixels and the textures of pixel regions, and different parts of the same object are easily divided into a plurality of super-pixels, resulting in segmentation errors. In recent years, a deep neural network is popular due to its strong feature extraction capability, and a typical method utilizes a convolutional neural network as an encoder to extract abstract semantic information of a video frame, and obtains a semantic segmentation mask through a layer-by-layer upsampling operation of a decoder. However, the convolutional layer can only extract local semantic information of the frame image, and it is difficult to characterize the global scene. Therefore, the spatial pyramid pooling technology is used for semantic segmentation, and is characterized in that: and performing multiple parallel pooling operations on the feature map obtained from the encoder to obtain compressed feature maps with different sizes so as to capture global scene features of multiple size receptive fields, performing up-sampling to restore the feature maps with the same size as the initial feature map and splicing the feature maps to obtain a total feature map, and finally obtaining a semantic segmentation mask through a decoder so as to obtain a video semantic segmentation result.
The existing semantic segmentation method still has many defects: 1) the spatial pyramid pooling technology considers local and global space-time structure information simultaneously so that the segmentation result is more reliable, but the defects of poor fault tolerance, poor generalization capability, high calculation complexity and the like are caused by using the maximum average pooling operation on the high-resolution feature map; 2) although the long-term semantic dependence relationship among feature graphs is strengthened by utilizing an attention mechanism, the model is too large and occupies much memory, and the real-time deployment of the model is not facilitated; 3) a Transformer encoder is widely used for natural language processing as a feature extractor, takes a one-dimensional embedded feature representation sequence of a two-dimensional image as input, and utilizes a self-attention mechanism and a multi-layer perceptron to stack and capture long-term dependence relation between video frames, but the model lacks weight sharing, so that the number of parameters is huge, and the self-attention computational complexity is high, so that the real-time performance is difficult to guarantee. Meanwhile, the precision and the real-time performance of most segmentation methods cannot be effectively balanced, so that the requirements of actual segmentation tasks cannot be effectively met. Therefore, aiming at the problems of high computational complexity, poor generalization capability and the like of the segmentation model, a method capable of ensuring the real-time performance of the segmentation model and achieving higher semantic segmentation precision is urgently needed.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a video semantic segmentation method by using multi-frequency dynamic hole convolution, which separates multiple frequencies of a feature map through Fourier transform, and the multi-frequency feature map can depict different gray value changes of different visual areas so as to reduce low-frequency visual space redundant information and reduce calculation complexity; meanwhile, the dynamic cavity convolution is designed to adaptively enlarge the receptive field of the multi-frequency characteristic diagram, and the discrimination capability of the model on different semantic classes of the video is improved from the global and local angles, so that the video semantic segmentation precision is improved.
The method firstly acquires a video data set, and then performs the following operations:
sampling a video to obtain a video frame, performing enhancement operation, and inputting the video frame to an encoder, namely a deep convolutional neural network to obtain a corresponding shallow visual characteristic diagram;
step (2) constructing a characteristic frequency separation module, inputting a shallow visual characteristic diagram, and outputting a multi-frequency characteristic diagram;
step (3) constructing a dynamic void convolution module, inputting the dynamic void convolution module into a multi-frequency feature map, and outputting the multi-frequency high-level semantic feature map;
inputting the multi-frequency high-level semantic feature map into a decoder, namely an up-sampling convolution module, and obtaining a segmentation mask of a video frame;
and (5) iteratively training a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic cavity convolution module and a decoder until convergence, and then inputting a new video into the model to obtain a corresponding semantic segmentation result.
Further, the step (1) is specifically:
(1-1) uniformly sampling a single video to obtain video frames with the sampling rate of 10-15 frames/second, and performing enhancement operation on the video frames to obtain a video frame sequence I with the number of N, and recording the video frame sequence I as a video frame sequence I
Figure BDA0003136106040000021
Wherein IiWhich represents the ith video frame of the video,
Figure BDA0003136106040000022
representing a real number field, 3 representing the number of RGB channels, H representing the height of a video frame, and W representing the width of the video frame;
(1-2) sequentially extracting a shallow visual feature map from the video frame sequence I by utilizing a convolutional neural network ResNet pre-trained on a large image library ImageNet
Figure BDA0003136106040000023
CfNumber of channels, H, representing a feature mapfIndicating the height of the feature map, WfRepresenting a feature map width; ResNet has a plurality of modules consisting of convolutional layers, fiAnd (4) obtaining a characteristic diagram of the ith video frame passing through the first three modules consisting of a plurality of convolutional layers by RestNet.
Further, the step (2) is specifically:
(2-1) constructing a characteristic frequency separation module, and carrying out three times of high-low frequency characteristic separation operation on the shallow visual characteristic graph by utilizing the characteristic that the image has separable frequency to obtain a multi-frequency characteristic graph; the high-frequency characteristic is used for describing a contour region of the characteristic diagram, the low-frequency characteristic is used for describing a plane region of the characteristic diagram, and the medium-frequency characteristic is used for describing a content region of the characteristic diagram;
(2-2) the specific operation of high and low frequency feature separation is as follows:
firstly, for the superficial visual characteristic diagram fiPerforming fast Fourier transform, and converting the space domain signal into a frequency domain signal to obtain fiSpectral diagram of
Figure BDA0003136106040000031
Will be provided with
Figure BDA0003136106040000032
The middle and low frequency signal part is translated to the middle to obtain a translation spectrogram
Figure BDA0003136106040000033
Determining
Figure BDA0003136106040000034
A central position vector (P, Q); wherein,
Figure BDA0003136106040000035
vector formed by abscissa values of central point of channel
Figure BDA0003136106040000036
Vector formed by ordinate values
Figure BDA0003136106040000037
The subscript r represents
Figure BDA0003136106040000038
The channel index of (2);
then will be
Figure BDA0003136106040000039
Each element of (1) and a low frequency transfer function Hl(ur,a,,vr,b) Multiplying to obtain low-frequency shift spectrogram
Figure BDA00031361060400000310
Transfer function of Gaussian low-pass filter
Figure BDA00031361060400000311
l represents low-frequency signal, a represents coordinate value of horizontal axis of pixel point, b represents coordinate value of vertical axis of pixel point, and {0 ≦ a ≦ Hf,0≤b≤Wf}, exp (. cndot.) denotes an exponential function, D0Is the set standard deviation; wherein,
Figure BDA00031361060400000312
represent
Figure BDA00031361060400000313
Distance coordinate point (P) of middle-r channel pixel points (a, b)r,Qr) Of Euclidean distance ur,aIs that
Figure BDA00031361060400000314
Distance P between the r-th channel spectrum position (a,0)rOf Euclidean distance, vr,bIs that
Figure BDA00031361060400000315
Middle-r channel spectral position (0, b) distance QrThe Euclidean distance of (c);
in the same way, will
Figure BDA00031361060400000316
Each element of (1) and a high frequency transfer function Hh(ur,a,,vr,b) Multiplication operation is carried out to obtain a high-frequency shift spectrogram
Figure BDA00031361060400000317
Where h denotes a high-frequency signal, and,
Figure BDA00031361060400000318
respectively convert the frequency spectrum
Figure BDA00031361060400000319
And
Figure BDA00031361060400000320
the middle low-frequency signal is translated back to the original position from the middle to obtain a low-frequency spectrogram
Figure BDA00031361060400000321
And high frequency spectrogram
Figure BDA00031361060400000322
Finally will be
Figure BDA00031361060400000323
And
Figure BDA00031361060400000324
respectively carrying out fast Fourier inversion transformation to convert the frequency domain signals into space domain signals to obtain weak low-frequency characteristic diagrams
Figure BDA00031361060400000325
And weak high frequency characteristic diagram
Figure BDA00031361060400000326
(2-3) for weak high frequency characteristics according to (2-2)
Figure BDA00031361060400000327
Carrying out secondary high-low frequency characteristic separation operation to obtain a strong high-frequency characteristic diagram
Figure BDA0003136106040000041
Characteristic diagram of medium and high frequency
Figure BDA0003136106040000042
hh represents that the characteristic diagram is subjected to high-frequency signal filtering twice, hi represents that the characteristic diagram is subjected to high-frequency signal filtering once and then low-frequency signal filtering once;
according to (2-2), for weaknessLow frequency signature
Figure BDA0003136106040000043
Carrying out secondary high-low frequency characteristic separation operation to obtain a strong-low frequency characteristic diagram
Figure BDA0003136106040000044
Middle and low frequency characteristic diagram
Figure BDA0003136106040000045
ll represents that the characteristic diagram is subjected to low-frequency signal filtering twice, lh represents that the characteristic diagram is subjected to low-frequency signal filtering once and then high-frequency signal filtering once;
(2-4) mapping the medium-high frequency characteristics
Figure BDA0003136106040000046
Middle and low frequency characteristic diagram
Figure BDA0003136106040000047
Performing one-time splicing, performing convolution operation with the size of 1 × 1 to obtain a compressed characteristic diagram, performing down-sampling operation with the maximum step length of 2 to obtain an intermediate frequency characteristic diagram
Figure BDA0003136106040000048
Where m represents the intermediate frequency signal and where,
Figure BDA0003136106040000049
channel dimensions of the intermediate frequency characteristic diagram;
(2-5) mapping the strong low-frequency characteristics
Figure BDA00031361060400000410
Obtaining a compressed characteristic diagram through convolution operation with the size of 1 multiplied by 1, and obtaining a low-frequency characteristic diagram through down sampling through maximum pooling operation with the step length of 4
Figure BDA00031361060400000411
Mapping strong high frequency characteristics
Figure BDA00031361060400000412
Obtaining a compressed high-frequency characteristic diagram through a convolution operation with the size of 1 multiplied by 1
Figure BDA00031361060400000413
Wherein,
Figure BDA00031361060400000414
and
Figure BDA00031361060400000415
representing the channel dimensions of the high frequency and low frequency profiles, respectively.
Still further, the step (3) is specifically:
(3-1) constructing a dynamic cavity convolution module consisting of a weight calculator and K parallel cavity convolution kernels, and respectively inputting the multi-frequency feature maps into the dynamic cavity convolution module to obtain multi-frequency high-level semantic feature maps, wherein the multi-frequency high-level semantic feature maps comprise a low-frequency high-level semantic feature map, a medium-frequency high-level semantic feature map and a high-frequency high-level semantic feature map;
(3-2) the specific operation of the dynamic hole convolution is as follows: low frequency feature map
Figure BDA00031361060400000416
Input to the weight calculator to obtain K weights
Figure BDA00031361060400000417
wtRepresents the weight of the t-th hole convolution, w is more than or equal to 0t<1,
Figure BDA00031361060400000418
The weight calculator consists of a global average pooling operation, a full connection layer, a Relu function, a full connection layer and a Softmax function; k parallel hole convolution kernels
Figure BDA00031361060400000419
KtA convolution of 3 × 3 holes indicating a tth hole rate of 2; ktRespectively corresponding to the weight wtThe dot product operation is carried out, and the operation,then adding K parallel cavity convolutions to obtain an integrated cavity convolution kernel
Figure BDA00031361060400000420
Parameters for utilizing a plurality of parallel hole convolutions to capture different receptive fields; low frequency signature
Figure BDA00031361060400000421
And then convolution kernel with the synthetic hole
Figure BDA00031361060400000422
Performing convolution operation to obtain a low-frequency high-level semantic feature map
Figure BDA00031361060400000423
Indicating the number of channels
Figure BDA00031361060400000424
Twice of; (3-3) serially superposing the dynamic cavity convolution modules, wherein the output of the first dynamic cavity convolution module is used as the input of the second dynamic cavity convolution module; according to (3-2), intermediate frequency characteristic diagram
Figure BDA0003136106040000051
Obtaining a medium-frequency high-level semantic feature map through two serial dynamic void convolution modules
Figure BDA0003136106040000052
Indicating the number of channels
Figure BDA0003136106040000053
Four times that of; similarly, high frequency characteristic diagram
Figure BDA0003136106040000054
Obtaining a high-frequency high-level semantic feature map through four serial dynamic void convolution modules
Figure BDA0003136106040000055
Indicating the number of channels
Figure BDA0003136106040000056
Eight times of that of the prior art.
Still further, the step (4) is specifically:
(4-1) constructing a decoder consisting of three transposed convolution layers, wherein the transposed convolution is the reverse process of convolution, and a large-size characteristic diagram is obtained by performing convolution operation on the transposed convolution and the input small-size characteristic diagram;
(4-2) mapping the low-frequency high-level semantic features
Figure BDA0003136106040000057
Intermediate frequency high-level semantic feature map
Figure BDA0003136106040000058
And high-frequency high-level semantic feature maps
Figure BDA0003136106040000059
Splicing in channel dimension to obtain integrated high-level semantic feature map
Figure BDA00031361060400000510
(4-3) integrating the semantic feature map tiInput decoder obtaining segmentation mask
Figure BDA00031361060400000511
And C represents the total number of semantic categories, and the category corresponding to each pixel in the video frame is the category with the highest probability in all the categories.
Still further, the step (5) is specifically:
(5-1) establishing a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic void convolution module and a decoder;
(5-2) sequentially inputting the video frame sequence into a semantic segmentation model to obtain a segmentation mask
Figure BDA00031361060400000512
1, …, N, adjusting model parameters by a gradient back propagation method according to cross entropy loss, and iteratively optimizing the model until convergence;
(5-3) inputting each frame of the new video into the trained model, and sequentially outputting corresponding segmentation results according to (5-2)
Figure BDA00031361060400000513
Wherein the first dimension represents a semantic category.
The method utilizes a characteristic frequency separation mechanism and a dynamic void convolution module to carry out semantic segmentation on the video, and has the following characteristics: 1) different from the existing method which uniformly processes the high-resolution feature map, the feature frequency separation module designed by the invention separates the feature map into features with different frequencies, wherein the high-frequency features represent regions with large variation amplitude, the low-frequency features represent regions with small variation amplitude, and the medium-frequency features represent regions with moderate variation amplitude, and the features with different frequencies are distinguished and processed, so that the network learns more targeted semantic features; 2) by constructing a dynamic void convolution module, different weights are dynamically distributed to a plurality of parallel void convolutions according to the characteristics of input characteristics under the condition of not increasing the depth and the width of a network, so that the void convolutions can be effectively fused together, and more effective semantic characteristics can be extracted; 3) most of the existing methods improve the segmentation precision by overlapping correction modules and increasing the network depth, but neglect the problems of redundancy of the model, low segmentation speed and the like.
The method is suitable for video semantic segmentation with strict real-time requirement, and has the advantages that: 1) the characteristic frequency separation module can be used for effectively separating and distinguishing the characteristics of different frequencies in the characteristic diagram, so that the processing efficiency can be improved; 2) by constructing a dynamic void convolution module, a plurality of void convolutions can be fused under the condition that the network complexity is not increased remarkably, more effective semantic information in a characteristic diagram is captured, and more accurate segmentation results are obtained; 3) for the characteristics of different frequencies, the characteristics of different frequencies can be processed in a targeted manner through the dynamic cavity convolution modules of different depths, the calculated amount of the model can be greatly reduced, and the semantic segmentation speed of the model on the video is improved. The invention can be applied to practical tasks such as intelligent monitoring, unmanned aerial vehicle reconnaissance, machine vision, automatic driving and the like.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1, a video semantic segmentation method using multi-frequency dynamic hole convolution first samples a given video and inputs the video into an encoder composed of a convolutional neural network to obtain a shallow visual feature map of a video frame; then, a characteristic frequency separation module consisting of Fourier transform, a Gaussian filter and inverse Fourier transform is used for separating a multi-frequency characteristic graph from the shallow visual characteristic graph; then, carrying out different-depth processing on the multi-frequency high-level semantic feature map by utilizing dynamic cavity convolution consisting of a weight calculator and a plurality of parallel cavity convolution kernels according to the multi-frequency feature map to obtain a multi-frequency high-level semantic feature map; and finally, splicing the multi-frequency high-level semantic feature maps, inputting the multi-frequency high-level semantic feature maps into a decoder, and performing up-sampling to obtain a semantic segmentation result. The method has the advantages that the idea of separable image frequency is popularized to the shallow visual characteristic diagram, visual areas with different frequencies of the characteristic diagram can be distinguished, the characteristic diagrams with different frequencies are processed by convolution of dynamic cavities with different depths, the sensing field of the characteristic diagrams is enlarged, the calculation complexity of a model is reduced, and high semantic segmentation precision can be obtained in real time.
The method comprises the steps of firstly acquiring a video data set, and then performing the following operations:
sampling a video to obtain a video frame, performing enhancement operation, and inputting the video frame to an encoder, namely a deep convolutional neural network to obtain a corresponding shallow visual characteristic diagram; the method comprises the following steps:
(1-1) uniformly sampling a single video to obtain video frames with a sampling rate of 10 frames/second, and performing enhancement operation on the video frames to obtain a video frame sequence I with the number of N, which is recorded as a video frame sequence I
Figure BDA0003136106040000061
In which IiWhich represents the (i) th video frame,
Figure BDA0003136106040000062
representing a real number field, 3 representing the number of RGB channels, H representing the height of a video frame, and W representing the width of the video frame;
(1-2) sequentially extracting a shallow visual feature map from the video frame sequence I by utilizing a convolutional neural network ResNet pre-trained on a large image library ImageNet
Figure BDA0003136106040000063
CfNumber of channels (1024 in this embodiment) representing feature map, HfIndicating the height of the feature map, WfRepresenting a feature map width; ResNet has a plurality of modules consisting of convolutional layers, fiAnd (4) obtaining a characteristic diagram of the ith video frame passing through the first three modules consisting of a plurality of convolutional layers by RestNet.
Step (2) constructing a characteristic frequency separation module, inputting a shallow visual characteristic diagram, and outputting a multi-frequency characteristic diagram; the method comprises the following steps:
(2-1) constructing a characteristic frequency separation module, and carrying out three times of high-low frequency characteristic separation operation on the shallow visual characteristic graph by utilizing the characteristic that the image has separable frequency to obtain a multi-frequency characteristic graph; the high-frequency characteristic is used for describing a contour region of the characteristic diagram, the low-frequency characteristic is used for describing a plane region of the characteristic diagram, and the medium-frequency characteristic is used for describing a content region of the characteristic diagram;
(2-2) the specific operation of high and low frequency feature separation is as follows:
firstly, a superficial visual feature map f is mappediPerforming fast Fourier transform, and converting the space domain signal into a frequency domain signal to obtain fiSpectral diagram of
Figure BDA0003136106040000071
Will be provided with
Figure BDA0003136106040000072
The middle and low frequency signal part is translated to the middle to obtain a translation spectrogram
Figure BDA0003136106040000073
Determining
Figure BDA0003136106040000074
A central position vector (P, Q); wherein,
Figure BDA0003136106040000075
vector formed by abscissa values of central point of channel
Figure BDA0003136106040000076
Vector formed by ordinate values
Figure BDA0003136106040000077
The subscript r represents
Figure BDA0003136106040000078
The channel index of (2);
then will be
Figure BDA0003136106040000079
Each element of (1) and low frequency transfer function Hl(ur,a,,vr,b) Multiplying to obtain low-frequency shift spectrogram
Figure BDA00031361060400000710
Transfer function of Gaussian low-pass filter
Figure BDA00031361060400000711
l represents low-frequency signal, a represents coordinate value of horizontal axis of pixel point, b represents coordinate value of vertical axis of pixel point, and {0 ≦ a ≦ Hf,0≤b≤Wf}, exp (. cndot.) denotes an exponential function, D0Is the set standard deviation (10 in this example); wherein,
Figure BDA00031361060400000712
represent
Figure BDA00031361060400000713
Distance coordinate point (P) of middle-r channel pixel points (a, b)r,Qr) Of Euclidean distance ur,aIs that
Figure BDA00031361060400000714
Distance P between the r-th channel spectrum position (a,0)rOf Euclidean distance, vr,bIs that
Figure BDA00031361060400000715
Middle-r channel spectral position (0, b) distance QrThe Euclidean distance of (c);
in the same way, will
Figure BDA00031361060400000716
Each element of (1) and high frequency transfer function Hh(ur,a,,vr,b) Multiplication operation is carried out to obtain a high-frequency shift spectrogram
Figure BDA00031361060400000717
Where h represents a high-frequency signal and,
Figure BDA00031361060400000718
respectively convert the frequency spectrum
Figure BDA00031361060400000719
And
Figure BDA00031361060400000720
the middle low-frequency signal is translated back to the original position from the middle to obtain a low-frequency spectrogram
Figure BDA00031361060400000721
And high frequency spectrogram
Figure BDA00031361060400000722
Finally will be
Figure BDA00031361060400000723
And
Figure BDA00031361060400000724
respectively carrying out fast Fourier inversion transformation to convert the frequency domain signals into space domain signals to obtain weak low-frequency characteristic diagrams
Figure BDA00031361060400000725
And weak high frequency characteristic diagram
Figure BDA00031361060400000726
(2-3) for weak high frequency characteristics according to (2-2)
Figure BDA00031361060400000727
Carrying out secondary high-low frequency characteristic separation operation to obtain a strong high-frequency characteristic diagram
Figure BDA00031361060400000728
Characteristic diagram of medium and high frequency
Figure BDA00031361060400000729
hh represents that the characteristic diagram is subjected to high-frequency signal filtering twice, hi represents that the characteristic diagram is subjected to high-frequency signal filtering once and then low-frequency signal filtering once;
according to (2-2), for weak low frequency characteristic diagram
Figure BDA0003136106040000081
Carrying out secondary high-low frequency characteristic separation operation to obtain a strong-low frequency characteristic diagram
Figure BDA0003136106040000082
Middle and low frequency characteristic diagram
Figure BDA0003136106040000083
ll represents that the characteristic diagram is subjected to low-frequency signal filtering twice, lh represents that the characteristic diagram is subjected to low-frequency signal filtering once and then high-frequency signal filtering once;
(2-4) mapping the medium-high frequency characteristics
Figure BDA0003136106040000084
Characteristic diagram of middle and low frequency
Figure BDA0003136106040000085
Performing one-time splicing, performing convolution operation with the size of 1 × 1 to obtain a compressed characteristic diagram, performing down-sampling operation with the maximum step length of 2 to obtain an intermediate frequency characteristic diagram
Figure BDA0003136106040000086
Where m represents the intermediate frequency signal and where,
Figure BDA0003136106040000087
channel dimensions of the intermediate frequency characteristic diagram;
(2-5) mapping the strong low-frequency characteristics
Figure BDA0003136106040000088
Obtaining a compressed characteristic diagram through convolution operation with the size of 1 multiplied by 1, and obtaining a low-frequency characteristic diagram through down sampling through maximum pooling operation with the step length of 4
Figure BDA0003136106040000089
Mapping strong high frequency characteristics
Figure BDA00031361060400000810
Obtaining a compressed high-frequency characteristic diagram through a convolution operation with the size of 1 multiplied by 1
Figure BDA00031361060400000811
Wherein,
Figure BDA00031361060400000812
and
Figure BDA00031361060400000813
representing the channel dimensions of the high frequency and low frequency signatures, respectively.
Step (3) constructing a dynamic void convolution module, inputting the dynamic void convolution module into a multi-frequency feature map, and outputting the multi-frequency high-level semantic feature map; the method comprises the following steps:
(3-1) constructing a dynamic cavity convolution module consisting of a weight calculator and K parallel cavity convolution kernels, and respectively inputting the multi-frequency feature maps into the dynamic cavity convolution module to obtain multi-frequency high-level semantic feature maps, wherein the multi-frequency high-level semantic feature maps comprise a low-frequency high-level semantic feature map, a medium-frequency high-level semantic feature map and a high-frequency high-level semantic feature map;
(3-2) the specific operation of the dynamic hole convolution is as follows: mapping low frequency features
Figure BDA00031361060400000814
Input to the weight calculator to obtain K weights
Figure BDA00031361060400000815
wtRepresents the weight of the t-th hole convolution, w is more than or equal to 0t<1,
Figure BDA00031361060400000816
The weight calculator consists of a global average pooling operation, a full connection layer, a Relu function, a full connection layer and a Softmax function; k parallel hole convolution kernels
Figure BDA00031361060400000817
KtA convolution of 3 × 3 holes indicating a tth hole rate of 2; ktRespectively corresponding to the weight wtPerforming dot product operation, and adding K parallel cavity convolutions to obtain an integrated cavity convolution kernel
Figure BDA00031361060400000818
Parameters for utilizing a plurality of parallel hole convolutions to capture different receptive fields; low frequency signature
Figure BDA00031361060400000819
And then convolution kernel with the synthetic hole
Figure BDA00031361060400000820
Carrying out convolution operation to obtain a low-frequency high-level semantic feature map
Figure BDA00031361060400000821
Indicating the number of channels
Figure BDA00031361060400000822
Twice of; (3-3) serially superposing the dynamic cavity convolution modules, wherein the output of the first dynamic cavity convolution module is used as the input of the second dynamic cavity convolution module; according to (3-2), intermediate frequency characteristic diagram
Figure BDA00031361060400000823
Obtaining a medium-frequency high-level semantic feature map through two serial dynamic void convolution modules
Figure BDA0003136106040000091
Indicating the number of channels
Figure BDA0003136106040000092
Four times that of; similarly, high frequency characteristic diagram
Figure BDA0003136106040000093
Obtaining a high-frequency high-level semantic feature map through four serial dynamic void convolution modules
Figure BDA0003136106040000094
Indicating the number of channels
Figure BDA0003136106040000095
Eight times of that of the prior art.
Inputting the multi-frequency high-level semantic feature map into a decoder, namely an up-sampling convolution module, and obtaining a segmentation mask of a video frame; the method comprises the following steps:
(4-1) constructing a decoder consisting of three transposed convolution layers, wherein the transposed convolution is the reverse process of convolution, and a large-size characteristic diagram is obtained by performing convolution operation on the transposed convolution layer and an input small-size characteristic diagram;
(4-2) mapping the low-frequency high-level semantic feature map
Figure BDA0003136106040000096
Intermediate frequency high-level semantic feature map
Figure BDA0003136106040000097
And high frequency high layerSemantic feature maps
Figure BDA0003136106040000098
Splicing in channel dimension to obtain integrated high-level semantic feature map
Figure BDA0003136106040000099
(4-3) integrating the semantic feature map tiInput decoder obtaining segmentation mask
Figure BDA00031361060400000910
And C represents the total number of semantic categories, and the category corresponding to each pixel in the video frame is the category with the highest probability in all the categories.
Iteratively training a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic void convolution module and a decoder until convergence, and then inputting a new video into the model to obtain a corresponding semantic segmentation result; the method comprises the following steps:
(5-1) establishing a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic void convolution module and a decoder;
(5-2) sequentially inputting the video frame sequence into a semantic segmentation model to obtain a segmentation mask
Figure BDA00031361060400000911
1, …, N, adjusting model parameters by a gradient back propagation method according to cross entropy loss, and iteratively optimizing the model until convergence;
(5-3) inputting each frame of the new video into the trained model, and sequentially outputting corresponding segmentation results according to (5-2)
Figure BDA00031361060400000912
Wherein the first dimension represents a semantic category.
The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.

Claims (3)

1. The video semantic segmentation method by using the multi-frequency dynamic hole convolution is characterized by firstly acquiring a video data set and then performing the following operations:
sampling a video to obtain a video frame, performing enhancement operation, and inputting the video frame to an encoder, namely a deep convolutional neural network to obtain a corresponding shallow visual characteristic diagram; the method comprises the following steps:
(1-1) uniformly sampling a single video to obtain video frames with the sampling rate of 10-15 frames/second, and performing enhancement operation on the video frames to obtain a video frame sequence I with the number of N, and recording the video frame sequence I as a video frame sequence I
Figure FDA0003606126000000011
Wherein IiWhich represents the ith video frame of the video,
Figure FDA0003606126000000012
representing a real number field, 3 representing the number of RGB channels, H representing the height of a video frame, and W representing the width of the video frame;
(1-2) sequentially extracting shallow visual feature maps from the video frame sequence I by using a convolutional neural network ResNet pre-trained on a large image library ImageNet
Figure FDA0003606126000000013
CfNumber of channels, H, representing a feature mapfIndicating the height of the feature map, WfRepresenting a feature map width; ResNet has a plurality of modules consisting of convolutional layers, fiObtaining a characteristic diagram of the ith video frame through three modules consisting of a plurality of convolutional layers before RestNet;
step (2) constructing a characteristic frequency separation module, inputting a shallow visual characteristic diagram, and outputting a multi-frequency characteristic diagram; the method comprises the following steps:
(2-1) constructing a characteristic frequency separation module, and carrying out three times of high-low frequency characteristic separation operation on the shallow visual characteristic graph by utilizing the characteristic that the image has separable frequency to obtain a multi-frequency characteristic graph; the high-frequency characteristic is used for describing a contour region of the characteristic diagram, the low-frequency characteristic is used for describing a plane region of the characteristic diagram, and the medium-frequency characteristic is used for describing a content region of the characteristic diagram;
(2-2) the specific operation of high and low frequency feature separation is as follows:
firstly, a superficial visual feature map f is mappediPerforming fast Fourier transform, and converting the space domain signal into a frequency domain signal to obtain fiSpectrum chart of
Figure FDA0003606126000000014
Will be provided with
Figure FDA0003606126000000015
The middle and low frequency signal is partially translated to the middle to obtain a translated spectrogram
Figure FDA0003606126000000016
Determining
Figure FDA0003606126000000017
A central position vector (P, Q); wherein,
Figure FDA0003606126000000018
vector formed by abscissa values of central point of channel
Figure FDA0003606126000000019
Vector formed by ordinate values
Figure FDA00036061260000000110
The subscript r represents
Figure FDA00036061260000000111
The channel index of (2);
then will be
Figure FDA00036061260000000112
Each element of (1) and low frequency transfer function Hl(ur,a,,vr,b) Multiplying to obtain low-frequency shift spectrogram
Figure FDA00036061260000000113
Transfer function of Gaussian low-pass filter
Figure FDA00036061260000000114
l represents low-frequency signal, a represents coordinate value of horizontal axis of pixel point, b represents coordinate value of vertical axis of pixel point, and {0 ≦ a ≦ Hf,0≤b≤Wf}, exp (. cndot.) denotes an exponential function, D0Is the set standard deviation; wherein,
Figure FDA00036061260000000115
represent
Figure FDA00036061260000000116
Distance coordinate point (P) of middle-r channel pixel points (a, b)r,Qr) Of Euclidean distance ur,aIs that
Figure FDA0003606126000000021
Distance P between middle r channel spectrum position (a,0)rOf Euclidean distance, vr,bIs that
Figure FDA0003606126000000022
Middle-r channel spectral position (0, b) distance QrThe Euclidean distance of (c);
in the same way, will
Figure FDA0003606126000000023
Each element of (1) and a high frequency transfer function Hh(ur,a,,vr,b) Multiplication operation is carried out to obtain a high-frequency shift spectrogram
Figure FDA0003606126000000024
Where h denotes a high-frequency signal, and,
Figure FDA0003606126000000025
respectively convert the frequency spectrum
Figure FDA0003606126000000026
And
Figure FDA0003606126000000027
the middle low-frequency signal is translated back to the original position from the middle to obtain a low-frequency spectrogram
Figure FDA0003606126000000028
And high frequency spectrogram
Figure FDA0003606126000000029
Finally will be
Figure FDA00036061260000000210
And
Figure FDA00036061260000000211
respectively carrying out fast Fourier inversion transformation to convert the frequency domain signals into space domain signals to obtain a weak low-frequency characteristic diagram
Figure FDA00036061260000000212
And weak high frequency characteristic diagram
Figure FDA00036061260000000213
(2-3) for weak high frequency characteristics according to (2-2)
Figure FDA00036061260000000214
Carrying out secondary high-low frequency characteristic separation operation to obtain a strong high-frequency characteristic diagram
Figure FDA00036061260000000215
Characteristic diagram of medium and high frequency
Figure FDA00036061260000000216
hh represents that the characteristic diagram is subjected to high-frequency signal filtering twice, hi represents that the characteristic diagram is subjected to high-frequency signal filtering once and then low-frequency signal filtering once;
according to (2-2), for weak low frequency characteristic diagram
Figure FDA00036061260000000217
Carrying out secondary high-low frequency characteristic separation operation to obtain a strong-low frequency characteristic diagram
Figure FDA00036061260000000218
Middle and low frequency characteristic diagram
Figure FDA00036061260000000219
ll represents that the characteristic diagram is subjected to low-frequency signal filtering twice, lh represents that the characteristic diagram is subjected to low-frequency signal filtering once and then high-frequency signal filtering once;
(2-4) mapping the medium-high frequency characteristics
Figure FDA00036061260000000220
Middle and low frequency characteristic diagram
Figure FDA00036061260000000221
Performing one-time splicing, performing convolution operation with the size of 1 × 1 to obtain a compressed characteristic diagram, performing down-sampling operation with the maximum step length of 2 to obtain an intermediate frequency characteristic diagram
Figure FDA00036061260000000222
Where m represents the intermediate frequency signal and where,
Figure FDA00036061260000000223
channel dimensions of the intermediate frequency characteristic diagram;
(2-5) mapping the strong low-frequency characteristics
Figure FDA00036061260000000224
Through one large passPerforming convolution operation with a small value of 1 × 1 to obtain compressed feature map, performing down-sampling operation with a maximum step size of 4 to obtain low-frequency feature map
Figure FDA00036061260000000225
Mapping strong high frequency characteristics
Figure FDA00036061260000000226
Obtaining a compressed high-frequency characteristic diagram through a convolution operation with the size of 1 multiplied by 1
Figure FDA00036061260000000227
Wherein,
Figure FDA00036061260000000228
Figure FDA00036061260000000229
and
Figure FDA00036061260000000230
respectively representing the channel dimensions of the high-frequency characteristic diagram and the low-frequency characteristic diagram;
step (3) constructing a dynamic void convolution module, inputting the dynamic void convolution module into a multi-frequency characteristic graph, and outputting a multi-frequency high-level semantic characteristic graph; the method comprises the following steps:
(3-1) constructing a dynamic cavity convolution module consisting of a weight calculator and K parallel cavity convolution kernels, and respectively inputting the multi-frequency feature maps into the dynamic cavity convolution module to obtain multi-frequency high-level semantic feature maps, wherein the multi-frequency high-level semantic feature maps comprise a low-frequency high-level semantic feature map, a medium-frequency high-level semantic feature map and a high-frequency high-level semantic feature map;
(3-2) the specific operation of the dynamic hole convolution is as follows: mapping low frequency features
Figure FDA0003606126000000031
Input to the weight calculator to obtain K weights
Figure FDA0003606126000000032
wtRepresents the weight of the t-th hole convolution, w is more than or equal to 0t<1,
Figure FDA0003606126000000033
The weight calculator consists of a global average pooling operation, a full connection layer, a Relu function, a full connection layer and a Softmax function; k parallel hole convolution kernels
Figure FDA0003606126000000034
KtA convolution of 3 × 3 holes representing the tth hole rate of 2; ktRespectively corresponding to the weight wtPerforming dot product operation, and adding K parallel cavity convolutions to obtain an integrated cavity convolution kernel
Figure FDA0003606126000000035
Low frequency signature
Figure FDA0003606126000000036
And then convolution kernel with the synthetic hole
Figure FDA0003606126000000037
Performing convolution operation to obtain a low-frequency high-level semantic feature map
Figure FDA0003606126000000038
Figure FDA0003606126000000039
Indicating the number of channels
Figure FDA00036061260000000310
Twice of;
(3-3) serially superposing the dynamic cavity convolution modules, wherein the output of the first dynamic cavity convolution module is used as the input of the second dynamic cavity convolution module; according to (3-2), intermediate frequency characteristic diagram
Figure FDA00036061260000000311
Obtaining a medium-frequency high-level semantic feature map through two serial dynamic void convolution modules
Figure FDA00036061260000000312
Figure FDA00036061260000000313
Indicating the number of channels
Figure FDA00036061260000000314
Four times that of; high frequency signature
Figure FDA00036061260000000315
Obtaining a high-frequency high-level semantic feature map through four serial dynamic void convolution modules
Figure FDA00036061260000000316
Figure FDA00036061260000000317
Indicating the number of channels
Figure FDA00036061260000000318
Eight times of;
inputting the multi-frequency high-level semantic feature map into a decoder, namely an up-sampling convolution module, and obtaining a segmentation mask of a video frame;
and (5) iteratively training a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic cavity convolution module and a decoder until convergence, and then inputting a new video into the model to obtain a corresponding semantic segmentation result.
2. The method for video semantic segmentation using multi-frequency dynamic hole convolution according to claim 1, wherein the step (4) is specifically:
(4-1) constructing a decoder consisting of three transposed convolution layers, wherein the transposed convolution is the reverse process of convolution, and a large-size characteristic diagram is obtained by performing convolution operation on the transposed convolution layer and an input small-size characteristic diagram;
(4-2) mapping the low-frequency high-level semantic feature map
Figure FDA00036061260000000319
Intermediate frequency high-level semantic feature map
Figure FDA00036061260000000320
And high-frequency high-level semantic feature maps
Figure FDA00036061260000000321
Splicing in channel dimension to obtain integrated high-level semantic feature map
Figure FDA00036061260000000322
(4-3) integrating the semantic feature map tiInput decoder obtains a segmentation mask
Figure FDA00036061260000000323
And C represents the total number of semantic categories, and the category corresponding to each pixel in the video frame is the category with the highest probability in all the categories.
3. The method for video semantic segmentation using multi-frequency dynamic hole convolution according to claim 2, wherein the step (5) is specifically:
(5-1) establishing a video semantic segmentation model consisting of an encoder, a characteristic frequency separation module, a dynamic void convolution module and a decoder;
(5-2) sequentially inputting the video frame sequence into a semantic segmentation model to obtain a segmentation mask
Figure FDA0003606126000000041
Adjusting model parameters by a gradient back propagation method according to cross entropy loss, and iteratively optimizing the model until convergence;
(5-3) inputting each frame of the new video into the trained model according to (5-2)) Sequentially outputting corresponding segmentation results
Figure FDA0003606126000000042
Wherein the first dimension represents a semantic category.
CN202110718738.1A 2021-06-28 2021-06-28 Video semantic segmentation method utilizing multi-frequency dynamic hole convolution Active CN113538457B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110718738.1A CN113538457B (en) 2021-06-28 2021-06-28 Video semantic segmentation method utilizing multi-frequency dynamic hole convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110718738.1A CN113538457B (en) 2021-06-28 2021-06-28 Video semantic segmentation method utilizing multi-frequency dynamic hole convolution

Publications (2)

Publication Number Publication Date
CN113538457A CN113538457A (en) 2021-10-22
CN113538457B true CN113538457B (en) 2022-06-24

Family

ID=78125962

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110718738.1A Active CN113538457B (en) 2021-06-28 2021-06-28 Video semantic segmentation method utilizing multi-frequency dynamic hole convolution

Country Status (1)

Country Link
CN (1) CN113538457B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494297B (en) * 2022-01-28 2022-12-06 杭州电子科技大学 Adaptive video target segmentation method for processing multiple priori knowledge
CN114240945B (en) * 2022-02-28 2022-05-10 科大天工智能装备技术(天津)有限公司 Bridge steel cable fracture detection method and system based on target segmentation
CN114821432B (en) * 2022-05-05 2022-12-02 杭州电子科技大学 Video target segmentation anti-attack method based on discrete cosine transform
CN116824139B (en) * 2023-06-14 2024-03-22 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Endoscope polyp segmentation method based on boundary supervision and time sequence association

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108985269A (en) * 2018-08-16 2018-12-11 东南大学 Converged network driving environment sensor model based on convolution sum cavity convolutional coding structure
CN110276354A (en) * 2019-05-27 2019-09-24 东南大学 A kind of training of high-resolution Streetscape picture semantic segmentation and real time method for segmenting

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10147193B2 (en) * 2017-03-10 2018-12-04 TuSimple System and method for semantic segmentation using hybrid dilated convolution (HDC)
CN111210435B (en) * 2019-12-24 2022-10-18 重庆邮电大学 Image semantic segmentation method based on local and global feature enhancement module
CN111860386B (en) * 2020-07-27 2022-04-08 山东大学 Video semantic segmentation method based on ConvLSTM convolutional neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108985269A (en) * 2018-08-16 2018-12-11 东南大学 Converged network driving environment sensor model based on convolution sum cavity convolutional coding structure
CN110276354A (en) * 2019-05-27 2019-09-24 东南大学 A kind of training of high-resolution Streetscape picture semantic segmentation and real time method for segmenting

Also Published As

Publication number Publication date
CN113538457A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN113538457B (en) Video semantic segmentation method utilizing multi-frequency dynamic hole convolution
CN111242037B (en) Lane line detection method based on structural information
CN112507997B (en) Face super-resolution system based on multi-scale convolution and receptive field feature fusion
CN109190752B (en) Image semantic segmentation method based on global features and local features of deep learning
CN111915592B (en) Remote sensing image cloud detection method based on deep learning
CN109035149B (en) License plate image motion blur removing method based on deep learning
CN113052210B (en) Rapid low-light target detection method based on convolutional neural network
CN110059728B (en) RGB-D image visual saliency detection method based on attention model
CN109034184B (en) Grading ring detection and identification method based on deep learning
CN113642634A (en) Shadow detection method based on mixed attention
CN112396607A (en) Streetscape image semantic segmentation method for deformable convolution fusion enhancement
CN110399840B (en) Rapid lawn semantic segmentation and boundary detection method
CN113240697B (en) Lettuce multispectral image foreground segmentation method
CN112115871B (en) High-low frequency interweaving edge characteristic enhancement method suitable for pedestrian target detection
CN115346071A (en) Image classification method and system for high-confidence local feature and global feature learning
CN112149526A (en) Lane line detection method and system based on long-distance information fusion
CN109508639B (en) Road scene semantic segmentation method based on multi-scale porous convolutional neural network
CN111539434B (en) Infrared weak and small target detection method based on similarity
CN112132746A (en) Small-scale pedestrian target rapid super-resolution method for intelligent roadside equipment
Yuan et al. Graph neural network based multi-feature fusion for building change detection
CN115035377A (en) Significance detection network system based on double-stream coding and interactive decoding
CN113780305A (en) Saliency target detection method based on interaction of two clues
CN113610857B (en) Apple grading method and system based on residual error network
CN113553919B (en) Target frequency characteristic expression method, network and image classification method based on deep learning
CN118470328A (en) Multi-dimensional attention semantic segmentation method and system for remote sensing image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant