CN111340049A

CN111340049A - Image processing method and device based on wide-area dynamic convolution

Info

Publication number: CN111340049A
Application number: CN202010151431.3A
Authority: CN
Inventors: 季向阳; 杨宇
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2020-06-26
Anticipated expiration: 2040-03-06
Also published as: CN111340049B

Abstract

The present disclosure relates to an image processing method and device based on wide area dynamic convolution, wherein the method comprises the following steps: extracting the features of the image to be processed to obtain an N-level feature map, wherein N is an integer greater than 1; carrying out up-sampling on the Nth-level feature map to obtain a first feature map of the image to be processed; determining a convolution kernel corresponding to each pixel point in the first characteristic diagram according to the N-level characteristic diagram and a preset convolution kernel size; performing convolution and pooling on each pixel point in the first feature map according to a convolution kernel corresponding to each pixel point in the first feature map and the size of an adjacent region of each pixel point to obtain a second feature map of the image to be processed; and determining an image processing result of the image to be processed according to the second feature map. The embodiment of the disclosure can enhance the characteristic diagram of the image to be processed, improve the resolution and definition of the characteristic diagram, and further improve the accuracy of image processing.

Description

Image processing method and device based on wide-area dynamic convolution

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image processing method and apparatus based on wide-area dynamic convolution.

Background

At present, convolutional neural networks are widely used in computer vision tasks. In general, an original image is input into a convolutional neural network, and after multi-level convolution and down-sampling, a feature map with a resolution lower than that of the original image and a channel higher than that of the original image is obtained, the feature map is considered to encode certain spatial information (that is, the feature map retains the adjacent relationship between pixels in the original image), and the feature vector of each pixel is considered to encode rich semantic information.

However, the feature map has problems such as low resolution and blurring, so that the accuracy of the image processing result determined from the feature map is poor. For example, the convolution kernel of the standard convolution operation is spatially shared, the feature map obtained through the standard convolution operation is relatively smooth, and edge information in the original image is easily lost, so that the accuracy of the standard convolution operation is poor when performing dense estimation tasks such as semantic segmentation and optical flow prediction.

Disclosure of Invention

In view of this, the present disclosure provides an image processing method and apparatus based on wide-area dynamic convolution.

According to an aspect of the present disclosure, there is provided an image processing method based on wide-area dynamic convolution, the method including:

extracting the features of the image to be processed to obtain an N-level feature map, wherein N is an integer greater than 1;

carrying out up-sampling on the Nth-level feature map to obtain a first feature map of the image to be processed;

determining a convolution kernel corresponding to each pixel point in the first characteristic diagram according to the N-level characteristic diagram and a preset convolution kernel size;

performing convolution and pooling on each pixel point in the first feature map according to a convolution kernel corresponding to each pixel point in the first feature map and the size of an adjacent region of each pixel point to obtain a second feature map of the image to be processed;

and determining an image processing result of the image to be processed according to the second feature map.

In a possible implementation manner, determining a convolution kernel corresponding to each pixel point in the first feature map according to the N-level feature map and a preset convolution kernel size includes:

carrying out size adjustment and fusion processing on the N-level feature map to obtain a third feature map, wherein the size of the third feature map is the same as that of the first feature map;

performing convolution and activation processing on the third feature map to obtain a fourth feature map, wherein the height and the width of the fourth feature map are the same as those of the first feature map, the channel number of the fourth feature map is k × k times of that of the first feature map, k × k is a preset convolution kernel size, and k is a positive integer;

and determining convolution kernel parameters corresponding to the pixel points according to the fourth feature map.

In a possible implementation manner, determining, according to the fourth feature map, a convolution kernel parameter corresponding to each pixel point includes:

determining a feature vector corresponding to each pixel point in the first feature map according to the fourth feature map, wherein the dimension of the feature vector is C × k × k, C is the number of channels of the first feature map, and C is a positive integer;

and converting the feature vector corresponding to each pixel point in the first feature map into a convolution kernel parameter corresponding to each pixel point.

In a possible implementation manner, performing convolution and pooling processing on each pixel point in the first feature map according to a convolution kernel corresponding to each pixel point in the first feature map and a size of an adjacent region of each pixel point to obtain a second feature map of the image to be processed includes:

determining the adjacent region of each pixel point according to the adjacent region size of each pixel point in the first characteristic diagram;

for any pixel point in the first characteristic diagram, performing convolution processing on the first characteristic diagram according to a convolution kernel corresponding to the pixel point, and determining a first convolution response of each pixel point in a neighboring area of the pixel point;

pooling first convolution responses of all pixel points in the adjacent areas of the pixel points, and determining second convolution responses of the pixel points;

and determining a second characteristic diagram of the image to be processed according to the second convolution response of each pixel point in the first characteristic diagram.

In a possible implementation manner, the resizing and fusing the N-level feature maps to obtain a third feature map includes:

according to the size of the first feature diagram, carrying out size adjustment on the N-level feature diagram to obtain N fifth feature diagrams, wherein the size of the fifth feature diagrams is the same as that of the first feature diagram;

and adding the N characteristic values of the pixel points with the same position in the N fifth characteristic graphs to obtain a third characteristic graph.

In one possible implementation, the method is implemented by a neural network, the neural network includes a feature extraction network, a convolution kernel generation network, and a wide-area dynamic convolution network, the feature extraction network is used for feature extraction, the convolution kernel generation network is used for determining a convolution kernel corresponding to each pixel point, the wide-area dynamic convolution network is used for determining a second feature map,

wherein the method further comprises:

training the feature extraction network according to a preset training set, wherein the training set comprises a plurality of sample images, reference feature maps of the sample images and reference processing results of the sample images;

and training the convolution kernel generation network and the wide-area dynamic convolution network according to the training set and the trained feature extraction network.

According to another aspect of the present disclosure, there is provided an image processing apparatus based on wide-area dynamic convolution, the apparatus comprising:

the characteristic extraction module is used for extracting characteristics of the image to be processed to obtain an N-level characteristic diagram, wherein N is an integer greater than 1;

the up-sampling module is used for up-sampling the N-level feature map to obtain a first feature map of the image to be processed;

a convolution kernel determining module, configured to determine, according to the N-level feature map and a preset convolution kernel size, a convolution kernel corresponding to each pixel point in the first feature map;

the convolution and pooling processing module is used for performing convolution and pooling processing on each pixel point in the first characteristic diagram according to a convolution kernel corresponding to each pixel point in the first characteristic diagram and the size of an adjacent region of each pixel point to obtain a second characteristic diagram of the image to be processed;

and the processing result determining module is used for determining the image processing result of the image to be processed according to the second feature map.

In one possible implementation, the convolution kernel determining module includes:

the feature map fusion submodule is used for carrying out size adjustment and fusion processing on the N-level feature map to obtain a third feature map, and the size of the third feature map is the same as that of the first feature map;

the convolution and activation submodule is used for performing convolution and activation processing on the third feature map to obtain a fourth feature map, the height and the width of the fourth feature map are the same as those of the first feature map, the channel number of the fourth feature map is k × k times of that of the first feature map, k × k is a preset convolution kernel size, and k is a positive integer;

and the convolution kernel determining submodule is used for determining convolution kernel parameters corresponding to the pixel points according to the fourth characteristic diagram.

In one possible implementation, the convolution kernel determination submodule is configured to:

In one possible implementation, the convolution and pooling processing module includes:

the adjacent region determining submodule is used for determining the adjacent region of each pixel point according to the adjacent region size of each pixel point in the first characteristic diagram;

the convolution submodule is used for performing convolution processing on the first characteristic diagram according to a convolution kernel corresponding to the pixel point for any pixel point in the first characteristic diagram, and determining a first convolution response of each pixel point in a neighboring area of the pixel point;

the pooling submodule is used for pooling first convolution responses of all the pixel points in the adjacent areas of the pixel points and determining second convolution responses of the pixel points;

and the characteristic map determining submodule is used for determining a second characteristic map of the image to be processed according to the second convolution response of each pixel point in the first characteristic map.

According to the embodiment of the disclosure, the feature extraction can be performed on the image to be processed to obtain the N-level feature map, and the up-sampling can be performed on the N-level feature map to obtain the first feature map of the image to be processed, then determining a convolution kernel corresponding to each pixel point in the first feature map according to the N-level feature map and the size of the convolution kernel, and according to the convolution kernel corresponding to each pixel point in the first characteristic diagram and the adjacent region size of each pixel point, performing convolution and pooling on each pixel point respectively to obtain a second feature map of the image to be processed, determining an image processing result of the image to be processed according to the second feature map, thereby enhancing the first characteristic diagram of the image to be processed to obtain a second characteristic diagram with high resolution and high definition, and determining the image processing result of the image to be processed according to the second characteristic diagram, so that the accuracy of image processing can be improved.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 illustrates a flowchart of an image processing method based on wide-area dynamic convolution according to an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of a neural network of an image processing method based on wide-area dynamic convolution according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of a convolution kernel generation network according to an embodiment of the present disclosure.

Fig. 4 is a schematic diagram illustrating an application scenario of an image processing method based on wide-area dynamic convolution according to an embodiment of the present disclosure.

Fig. 5 illustrates a block diagram of an image processing apparatus based on wide-area dynamic convolution according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 illustrates a flowchart of an image processing method based on wide-area dynamic convolution according to an embodiment of the present disclosure. As shown in fig. 1, the method includes:

step S11, extracting the features of the image to be processed to obtain an N-level feature map, wherein N is an integer greater than 1;

step S12, performing up-sampling on the Nth-level feature map to obtain a first feature map of the image to be processed;

step S13, determining a convolution kernel corresponding to each pixel point in the first feature map according to the N-level feature map and a preset convolution kernel size;

step S14, according to the convolution kernel corresponding to each pixel point in the first feature map and the size of the adjacent region of each pixel point, performing convolution and pooling processing on each pixel point in the first feature map respectively to obtain a second feature map of the image to be processed;

and step S15, determining the image processing result of the image to be processed according to the second feature map.

The wide-area dynamic convolution according to the embodiment of the present disclosure may include: and performing dynamic convolution on the first feature map of the image to be processed according to the position-specific and sample-specific convolution kernels, and performing pooling processing on dynamic convolution responses in the adjacent regions of all pixel points in the first feature map so as to determine the wide-area dynamic convolution response of all the pixel points. The position specificity can indicate that convolution kernels corresponding to different pixel points in the first feature map are different, and the sample specificity can indicate that the convolution kernels are generated according to the multi-level feature map of the image to be processed and are changed depending on input. The super-parameters of the wide-area dynamic convolution can be set and adjusted by those skilled in the art according to the image processing requirements, and the super-parameters can include the convolution kernel size, the size of the adjacent area, the void ratio and the step length when each pixel point is convolved, and the like. The disclosure does not limit the hyper-parameters of the wide-area dynamic convolution and the specific values thereof.

In one possible implementation, the image Processing method based on wide-area dynamic convolution may be applied to a processor, which may be a general-purpose processor, such as a CPU (Central Processing Unit), or an artificial Intelligence Processor (IPU) for performing artificial intelligence operations, such as a GPU (Graphics Processing Unit), an NPU (Neural-Network Processing Unit), a DSP (Digital signal Processing Unit), and the like. The present disclosure is not limited to a particular type of processor.

In one possible implementation, the image processing method based on wide-area dynamic convolution can be used for improving the accuracy of image processing in a computer vision task. Among other things, computer vision tasks may include image classification, target detection, optical flow estimation, image semantic segmentation, and so forth. The present disclosure is not limited to the details of the computer vision task.

In one possible implementation manner, in step S11, feature extraction may be performed on the image to be processed, so as to obtain an N-level feature map. The value of N may be an integer greater than 1, such as 3, 5, etc., and a person skilled in the art may set the specific value of N according to the actual situation, which is not limited in the present disclosure. The skilled person can also determine the way of extracting the features of the image to be processed according to the actual situation, and the present disclosure does not limit this.

In a possible implementation manner, when the image processing method is implemented by a neural network, feature extraction may be performed on the image to be processed by a feature extraction network. It should be understood that the feature extraction network may be a convolutional neural network, a deep neural network, etc., for example, the feature extraction network may include a plurality of convolutional layers, downsampling, etc., and the present disclosure does not limit the specific type of feature extraction network.

In one possible implementation manner, in step S12, the nth level feature map may be upsampled to obtain a first feature map of the image to be processed. The upsampling may be performed in various manners such as bilinear interpolation, transposed convolution, etc., which is not limited by the present disclosure.

In a possible implementation manner, after obtaining the first feature map of the image to be processed, in step S13, a convolution kernel corresponding to each pixel point in the first feature map may be determined according to the N-level feature map and a preset convolution kernel size, that is, each pixel point in the first feature map may correspond to a different convolution kernel, and the convolution kernel is determined according to the N-level feature map obtained in the feature extraction process and the preset convolution kernel size, where the size of the convolution kernel may be set according to actual needs, for example, 3 × 3, 5 × 5, or 7 × 7, and the disclosure does not limit this.

In one possible implementation, the number of convolution kernels corresponding to any pixel point in the first feature map is the same as the number of channels of the first feature map. For example, when the first feature map is a single channel, the number of convolution kernels corresponding to any pixel point in the first feature map is 1; when the first feature map is a multi-channel feature map, for example, when the number of channels of the first feature map is 64, the number of convolution kernels corresponding to any pixel point in the first feature map is 64.

In a possible implementation manner, after determining the convolution kernel corresponding to each pixel point in the first feature map, in step S14, each pixel point in the first feature map may be convolved and pooled respectively according to the convolution kernel corresponding to each pixel point in the first feature map and the size of the neighboring region of each pixel point, so as to obtain a second feature map of the image to be processed.

In a possible implementation manner, each pixel point in the first feature map may be respectively convolved according to a convolution kernel corresponding to each pixel point in the first feature map, a plurality of convolution responses in a neighboring region of each pixel point are determined, and then pooling (for example, average pooling) is respectively performed on the plurality of convolution responses in the neighboring region of each pixel point, so as to obtain a second feature map of the image to be processed.

In a possible implementation manner, when the first feature map is multi-channel, the first feature map may be processed channel by channel to obtain a second feature map of the image to be processed.

In one possible implementation, after obtaining the second feature map of the image to be processed, in step S15, an image processing result of the image to be processed may be determined according to the second feature map. For example, an image classification result, an object detection result, an image semantic segmentation result, and the like of the image to be processed may be determined according to the second feature map.

In one possible implementation, step S13 may include:

In a possible implementation manner, when a convolution kernel corresponding to each pixel point in the first feature map is determined according to the N-level feature map and a preset convolution kernel size, due to the fact that the N-level feature maps have different sizes, the N-level feature map can be subjected to size adjustment through modes of upsampling, convolution and the like, the N-level feature map after size adjustment is subjected to fusion processing, and a third feature map is obtained, wherein the size of the third feature map is the same as that of the first feature map. That is, the N feature maps with different sizes are resized and fused to obtain a third feature map with the same size as the first feature map. The dimensions of the first feature map may include a height H, a width W, and a channel number C.

In one possible implementation, after obtaining the third feature map, performing convolution and activation processing on the third feature map to obtain a fourth feature map, where the height and width of the fourth feature map are the same as those of the first feature map, and the number of channels of the fourth feature map is k × k times the number of channels of the first feature map, where k × k is a preset convolution kernel size and k is a positive integer.

For example, assuming that the convolution kernel size is 5 ×, the size of the third feature map is 512 × H × W (where 512 is the number of channels), 1 × convolution and ReLU activation, 3 × convolution and ReLU activation, and 1 × convolution and ReLU activation may be sequentially performed on the third feature map to obtain a fourth feature map, where 256 is the number of channels, the first intermediate feature map with the size of 256 × H × W is obtained after 1 × 41 convolution and ReLU activation of the third feature map, the second intermediate feature map with the size of 256 × H × W is obtained after 3 lu 84 convolution and reactivation of the first intermediate feature map, and the fourth feature map with the size of 256 × × H × W is obtained after 1 × convolution and ReLU activation of the second intermediate feature map.

In one possible implementation, after obtaining the fourth feature map with size (C × k × k) × H × W, the convolution kernel parameters corresponding to each pixel point in the first feature map may be determined according to the fourth feature map, and the C × k × k feature values corresponding to each pixel point may correspond to the convolution kernel parameters of C channels of each pixel point.

In this embodiment, when the size of the first feature map is C × H × W, the size adjustment and fusion processing can be performed on the N-level feature map to obtain a third feature map with the same size as the first feature map, the convolution and activation processing can be performed on the third feature map to obtain a fourth feature map with the size of (C × k × k) × H × W, and then the convolution kernel parameters corresponding to each pixel point in the first feature map are determined according to the fourth feature map, so that when determining the convolution kernel parameters corresponding to each pixel point in the first feature map, the accuracy of the convolution kernel parameters can be improved by considering information lost in the feature extraction process (e.g., lost edge information and the like).

In a possible implementation manner, the resizing and fusing the N-level feature maps to obtain a third feature map includes: according to the size of the first feature diagram, carrying out size adjustment on the N-level feature diagram to obtain N fifth feature diagrams, wherein the size of the fifth feature diagrams is the same as that of the first feature diagram; and adding the N characteristic values of the pixel points with the same position in the N fifth characteristic graphs to obtain a third characteristic graph.

For example, the N-level feature maps with different sizes may be respectively subjected to upsampling and 3 × 3 convolution processing to obtain N fifth feature maps with the size of C × H × W, and then N feature values of pixel points with the same position in the N fifth feature maps are added to obtain a third feature map.

In this embodiment, N fifth feature maps are obtained by resizing the N-level feature maps, and N feature values of pixels having the same position in the N fifth feature maps are added to obtain a third feature map, so that the N-level feature maps can be fused into one feature map.

In a possible implementation manner, determining convolution kernel parameters corresponding to the pixel points according to the fourth feature map may include determining feature vectors corresponding to the pixel points in the first feature map according to the fourth feature map, where the dimension of the feature vectors is C × k × k, C is the number of channels of the first feature map, and C is a positive integer, and converting the feature vectors corresponding to the pixel points in the first feature map into convolution kernel parameters corresponding to the pixel points.

For example, when the size of the fourth feature map is (C × k × k) × H × W, for any pixel (i, j) in the first feature map, where i and j are positive integers, the C × k × k-dimensional vector with position (i, j) in the fourth feature map may be determined as the feature vector of the pixel (i, j), and then the C × k × k-dimensional feature vector of the pixel (i, j) may be transformed channel by channel (for example, transformed using the transformation function Reshape), so as to obtain the convolution kernel parameters of C channels corresponding to the pixel (i, j).

In this embodiment, the feature vector corresponding to each pixel point in the first feature map can be determined according to the fourth feature map, and the feature vector is converted into the convolution kernel parameter corresponding to each pixel point, so that the convolution kernel parameter corresponding to each pixel point can be determined simply and quickly, and the processing efficiency can be improved.

In one possible implementation, step S14 may include:

In a possible implementation manner, the neighboring region of each pixel point may be determined according to the size of the neighboring region of each pixel point in the first feature map. For any pixel point in the first feature map, when determining the neighboring region thereof, the position of the pixel point may be used as a center point, and the neighboring region thereof is determined according to the size of the neighboring region. The positions of the pixels are different, and the number of the pixels in the adjacent area may be different.

In a possible implementation manner, for any pixel point in the first feature map, convolution processing may be performed on the first feature map according to a convolution kernel corresponding to the pixel point, and a first convolution response of each pixel point in a neighboring region of the pixel point is determined.

For example, when the size of the first feature map of the single channel is 16 × 16 and the size of the neighborhood is 3 × 3, the pixel (5, 5) in the first feature map may be taken as a center point, the 3 × 3 neighborhood around the center point may be determined as the neighborhood, the neighborhood includes 9 pixels, and then the convolution processing is performed on the first feature map according to the convolution kernel corresponding to the pixel (5, 5), so as to determine the first convolution response of the 9 pixels in the neighborhood of the pixel (5, 5).

In a possible implementation manner, the first convolution response of each pixel point in the neighboring area of the pixel point may be pooled, and the second convolution response of the pixel point is determined. Wherein the pooling treatment may be an average pooling. For example, the first convolution responses of 9 pixels in the neighborhood of pixel (5, 5) may be averaged, and the average may be determined as the second convolution response of pixel (5, 5).

In a possible implementation manner, the second feature map of the image to be processed may be determined according to the second convolution response of each pixel point in the first feature map. When the first feature map is multi-channel, the first feature map can be processed channel by channel to obtain a second feature map of the image to be processed.

In this embodiment, convolution processing may be performed on each pixel point in the first feature map to obtain a plurality of first convolution responses in the neighboring region of each pixel point, pooling processing may be performed on the plurality of first convolution responses in the neighboring region of each pixel point to obtain a second convolution response of each pixel point, and then the second feature map of the image to be processed may be determined.

In one possible implementation, the second convolution response G of any pixel in the first feature map can be determined by the following formula (1)_u：

Wherein F is a single-channel diagram of the first feature diagram of the image to be processed, G is a single-channel diagram corresponding to F in the second feature diagram of the image to be processed, u represents any pixel point in F,

denotes the neighborhood of u, v denotes the neighborhood

W (u) represents the convolution kernel corresponding to pixel u, and represents the standard convolution operation.

The first feature map of the image to be processed can be processed channel by using the formula (1) to obtain a second feature map of the image to be processed.

In one possible implementation, the method may be implemented by a neural network, which may include a feature extraction network, a convolution kernel generation network, and a wide-area dynamic convolution network, the feature extraction network may be used for feature extraction, the convolution kernel generation network may be used for determining a convolution kernel corresponding to each pixel point, the wide-area dynamic convolution network may be used for determining a second feature map,

wherein the method further comprises: training the feature extraction network according to a preset training set, wherein the training set comprises a plurality of sample images, reference feature maps of the sample images and reference processing results of the sample images; and training the convolution kernel generation network and the wide-area dynamic convolution network according to the training set and the trained feature extraction network.

In one possible implementation, the feature extraction network may be trained according to a plurality of sample images in the training set and reference feature maps of the plurality of sample images. In the training process, the parameter values of the feature extraction network may be adjusted in a direction that minimizes the first loss function, and when the first loss function decreases to a certain degree or converges within a certain threshold, the adjustment is stopped, and the trained feature extraction network is obtained. The present disclosure does not limit the first penalty function used in the training process.

In a possible implementation manner, after the training of the feature extraction network is completed, the convolution kernel generation network and the wide-area dynamic convolution network can be trained through end-to-end training according to the multiple sample images in the training set and the reference processing results of the multiple sample images. In the training process, parameter values of the dynamic convolution kernel generation network and the wide-area dynamic convolution network can be adjusted according to the direction which minimizes the second loss function, and when the second loss function is reduced to a certain degree or converged within a certain threshold value, the adjustment is stopped, and the trained convolution kernel generation network and the wide-area dynamic convolution network are obtained. The present disclosure does not limit the second penalty function used in the training process.

Fig. 2 shows a schematic diagram of a neural network of an image processing method based on wide-area dynamic convolution according to an embodiment of the present disclosure. As shown in fig. 2, the neural network includes a feature extraction network 22, a convolution kernel generation network 24, and a wide area dynamic convolution network 25. The image 21 to be processed can be input into a feature extraction network 22 for feature extraction to obtain a 5-level feature map; up-sampling the 5 th-level feature map to obtain a first feature map 23 of the image 21 to be processed, and inputting the 5 th-level feature map into a convolution kernel generation network 24 to obtain convolution kernels corresponding to each pixel point of the first feature map 23; the first feature map 23 and the convolution kernels corresponding to the pixel points of the first feature map 23 are input into the wide-area dynamic convolution network 25 to perform wide-area dynamic convolution, so as to obtain a second feature map 26 of the image to be processed, and then an image processing result 27 of the image to be processed can be obtained according to the second feature map 26.

FIG. 3 is a schematic diagram of a convolution kernel generation network according to an embodiment of the disclosure, and as shown in FIG. 3, 5-level feature maps of an image to be processed are respectively feature map 311, feature map 312, feature map 313, and feature map 314, feature map 315, the 5-level feature maps can be respectively up-sampled and 3 × 3 convolved to obtain feature map 321, feature map 322, feature map 323, feature map 324, and feature map 325 with the same size as a first feature map (not shown) of the image to be processed, and then feature map 321, feature map 322, feature map 323, feature map 324, and feature map 325 are fused to obtain feature map 33, and then feature map 33 is sequentially subjected to 1 × 1 convolution and ReLU activation, 3 × 3 convolution and ReLU activation, and 1 × 1 convolution and ReLU activation to obtain feature map 34. according to feature map 34, a convolution kernel parameter corresponding to each pixel in the first feature map of the image to be processed can be determined.

Fig. 4 is a schematic diagram illustrating an application scenario of an image processing method based on wide-area dynamic convolution according to an embodiment of the present disclosure. As shown in fig. 4, the image processing method is implemented by a neural network, and is used for performing image semantic segmentation on an image 41 to be processed, performing feature extraction on the image 41 to be processed by using a full convolution network composed of 5 convolution groups (the first 4 convolution groups include a down-sampling layer), obtaining 5-level feature maps, namely a feature map 42, a feature map 43, a feature map 44, a feature map 45 and a feature map 46, then performing feature map enhancement on the 5-level feature map, enhancing the feature map, obtaining a feature map 48, and inputting the feature map 48 into a decoder to generate a semantic segmentation result 49.

When the feature map is enhanced, the input 5-level feature map is subjected to upsampling, 3 × 3 convolution and fusion processing, the fused feature map is subjected to 1 × 1 convolution and ReLU activation, 3 × 3 convolution and ReLU activation, and 1 × 1 convolution and ReLU activation in sequence to obtain a feature map 471, meanwhile, the 5 th feature map (namely, the feature map 46) is subjected to upsampling to obtain a feature map 472, and then, the feature map 472 is subjected to wide-area dynamic convolution according to the dynamic convolution determined by the feature map 471 to obtain a feature map 48.

The convolution group shown in fig. 4 can have various structures, and can be selected according to different networks, such as VGG series, ResNet series, inclusion system, and the like, and the disclosure is not limited thereto. The decoder shown in fig. 4 may be comprised of one or more convolutional layers, an upsampled layer, or a more complex decoder, as the present disclosure is not limited in this respect.

In a possible implementation manner, the training set may be segmented according to a preset semantic meaning, and the neural network in fig. 4 may be trained end to end. In training, the network loss L of the neural network can be determined using the following equation (2):

wherein the content of the first and second substances,

representing a training set of neural networks for semantic segmentation, I represents

Of any one of the sample images, S^*To represent

In the segmentation label corresponding to the sample image I, m represents any pixel point in the sample image I,

the class label of the pixel point m is expressed as one-hot encoding, theta represents the network parameter of the neural network, f_m(I; theta) represents the class probability distribution of the prediction of the sample image I at pixel point m by the neural network,

is expressed and calculated in

H 'denotes the height of the sample image I, W' denotes the width of the sample image I, and H 'W' denotes the total number of pixels in the sample image I.

The parameter values of the neural network may be adjusted in a direction that minimizes the network loss L, and when the network loss L decreases to a certain degree or converges within a certain threshold, the adjustment is stopped, and a trained neural network is obtained. The trained neural network can be used to complete the image semantic segmentation task.

In a possible implementation manner, a deep learning optimization algorithm, such as a Stochastic Gradient Descent (SGD) algorithm and an Adaptive motion Estimation (Adaptive motion Estimation), may be further applied to determine the network loss L. The present disclosure is not so limited.

According to the image processing method based on the wide-area dynamic convolution, the feature map in the image processing process can be enhanced, so that the enhanced feature map has more accurate space geometric information, and the edge information of an object and a structure in the image to be processed can be retained, so that the accuracy of the image processing result of the image to be processed can be improved.

Fig. 5 illustrates a block diagram of an image processing apparatus based on wide-area dynamic convolution according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus includes:

the feature extraction module 51 is configured to perform feature extraction on an image to be processed to obtain an N-level feature map, where N is an integer greater than 1;

the upsampling module 52 is configured to upsample the nth-level feature map to obtain a first feature map of the image to be processed;

a convolution kernel determining module 53, configured to determine, according to the N-level feature map and a preset convolution kernel size, a convolution kernel corresponding to each pixel point in the first feature map;

a convolution and pooling processing module 54, configured to perform convolution and pooling processing on each pixel point in the first feature map according to a convolution kernel corresponding to each pixel point in the first feature map and a size of an adjacent region of each pixel point, so as to obtain a second feature map of the to-be-processed image;

and the processing result determining module 55 determines an image processing result of the image to be processed according to the second feature map.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An image processing method based on wide-area dynamic convolution, characterized in that the method comprises:

2. The method of claim 1, wherein determining a convolution kernel corresponding to each pixel point in the first feature map according to the N-level feature map and a predetermined convolution kernel size comprises:

3. The method of claim 2, wherein determining convolution kernel parameters corresponding to the respective pixel points from the fourth feature map comprises:

4. The method according to claim 1, wherein the convolving and pooling the pixels in the first feature map according to the convolution kernels corresponding to the pixels in the first feature map and the sizes of the neighboring regions of the pixels to obtain the second feature map of the image to be processed comprises:

5. The method of claim 2, wherein resizing and fusing the N-level feature maps to obtain a third feature map comprises:

6. The method according to any one of claims 1 to 5, wherein the method is implemented by a neural network, the neural network comprises a feature extraction network, a convolution kernel generation network and a wide-area dynamic convolution network, the feature extraction network is used for feature extraction, the convolution kernel generation network is used for determining convolution kernels corresponding to all pixel points, the wide-area dynamic convolution network is used for determining a second feature map,

wherein the method further comprises:

7. An image processing apparatus based on wide-area dynamic convolution, the apparatus comprising:

8. The apparatus of claim 7, wherein the convolution kernel determination module comprises:

9. The apparatus of claim 8, wherein the convolution kernel determination submodule is configured to:

10. The apparatus of claim 7, wherein the convolution and pooling processing module comprises: