CN111340049A - Image processing method and device based on wide-area dynamic convolution - Google Patents

Image processing method and device based on wide-area dynamic convolution Download PDF

Info

Publication number
CN111340049A
CN111340049A CN202010151431.3A CN202010151431A CN111340049A CN 111340049 A CN111340049 A CN 111340049A CN 202010151431 A CN202010151431 A CN 202010151431A CN 111340049 A CN111340049 A CN 111340049A
Authority
CN
China
Prior art keywords
feature map
convolution
pixel point
feature
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010151431.3A
Other languages
Chinese (zh)
Other versions
CN111340049B (en
Inventor
季向阳
杨宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202010151431.3A priority Critical patent/CN111340049B/en
Publication of CN111340049A publication Critical patent/CN111340049A/en
Application granted granted Critical
Publication of CN111340049B publication Critical patent/CN111340049B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure relates to an image processing method and device based on wide area dynamic convolution, wherein the method comprises the following steps: extracting the features of the image to be processed to obtain an N-level feature map, wherein N is an integer greater than 1; carrying out up-sampling on the Nth-level feature map to obtain a first feature map of the image to be processed; determining a convolution kernel corresponding to each pixel point in the first characteristic diagram according to the N-level characteristic diagram and a preset convolution kernel size; performing convolution and pooling on each pixel point in the first feature map according to a convolution kernel corresponding to each pixel point in the first feature map and the size of an adjacent region of each pixel point to obtain a second feature map of the image to be processed; and determining an image processing result of the image to be processed according to the second feature map. The embodiment of the disclosure can enhance the characteristic diagram of the image to be processed, improve the resolution and definition of the characteristic diagram, and further improve the accuracy of image processing.

Description

Image processing method and device based on wide-area dynamic convolution
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to an image processing method and apparatus based on wide-area dynamic convolution.
Background
At present, convolutional neural networks are widely used in computer vision tasks. In general, an original image is input into a convolutional neural network, and after multi-level convolution and down-sampling, a feature map with a resolution lower than that of the original image and a channel higher than that of the original image is obtained, the feature map is considered to encode certain spatial information (that is, the feature map retains the adjacent relationship between pixels in the original image), and the feature vector of each pixel is considered to encode rich semantic information.
However, the feature map has problems such as low resolution and blurring, so that the accuracy of the image processing result determined from the feature map is poor. For example, the convolution kernel of the standard convolution operation is spatially shared, the feature map obtained through the standard convolution operation is relatively smooth, and edge information in the original image is easily lost, so that the accuracy of the standard convolution operation is poor when performing dense estimation tasks such as semantic segmentation and optical flow prediction.
Disclosure of Invention
In view of this, the present disclosure provides an image processing method and apparatus based on wide-area dynamic convolution.
According to an aspect of the present disclosure, there is provided an image processing method based on wide-area dynamic convolution, the method including:
extracting the features of the image to be processed to obtain an N-level feature map, wherein N is an integer greater than 1;
carrying out up-sampling on the Nth-level feature map to obtain a first feature map of the image to be processed;
determining a convolution kernel corresponding to each pixel point in the first characteristic diagram according to the N-level characteristic diagram and a preset convolution kernel size;
performing convolution and pooling on each pixel point in the first feature map according to a convolution kernel corresponding to each pixel point in the first feature map and the size of an adjacent region of each pixel point to obtain a second feature map of the image to be processed;
and determining an image processing result of the image to be processed according to the second feature map.
In a possible implementation manner, determining a convolution kernel corresponding to each pixel point in the first feature map according to the N-level feature map and a preset convolution kernel size includes:
carrying out size adjustment and fusion processing on the N-level feature map to obtain a third feature map, wherein the size of the third feature map is the same as that of the first feature map;
performing convolution and activation processing on the third feature map to obtain a fourth feature map, wherein the height and the width of the fourth feature map are the same as those of the first feature map, the channel number of the fourth feature map is k × k times of that of the first feature map, k × k is a preset convolution kernel size, and k is a positive integer;
and determining convolution kernel parameters corresponding to the pixel points according to the fourth feature map.
In a possible implementation manner, determining, according to the fourth feature map, a convolution kernel parameter corresponding to each pixel point includes:
determining a feature vector corresponding to each pixel point in the first feature map according to the fourth feature map, wherein the dimension of the feature vector is C × k × k, C is the number of channels of the first feature map, and C is a positive integer;
and converting the feature vector corresponding to each pixel point in the first feature map into a convolution kernel parameter corresponding to each pixel point.
In a possible implementation manner, performing convolution and pooling processing on each pixel point in the first feature map according to a convolution kernel corresponding to each pixel point in the first feature map and a size of an adjacent region of each pixel point to obtain a second feature map of the image to be processed includes:
determining the adjacent region of each pixel point according to the adjacent region size of each pixel point in the first characteristic diagram;
for any pixel point in the first characteristic diagram, performing convolution processing on the first characteristic diagram according to a convolution kernel corresponding to the pixel point, and determining a first convolution response of each pixel point in a neighboring area of the pixel point;
pooling first convolution responses of all pixel points in the adjacent areas of the pixel points, and determining second convolution responses of the pixel points;
and determining a second characteristic diagram of the image to be processed according to the second convolution response of each pixel point in the first characteristic diagram.
In a possible implementation manner, the resizing and fusing the N-level feature maps to obtain a third feature map includes:
according to the size of the first feature diagram, carrying out size adjustment on the N-level feature diagram to obtain N fifth feature diagrams, wherein the size of the fifth feature diagrams is the same as that of the first feature diagram;
and adding the N characteristic values of the pixel points with the same position in the N fifth characteristic graphs to obtain a third characteristic graph.
In one possible implementation, the method is implemented by a neural network, the neural network includes a feature extraction network, a convolution kernel generation network, and a wide-area dynamic convolution network, the feature extraction network is used for feature extraction, the convolution kernel generation network is used for determining a convolution kernel corresponding to each pixel point, the wide-area dynamic convolution network is used for determining a second feature map,
wherein the method further comprises:
training the feature extraction network according to a preset training set, wherein the training set comprises a plurality of sample images, reference feature maps of the sample images and reference processing results of the sample images;
and training the convolution kernel generation network and the wide-area dynamic convolution network according to the training set and the trained feature extraction network.
According to another aspect of the present disclosure, there is provided an image processing apparatus based on wide-area dynamic convolution, the apparatus comprising:
the characteristic extraction module is used for extracting characteristics of the image to be processed to obtain an N-level characteristic diagram, wherein N is an integer greater than 1;
the up-sampling module is used for up-sampling the N-level feature map to obtain a first feature map of the image to be processed;
a convolution kernel determining module, configured to determine, according to the N-level feature map and a preset convolution kernel size, a convolution kernel corresponding to each pixel point in the first feature map;
the convolution and pooling processing module is used for performing convolution and pooling processing on each pixel point in the first characteristic diagram according to a convolution kernel corresponding to each pixel point in the first characteristic diagram and the size of an adjacent region of each pixel point to obtain a second characteristic diagram of the image to be processed;
and the processing result determining module is used for determining the image processing result of the image to be processed according to the second feature map.
In one possible implementation, the convolution kernel determining module includes:
the feature map fusion submodule is used for carrying out size adjustment and fusion processing on the N-level feature map to obtain a third feature map, and the size of the third feature map is the same as that of the first feature map;
the convolution and activation submodule is used for performing convolution and activation processing on the third feature map to obtain a fourth feature map, the height and the width of the fourth feature map are the same as those of the first feature map, the channel number of the fourth feature map is k × k times of that of the first feature map, k × k is a preset convolution kernel size, and k is a positive integer;
and the convolution kernel determining submodule is used for determining convolution kernel parameters corresponding to the pixel points according to the fourth characteristic diagram.
In one possible implementation, the convolution kernel determination submodule is configured to:
determining a feature vector corresponding to each pixel point in the first feature map according to the fourth feature map, wherein the dimension of the feature vector is C × k × k, C is the number of channels of the first feature map, and C is a positive integer;
and converting the feature vector corresponding to each pixel point in the first feature map into a convolution kernel parameter corresponding to each pixel point.
In one possible implementation, the convolution and pooling processing module includes:
the adjacent region determining submodule is used for determining the adjacent region of each pixel point according to the adjacent region size of each pixel point in the first characteristic diagram;
the convolution submodule is used for performing convolution processing on the first characteristic diagram according to a convolution kernel corresponding to the pixel point for any pixel point in the first characteristic diagram, and determining a first convolution response of each pixel point in a neighboring area of the pixel point;
the pooling submodule is used for pooling first convolution responses of all the pixel points in the adjacent areas of the pixel points and determining second convolution responses of the pixel points;
and the characteristic map determining submodule is used for determining a second characteristic map of the image to be processed according to the second convolution response of each pixel point in the first characteristic map.
According to the embodiment of the disclosure, the feature extraction can be performed on the image to be processed to obtain the N-level feature map, and the up-sampling can be performed on the N-level feature map to obtain the first feature map of the image to be processed, then determining a convolution kernel corresponding to each pixel point in the first feature map according to the N-level feature map and the size of the convolution kernel, and according to the convolution kernel corresponding to each pixel point in the first characteristic diagram and the adjacent region size of each pixel point, performing convolution and pooling on each pixel point respectively to obtain a second feature map of the image to be processed, determining an image processing result of the image to be processed according to the second feature map, thereby enhancing the first characteristic diagram of the image to be processed to obtain a second characteristic diagram with high resolution and high definition, and determining the image processing result of the image to be processed according to the second characteristic diagram, so that the accuracy of image processing can be improved.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
Fig. 1 illustrates a flowchart of an image processing method based on wide-area dynamic convolution according to an embodiment of the present disclosure.
Fig. 2 shows a schematic diagram of a neural network of an image processing method based on wide-area dynamic convolution according to an embodiment of the present disclosure.
Fig. 3 shows a schematic diagram of a convolution kernel generation network according to an embodiment of the present disclosure.
Fig. 4 is a schematic diagram illustrating an application scenario of an image processing method based on wide-area dynamic convolution according to an embodiment of the present disclosure.
Fig. 5 illustrates a block diagram of an image processing apparatus based on wide-area dynamic convolution according to an embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
Fig. 1 illustrates a flowchart of an image processing method based on wide-area dynamic convolution according to an embodiment of the present disclosure. As shown in fig. 1, the method includes:
step S11, extracting the features of the image to be processed to obtain an N-level feature map, wherein N is an integer greater than 1;
step S12, performing up-sampling on the Nth-level feature map to obtain a first feature map of the image to be processed;
step S13, determining a convolution kernel corresponding to each pixel point in the first feature map according to the N-level feature map and a preset convolution kernel size;
step S14, according to the convolution kernel corresponding to each pixel point in the first feature map and the size of the adjacent region of each pixel point, performing convolution and pooling processing on each pixel point in the first feature map respectively to obtain a second feature map of the image to be processed;
and step S15, determining the image processing result of the image to be processed according to the second feature map.
According to the embodiment of the disclosure, the feature extraction can be performed on the image to be processed to obtain the N-level feature map, and the up-sampling can be performed on the N-level feature map to obtain the first feature map of the image to be processed, then determining a convolution kernel corresponding to each pixel point in the first feature map according to the N-level feature map and the size of the convolution kernel, and according to the convolution kernel corresponding to each pixel point in the first characteristic diagram and the adjacent region size of each pixel point, performing convolution and pooling on each pixel point respectively to obtain a second feature map of the image to be processed, determining an image processing result of the image to be processed according to the second feature map, thereby enhancing the first characteristic diagram of the image to be processed to obtain a second characteristic diagram with high resolution and high definition, and determining the image processing result of the image to be processed according to the second characteristic diagram, so that the accuracy of image processing can be improved.
The wide-area dynamic convolution according to the embodiment of the present disclosure may include: and performing dynamic convolution on the first feature map of the image to be processed according to the position-specific and sample-specific convolution kernels, and performing pooling processing on dynamic convolution responses in the adjacent regions of all pixel points in the first feature map so as to determine the wide-area dynamic convolution response of all the pixel points. The position specificity can indicate that convolution kernels corresponding to different pixel points in the first feature map are different, and the sample specificity can indicate that the convolution kernels are generated according to the multi-level feature map of the image to be processed and are changed depending on input. The super-parameters of the wide-area dynamic convolution can be set and adjusted by those skilled in the art according to the image processing requirements, and the super-parameters can include the convolution kernel size, the size of the adjacent area, the void ratio and the step length when each pixel point is convolved, and the like. The disclosure does not limit the hyper-parameters of the wide-area dynamic convolution and the specific values thereof.
In one possible implementation, the image Processing method based on wide-area dynamic convolution may be applied to a processor, which may be a general-purpose processor, such as a CPU (Central Processing Unit), or an artificial Intelligence Processor (IPU) for performing artificial intelligence operations, such as a GPU (Graphics Processing Unit), an NPU (Neural-Network Processing Unit), a DSP (Digital signal Processing Unit), and the like. The present disclosure is not limited to a particular type of processor.
In one possible implementation, the image processing method based on wide-area dynamic convolution can be used for improving the accuracy of image processing in a computer vision task. Among other things, computer vision tasks may include image classification, target detection, optical flow estimation, image semantic segmentation, and so forth. The present disclosure is not limited to the details of the computer vision task.
In one possible implementation manner, in step S11, feature extraction may be performed on the image to be processed, so as to obtain an N-level feature map. The value of N may be an integer greater than 1, such as 3, 5, etc., and a person skilled in the art may set the specific value of N according to the actual situation, which is not limited in the present disclosure. The skilled person can also determine the way of extracting the features of the image to be processed according to the actual situation, and the present disclosure does not limit this.
In a possible implementation manner, when the image processing method is implemented by a neural network, feature extraction may be performed on the image to be processed by a feature extraction network. It should be understood that the feature extraction network may be a convolutional neural network, a deep neural network, etc., for example, the feature extraction network may include a plurality of convolutional layers, downsampling, etc., and the present disclosure does not limit the specific type of feature extraction network.
In one possible implementation manner, in step S12, the nth level feature map may be upsampled to obtain a first feature map of the image to be processed. The upsampling may be performed in various manners such as bilinear interpolation, transposed convolution, etc., which is not limited by the present disclosure.
In a possible implementation manner, after obtaining the first feature map of the image to be processed, in step S13, a convolution kernel corresponding to each pixel point in the first feature map may be determined according to the N-level feature map and a preset convolution kernel size, that is, each pixel point in the first feature map may correspond to a different convolution kernel, and the convolution kernel is determined according to the N-level feature map obtained in the feature extraction process and the preset convolution kernel size, where the size of the convolution kernel may be set according to actual needs, for example, 3 × 3, 5 × 5, or 7 × 7, and the disclosure does not limit this.
In one possible implementation, the number of convolution kernels corresponding to any pixel point in the first feature map is the same as the number of channels of the first feature map. For example, when the first feature map is a single channel, the number of convolution kernels corresponding to any pixel point in the first feature map is 1; when the first feature map is a multi-channel feature map, for example, when the number of channels of the first feature map is 64, the number of convolution kernels corresponding to any pixel point in the first feature map is 64.
In a possible implementation manner, after determining the convolution kernel corresponding to each pixel point in the first feature map, in step S14, each pixel point in the first feature map may be convolved and pooled respectively according to the convolution kernel corresponding to each pixel point in the first feature map and the size of the neighboring region of each pixel point, so as to obtain a second feature map of the image to be processed.
In a possible implementation manner, each pixel point in the first feature map may be respectively convolved according to a convolution kernel corresponding to each pixel point in the first feature map, a plurality of convolution responses in a neighboring region of each pixel point are determined, and then pooling (for example, average pooling) is respectively performed on the plurality of convolution responses in the neighboring region of each pixel point, so as to obtain a second feature map of the image to be processed.
In a possible implementation manner, when the first feature map is multi-channel, the first feature map may be processed channel by channel to obtain a second feature map of the image to be processed.
In one possible implementation, after obtaining the second feature map of the image to be processed, in step S15, an image processing result of the image to be processed may be determined according to the second feature map. For example, an image classification result, an object detection result, an image semantic segmentation result, and the like of the image to be processed may be determined according to the second feature map.
In one possible implementation, step S13 may include:
carrying out size adjustment and fusion processing on the N-level feature map to obtain a third feature map, wherein the size of the third feature map is the same as that of the first feature map;
performing convolution and activation processing on the third feature map to obtain a fourth feature map, wherein the height and the width of the fourth feature map are the same as those of the first feature map, the channel number of the fourth feature map is k × k times of that of the first feature map, k × k is a preset convolution kernel size, and k is a positive integer;
and determining convolution kernel parameters corresponding to the pixel points according to the fourth feature map.
In a possible implementation manner, when a convolution kernel corresponding to each pixel point in the first feature map is determined according to the N-level feature map and a preset convolution kernel size, due to the fact that the N-level feature maps have different sizes, the N-level feature map can be subjected to size adjustment through modes of upsampling, convolution and the like, the N-level feature map after size adjustment is subjected to fusion processing, and a third feature map is obtained, wherein the size of the third feature map is the same as that of the first feature map. That is, the N feature maps with different sizes are resized and fused to obtain a third feature map with the same size as the first feature map. The dimensions of the first feature map may include a height H, a width W, and a channel number C.
In one possible implementation, after obtaining the third feature map, performing convolution and activation processing on the third feature map to obtain a fourth feature map, where the height and width of the fourth feature map are the same as those of the first feature map, and the number of channels of the fourth feature map is k × k times the number of channels of the first feature map, where k × k is a preset convolution kernel size and k is a positive integer.
For example, assuming that the convolution kernel size is 5 ×, the size of the third feature map is 512 × H × W (where 512 is the number of channels), 1 × convolution and ReLU activation, 3 × convolution and ReLU activation, and 1 × convolution and ReLU activation may be sequentially performed on the third feature map to obtain a fourth feature map, where 256 is the number of channels, the first intermediate feature map with the size of 256 × H × W is obtained after 1 × 41 convolution and ReLU activation of the third feature map, the second intermediate feature map with the size of 256 × H × W is obtained after 3 lu 84 convolution and reactivation of the first intermediate feature map, and the fourth feature map with the size of 256 × × H × W is obtained after 1 × convolution and ReLU activation of the second intermediate feature map.
In one possible implementation, after obtaining the fourth feature map with size (C × k × k) × H × W, the convolution kernel parameters corresponding to each pixel point in the first feature map may be determined according to the fourth feature map, and the C × k × k feature values corresponding to each pixel point may correspond to the convolution kernel parameters of C channels of each pixel point.
In this embodiment, when the size of the first feature map is C × H × W, the size adjustment and fusion processing can be performed on the N-level feature map to obtain a third feature map with the same size as the first feature map, the convolution and activation processing can be performed on the third feature map to obtain a fourth feature map with the size of (C × k × k) × H × W, and then the convolution kernel parameters corresponding to each pixel point in the first feature map are determined according to the fourth feature map, so that when determining the convolution kernel parameters corresponding to each pixel point in the first feature map, the accuracy of the convolution kernel parameters can be improved by considering information lost in the feature extraction process (e.g., lost edge information and the like).
In a possible implementation manner, the resizing and fusing the N-level feature maps to obtain a third feature map includes: according to the size of the first feature diagram, carrying out size adjustment on the N-level feature diagram to obtain N fifth feature diagrams, wherein the size of the fifth feature diagrams is the same as that of the first feature diagram; and adding the N characteristic values of the pixel points with the same position in the N fifth characteristic graphs to obtain a third characteristic graph.
For example, the N-level feature maps with different sizes may be respectively subjected to upsampling and 3 × 3 convolution processing to obtain N fifth feature maps with the size of C × H × W, and then N feature values of pixel points with the same position in the N fifth feature maps are added to obtain a third feature map.
In this embodiment, N fifth feature maps are obtained by resizing the N-level feature maps, and N feature values of pixels having the same position in the N fifth feature maps are added to obtain a third feature map, so that the N-level feature maps can be fused into one feature map.
In a possible implementation manner, determining convolution kernel parameters corresponding to the pixel points according to the fourth feature map may include determining feature vectors corresponding to the pixel points in the first feature map according to the fourth feature map, where the dimension of the feature vectors is C × k × k, C is the number of channels of the first feature map, and C is a positive integer, and converting the feature vectors corresponding to the pixel points in the first feature map into convolution kernel parameters corresponding to the pixel points.
For example, when the size of the fourth feature map is (C × k × k) × H × W, for any pixel (i, j) in the first feature map, where i and j are positive integers, the C × k × k-dimensional vector with position (i, j) in the fourth feature map may be determined as the feature vector of the pixel (i, j), and then the C × k × k-dimensional feature vector of the pixel (i, j) may be transformed channel by channel (for example, transformed using the transformation function Reshape), so as to obtain the convolution kernel parameters of C channels corresponding to the pixel (i, j).
In this embodiment, the feature vector corresponding to each pixel point in the first feature map can be determined according to the fourth feature map, and the feature vector is converted into the convolution kernel parameter corresponding to each pixel point, so that the convolution kernel parameter corresponding to each pixel point can be determined simply and quickly, and the processing efficiency can be improved.
In one possible implementation, step S14 may include:
determining the adjacent region of each pixel point according to the adjacent region size of each pixel point in the first characteristic diagram;
for any pixel point in the first characteristic diagram, performing convolution processing on the first characteristic diagram according to a convolution kernel corresponding to the pixel point, and determining a first convolution response of each pixel point in a neighboring area of the pixel point;
pooling first convolution responses of all pixel points in the adjacent areas of the pixel points, and determining second convolution responses of the pixel points;
and determining a second characteristic diagram of the image to be processed according to the second convolution response of each pixel point in the first characteristic diagram.
In a possible implementation manner, the neighboring region of each pixel point may be determined according to the size of the neighboring region of each pixel point in the first feature map. For any pixel point in the first feature map, when determining the neighboring region thereof, the position of the pixel point may be used as a center point, and the neighboring region thereof is determined according to the size of the neighboring region. The positions of the pixels are different, and the number of the pixels in the adjacent area may be different.
In a possible implementation manner, for any pixel point in the first feature map, convolution processing may be performed on the first feature map according to a convolution kernel corresponding to the pixel point, and a first convolution response of each pixel point in a neighboring region of the pixel point is determined.
For example, when the size of the first feature map of the single channel is 16 × 16 and the size of the neighborhood is 3 × 3, the pixel (5, 5) in the first feature map may be taken as a center point, the 3 × 3 neighborhood around the center point may be determined as the neighborhood, the neighborhood includes 9 pixels, and then the convolution processing is performed on the first feature map according to the convolution kernel corresponding to the pixel (5, 5), so as to determine the first convolution response of the 9 pixels in the neighborhood of the pixel (5, 5).
In a possible implementation manner, the first convolution response of each pixel point in the neighboring area of the pixel point may be pooled, and the second convolution response of the pixel point is determined. Wherein the pooling treatment may be an average pooling. For example, the first convolution responses of 9 pixels in the neighborhood of pixel (5, 5) may be averaged, and the average may be determined as the second convolution response of pixel (5, 5).
In a possible implementation manner, the second feature map of the image to be processed may be determined according to the second convolution response of each pixel point in the first feature map. When the first feature map is multi-channel, the first feature map can be processed channel by channel to obtain a second feature map of the image to be processed.
In this embodiment, convolution processing may be performed on each pixel point in the first feature map to obtain a plurality of first convolution responses in the neighboring region of each pixel point, pooling processing may be performed on the plurality of first convolution responses in the neighboring region of each pixel point to obtain a second convolution response of each pixel point, and then the second feature map of the image to be processed may be determined.
In one possible implementation, the second convolution response G of any pixel in the first feature map can be determined by the following formula (1)u
Figure BDA0002402574280000131
Wherein F is a single-channel diagram of the first feature diagram of the image to be processed, G is a single-channel diagram corresponding to F in the second feature diagram of the image to be processed, u represents any pixel point in F,
Figure BDA0002402574280000132
denotes the neighborhood of u, v denotes the neighborhood
Figure BDA0002402574280000133
W (u) represents the convolution kernel corresponding to pixel u, and represents the standard convolution operation.
The first feature map of the image to be processed can be processed channel by using the formula (1) to obtain a second feature map of the image to be processed.
In one possible implementation, the method may be implemented by a neural network, which may include a feature extraction network, a convolution kernel generation network, and a wide-area dynamic convolution network, the feature extraction network may be used for feature extraction, the convolution kernel generation network may be used for determining a convolution kernel corresponding to each pixel point, the wide-area dynamic convolution network may be used for determining a second feature map,
wherein the method further comprises: training the feature extraction network according to a preset training set, wherein the training set comprises a plurality of sample images, reference feature maps of the sample images and reference processing results of the sample images; and training the convolution kernel generation network and the wide-area dynamic convolution network according to the training set and the trained feature extraction network.
In one possible implementation, the feature extraction network may be trained according to a plurality of sample images in the training set and reference feature maps of the plurality of sample images. In the training process, the parameter values of the feature extraction network may be adjusted in a direction that minimizes the first loss function, and when the first loss function decreases to a certain degree or converges within a certain threshold, the adjustment is stopped, and the trained feature extraction network is obtained. The present disclosure does not limit the first penalty function used in the training process.
In a possible implementation manner, after the training of the feature extraction network is completed, the convolution kernel generation network and the wide-area dynamic convolution network can be trained through end-to-end training according to the multiple sample images in the training set and the reference processing results of the multiple sample images. In the training process, parameter values of the dynamic convolution kernel generation network and the wide-area dynamic convolution network can be adjusted according to the direction which minimizes the second loss function, and when the second loss function is reduced to a certain degree or converged within a certain threshold value, the adjustment is stopped, and the trained convolution kernel generation network and the wide-area dynamic convolution network are obtained. The present disclosure does not limit the second penalty function used in the training process.
Fig. 2 shows a schematic diagram of a neural network of an image processing method based on wide-area dynamic convolution according to an embodiment of the present disclosure. As shown in fig. 2, the neural network includes a feature extraction network 22, a convolution kernel generation network 24, and a wide area dynamic convolution network 25. The image 21 to be processed can be input into a feature extraction network 22 for feature extraction to obtain a 5-level feature map; up-sampling the 5 th-level feature map to obtain a first feature map 23 of the image 21 to be processed, and inputting the 5 th-level feature map into a convolution kernel generation network 24 to obtain convolution kernels corresponding to each pixel point of the first feature map 23; the first feature map 23 and the convolution kernels corresponding to the pixel points of the first feature map 23 are input into the wide-area dynamic convolution network 25 to perform wide-area dynamic convolution, so as to obtain a second feature map 26 of the image to be processed, and then an image processing result 27 of the image to be processed can be obtained according to the second feature map 26.
FIG. 3 is a schematic diagram of a convolution kernel generation network according to an embodiment of the disclosure, and as shown in FIG. 3, 5-level feature maps of an image to be processed are respectively feature map 311, feature map 312, feature map 313, and feature map 314, feature map 315, the 5-level feature maps can be respectively up-sampled and 3 × 3 convolved to obtain feature map 321, feature map 322, feature map 323, feature map 324, and feature map 325 with the same size as a first feature map (not shown) of the image to be processed, and then feature map 321, feature map 322, feature map 323, feature map 324, and feature map 325 are fused to obtain feature map 33, and then feature map 33 is sequentially subjected to 1 × 1 convolution and ReLU activation, 3 × 3 convolution and ReLU activation, and 1 × 1 convolution and ReLU activation to obtain feature map 34. according to feature map 34, a convolution kernel parameter corresponding to each pixel in the first feature map of the image to be processed can be determined.
Fig. 4 is a schematic diagram illustrating an application scenario of an image processing method based on wide-area dynamic convolution according to an embodiment of the present disclosure. As shown in fig. 4, the image processing method is implemented by a neural network, and is used for performing image semantic segmentation on an image 41 to be processed, performing feature extraction on the image 41 to be processed by using a full convolution network composed of 5 convolution groups (the first 4 convolution groups include a down-sampling layer), obtaining 5-level feature maps, namely a feature map 42, a feature map 43, a feature map 44, a feature map 45 and a feature map 46, then performing feature map enhancement on the 5-level feature map, enhancing the feature map, obtaining a feature map 48, and inputting the feature map 48 into a decoder to generate a semantic segmentation result 49.
When the feature map is enhanced, the input 5-level feature map is subjected to upsampling, 3 × 3 convolution and fusion processing, the fused feature map is subjected to 1 × 1 convolution and ReLU activation, 3 × 3 convolution and ReLU activation, and 1 × 1 convolution and ReLU activation in sequence to obtain a feature map 471, meanwhile, the 5 th feature map (namely, the feature map 46) is subjected to upsampling to obtain a feature map 472, and then, the feature map 472 is subjected to wide-area dynamic convolution according to the dynamic convolution determined by the feature map 471 to obtain a feature map 48.
The convolution group shown in fig. 4 can have various structures, and can be selected according to different networks, such as VGG series, ResNet series, inclusion system, and the like, and the disclosure is not limited thereto. The decoder shown in fig. 4 may be comprised of one or more convolutional layers, an upsampled layer, or a more complex decoder, as the present disclosure is not limited in this respect.
In a possible implementation manner, the training set may be segmented according to a preset semantic meaning, and the neural network in fig. 4 may be trained end to end. In training, the network loss L of the neural network can be determined using the following equation (2):
Figure BDA0002402574280000161
wherein the content of the first and second substances,
Figure BDA0002402574280000162
representing a training set of neural networks for semantic segmentation, I represents
Figure BDA0002402574280000163
Of any one of the sample images, S*To represent
Figure BDA0002402574280000164
In the segmentation label corresponding to the sample image I, m represents any pixel point in the sample image I,
Figure BDA0002402574280000165
the class label of the pixel point m is expressed as one-hot encoding, theta represents the network parameter of the neural network, fm(I; theta) represents the class probability distribution of the prediction of the sample image I at pixel point m by the neural network,
Figure BDA0002402574280000166
is expressed and calculated in
Figure BDA0002402574280000167
H 'denotes the height of the sample image I, W' denotes the width of the sample image I, and H 'W' denotes the total number of pixels in the sample image I.
The parameter values of the neural network may be adjusted in a direction that minimizes the network loss L, and when the network loss L decreases to a certain degree or converges within a certain threshold, the adjustment is stopped, and a trained neural network is obtained. The trained neural network can be used to complete the image semantic segmentation task.
In a possible implementation manner, a deep learning optimization algorithm, such as a Stochastic Gradient Descent (SGD) algorithm and an Adaptive motion Estimation (Adaptive motion Estimation), may be further applied to determine the network loss L. The present disclosure is not so limited.
According to the image processing method based on the wide-area dynamic convolution, the feature map in the image processing process can be enhanced, so that the enhanced feature map has more accurate space geometric information, and the edge information of an object and a structure in the image to be processed can be retained, so that the accuracy of the image processing result of the image to be processed can be improved.
Fig. 5 illustrates a block diagram of an image processing apparatus based on wide-area dynamic convolution according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus includes:
the feature extraction module 51 is configured to perform feature extraction on an image to be processed to obtain an N-level feature map, where N is an integer greater than 1;
the upsampling module 52 is configured to upsample the nth-level feature map to obtain a first feature map of the image to be processed;
a convolution kernel determining module 53, configured to determine, according to the N-level feature map and a preset convolution kernel size, a convolution kernel corresponding to each pixel point in the first feature map;
a convolution and pooling processing module 54, configured to perform convolution and pooling processing on each pixel point in the first feature map according to a convolution kernel corresponding to each pixel point in the first feature map and a size of an adjacent region of each pixel point, so as to obtain a second feature map of the to-be-processed image;
and the processing result determining module 55 determines an image processing result of the image to be processed according to the second feature map.
In one possible implementation, the convolution kernel determining module includes:
the feature map fusion submodule is used for carrying out size adjustment and fusion processing on the N-level feature map to obtain a third feature map, and the size of the third feature map is the same as that of the first feature map;
the convolution and activation submodule is used for performing convolution and activation processing on the third feature map to obtain a fourth feature map, the height and the width of the fourth feature map are the same as those of the first feature map, the channel number of the fourth feature map is k × k times of that of the first feature map, k × k is a preset convolution kernel size, and k is a positive integer;
and the convolution kernel determining submodule is used for determining convolution kernel parameters corresponding to the pixel points according to the fourth characteristic diagram.
In one possible implementation, the convolution kernel determination submodule is configured to:
determining a feature vector corresponding to each pixel point in the first feature map according to the fourth feature map, wherein the dimension of the feature vector is C × k × k, C is the number of channels of the first feature map, and C is a positive integer;
and converting the feature vector corresponding to each pixel point in the first feature map into a convolution kernel parameter corresponding to each pixel point.
In one possible implementation, the convolution and pooling processing module includes:
the adjacent region determining submodule is used for determining the adjacent region of each pixel point according to the adjacent region size of each pixel point in the first characteristic diagram;
the convolution submodule is used for performing convolution processing on the first characteristic diagram according to a convolution kernel corresponding to the pixel point for any pixel point in the first characteristic diagram, and determining a first convolution response of each pixel point in a neighboring area of the pixel point;
the pooling submodule is used for pooling first convolution responses of all the pixel points in the adjacent areas of the pixel points and determining second convolution responses of the pixel points;
and the characteristic map determining submodule is used for determining a second characteristic map of the image to be processed according to the second convolution response of each pixel point in the first characteristic map.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. An image processing method based on wide-area dynamic convolution, characterized in that the method comprises:
extracting the features of the image to be processed to obtain an N-level feature map, wherein N is an integer greater than 1;
carrying out up-sampling on the Nth-level feature map to obtain a first feature map of the image to be processed;
determining a convolution kernel corresponding to each pixel point in the first characteristic diagram according to the N-level characteristic diagram and a preset convolution kernel size;
performing convolution and pooling on each pixel point in the first feature map according to a convolution kernel corresponding to each pixel point in the first feature map and the size of an adjacent region of each pixel point to obtain a second feature map of the image to be processed;
and determining an image processing result of the image to be processed according to the second feature map.
2. The method of claim 1, wherein determining a convolution kernel corresponding to each pixel point in the first feature map according to the N-level feature map and a predetermined convolution kernel size comprises:
carrying out size adjustment and fusion processing on the N-level feature map to obtain a third feature map, wherein the size of the third feature map is the same as that of the first feature map;
performing convolution and activation processing on the third feature map to obtain a fourth feature map, wherein the height and the width of the fourth feature map are the same as those of the first feature map, the channel number of the fourth feature map is k × k times of that of the first feature map, k × k is a preset convolution kernel size, and k is a positive integer;
and determining convolution kernel parameters corresponding to the pixel points according to the fourth feature map.
3. The method of claim 2, wherein determining convolution kernel parameters corresponding to the respective pixel points from the fourth feature map comprises:
determining a feature vector corresponding to each pixel point in the first feature map according to the fourth feature map, wherein the dimension of the feature vector is C × k × k, C is the number of channels of the first feature map, and C is a positive integer;
and converting the feature vector corresponding to each pixel point in the first feature map into a convolution kernel parameter corresponding to each pixel point.
4. The method according to claim 1, wherein the convolving and pooling the pixels in the first feature map according to the convolution kernels corresponding to the pixels in the first feature map and the sizes of the neighboring regions of the pixels to obtain the second feature map of the image to be processed comprises:
determining the adjacent region of each pixel point according to the adjacent region size of each pixel point in the first characteristic diagram;
for any pixel point in the first characteristic diagram, performing convolution processing on the first characteristic diagram according to a convolution kernel corresponding to the pixel point, and determining a first convolution response of each pixel point in a neighboring area of the pixel point;
pooling first convolution responses of all pixel points in the adjacent areas of the pixel points, and determining second convolution responses of the pixel points;
and determining a second characteristic diagram of the image to be processed according to the second convolution response of each pixel point in the first characteristic diagram.
5. The method of claim 2, wherein resizing and fusing the N-level feature maps to obtain a third feature map comprises:
according to the size of the first feature diagram, carrying out size adjustment on the N-level feature diagram to obtain N fifth feature diagrams, wherein the size of the fifth feature diagrams is the same as that of the first feature diagram;
and adding the N characteristic values of the pixel points with the same position in the N fifth characteristic graphs to obtain a third characteristic graph.
6. The method according to any one of claims 1 to 5, wherein the method is implemented by a neural network, the neural network comprises a feature extraction network, a convolution kernel generation network and a wide-area dynamic convolution network, the feature extraction network is used for feature extraction, the convolution kernel generation network is used for determining convolution kernels corresponding to all pixel points, the wide-area dynamic convolution network is used for determining a second feature map,
wherein the method further comprises:
training the feature extraction network according to a preset training set, wherein the training set comprises a plurality of sample images, reference feature maps of the sample images and reference processing results of the sample images;
and training the convolution kernel generation network and the wide-area dynamic convolution network according to the training set and the trained feature extraction network.
7. An image processing apparatus based on wide-area dynamic convolution, the apparatus comprising:
the characteristic extraction module is used for extracting characteristics of the image to be processed to obtain an N-level characteristic diagram, wherein N is an integer greater than 1;
the up-sampling module is used for up-sampling the N-level feature map to obtain a first feature map of the image to be processed;
a convolution kernel determining module, configured to determine, according to the N-level feature map and a preset convolution kernel size, a convolution kernel corresponding to each pixel point in the first feature map;
the convolution and pooling processing module is used for performing convolution and pooling processing on each pixel point in the first characteristic diagram according to a convolution kernel corresponding to each pixel point in the first characteristic diagram and the size of an adjacent region of each pixel point to obtain a second characteristic diagram of the image to be processed;
and the processing result determining module is used for determining the image processing result of the image to be processed according to the second feature map.
8. The apparatus of claim 7, wherein the convolution kernel determination module comprises:
the feature map fusion submodule is used for carrying out size adjustment and fusion processing on the N-level feature map to obtain a third feature map, and the size of the third feature map is the same as that of the first feature map;
the convolution and activation submodule is used for performing convolution and activation processing on the third feature map to obtain a fourth feature map, the height and the width of the fourth feature map are the same as those of the first feature map, the channel number of the fourth feature map is k × k times of that of the first feature map, k × k is a preset convolution kernel size, and k is a positive integer;
and the convolution kernel determining submodule is used for determining convolution kernel parameters corresponding to the pixel points according to the fourth characteristic diagram.
9. The apparatus of claim 8, wherein the convolution kernel determination submodule is configured to:
determining a feature vector corresponding to each pixel point in the first feature map according to the fourth feature map, wherein the dimension of the feature vector is C × k × k, C is the number of channels of the first feature map, and C is a positive integer;
and converting the feature vector corresponding to each pixel point in the first feature map into a convolution kernel parameter corresponding to each pixel point.
10. The apparatus of claim 7, wherein the convolution and pooling processing module comprises:
the adjacent region determining submodule is used for determining the adjacent region of each pixel point according to the adjacent region size of each pixel point in the first characteristic diagram;
the convolution submodule is used for performing convolution processing on the first characteristic diagram according to a convolution kernel corresponding to the pixel point for any pixel point in the first characteristic diagram, and determining a first convolution response of each pixel point in a neighboring area of the pixel point;
the pooling submodule is used for pooling first convolution responses of all the pixel points in the adjacent areas of the pixel points and determining second convolution responses of the pixel points;
and the characteristic map determining submodule is used for determining a second characteristic map of the image to be processed according to the second convolution response of each pixel point in the first characteristic map.
CN202010151431.3A 2020-03-06 2020-03-06 Image processing method and device based on wide-area dynamic convolution Active CN111340049B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010151431.3A CN111340049B (en) 2020-03-06 2020-03-06 Image processing method and device based on wide-area dynamic convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010151431.3A CN111340049B (en) 2020-03-06 2020-03-06 Image processing method and device based on wide-area dynamic convolution

Publications (2)

Publication Number Publication Date
CN111340049A true CN111340049A (en) 2020-06-26
CN111340049B CN111340049B (en) 2023-06-09

Family

ID=71185938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010151431.3A Active CN111340049B (en) 2020-03-06 2020-03-06 Image processing method and device based on wide-area dynamic convolution

Country Status (1)

Country Link
CN (1) CN111340049B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183291A (en) * 2020-09-22 2021-01-05 上海蜜度信息技术有限公司 Method and system for detecting tiny object in image, storage medium and terminal
CN112200008A (en) * 2020-09-15 2021-01-08 青岛邃智信息科技有限公司 Face attribute recognition method in community monitoring scene
CN113936163A (en) * 2020-07-14 2022-01-14 武汉Tcl集团工业研究院有限公司 Image processing method, terminal and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7649927B1 (en) * 2006-12-29 2010-01-19 Kiomars Anvari Equalizer filter with dynamically configurable convolution filter
CN107578054A (en) * 2017-09-27 2018-01-12 北京小米移动软件有限公司 Image processing method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7649927B1 (en) * 2006-12-29 2010-01-19 Kiomars Anvari Equalizer filter with dynamically configurable convolution filter
CN107578054A (en) * 2017-09-27 2018-01-12 北京小米移动软件有限公司 Image processing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIALIN WU ET AL.: "Dynamic Filtering with Large Sampling Field for ConvNets", 《ARXIV》 *
冯家文;张立民;邓向阳;: "双通道卷积神经网络在静态手势识别中的应用", 计算机工程与应用 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113936163A (en) * 2020-07-14 2022-01-14 武汉Tcl集团工业研究院有限公司 Image processing method, terminal and storage medium
CN112200008A (en) * 2020-09-15 2021-01-08 青岛邃智信息科技有限公司 Face attribute recognition method in community monitoring scene
CN112183291A (en) * 2020-09-22 2021-01-05 上海蜜度信息技术有限公司 Method and system for detecting tiny object in image, storage medium and terminal

Also Published As

Publication number Publication date
CN111340049B (en) 2023-06-09

Similar Documents

Publication Publication Date Title
CN111311629B (en) Image processing method, image processing device and equipment
CN111340049B (en) Image processing method and device based on wide-area dynamic convolution
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN113033570B (en) Image semantic segmentation method for improving void convolution and multilevel characteristic information fusion
CN110136062B (en) Super-resolution reconstruction method combining semantic segmentation
CN110633661A (en) Semantic segmentation fused remote sensing image target detection method
CN112070670B (en) Face super-resolution method and system of global-local separation attention mechanism
CN109712165B (en) Similar foreground image set segmentation method based on convolutional neural network
CN110223304B (en) Image segmentation method and device based on multipath aggregation and computer-readable storage medium
CN112446383A (en) License plate recognition method and device, storage medium and terminal
CN114005090A (en) Suspected smoke proposed area and deep learning-based smoke detection method
CN112016569A (en) Target detection method, network, device and storage medium based on attention mechanism
CN116645592A (en) Crack detection method based on image processing and storage medium
CN114332133A (en) New coronary pneumonia CT image infected area segmentation method and system based on improved CE-Net
CN116863194A (en) Foot ulcer image classification method, system, equipment and medium
CN114494786A (en) Fine-grained image classification method based on multilayer coordination convolutional neural network
CN114241388A (en) Video instance segmentation method and segmentation device based on space-time memory information
CN117593187A (en) Remote sensing image super-resolution reconstruction method based on meta-learning and transducer
CN113284155A (en) Video object segmentation method and device, storage medium and electronic equipment
CN113313162A (en) Method and system for detecting multi-scale feature fusion target
CN114078149A (en) Image estimation method, electronic equipment and storage medium
US20230073175A1 (en) Method and system for processing image based on weighted multiple kernels
CN112927250B (en) Edge detection system and method based on multi-granularity attention hierarchical network
CN111340044A (en) Image processing method, image processing device, electronic equipment and storage medium
CN112634153B (en) Image deblurring method based on edge enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant