CN113936071A

CN113936071A - Image processing method and device

Info

Publication number: CN113936071A
Application number: CN202111211465.8A
Authority: CN
Inventors: 周焕祥; 戴宇荣; 王斌; 黄晓政
Original assignee: Tsinghua University; Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Tsinghua University; Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-01-14

Abstract

The present disclosure relates to an image processing method and apparatus. The image processing method comprises the following steps: extracting shallow features of the image; extracting detail features on a local scale of the image based on the shallow features to obtain local dynamic features of the image, and extracting regional features on a global scale of the image based on the shallow features to obtain deformable features of the image; fusing the local dynamic features and the deformable features based on the image to obtain fused features; and reconstructing the fusion characteristics to obtain a reconstructed image. According to the image processing method and the image processing device, the de-defocusing blurring effect and efficiency of image processing can be improved.

Description

Image processing method and device

Technical Field

The present disclosure relates to the field of image processing technology. More particularly, the present disclosure relates to an image processing method and apparatus.

Background

In the related art, defocus blur is a common phenomenon in photography, sometimes is a visual effect actively created by a photographer, and sometimes brings unnecessary image quality degradation. When a camera equipped with a large aperture lens is used for shooting, objects outside the depth of field tend to exhibit varying degrees of blur distortion. The degree of defocus blur is associated with a variety of factors. On one hand, between different defocused images, the difference between the imaging effect and the defocusing degree is directly caused because the shooting camera adopts different apertures, depth of field, optical phase difference and the like. On the other hand, within a single image, the scene depth difference of different regions will also cause defocusing to different degrees, and the object farther from the depth range will be defocused and blurred more seriously. At present, an algorithm for generating a full-focus image by taking a defocused image as an input has a wide application prospect, for example, in the fields of image refocusing, image correction and the like, and the full-focus image also provides good input for other computer vision tasks, for example, tasks such as scene analysis, depth estimation, object detection and the like based on semantic segmentation.

Defocused image deblurring is a classical ill-posed problem with the goal of optimizing the input defocused image as close as possible to the corresponding fully focused image. Over the past 20 years, a number of researchers have studied this. One type of conventional method relies on defocus estimation results and generally employs a two-stage architecture, the first stage performs defocus estimation, and the second stage removes blur under the guidance of a defocus map. The defocus map is generally represented by a pixel-by-pixel scalar, and the deblurring algorithm is generally represented by a non-blind deconvolution method. There are two significant drawbacks to this type of approach. First, considering that the real point spread function is not necessarily a pie function, the defocus estimation graph represented by a scalar pixel by pixel has limited capability, and cannot provide sufficient information for accurately de-defocusing the blur of the complex scene. Secondly, the space overhead and time overhead of non-blind deconvolution are large, and are not suitable for being deployed in practical application. Such algorithms are generally less generalizable and have some effect on slight defocus blur, but have little effect in moderate and severe blur scenes.

Disclosure of Invention

An exemplary embodiment of the present disclosure is to provide an image processing method and apparatus to solve at least the problems of image processing in the related art, and may not solve any of the above problems.

According to an exemplary embodiment of the present disclosure, there is provided an image processing method including: extracting shallow features of the image; extracting detail features on a local scale of the image based on the shallow features to obtain local dynamic features of the image, and extracting regional features on a global scale of the image based on the shallow features to obtain deformable features of the image; fusing the local dynamic features and the deformable features of the image to obtain fused features; and reconstructing the fusion characteristics to obtain a reconstructed image.

Optionally, before performing detail feature extraction on a local scale on an image, the image processing method further includes: predicting the convolution attribute of each pixel of the image based on the shallow feature, wherein the convolution attribute of each pixel comprises a local convolution kernel parameter of each pixel, and a deformable position offset and a sampling weight of each pixel, and the step of extracting the detail feature on the local scale of the image comprises the following steps: performing detail feature extraction on the image on a local scale based on the convolution attribute of each pixel of the image, wherein the step of performing region feature extraction on the image on a global scale comprises the following steps: and performing regional feature extraction on the image on a global scale based on the convolution attribute of each pixel of the image.

Alternatively, the step of predicting the convolution property of each pixel of the image based on the shallow feature may comprise: performing multi-level decomposition on the image by adopting two-dimensional discrete wavelet transform to obtain a plurality of detail sub-bands with different scales of the image; extracting the characteristics of the detail sub-bands in each scale to obtain the characteristics of the sub-bands in different scales; encoding shallow layer characteristics in each scale to obtain encoding characteristics of different scales; respectively combining the sub-band characteristics of the same scale with the encoder characteristics to obtain combined characteristics of different scales; and decoding the combined features of different scales to obtain the local convolution kernel parameter, the deformable position offset and the sampling weight of each pixel of the image.

Optionally, the detail sub-bands may include structural information.

Alternatively, the step of combining the subband features of the same scale with the encoder features may comprise: combining the sub-band features and the encoder features of the same scale from shallow to deep in an iterative manner.

Optionally, the step of performing detail feature extraction on the image on a local scale based on the convolution property of each pixel of the image may include: determining a local dynamic convolution kernel based on the local convolution kernel parameters aiming at each pixel of the image to obtain the local dynamic convolution kernel of each pixel of the image; and performing detail feature extraction on the image on a local scale based on the local dynamic convolution kernel of each pixel of the image.

Optionally, the step of performing detail feature extraction on a local scale on the image based on the dynamic convolution kernel of each pixel of the image may include: for each pixel of the image, a dynamic convolution kernel of the pixel is convolved with the pixel.

Optionally, the step of performing region feature extraction on the image on a global scale based on the convolution attribute of each pixel of the image may include: and performing regional feature extraction on the image on a global scale based on the deformable position offset and the sampling weight of each pixel of the image.

Alternatively, the step of performing region feature extraction on the image on a global scale based on the deformable position offset and the sampling weight of each pixel of the image may include: for each pixel of the image, the position of the pixel is shifted based on the amount of the deformable position shift of the pixel, and the pixel is convolved with the sampling weight of the shifted pixel.

Optionally, before performing detail feature extraction on a local scale on an image, the image processing method may further include: obtaining convolution attributes of the image from the shallow features, wherein the convolution attributes include convolution kernel parameters, position offset and sampling weight, and the step of extracting detail features on the local scale of the image may include: extracting detail features on a local scale from the image based on the convolution attribute of the image, wherein extracting region features on a global scale from the image may include: and extracting regional features on the global scale of the image based on the convolution attribute of the image.

Optionally, the step of extracting detail features on a local scale from the image based on the convolution attribute of the image may include: determining a convolution kernel of the image based on the convolution kernel parameters of the image; and carrying out convolution operation on the convolution kernel of the image and each pixel of the image.

Optionally, the step of performing region feature extraction on the image on a global scale based on the convolution attribute of the image may include: shifting a position of each pixel of the image based on the amount of positional shift of the image; and performing convolution operation on the sampling weight of the image and each pixel after the offset.

According to an exemplary embodiment of the present disclosure, there is provided an image processing apparatus including: a feature extraction unit configured to extract a shallow feature of an image; the feature processing unit is configured to extract detail features on a local scale of the image based on the shallow features to obtain local dynamic features of the image, extract regional features on a global scale of the image based on the shallow features to obtain deformable features of the image, and fuse the local dynamic features and the deformable features of the image to obtain fused features; and an image reconstruction unit configured to reconstruct the fusion features, resulting in a reconstructed image.

Optionally, the feature processing unit may include: a pixel-by-pixel prediction unit configured to predict a convolution property of each pixel of the image based on the shallow feature, wherein the convolution property of each pixel includes a local convolution kernel parameter of each pixel, and a deformable position offset and a sampling weight of each pixel; and the self-adaptive feature fusion unit is configured to extract detail features on a local scale of the image based on the convolution attribute of each pixel of the image, extract regional features on a global scale of the image based on the convolution attribute of each pixel of the image, and fuse the local dynamic features and the deformable features of the image to obtain fusion features.

Alternatively, the pixel-by-pixel prediction unit may be configured to: performing multi-level decomposition on the image by adopting two-dimensional discrete wavelet transform to obtain a plurality of detail sub-bands with different scales of the image; extracting the characteristics of the detail sub-bands in each scale to obtain the characteristics of the sub-bands in different scales; encoding shallow layer characteristics in each scale to obtain encoding characteristics of different scales; respectively combining the sub-band characteristics of the same scale with the encoder characteristics to obtain combined characteristics of different scales; and decoding the combined features of different scales to obtain the local convolution kernel parameter, the deformable position offset and the sampling weight of each pixel of the image.

Optionally, the detail sub-bands may include structural information.

Alternatively, the pixel-by-pixel prediction unit may be configured to: combining the sub-band features and the encoder features of the same scale from shallow to deep in an iterative manner.

Optionally, the adaptive feature fusion unit may comprise a dynamic branching unit configured to: determining a local dynamic convolution kernel based on the local convolution kernel parameters aiming at each pixel of the image to obtain the local dynamic convolution kernel of each pixel of the image; and performing detail feature extraction on the image on a local scale based on the local dynamic convolution kernel of each pixel of the image.

Optionally, the dynamic branching unit may be configured to: for each pixel of the image, a dynamic convolution kernel of the pixel is convolved with the pixel.

Optionally, the adaptive feature fusion unit may comprise a deformable branching unit configured to: and performing regional feature extraction on the image on a global scale based on the deformable position offset and the sampling weight of each pixel of the image.

Optionally, the deformable branching unit may be configured to: for each pixel of the image, the pixel position is shifted based on the amount of the deformable position shift of the pixel, and the sampling weight of the pixel is convolved with the shifted pixel.

Optionally, the feature processing unit may be configured to: acquiring convolution attributes of the image from the shallow features, wherein the convolution attributes comprise convolution kernel parameters, position offset and sampling weight; and extracting detail features on a local scale of the image based on the convolution attribute of the image, and extracting region features on a global scale of the image based on the convolution attribute of the image.

Optionally, the feature processing unit may be configured to: determining a convolution kernel of the image based on the convolution kernel parameters of the image; and carrying out convolution operation on the convolution kernel of the image and each pixel of the image.

Optionally, the feature processing unit may be configured to: shifting a position of each pixel of the image based on the amount of positional shift of the image; and performing convolution operation on the sampling weight of the image and each pixel after the offset.

According to an exemplary embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement an image processing method according to an exemplary embodiment of the present disclosure.

According to an exemplary embodiment of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor of an electronic device, causes the electronic device to execute an image processing method according to an exemplary embodiment of the present disclosure.

According to an exemplary embodiment of the present disclosure, a computer program product is provided, comprising computer programs/instructions which, when executed by a processor, implement an image processing method according to an exemplary embodiment of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

prevent the loss of detail in the defocused area;

the de-defocus blur effect and efficiency of image processing are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 illustrates an exemplary system architecture to which exemplary embodiments of the present disclosure may be applied.

Fig. 2 illustrates a flowchart of an image processing method according to an exemplary embodiment of the present disclosure.

Fig. 3 illustrates a flowchart of an image processing method according to another exemplary embodiment of the present disclosure.

Fig. 4 illustrates an example structural diagram of an image processing network for implementing the image processing method in fig. 3 according to an example embodiment of the present disclosure.

FIG. 5 shows a schematic diagram of a two-dimensional Haar transform according to an example embodiment of the present disclosure.

Fig. 6 illustrates a prediction flow of a pixel-by-pixel attribute prediction module according to an exemplary embodiment of the present disclosure.

Fig. 7 shows a schematic diagram of local dynamic features through local dynamic branching according to an example embodiment of the present disclosure.

Fig. 8 shows a schematic diagram of deriving a deformable feature by deformable branches according to an exemplary embodiment of the present disclosure.

Fig. 9 is a block diagram of an image processing apparatus according to an exemplary embodiment of the present disclosure.

Fig. 10 illustrates a block diagram of an image processing apparatus according to another exemplary embodiment of the present disclosure.

Fig. 11 shows a block diagram of the adaptive feature fusion unit 103 according to an exemplary embodiment of the present disclosure.

Fig. 12 is a block diagram of an electronic device 1200 according to an example embodiment of the disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

In the related art, the principle of defocused blur imaging based on a thin convex lens is as follows. The light rays emitted by the object on the focal plane are refracted by the lens and converged to a single point on the imaging plane, and the image is clearly formed; the light rays emitted by an object located on the defocused plane diverge over the circular area of the imaging plane, forming a circle of confusion. The defocusing degree of the pixel points can be measured by the diameter of the circle of confusion, and the defocusing degree only changes along with the depth of the scene in a single image. In an ideal case, the diameter of the circle of confusion is 0 if and only if the object is at the focal plane, and the greater the distance of the defocused plane from the focal plane, the greater the diameter of the corresponding circle of confusion, the more severe the degree of blur. In most shooting environments, even a single scene contains objects of various depths, and thus the degree of defocus in a single defocused image varies spatially. In the region with a light defocusing degree, detail texture information is kept well, and a network is required to effectively extract differentiated depth features; in regions with severe defocus, objects diffuse and overlap each other, requiring a larger field of view for the network to better perceive the overall structure of the object. In the frequency domain, defocus blur generally shows a phenomenon of loss of detail information, which means that there is a significant loss of detail information at high frequency. Some convolutional neural networks are also capable of recovering part of the high frequency detail information without additional improvement, with the disadvantages of poor performance and not being robust enough. In order to further enhance the extraction, processing and reconstruction capabilities of high-frequency detail information, the present disclosure takes as a starting point a strategy that allows the network to explicitly utilize information of wavelet domain detail subbands as additional input when designing the pixel-by-pixel attribute prediction module.

In the present disclosure, when the defocus blur phenomenon caused by the photographic subject outside the depth of field is removed,

firstly, aiming at the change of a point spread function scale on a space domain, a self-adaptive feature fusion strategy can be adopted, and the dual features from a local dynamic branch and a deformable branch are combined;

secondly, aiming at the phenomenon of detail loss of defocused areas, a pixel-by-pixel attribute prediction module based on wavelet characteristics can be adopted to endow each pixel with different convolution attributes.

Hereinafter, an image processing method and apparatus according to an exemplary embodiment of the present disclosure will be described in detail with reference to fig. 1 to 12.

Fig. 1 illustrates an exemplary system architecture 100 in which exemplary embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The user may use the

terminal devices

101, 102, 103 to interact with the server 105 over the network 104 to receive or send messages (e.g., image processing requests) or the like. Various image applications, such as photographing software, video recording software, and the like, may be installed on the

terminal apparatuses

101, 102, 103. The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may include, but are not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal device

101, 102, 103 is software, it may be installed in the electronic devices listed above, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or it may be implemented as a single software or software module. And is not particularly limited herein.

The

terminal devices

101, 102, 103 may be equipped with an image capture device (e.g., a camera) to capture image (or video) data. In practice, the smallest visual unit that makes up a video is a Frame (Frame). Each frame is a static image. Temporally successive sequences of frames are composited together to form a motion video.

The server 105 may be a server providing various services, such as a background server providing support for image applications installed on the

terminal devices

101, 102, 103. The background server can process the received image to achieve the effect of removing the defocus blur.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the image processing method provided by the embodiment of the present disclosure is generally executed by a terminal device, but may also be executed by a server, or may also be executed by cooperation of the terminal device and the server. Accordingly, the image processing apparatus may be provided in the terminal device, the server, or both the terminal device and the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation, and the disclosure is not limited thereto.

Referring to fig. 2, in step S201, shallow features of an image are extracted. Here, the image is a single defocused image, and the shallow features include texture features.

In step S202, detail feature extraction on the local scale is performed on the image based on the shallow feature to obtain a local dynamic feature of the image, and region feature extraction on the global scale is performed on the image based on the shallow feature to obtain a deformable feature of the image.

In an exemplary embodiment of the present disclosure, before extracting detail features on a local scale from an image, a convolution property of the image may also be first obtained from a shallow feature. Then, when the detail feature extraction on the local scale is carried out on the image, the detail feature extraction on the local scale is carried out on the image based on the convolution attribute of the image, and when the region feature extraction on the global scale is carried out on the image, the region feature extraction on the global scale is carried out on the image based on the convolution attribute of the image. Here, the convolution properties may include convolution kernel parameters, position offsets, and sampling weights.

In an exemplary embodiment of the present disclosure, when extracting detail features on a local scale from an image based on convolution properties of the image, a convolution kernel of the image may be first determined based on convolution kernel parameters of the image, and then the convolution kernel of the image is convolved with each pixel of the image.

In an exemplary embodiment of the present disclosure, when extracting the region feature on the global scale from the image based on the convolution property of the image, the position of each pixel of the image may be first shifted based on the position shift amount of the image, and then the sampling weight of the image may be respectively convolved with each pixel after the shift.

In step S203, the local dynamic feature and the deformable feature of the image are fused to obtain a fused feature.

In exemplary embodiments of the present disclosure, the two types of features (local dynamic features and deformable features) may be fused in a way that stitches in the feature dimension.

In step S204, the fusion features are reconstructed to obtain a reconstructed image.

In exemplary embodiments of the present disclosure, the fused features may be reconstructed using any image reconstruction method.

Fig. 3 illustrates a flowchart of an image processing method according to another exemplary embodiment of the present disclosure. Fig. 4 illustrates an example structural diagram of an image processing network for implementing the image processing method in fig. 3 according to an example embodiment of the present disclosure. In fig. 4, the image processing network includes a feature extraction module, a pixel-by-pixel attribute prediction module, an adaptive feature fusion module, and a reconstruction module. As shown in fig. 4, the four modules are connected in series, and the output of the previous module is used as the input of the next module until the reconstruction obtains the deblurring result.

Referring to fig. 3, in step S301, shallow features of an image are extracted. Here, the image is a single defocused image. Specifically, a single defocused image may be input into the image processing network, and the feature extraction module extracts shallow features from the input single defocused image.

In step S302, the convolution attribute of each pixel of the image is predicted based on the shallow feature. Here, the convolution attribute of each pixel includes a local convolution kernel parameter, a deformable position offset, and a sampling weight of each pixel.

In an exemplary embodiment of the present disclosure, when predicting the convolution attribute of each pixel of an image based on a shallow feature, the image may be first decomposed in multiple levels by using two-dimensional discrete wavelet transform to obtain multiple detail subbands of different scales of the image, performing feature extraction on the detail subbands in each scale to obtain subband features (also referred to as wavelet subband features) of different scales, encoding the shallow feature in each scale to obtain encoded features of different scales, then combining the subband features of the same scale with the encoder features respectively to obtain combined features of different scales, and finally decoding the combined features of different scales to obtain a local convolution kernel parameter, a deformable position offset, and a sampling weight of each pixel of the image.

In an exemplary embodiment of the present disclosure, the detail sub-band may include structural information.

In an exemplary embodiment of the present disclosure, when combining sub-band features of the same scale with encoder features, the sub-band features of the same scale and the encoder features may be combined in an iterative manner from a shallow layer to a deep layer.

In fig. 4, the pixel-by-pixel attribute prediction module is based on wavelet features. The pixel-by-pixel attribute prediction module can employ a U-Net like encoder-decoder architecture with the goal of predicting the convolution attributes (including the convolution kernel parameters of the local dynamic branches, as well as the position offset and sampling weights of the deformable branches) of each pixel of the image for the adaptive feature fusion module. The pixel-wise attribute prediction module exploits the detail subband characteristics of wavelet transforms (e.g., two-dimensional Haar transforms) as additional information, flexibly incorporating them into the encoder-decoder network structure.

Specifically, the pixel-by-pixel attribute prediction module performs multi-level decomposition on an input single defocused image by adopting two-dimensional Haar transformation to obtain detail sub-bands with different scales, and then combines the detail sub-bands with the depth characteristics of an encoder-decoder network, wherein the whole combination process is performed in an iterative manner from a shallow layer to a deep layer.

FIG. 5 shows a schematic diagram of a two-dimensional Haar transform according to an example embodiment of the present disclosure. As shown in fig. 5, the input single defocused image is decomposed into four sub-bands after one-scale two-dimensional Haar transform, wherein the brightness of the last three sub-bands is scaled to achieve better visual effect. Each subband is one-fourth the size of the input single defocused image (i.e., the original image), and contains different information: an approximation subband represents the approximation of the original image at a low resolution scale, and contains most luminance information; the three detail sub-bands respectively represent the details of the original image in the horizontal, vertical and diagonal directions and contain sparse structural information. The decomposition process of the two-dimensional discrete wavelet transform is lossless and the corresponding inverse transform is able to reconstruct the original image completely with four subbands as input.

In the present disclosure, in order to avoid the phenomenon that the input defocused image loses high-frequency information in the frequency domain, it is desirable to enhance the detail sub-band information in the encoder-decoder network structure, and guide the network to better detect and reconstruct the high-frequency texture.

Fig. 6 illustrates a prediction flow of a pixel-by-pixel attribute prediction module according to an exemplary embodiment of the present disclosure. As shown in fig. 6, first, the pixel-by-pixel attribute prediction module decomposes an input defocused image into multi-scale subbands by multi-level two-dimensional Haar wavelet transform. The detail sub-bands are then converted to wavelet features at each scale by a shallower wavelet feature extraction network, as shown by the dashed box area in fig. 6. Then, the wavelet features are combined with the encoder features of the corresponding scale in a splicing manner. Finally, the entire bonding process proceeds iteratively from shallow to deep.

In a specific implementation, the pixel-by-pixel attribute prediction module can be obtained by modifying a standard U-Net structure as follows: adopting the convolution layer with the step length of 2 to replace the pooling operation so as to reserve more characteristic information; the common convolutional layer is replaced by a residual module in the encoder.

In step S303, detail feature extraction on the local scale is performed on the image based on the convolution attribute of each pixel of the image to obtain a local dynamic feature of the image, and region feature extraction on the global scale is performed on the image based on the convolution attribute of each pixel of the image to obtain a deformable feature of the image.

In an exemplary embodiment of the present disclosure, when performing detail feature extraction on an image on a local scale and region feature extraction on a global scale based on a convolution attribute of each pixel of the image, a local dynamic convolution kernel may be first determined for each pixel of the image based on a local convolution kernel parameter to obtain a local dynamic convolution kernel of each pixel of the image, and then the image may be performed with detail feature extraction on the local scale based on the local dynamic convolution kernel of each pixel of the image.

In an exemplary embodiment of the present disclosure, when extracting detail features on a local scale of an image based on a dynamic convolution kernel of each pixel of the image, a convolution operation may be performed on the dynamic convolution kernel of the pixel and the pixel for each pixel of the image.

In an exemplary embodiment of the present disclosure, in performing the region feature extraction on the global scale on the image based on the convolution attribute of each pixel of the image, the region feature extraction on the global scale may be performed on the image based on the deformable position offset and the sampling weight of each pixel of the image.

In an exemplary embodiment of the present disclosure, when performing region feature extraction on a global scale on an image based on a deformable position shift amount and a sampling weight of each pixel of the image, a position of the pixel may be shifted based on the deformable position shift amount of the pixel for each pixel of the image, and the shifted pixel may be convolved with the sampling weight of the pixel.

As shown in fig. 4, the adaptive feature fusion module includes a local dynamic branch and a deformable branch. The two branches extract features from the defocused image in parallel, and the features are fused and output to the next module. Two branches in the feature fusion module respectively adopt local dynamic convolution and deformable convolution, and compared with a standard convolution layer, the degree of freedom of three dimensions including convolution kernel parameters, position offset and sampling weight is increased, and the modeling capability of defocusing blur of different scales is enhanced.

Fig. 7 shows a schematic diagram of local dynamic features through local dynamic branching according to an example embodiment of the present disclosure. As shown in fig. 7, the local dynamic branching uses local dynamic convolution instead of the standard convolutional layer, and the principle of the local dynamic branching is that each pixel (or pixel position) uses adaptively generated convolution kernel parameters to extract features. In the defocus-blur removing task of a single image, the local dynamic branch breaks through the parameter sharing limitation of standard convolution, and the method has two advantages. In one aspect, local dynamic branching can accommodate overall differences in defocus levels from sample to sample. On the other hand, local dynamic branching can adapt to the local difference of the blur degree inside a single image.

Fig. 8 shows a schematic diagram of deriving a deformable feature by deformable branches according to an exemplary embodiment of the present disclosure. As shown in fig. 8, when convolving the pixel position, the deformable branch generates the offset and the weight of the sampling point from the current position in a position-dependent manner, instead of using a fixed neighborhood lattice as the sampling point, and has the following three advantages. First, the field of view of the deformable branch is larger. Secondly, the sampling point location of the deformable branch is more flexible. Finally, the deformable branches equipped with sampling weights can serve as an important feature screening.

The local dynamic branch is functionally complementary to the deformable branch, the former providing detail features with discrimination on a local scale, and the latter providing regional features with adaptive defocus degree on a global scale. By combining the two, the image processing network can extract relatively complete texture information and defocus information from the blurred image to perform optimized reconstruction.

In step S304, the local dynamic feature and the deformable feature of the image are fused to obtain a fused feature.

In step S305, the fusion features are reconstructed to obtain a reconstructed image.

In exemplary embodiments of the present disclosure, the fused features may be reconstructed using any image reconstruction method. The reconstruction module in fig. 4 uses the same encoder-decoder network structure as the pixel-by-pixel attribute prediction module to better restore the multi-scale object information of the defocus map, and adds the multi-scale object information to the input defocus map to obtain the final deblurring result.

Furthermore, in an exemplary embodiment of the present disclosure, the image processing network as shown in fig. 4 may be trained in conjunction with two loss functions, the Charbonnier loss function and the SSIM loss function. The Charbonnier loss function is a slight variation of the L1 loss function to measure the pixel-by-pixel difference between the deblurring result and the full focus true value in image space. The SSIM loss function integrates the metrics of multiple dimensions (brightness, contrast, and structure) based on the structural similarity index, and is more suitable for the subjective feeling of human eyes than the pixel-by-pixel L1 loss function, L2 loss function, charbonier loss function, and the like. The overall loss function is or includes a weighted sum of the two.

The image processing method according to the exemplary embodiment of the present disclosure has been described above in conjunction with fig. 1 to 3. Hereinafter, an image processing apparatus and units thereof according to an exemplary embodiment of the present disclosure will be described with reference to fig. 9 to 11.

Referring to fig. 9, the image processing apparatus 900 includes a feature extraction unit 91, a feature processing unit 92, and an image reconstruction unit 93.

The feature extraction unit 91 is configured to extract shallow features of an image.

The feature processing unit 92 is configured to perform detail feature extraction on the image on a local scale based on the shallow feature to obtain a local dynamic feature of the image, perform region feature extraction on the image on a global scale based on the shallow feature to obtain a deformable feature of the image, and fuse the local dynamic feature and the deformable feature of the image to obtain a fused feature.

In an exemplary embodiment of the present disclosure, the feature processing unit 92 may be configured to: acquiring convolution attributes of the image from the shallow features, wherein the convolution attributes comprise convolution kernel parameters, position offset and sampling weight; and extracting detail features on a local scale of the image based on the convolution attribute of the image, and extracting region features on a global scale of the image based on the convolution attribute of the image.

In an exemplary embodiment of the present disclosure, the feature processing unit 92 may be configured to: determining a convolution kernel of the image based on the convolution kernel parameters of the image; and carrying out convolution operation on the convolution kernel of the image and each pixel of the image.

In an exemplary embodiment of the present disclosure, the feature processing unit 92 may be configured to: shifting a position of each pixel of the image based on the amount of positional shift of the image; and performing convolution operation on the sampling weight of the image and each pixel after the offset.

The image reconstruction unit 93 is configured to reconstruct the fused features, resulting in a reconstructed image.

Referring to fig. 10, the image processing apparatus 1000 includes a feature extraction unit 101, a pixel-by-pixel prediction unit 102, an adaptive feature fusion unit 103, and an image reconstruction unit 104.

The feature extraction unit 101 is configured to extract shallow features of an image.

The pixel-by-pixel prediction unit 102 is configured to predict a convolution property of each pixel of the image based on the shallow feature, wherein the convolution property of each pixel includes a local convolution kernel parameter, a deformable position offset, and a sampling weight of each pixel.

In an exemplary embodiment of the present disclosure, the pixel-by-pixel prediction unit is configured to: performing multi-level decomposition on the image by adopting two-dimensional discrete wavelet transform to obtain a plurality of detail sub-bands with different scales of the image; extracting the characteristics of the detail sub-bands in each scale to obtain the characteristics of the sub-bands in different scales; encoding shallow layer characteristics in each scale to obtain encoding characteristics of different scales; respectively combining the sub-band characteristics of the same scale with the encoder characteristics to obtain combined characteristics of different scales; and decoding the combined features of different scales to obtain the local convolution kernel parameter, the deformable position offset and the sampling weight of each pixel of the image.

In an exemplary embodiment of the present disclosure, the detail sub-band includes structural information.

In an exemplary embodiment of the present disclosure, the pixel-by-pixel prediction unit is configured to: combining the sub-band features and the encoder features of the same scale from shallow to deep in an iterative manner.

The adaptive feature fusion unit 103 is configured to perform detail feature extraction on the image on a local scale based on the convolution attribute of each pixel of the image to obtain a local dynamic feature of the image, perform region feature extraction on the image on a global scale based on the convolution attribute of each pixel of the image to obtain a deformable feature of the image, and fuse the local dynamic feature and the deformable feature of the image to obtain a fusion feature.

In an exemplary embodiment of the present disclosure, the adaptive feature fusion unit 103 may include a dynamic branching unit 1031 configured to: determining a local dynamic convolution kernel based on the local convolution kernel parameters aiming at each pixel of the image to obtain the local dynamic convolution kernel of each pixel of the image; and performing detail feature extraction on the image on a local scale based on the local dynamic convolution kernel of each pixel of the image.

In an exemplary embodiment of the present disclosure, the dynamic branch unit 1031 may be configured to: for each pixel of the image, a dynamic convolution kernel of the pixel is convolved with the pixel.

In an exemplary embodiment of the present disclosure, the adaptive feature fusion unit 103 may include a deformable branching unit 1032 configured to: and performing regional feature extraction on the image on a global scale based on the deformable position offset and the sampling weight of each pixel of the image.

In an exemplary embodiment of the present disclosure, the deformable branch unit 1032 may be configured to: for each pixel of the image, the position of the pixel is shifted based on the amount of the deformable position shift of the pixel, and the sampling weight of the pixel is convolved with the shifted pixel.

The image reconstruction unit 104 is configured to reconstruct the fused features, resulting in a reconstructed image.

Fig. 11 shows a block diagram of the adaptive feature fusion unit 103 according to an exemplary embodiment of the present disclosure. As shown in fig. 11, the adaptive feature fusion unit 103 includes a dynamic branching unit 1031 and a deformable branching unit 11032.

In exemplary embodiments of the present disclosure, the feature extraction unit 101, the pixel-by-pixel prediction unit 102, the adaptive feature fusion unit 103, and the image reconstruction unit 104 may be similar to the feature extraction module, the pixel-by-pixel attribute prediction module, the adaptive feature fusion module, and the reconstruction module in fig. 4, respectively, or the feature extraction unit 101, the pixel-by-pixel prediction unit 102, the adaptive feature fusion unit 103, and the image reconstruction unit 104 may perform similar functions to the feature extraction module, the pixel-by-pixel attribute prediction module, the adaptive feature fusion module, and the reconstruction module in fig. 4, respectively. The dynamic branching unit 1031 and the deformable branching unit 11032 may be similar to the local dynamic branching and the deformable branching in fig. 4, respectively, or the dynamic branching unit 1031 and the deformable branching unit 11032 may perform similar functions to the local dynamic branching and the deformable branching in fig. 4, respectively.

In an exemplary embodiment of the present disclosure, the image processing apparatus may include the image processing network in fig. 4.

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

The image processing apparatus according to the exemplary embodiment of the present disclosure has been described above with reference to fig. 9 to 11. Next, an electronic apparatus according to an exemplary embodiment of the present disclosure is described with reference to fig. 12.

Referring to fig. 12, an electronic device 1200 includes at least one memory 1201 and at least one processor 1202, the at least one memory 1201 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 1202, perform a method of image processing according to an exemplary embodiment of the present disclosure.

In exemplary embodiments of the present disclosure, the electronic device 1200 may be a PC computer, a tablet device, a personal digital assistant, a smartphone, or other device capable of executing the above-described set of instructions. Here, the electronic device 1200 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) individually or in combination. The electronic device 1200 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 1200, the processor 1202 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 1202 may execute instructions or code stored in the memory 1201, where the memory 1201 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 1201 may be integrated with the processor 1202, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 1201 may include a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 1201 and the processor 1202 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 1202 is able to read files stored in the memory.

In addition, the electronic device 1200 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 1200 may be connected to each other via a bus and/or a network.

There is also provided, in accordance with an example embodiment of the present disclosure, a computer-readable storage medium, such as the memory 1201, including instructions executable by the processor 1202 of the apparatus 1200 to perform the above-described method. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, which comprises computer programs/instructions, which when executed by a processor, implement the method of image processing according to an exemplary embodiment of the present disclosure.

The image processing method and apparatus according to the exemplary embodiment of the present disclosure have been described above with reference to fig. 1 to 12. However, it should be understood that: the image processing apparatus and units thereof shown in fig. 9 to 11 may be respectively configured as software, hardware, firmware, or any combination thereof to perform a specific function, the electronic device shown in fig. 12 is not limited to include the above-shown components, but some components may be added or deleted as needed, and the above components may also be combined.

According to the image processing method and device, the shallow feature of the image is extracted, the detail feature on the local scale of the image is extracted based on the shallow feature to obtain the local dynamic feature of the image, the regional feature on the global scale of the image is extracted to obtain the deformable feature of the image, the local dynamic feature and the deformable feature of the image are fused to obtain the fusion feature, the fusion feature is reconstructed to obtain the reconstructed image, and therefore the defocusing blurring effect and the defocusing blurring efficiency of image processing are improved.

In addition, according to the image processing method and apparatus of the present disclosure, for the change of the point spread function scale in the spatial domain, an adaptive feature fusion strategy may be adopted, and the dual features from the local dynamic branch and the deformable branch are combined, so that the feature information of the local scale is paid attention to through the local dynamic branch, and the feature information of the global scale is taken into account through the deformable branch.

In addition, according to the image processing method and device disclosed by the invention, aiming at the phenomenon of detail loss of a defocused area, a pixel-by-pixel attribute prediction module based on wavelet characteristics can be adopted to endow each pixel with different convolution attributes, so that the texture recovery of different scales is enhanced.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An image processing method, comprising:

extracting shallow features of the image;

extracting detail features on a local scale of the image based on the shallow features to obtain local dynamic features of the image, and extracting regional features on a global scale of the image based on the shallow features to obtain deformable features of the image;

fusing the local dynamic features and the deformable features of the image to obtain fused features;

and reconstructing the fusion characteristics to obtain a reconstructed image.

2. The image processing method according to claim 1, wherein before extracting the detail features on the local scale from the image, the method further comprises:

predicting a convolution property of each pixel of the image based on the shallow features, wherein the convolution property of each pixel comprises a local convolution kernel parameter of each pixel, and a deformable position offset and a sampling weight of each pixel,

the method for extracting the detail features of the image on the local scale comprises the following steps: extracting detail features on local scale from the image based on convolution properties of each pixel of the image,

the method for extracting the regional features of the image on the global scale comprises the following steps: and performing regional feature extraction on the image on a global scale based on the convolution attribute of each pixel of the image.

3. The image processing method according to claim 2, wherein the step of predicting the convolution property of each pixel of the image based on the shallow feature comprises:

performing multi-level decomposition on the image by adopting two-dimensional discrete wavelet transform to obtain a plurality of detail sub-bands with different scales of the image;

extracting the characteristics of the detail sub-bands in each scale to obtain the characteristics of the sub-bands in different scales;

encoding shallow layer characteristics in each scale to obtain encoding characteristics of different scales;

respectively combining the sub-band characteristics of the same scale with the encoder characteristics to obtain combined characteristics of different scales;

and decoding the combined features of different scales to obtain the local convolution kernel parameter, the deformable position offset and the sampling weight of each pixel of the image.

4. The image processing method according to claim 3, wherein the detail sub-band includes structural information.

5. The image processing method of claim 3, wherein the step of combining the same scale sub-band features with the encoder features comprises:

combining the sub-band features and the encoder features of the same scale from shallow to deep in an iterative manner.

6. The image processing method according to claim 2, wherein the step of extracting detail features on a local scale from the image based on the convolution property of each pixel of the image comprises:

determining a local dynamic convolution kernel based on the local convolution kernel parameters aiming at each pixel of the image to obtain the local dynamic convolution kernel of each pixel of the image;

and performing detail feature extraction on the image on a local scale based on the local dynamic convolution kernel of each pixel of the image.

7. The image processing method according to claim 6, wherein the step of performing detail feature extraction on the local scale on the image based on the dynamic convolution kernel for each pixel of the image comprises:

for each pixel of the image, a dynamic convolution kernel of the pixel is convolved with the pixel.

8. An image processing apparatus characterized by comprising:

a feature extraction unit configured to extract a shallow feature of an image;

the feature processing unit is configured to extract detail features on a local scale of the image based on the shallow features to obtain local dynamic features of the image, extract regional features on a global scale of the image based on the shallow features to obtain deformable features of the image, and fuse the local dynamic features and the deformable features of the image to obtain fused features; and

and the image reconstruction unit is configured to reconstruct the fusion characteristics to obtain a reconstructed image.

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image processing method of any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, which, when executed by a processor of an electronic device, causes the electronic device to perform the image processing method according to any one of claims 1 to 7.