CN112862715A

CN112862715A - Real-time and controllable scale space filtering method

Info

Publication number: CN112862715A
Application number: CN202110172012.2A
Authority: CN
Inventors: 郭晓杰; 付园斌
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-05-28
Anticipated expiration: 2041-02-08
Also published as: CN112862715B

Abstract

The invention discloses a real-time and controllable scale space filtering method, which outputs a plurality of filtering results with different scales circularly by designing a recurrent neural network model. The cyclic neural network is to obtain a new output by taking the output of the neural network as the input of the network again, and then obtain a new output again by taking the new output as the input of the neural network again, so as to reciprocate until the condition of cycle termination is met. The recurrent neural network model includes a guide network G and a stripping network P.

Description

Real-time and controllable scale space filtering method

Technical Field

The invention relates to the field of image processing based on deep learning, in particular to a real-time and controllable scale space filtering method.

Background

The existing traditional methods such as L0[1], RGF [2], RTV [3], muGIF [4], etc. need to go through many iterative optimization operations to obtain reasonable filtering results, resulting in very slow speed and consuming a lot of time during execution; the deep learning method [13] [14] [15] [16] can solve the disadvantage of slow speed of the traditional method, because the deep neural network only needs one-time forward propagation operation in the testing stage, but the current deep learning method can only obtain a filtering result of one scale, cannot obtain a plurality of images of different scales at one time, and needs to set different hyper-parameters and retrain for the smoothing results of different scales. PIO [5] can adjust the hyper-parameters of the network to obtain filtering results of different scales without retraining the neural network. However, for all the conventional methods and deep learning methods at present, the manner of obtaining filtering results of different scales needs to be implemented by adjusting hyper-parameters (such as weights of different loss functions, number of layers of a network, and other parameters) of a model, and is not intuitive enough. Because the hyper-parameter of the model is only a number and a scalar, the spatial and semantic information of the image cannot be reflected intuitively. In other words, the hyper-parameters cannot intuitively control which parts of the image need to be smoothed and which parts should be maintained.

The importance of multi-scale representation of images has long been well-validated, thereby leading to the idea of scale-space filtering. Similar to the human eye observing an object, the perceived features are not the same when the distance from the object is different, i.e. for the same object in the visual range, the features are not the same when the imaged size is different, i.e. the scale is different. An intuitive example of the scale space is different scales of a map, a map with a large scale can know the position information of a province or even a country, and a map with a small scale can only know the position information of a town in detail. According to human vision mechanism and scale space theory, people usually obtain different information on different scales, but when an unknown scene is analyzed by a vision system of a computer, the computer has no way to know the scale of an object in an image in advance, so that the description of the image under multiple scales needs to be considered at the same time to know the optimal scale of the object of interest. Many times, therefore, the images are constructed as a series of image sets of different scales in which features of interest are detected. In short, a larger scale provides more structural information about a scene, while a smaller scale reflects more texture information.

In various existing documents, various image filtering methods attempt to decompose an image into single-scale structural and texture portions. However, for different images and tasks, it is difficult to have a reasonable way to determine which scales are correct or best. Rather than eliminating the size ambiguity and finding an optimal image separation strategy, the images are separated in an organized, natural and efficient manner to obtain results that vary in size for the user to pick. Establishing a multi-scale representation of an image should be a very useful function, as users wish to adjust and select the most satisfactory results in various multimedia, computer vision and graphics applications. Applications of scale-space filtering in multimedia, computer vision, and graphics include image restoration, image stylization, stereo matching, optical flow, and semantic flow, among others.

Over the past few decades, the computer vision and multimedia fields have increasingly focused on organizing images hierarchically or multi-dimensionally through research that has perceived the human eye as an external world principle. Taking image segmentation as an example, an image may be spatially segmented into a set of object instances or superpixels, which may serve as a basis for subsequent further processing. In contrast to image segmentation, another hierarchical image organization scheme is being explored from the perspective of information extraction. This task is called image separation to distinguish it from image segmentation.

The work of Image smoothing via unsupervised learning explores how to directly learn data by using a deep neural network (without supervising data) to generate an ideal filtering effect. The advantage of using deep learning is that it can not only produce good filtering effect, but also have faster speed. However, deep learning often requires real filtered images (GT) as supervised data during training, and these supervised data are often difficult to obtain. Manually labeling the supervised data of the training images is time consuming and laborious. To solve this problem, this work designed the training objective as a loss function, similar to optimization-based methods, to train deep neural networks in an unsupervised, label-free situation. The major contributions of this work can be summarized as:

(1) an unsupervised image smoothing framework is proposed. Unlike previous approaches, the proposed framework does not require real tags for supervised training and can achieve the desired results by learning from any sufficiently diverse set of image data.

(2) Can obtain filtering results equivalent to or even better than the flag drum of the prior method, and designs a new image smoothing loss function which is based on a spatial adaptive Lp flat criterion and an edge image keeping regularizer.

(3) The proposed method is based on a convolutional neural network, which is far less computationally intensive than most previous methods. For example, processing a 1280 × 720 image on a modern GPU requires only 5 ms.

The specific method of this work is to take the original image as the input of the deep neural network, and the deep convolutional neural network outputs the smoothing result of the original image. This work designed a full convolution neural network with hole convolution and hopping connections. This network contains a total of 26 convolutional layers, all of which use a 33 convolution kernel and output 64 feature maps (except the last layer of the network which outputs 3 channels of images), all of which are followed by a batch normalization layer (batch normalization) and a ReLU activation function. The third convolutional layer down-samples the feature map to half the size using a convolution with step size 2, and the third to last convolutional layer restores the feature map to the same size as the input image using a deconvolution.

Image smoothing generally requires acquiring information about the context of an image, and this work gradually enlarges the image's field of view by introducing a hole convolution with an exponentially increasing dilation rate. More specifically, each two adjacent residual blocks have the same expansion ratio, and the expansion ratio of the last two residual blocks is multiplied by two.

The network structure of this work also adopts a residual learning mode, namely, the last layer of the website outputs a residual image, and the addition of the residual image and the input image is the final filtering result.

The purpose of image smoothing is to reduce unimportant image details while maintaining the original image structure, and this works as the overall loss function formula for image smoothing as follows:

ε＝ε_d+λ_f·ε_f+λ_e·ε_e

wherein epsilon_dIs a data retention term, ε_fIs a regular term, epsilon_eThe edge image retention term is a weight that balances the different penalty function terms.

The data retention term minimizes the difference between the input image and the output filtered image to ensure structural similarity. An input image is denoted by I, an output image is denoted by T (note that T here means different meaning from T in the present embodiment, and T means that the output image is valid only in the section of the conventional embodiment), and in the RGB color space, a simple data holding item can be defined as:

where i represents the index of the pixel and N is the total number of all pixels. During the filtering process, some important edge images may be lost or weakened in the smoothing process because the object of pixel value smoothing and the object of edge image retention have a certain degree of conflict. To address this problem, this work proposed an explicit edge image retention constraint that preserves the important edge image pixels. Before this constraint is put forward, the concept of a guide image is first introduced, which refers to the edge image response of an image in appearance. One simple form of edge image response is the local gradient values:

wherein, n (i) represents the domain of the ith pixel point, and c represents the c color channel. For output filtering result T sideThe edge image response may also be denoted as e (t). The edge image retention term is defined by minimizing the difference in edge image response of the leading edge images e (i) and e (t). Let B be a binary image, where B _i1 denotes an important edge image point, and the edge image retention term is defined as:

herein, the

Is the total number of all significant edge image points. The definition of the important edge image is more subjective and diverse, and needs to depend on different application scenes. An ideal way to obtain the binary map B would be to use manual tagging of user preferences. However, manual marking at the pixel level is quite time consuming and laborious. This work uses existing edge image detection techniques to obtain B. This process is not described in excess because it is not a major contribution to the work. Given enough training images and edge images B, the depth network will explicitly learn the information of the significant edge images by minimizing the edge image retention terms and reflect the features of the significant edge images into the filtering results.

To achieve better quality and greater flexibility, this work proposed a new smoothing/flattening term and a spatially variable Lp norm. To remove unwanted image details, the smoothing or flattening term ensures the smoothness of the filtering result by penalizing the gradients between adjacent pixels:

wherein N is_h(i) Representing adjacent pixels in a h window around the ith pixel point, w_i，jIs the weight of each pixel pair. Weight w_i，jThe calculation can be carried out through the spatial position relation or the pixel value relation, and the calculation modes are respectively as follows:

wherein σ_rAnd σ_sC denotes the channel of the image, and x and y are the coordinates of the pixel, respectively, for calculating the standard deviation of the gaussian function of the pixel values and the spatial position differences. It is not easy to determine the P value of Lp. To locate in the algorithm which regions in the image use what p-values, this work uses edge image guided images to define the p-value of each image pixel i and its corresponding weight as:

wherein p is^large2 and p^small0.8 is two values of p, c₁And c₂Are two non-negative thresholds. It can be seen that the value of p is not determined by the input image, but is conditioned on the output filtering result.

The reason for determining the p-value in this way is that when minimizing the loss function, L is used first_0.8Norm until p is due to^smallApplying L again, which results in some over-sharpened artifacts in the output image due to the 0.8 regularization term₂The norm suppresses the artifact. These pseudo-structures are identified as: the edge image response of pixel I on the original image I is low (e.g., E)_i(I)＜c₁Shown) but in the output image T (E)_i(T)-E_i(i)＞c₂) A significantly enhanced structure without L₂The norm can achieve a strong smoothing effect, but also produces stair-stepping artifacts. On the other hand, in the absence of L_0.8Norm due to L₂Due to the existence of the norm, an optimized image is very fuzzy, and a plurality of important structures are not well reserved. In contrast, with a complete regularization termA better visual effect can be obtained.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a real-time and controllable scale space filtering method, belongs to a deep learning method, only needs one forward propagation operation during execution, is high in speed, and can obtain a plurality of filtering results of different scales of one image. The multi-scale filtering result is controlled by introducing the edge image of the image, and the intuitiveness is stronger than that of controlling the smoothing degree by only using the hyper-parameter.

The purpose of the invention is realized by the following technical scheme:

a real-time and controllable scale space filtering method is based on a recurrent neural network model consisting of a guide network G and a stripping network P, and comprises the following steps:

(1) using original image I as input of guide network G, utilizing guide network to output several edge images G with different scales of image in circulating mode^tThen, the edge image G of the image is processed^tAnd the filtering result I^t-1Input together into the stripping network P; wherein t represents the t-th step of the cycle; t is 1,2,3 … … T, wherein I⁰I denotes the original drawing;

(2) stripping the network P at the edge image G^tUnder the guidance of (2), outputting the next filtering result I^t；

(3) Filtering result I^tRecycled ground and edge image G^t+1Taken together as input to the stripping network to again obtain a new filtering result I^t+1Repeating the operation until the cycle number reaches the set total cycle number T; in which the filtering result I output by the stripping network P is^tAnd the input edge image G^tMaintaining the same picture structure, i.e. at G^tUnder the guide of (3), image peeling is performed.

Further, the peeling network P can peel off each component from the image hierarchically: the structure/edge image included in the filtering result of each step is a subset of the structure/edge image included in the filtering result of the previous step.

Further, in the above-mentioned case,for edge image G^tOf the edge image, which is in the filtering result I^tThe gradient of the pixels corresponding to the same position should remain unchanged; for G^tOf the non-edge image, which is in the filtering result I^tThe more thoroughly the corresponding co-located pixels are smoothed the better.

Further, an edge image G^tCan be obtained by any one of the existing depth or non-depth edge image detection methods.

Further, the loop of each step in the filtering method is implemented by a single forward propagation operation or by iterating two or three steps of operations.

Furthermore, the stripping network P can supervise and learn any one of the existing filtering methods, and can also train in an unsupervised manner; at each step of the cycle, the core of the stripping network P is to accept the output I of the previous step^t-1As input, at the edge image G^tIs guided by (2) to perform image filtering, thereby obtaining a filtered image from^t-1Middle peeling off to obtain^t。

Further, the filtering result of the input image is obtained by setting a hyper-parameter, and the guide network G is used for establishing the hyper-parameter and the edge image G^tThe relation between the two; setting different hyper-parameters for different scales, wherein each hyper-parameter corresponds to an edge image G^t。

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

1. the invention formally defines a general image separation problem, and focuses on a specific member in an image separation series task by introducing a concept of scale space filtering: and (5) stripping the hierarchical image. And the initial image separation problem is simplified through theoretical analysis, so that the original complex problem is converted into a series of small sub-problems on the basis of the theoretical analysis, and the complexity is greatly reduced.

2. The method belongs to one deep learning method, only needs one forward propagation operation during execution, has high speed, and can obtain a plurality of filtering results of different scales of one image. The invention introduces the edge image of the image to control the multi-scale filtering result, and the intuitiveness is stronger than that of controlling the smoothing degree by only utilizing the hyper-parameter.

3. Compared with the numerical value hyper-parameter through adjusting the model, the invention adopts a more intuitive and flexible method to generate the filtering results with different scales, namely, the edge image with perception significance is adopted to replace the hyper-parameter.

4. Many of the tasks currently available expect the model to be able to perform in a real-time manner. As a core function of the model, in addition to useful functions, the efficiency of the model is also crucial. The invention designs a lightweight recurrent neural network model, namely a hierarchical image stripping network, so that the task of hierarchical image stripping can be efficiently and effectively completed, and the supervised and unsupervised conditions can be flexibly processed. The size of the image is about 3.5Mb, and on a GTX 2080Ti GPU, the speed of repeatedly processing 1080p images each time exceeds 60fps, so that the image processing method has very strong practical value.

5. The method can realize the hierarchical organization of the image, finally obtain the multi-scale representation of the image, and can acquire the interested information in different scales aiming at the image.

Drawings

FIGS. 1a and 1b are schematic diagrams illustrating a general image filtering; fig. 1c to 1g refer to schematic diagrams for performing scale-space filtering.

Fig. 2 is a schematic diagram of a framework structure and a working process of the recurrent neural network model in this embodiment. The numbers below each network block in the figure represent the number of channels output by the corresponding convolution module in the neural network, and the letters K, S and D represent the size of the convolution kernel, the step size of the convolution and the expansion rate of the hole convolution, respectively.

Fig. 3a to 3c show an input image, a guide map and a final result map during application of the method of the present invention, respectively.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Over the past few decades, the computer vision and multimedia fields have increasingly focused on organizing images hierarchically or multi-dimensionally through research that has perceived the human eye as an external world principle. Taking image segmentation as an example, an image may be spatially segmented into a set of object instances or superpixels, which may serve as a basis for subsequent further processing. Unlike image segmentation, this embodiment introduces another level of image organization from the perspective of information extraction. And is referred to as image separation to distinguish it from image segmentation.

The object of the invention is to hierarchically organize images resulting in a multi-scale representation of the images. Specifically, given an image I, a plurality of components C satisfying a hierarchical relationship are gradually peeled off from the image_iThe original image I is obtained by adding these components at the pixel level. Unlike image segmentation (considered from the overall space), the present embodiment decomposes an image into a series of components from the perspective of scale-space filtering. The term image filtering, also called image smoothing, refers to removing image texture while keeping the main structure of the image unchanged. The structure of the image is generally an edge image part and can reflect the overall contour and shape of an object in the image; the texture of an image refers to a visual pattern that is distributed within an object, repeating in a regular or irregular manner. Fig. 1a shows an example of image filtering, in which images of the edges of a dog, five sense organs of the dog, a rope pulling the dog and a square frame can be seen as the structure of the image, and dense and numb small squares distributed inside the body of the dog can be seen as texture details. It is a very subjective matter to determine whether a pixel in an image belongs to a structure or a texture, for example, in fig. 1a and 1b, the eyes and mouth inside a dog are determined to be a structure and thus are not smoothed but maintained, but the eyes and mouth of the dog can be determined to be texture smooth-out completely, and the two ways do not have a good or bad score depending on how the user wants to smooth the image, in other words, the result of image filtering is not single and can be selected in many different ways. As shown in the figure1c to 1g, the scale space filtering means that a plurality of filtering results with different scales and different degrees are obtained after a single input image is processed, and the image is described in a multi-scale manner, so that interesting information is obtained in different scales.

The core of the invention is to design a real-time and controllable scale space filtering method. To simplify the description, let

And is

I^tAnd P^tSatisfy I ═ I^t+P^tHere, n represents that a total of n components are separated from the original image I. In order to perform scale-space filtering efficiently, the present embodiment provides a recurrent neural network model to cyclically output a plurality of filtering results of different scales. The cyclic neural network is to obtain a new output by taking the output of the neural network as the input of the network again, and then obtain a new output again by taking the new output as the input of the neural network again, so as to reciprocate until the condition of cycle termination is met. The recurrent neural network model can be trained both in a supervised and unsupervised manner. The recurrent neural network model includes a guide network G and a stripping network P. The specific method is that the original image I is used as the input of the guide network, and the guide network is used to output a plurality of edge images G with different scales of the image in a circulating mode^tThen, the edge image G of the image is processed^tAnd the filtering result I of the previous step^t-1(T is 1,2,3 … … T; the result of filtering in the zeroth step is the original image, i.e. I⁰I) are input together into a stripping network, which strips the edge image G^tUnder the guidance of (2), outputting the next filtering result I^tNext step of filtering result I^tRecycled ground and edge image G^t+1Taken together as input to the stripped network, to again obtain new filteringResults I^t+1This operation is repeated until the number of cycles reaches the set total number of cycles T. It is emphasized that the filtering result I of the output of the stripping network^tEdge image G to be input^tMaintaining the same picture structure, i.e. at G^tUnder the guide of (3), image peeling is performed.

Specifically, definition 1 (image separation): given an image I, I is separated to obtain its constituent C ═ C₁，C₂，…，C_nAre multiplied by

This is called image decomposition, and n denotes the separation of n components from the image.

The object of the present embodiment can be represented by definition 1;

representing a first order gradient of the decomposed components. The gradient may represent the structure and detail information of the image, and is obtained by subtracting adjacent pixels of the image, that is, each pixel value of the image gradient is the result of subtracting a pixel at a corresponding position in the original image and its adjacent pixels. Is calculated by the formula

Where x, y are the position indices of the image pixels. In general, the gradient corresponding to the edge image portion of the image is large. Due to a plurality of composition components C_iThe complex relationship between them, it is very difficult to resolve multiple components directly from the image, and in order to make the problem easier to handle, image stripping based on the following sequence reduces it to a series of sub-problems that iteratively strip two components from the image.

Theorem 1 (image stripping of the sequence): suppose for any t, [ C ]_t，I^t]Are all from I^t-1In such a way that the two components are iteratively separated from I^t-1Middle peeling off to obtain C_tThe result obtained is the same as the result of separating all components directly from the image.

And (3) proving that: genus dissected according to hierarchyThe nature of the Chinese herbal medicine is that,

should be

A subset of non-zero elements. The properties that are retained by the structure can then be obtained,

and

should not have a correlation between them, can be expressed as

Now given I^t＝I^t+1+C_t+1And

can obtain the product

And

by analogy, can obtain

And

theorem 1 proves the syndrome.

Hierarchical image stripping is a special member of the image separation series of tasks, which progressively separates/strips out the components of an image in a progressive manner, not all at once from the original. The above analysis transforms the initial problem into a sequential one, naturally inspiring solutions with recurrent neural networks. Each step cycle may be viewed as performing a controlled image stripping operation of the retaining structure. The goal to be completed for each step cycle can generally be written in the form:

wherein phi (-) and psi (-) are with respect to C_tAnd I^tσ denotes a hyper-parameter that controls the filtering/stripping strength. The formula represents C output at each step of the cycle_tAnd I^tShould be such that the objective function Φ (C) of the optimization_t)+Ψ(I^t) A solution to the minimum is taken. Many conventional methods, e.g. L0[1]],RGF[2],RTV[3],muGIF[4]And the like, operations involving very large calculation amount such as inverse calculation of a matrix are slow, which limits the application of the traditional method in real time in practical scenes. When using the deep learning method, a deep neural network can be trained from the input and output perspective to simulate the effect of the conventional method (training the deep neural network using the result of the conventional method as supervised data). Once training is completed, the neural network only needs one forward propagation operation in execution, and the required operation overhead is greatly reduced. However, the use of the numerical parameter α to control the degree of filtering/stripping is not intuitive, and it is difficult to adjust the hyper-parameter α to obtain the desired result, after all α is only a number, a scalar. In contrast, guidance information that is meaningful and intuitive in visual perception is more practical. Among the many visual meaningful cues that can be selected, edge images are a good choice because they are very simple and can intuitively reflect the semantic features of an image and the overall outline of an image. Unfortunately, in general, at each stage of the loop, the edge image used as a guide is unknown. To solve this problem, the invention makes it possible to predict in advance a reasonable edge image for each phase of the cycle, i.e. G, using the bootstrap network G^t←G(I^t-1) And then using the edge image obtained by prediction to guide the stripping network P from I^t-1Middle peeling off to obtain^t. With the above considerations in mind, the present invention therefore proposes a round robin strategy, cyclically with G^tAs a guide from I^t-1Middle peeling off to obtain^tI.e., [ C ]_t，I^t]←P(I^t-1，G^t). t denotes the t-th step of the cycle. Notably, G is^tNot only can be output by a guide network, but also allows a user to create an edge image G in a customized manner^t. All the image pixel values used in the invention have the value ranges of [0, 1]]For a value range of [0,255]Can be obtained by simple standardization operation to obtain [0, 1]]The image of (2).

As shown in fig. 2, the recurrent neural network model according to this embodiment logically includes two modules, one is a bootstrap network G, and the other is a stripping network P. The stripping network P is an edge image G^tPerformed as a condition, edge image G^tEither output by the bootstrap network G or provided by the user. By this logical partitioning, the boot network and the peel network can be decoupled to a large extent, further simplifying the problem. In addition, through the logical division, the solution space of the original problem is greatly restricted, so that the solution space becomes smaller, which is beneficial to the simplification and training of the model.

First, regarding the stripping network; the stripped network can not only learn the effect of any one of the existing conventional filtering methods with supervision (using the result of the conventional method as supervision during training), but also train in an unsupervised manner. At each step of the cycle, the core task of the stripping network is to accept the output I of the previous step^t-1As input, at the edge image G^tIs guided by (2) to perform image filtering, thereby obtaining a filtered image from^t-1Middle peeling off to obtain^t. Edge image G whether used as a guide^tWhat appears to be reasonable, peeling result I^tAll should strictly comply with G^tThe guidance of (2). Since this section mainly introduces the peeling network, assume that the edge image G^tAre existing, known, and the following section will give details on how the bootstrap network is designed to get G^tIn (1). Then based on I^t-1＝I^t+C_tThis hard constraint, the stripped network, may be represented by^t-1As input, output I^tOr C_t. As long as I^tOr C_tOne of the two is output and the other can simply be through I^t-1And subtracting the output result. Furthermore, to better consider the context information of the image, the stripped network needs to have a larger receptive field. Thus, as shown in FIG. 2, a hole convolution is introduced to gradually increase the receptive field, where the dilation rate of the hole convolution increases exponentially. Although the purpose of increasing the receptive field can be achieved by designing a deeper network, the parameter number of the very deep network is very large, and in order to save the storage overhead of the model, the receptive field is enlarged without increasing the parameter number of the network by deepening the network layer number, but by introducing the hole convolution.

And second to boot the network. For input image I^t-1The result of the filtering is, in general, by setting the hyper-parameter σ^tObtained (for different scales, σ)^tDifferent), can be represented as

F denotes a specific filtering method and,

is the target result of image filtering. What the bootstrap network needs to do is to establish the sigma of the value^tAnd the corresponding guide map (edge image in the present invention) having a visual perception meaning. In order to achieve this object,

gradient of (2)

Can be used as a pair G^tTo train a bootstrap network. However, there is some difference between the image gradient and the edge image in the true sense. A reasonable edge image should be considered semantically, and at the same time be binary (binary image takes value of 0 or 1).The edge image considering the semantic meaning can reflect the target which can be perceived by human eyes in the image of the image, and the binary edge image can avoid the ambiguity when judging whether a certain pixel belongs to the edge image. To reduce the difference between the gradient and the edge image in the true sense, on the one hand, similar to the stripped network, the steering network also employs hole convolution to increase the receptive field to better learn the contextual characteristics of the image. The perception of the neural network on semantically meaningful objects can be obviously improved by using the context information obtained by the cavity convolution learning; on the other hand, a Sigmoid activation function is added to the last layer of the network to force the input as close to binary as possible. More details can be taken from the part of the guiding network in fig. 2. However, real supervision data

Not always available during the training phase. Therefore, when

In the absence of the presence of the agent,

gradient of (2)

Should use I⁰Gradient of (2)

To approximate. Result of approximation

May not be particularly precise as long as the overall structure of the image can be reflected. By repeatedly using I⁰Gradient of (2)

Of a series of different dimensions

Can be constructed to train a reasonable steering network.

Why the edge image G is not directly obtained by the ready-made edge image detection method^tSuch as conventional edge image detection operators Roberts, Sobel and Canny, or other edge image detection methods based on deep learning. Theoretically, the existing edge image detection method is feasible, but the existing method has the defects of sensitivity to noise, inaccurate edge image positioning, too thick edge image obtained by detection, unsatisfied hierarchical property and the like. In addition, the basic solution idea of the deep learning method is to learn the multi-scale features of a single input image, and then fuse the multi-scale features to obtain a final predicted edge image. The existing deep learning method needs to acquire multi-scale features by means of a pre-trained image classification network such as AlexNet or VGG 16. The specific method of these existing deep learning methods is to fuse features output by different layers of the pre-trained image classification network by using 1 × 1 convolution operation, and then obtain a predicted edge image by using the fused features, where the features of different layers of the pre-trained image classification network can be regarded as features with multiple scales. As shown in fig. 2, the framework of this embodiment can overcome the problems of the existing deep learning method, because the present invention reuses the parameters of the network and cyclically outputs the feature F output in the previous step^t-1Re-used as input to obtain the next feature F^t。F^tCompared with F^t-1Has a larger receptive field.

How the neural network model in this embodiment is trained is specifically as follows: for different scales, different σ needs to be set^t，I^tCan be seen as setting different sigma^tAs a result of the filtering of (2), each sigma^tWill have an edge image G^tCorresponding to it. T represents the total number of cycles, and in this embodiment, T is temporarily set to 4. It should be noted that the recurrent neural network may be cycled any number of times, not limited to T times. The interval of the filtering degree at each step is formed by^tControl, σ^tOr may be adjusted according to particular needs.

The loss function during network training can be simultaneously applied to supervised and unsupervised training modes, including guiding consistency loss, peeling reconstruction loss, peeling maintenance loss and peeling consistency loss. In connection with the supervised image stripping,

and

is present, obtainable, wherein

Can be generated by any existing image filtering method and used as supervision in neural network training; for unsupervised image stripping,

the specific form of the loss function is described below. The guiding consistency loss is to ensure the edge image G output by the guiding network^tAnd

and the consistency is maintained. Let 1 denote that one pixel value is all 1 and the size and I^tThe same image, ° represents the Hadamard product, or pixel-by-pixel multiplication, i.e. the corresponding multiplication of pixels at the same position in both images. If it is not

Is not available and can be obtained

Wherein G is^grTo further enhance the important edge images in the image. G^grCan be manually marked edge images or pairs

And (5) binarization result. For simplicity of description, the following description is used uniformly

To represent

The boot consistency loss can be expressed as:

wherein the content of the first and second substances,

||·||₁denotes a 1 norm, β_gIs a constant that balances the two terms in the loss function. The cycle consistency loss is compared

And

the sizes of the two are determined as G^tWhether the pixel at a certain position in the image belongs to the edge image or not.

Stripping reconstruction losing result I of desired output^tAnd

as close as possible in color space. Make | · | non-conducting phosphor₂Is a 2 norm in the form:

the purpose of the peel retention loss is to retain I^tThe gradient of the structural moiety in (1) is unchanged. I is^tPixel and G identified as belonging to a structure^tThe pixels whose median value is close to 1 have the same position, in other words,I^tstructural pixel of (2) corresponds to G^tA pixel having a middle pixel value close to 1 (considered to belong to an edge image). Since the gradient of an image can naturally reflect the structural information of the image, the peel retention loss is defined as I^tAnd

is the distance between the gradients of (1), i.e.

And

the distance between:

the loss of peel consistency severely constrains the image peeling process such that the result of peeling is in accordance with G^tAnd the consistency is maintained. Peeling consistency loss pair I^tEach pixel of (a) is smoothed to a different degree. For I^tThe punishment of the pixels belonging to the structure is small, and the punishment of the pixels belonging to the texture is large. The specific form of peel consistency loss is:

where e is used to avoid a denominator of 0. In order to stabilize the training and increase the convergence rate during the training, the bootstrap network and the stripper network are trained independently of each other. Since the stripping network requires G^tThe lead network is trained with the lead consistency loss, the parameters of the trained lead network are fixed, and the peel network is trained with the peel reconstruction loss, the peel retention loss, and the peel consistency loss.

Further, to demonstrate the significant progress of the method of the present invention, the following is illustrated with reference to some experimental results:

firstly, in order to fully utilize the multi-scale characteristics of the method, a new strategy is provided for applying the method to the significance detection task. The image saliency detection means that a computer algorithm is utilized to simulate the visual characteristics of human beings so as to extract a salient region in an image. The salient regions of an image can be regarded as the parts most noticeable to the human eye when looking at an image, and in general, the salient, high-contrast and color-changing regions of an image are noticeable to the human eye. Saliency detection is closely related to the nature of what the human visual system selectively processes, and its goal is to locate important and salient regions or objects in an image, which is an important and popular research direction in the field of computer vision. This example first utilizes the existing significance detection model CSF 6]And EGNet [7 ]]The original image I and four filtering results (total 5) generated by the method of the invention are respectively subjected to significance detection, and then DUTS-TR [8 ]]A lightweight network (only 91KB) was trained on the dataset to predict a better saliency map from these five saliency detections. The invention refers to a method for evaluating the significance detection accuracy in the related documents, and the significance detection quality is measured by using the average absolute error, and the calculation formula is MAE (S)^o，S^gt)：＝mean(|S^o-S^gtIn which S)^oSignificance map, S, output for the model^gtIs a true saliency map (marked by a human). The evaluation dataset of this example is a public significance detection dataset: ECSSD [9],PASCAL-S[10],HKU-IS[11],SOD[12]And DUTS-TE [8 ]]. The results show that the method can effectively improve the performance of the existing saliency detection model, because some features useful for saliency detection may be more prominent in different scales, and the removal of unwanted textures in the image helps to enhance the contrast of a salient region. In addition to saliency detection, the present invention can flexibly improve the performance of many other visual and graphical models.

In addition, the embodiment is compared with other filtering methods. Conventional methods for comparison include L0[1]]，RTV[3]，RGF[2]，SD[17]，muGIF[4]，realLS[18]And enBF [19 ]](ii) a The deep learning method comprises DEAF [15]]，FIP[16]And PIO 5]. In the visual results, Ours-S (A), (B), (C), (D

Produced by muGIF) and Ours represent models resulting from supervised and unsupervised training, respectively. For evaluating the quality of the image, gradient correlation coefficients are used

To evaluate the degree of irrelevancy between the stripped-out image texture and structure. For fairness, the hyper-parameters of the contrast method are also carefully adjusted so that all methods achieve a similar degree of filtering/smoothing, which is calculated by

TABLE 1 quantitative comparison of GCC and execution speed for each method model

Note 1: quantitative comparison with GCC. For a fair comparison, the smoothing/filtering degree of all comparison methods is controlled to 0.146 ± 0.01. The best results are bolded. Smaller GCC values indicate better results.

Note 2: run-time contrast when processing 1080p (1627 × 1080) images. The time of the CPU is not marked in the table, and the time of the GPU is used

And (4) marking.

As can be seen from the quantitative results in Table 1, the method of the present invention ranks first in the index of GCC compared to other methods, which illustrates that it is obtained using the recurrent neural network framework of the present invention

And

the mutually orthogonal property is well satisfied. In addition, whether running on a CPU or GPU, the recurrent neural network model executes much faster than conventional methods. Based on the advantages of the deep learning technology, the cyclic neural network model and the PIO can achieve real-time speed when processing 1080p images. In terms of visual effect, it was observed that L0, RGF and PIO had very poor visual effects and PIO also had a very severe color shift problem when the degree of filtering/smoothing was increased. While RTV and muGIF perform relatively well, both methods do not completely smooth or preserve certain areas of the image, in contrast, the present method achieves visually pleasing results in both smoothing the texture details of the image and preserving the main texture/edge images of the image. It is worth mentioning here that, except for the method of the present invention, neither the conventional method nor the depth learning method can generate a filtering result conforming to the indication of the guide image by using an edge image whose scale changes step by step or an edge image provided/edited by a user as the guide image. As shown in fig. 3a to 3c, model flexibility is verified using a manually edited guide map. The guide map is formed by combining four edge images with different scales. Only the method of the present invention successfully outputs a filtering result structurally consistent with the guide image, as compared with other methods.

The invention also evaluates the edge image output by the guide network. Because the framework involved in the invention has the property of multi-scale, the edge image output by the guide network at each step of the loop can be used for constructing an edge image confidence map. The edge image confidence map is also an edge image, except that the value of each pixel in the edge image confidence map is not necessarily close to 0 or 1, and there may be many pixel values around 0.5, because the value of each pixel in the edge image confidence map can be regarded as the probability whether the pixel belongs to the edge image, and the larger the value of the pixel in the edge image confidence map is, the more likely the pixel belongs to the edge image. In particular, the present invention relates to a method for producing,the real edge image artificially marked in the BSDS500 data set is used as G in the process of unsupervised learning^grTraining a guide network, enabling the guide network to iterate for 24 times during execution, and averaging edge images obtained by iterating for 24 times to obtain an edge image confidence map. The constructed edge image confidence map is quantitatively evaluated with an accuracy-recall curve. The edge image confidence map is processed for non-maximum suppression before evaluation.

Finally, an ablation experiment for the recurrent neural network model of the present invention is presented. Since the guiding network has only one guiding consistency loss, it is no longer necessary or necessary to perform an ablation analysis on the loss function of the guiding network. For with I^t-1As an input, there are two execution modes for the stripping network: one is output I^tThe other is output C_t. The invention is inclined to output C_tI.e. C_t←P(I^t-1，G^t). The reason for this is C_tContaining the information ratio I^tLess and simpler distribution.

In addition, the technical scheme of the invention can be replaced by the following steps:

alternative one: edge image G^tThe method is not obtained by guiding network output, but by any one of the existing deep or non-deep edge image detection methods.

Alternative scheme two: each step of loop is not a single forward propagation operation, but needs to iterate two or three steps again, which is equivalent to embedding a small loop in a large loop. The cyclic neural network model of the invention has only one forward operation in each step of cycle, and a small cycle cannot be embedded in a large cycle.

Alternative scheme three: the interaction mode of the guide network and the stripping network is not to take the output of the guide network as the input of the stripping network, but to directly use the guide network to output the model parameters of the stripping network.

The present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.

Reference documents:

[1]Li Xu,Cewu Lu,Yi Xu,and Jiaya Jia.Image smoothing via L0 gradient minimization.TOG,30(6):112,2011.

[2]Qi Zhang,Xiaoyong Shen,Li Xu,and Jiaya Jia.Rolling guidance filter.In ECCV,2014.

[3]L.Xu,Q.Yan,Y.Xia,and J.Jia.Structure extraction from texture via relative total variation.TOG.,31(6):139,2012.

[4]X.Guo,Y.Li,J.Ma,and H.Ling.Mutually guided image filtering.TPAMI,42(3):694–707,2020.

[5]Qingnan Fan,Dongdong Chen,Lu Yuan,Gang Hua,Nenghai Yu,and Baoquan Chen.A general decoupled learning framework for parameterized image operators.TPAMI,2019.

[6]Shang-Hua Gao,Yong-Qiang Tan,Ming-Ming Cheng,Chengze Lu,Yunpeng Chen,and Shuicheng Yan.Highly efficient salient object detection with 100k parameters.In ECCV,2020.

[7]Jia-Xing Zhao,Jiang-Jiang Liu,Deng-Ping Fan,Yang Cao,Jufeng Yang,and Ming-Ming Cheng.Egnet:edge guidance network for salient object detection.In ICCV,Oct 2019.

[8]Chuan Yang,Lihe Zhang,Ruan Xiang Lu,Huchuan,and Ming-Hsuan Yang.Saliency detection via graph-based manifold ranking.In CVPR,pages 3166–3173,2013.

[9]Q.Yan,L.Xu,J.Shi,and J.Jia.Hierarchical saliency detection.In CVPR,pages 1155–1162,2013.

[10]Y.Li,X.Hou,C.Koch,J.M.Rehg,and A.L.Yuille.The secrets of salient object segmentation.In CVPR,pages 280–287,2014.

[11]Guanbin Li and Y.Yu.Visual saliency based on multiscale deep features.In CVPR,pages 5455–5463,2015.

[12]David Martin,Charless Fowlkes,Doron Tal,and Jitendra Malik.A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics.In ICCV,2001.

[13]Michal Gharbi,Gaurav Chaurasia,Sylvain Paris,and Frdo Durand.Deep joint demosaicking and denoising.TOG,35(6):1–12,2016.

[14]Sifei Liu,Jinshan Pan,and Ming-Hsuan Yang.Learning recursive filters for low-level vision via a hybrid neural network.In ECCV,2016.

[15]Li Xu,Jimmy S.J.Ren,Qiong Yan,Renjie Liao,and Jiaya Jia.Deep edge-aware filters.In ICML,2015.

[16]Qifeng Chen,Jia Xu,and Vladlen Koltun.Fast image processing with fully-convolutional networks.In ICCV,pages 2516–2525,2017.

[17]Bumsub Ham,Minsu Cho,and Jean Ponce.Robust guided image filtering using nonconvex potentials.TPAMI,40(1):192–207,2017.

[18]Wei Liu,Pingping Zhang,Xiaolin Huang,Jie Yang,Chunhua Shen,and Ian Reid.Real-time image smoothing via iterative least squares.TOG,39(3):28,2020.

[19]Wei Liu,Pingping Zhang,Xiaogang Chen,Chunhua Shen,Xiaolin Huang,and Jie Yang.Embedding bilateral filter in least squares for efficient edge-preserving image smoothing.TCSVT,30(1):23–35,2020.

[20]John Canny.A computational approach to edge detection.TPAMI,8(6):679–698,1986.

[21]David R Martin,Charless C Fowlkes,and Jitendra Malik.Learning to detect natural image boundaries using local brightness,color,and texture cues.TPAMI,26(5):530–549,2004.

[22]Pablo Arbelaez,Michael Maire,Charless Fowlkes,and Jitendra Malik.Contour detection and hierarchical image segmentation.TPAMI,33(5):898–916,2011.

[23]Zhile Ren and Gregory Shakhnarovich.Image segmentation by cascaded region agglomeration.In CVPR,pages 2011–2018,2013.

[24]Piotr Doll′ar and C Lawrence Zitnick.Structured forests for fast edge detection.In CVPR,pages 1841–1848,2013.

[25]Wei Shen,Xinggang Wang,Yan Wang,Xiang Bai,and Zhijiang Zhang.DeepContour:A deep convolutional feature learned by positive-sharing loss for contour detection.In CVPR,pages 3982–3991,2015.

[26]Saining Xie and Zhuowen Tu.Holistically-nested edge detection.In CVPR,2015.

[27]Yun Liu,Ming-Ming Cheng,Xiaowei Hu,Kai Wang,and Xiang Bai.Richer convolutional features for edge detection.In CVPR,2017.

Claims

1. a real-time and controllable scale space filtering method is characterized in that based on a recurrent neural network model composed of a guide network G and a stripping network P, the method comprises the following steps:

2. A real-time and controllable scale-space filtering method according to claim 1, wherein the peeling network P is capable of peeling each component from the image hierarchically: the structure/edge image included in the filtering result of each step is a subset of the structure/edge image included in the filtering result of the previous step.

3. A real-time and controllable scale-space filtering method according to claim 1, characterized in that for edge images G^tOf the edge image, which is in the filtering result I^tThe gradient of the pixels corresponding to the same position should remain unchanged; for G^tOf the non-edge image, which is in the filtering result I^tThe more thoroughly the corresponding co-located pixels are smoothed the better.

4. A real-time and controllable scale space filtering method according to claim 1, characterized in that the edge image G^tCan be obtained by any one of the existing depth or non-depth edge image detection methods.

5. A real-time and controllable scale space filtering method according to claim 1, wherein the loop of each step in the filtering method is implemented by a single forward propagation operation or by an iterative two-step or three-step operation.

6. A real-time and controllable scale space filtering method according to claim 1, wherein the peeling network P can learn any one of the existing filtering methods supervised and also can train in an unsupervised manner; at each step of the cycle, the core of the stripping network P is to accept the output I of the previous step^t-1As input, at the edge image G^tIs guided by (2) to perform image filtering, thereby obtaining a filtered image from^t-1Middle peeling off to obtain^t。

7. A real-time and controllable scale space filtering method according to claim 1, wherein the filtering result of the input image is obtained by setting hyper-parameters,the guide network G is used for establishing a hyper-parameter and sum-edge image G^tThe relation between the two; setting different hyper-parameters for different scales, wherein each hyper-parameter corresponds to an edge image G^t。