CN112862715B

CN112862715B - Real-time and controllable scale space filtering method

Info

Publication number: CN112862715B
Application number: CN202110172012.2A
Authority: CN
Inventors: 郭晓杰; 付园斌
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2023-06-30
Anticipated expiration: 2041-02-08
Also published as: CN112862715A

Abstract

The invention discloses a real-time and controllable scale space filtering method, which circularly outputs a plurality of filtering results with different scales by designing a circulating neural network model. The cyclic neural network refers to that the output of the neural network is taken as the input of the network again to obtain a new output, and then the new output is taken as the input of the neural network again to obtain a new output, so that the cyclic termination condition is met. The recurrent neural network model includes a bootstrap network G and a lift-off network P.

Description

Real-time and controllable scale space filtering method

Technical Field

The invention relates to the field of image processing based on deep learning, in particular to a real-time and controllable scale space filtering method.

Background

The traditional methods such as L0[1], RGF [2], RTV [3], muGIF [4] and the like need to be optimized through a plurality of steps of iteration to obtain reasonable filtering results, so that the speed is very slow, and a large amount of time is consumed during execution; the deep learning method [13] [14] [15] [16] can solve the defect of slow speed of the traditional method, because the deep neural network only needs one forward propagation operation in the test stage, but the current deep learning method can only obtain a filtering result of one scale, cannot obtain a plurality of images of different scales at one time, and different super parameters are required to be set for the smoothing result of different scales and retrain. PIO 5 can adjust the super parameters of the network to obtain filtering results of different scales without retraining the neural network. However, for all the conventional methods and deep learning methods at present, the mode of obtaining the filtering results with different scales is realized by adjusting the super parameters (such as the weights of different loss functions, the number of layers of the network and other parameters) of the model, which is not intuitive. Because the hyper-parameters of the model are only one number, one scalar, the spatial and semantic information of the image cannot be intuitively reflected. In other words, the superparameter cannot intuitively control which portions of the image need to be smoothed and which portions should be preserved.

The importance of the multi-scale representation of images has long been fully verified, thereby yielding the idea of scale-space filtering. Similar to a human eye observing an object, perceived features are different when the distance from the object is different, i.e. features are different when the size of the imaging is different, i.e. the scale is different, for the same object in the visual range. One intuitive example of scale space is the different scales of a map, a map with a large scale may learn the location information of a province or even a country, and a map with a small scale may only learn the location information of a town, but in detail. According to human visual mechanisms and scale space theory, people often obtain different information on different scales, but when a vision system of one computer is used to analyze an unknown scene, the computer has no way to know the scales of objects in the image in advance, so the description of the image under multiple scales needs to be considered simultaneously, so that the optimal scale of the object of interest is known. Many times, therefore, the features of interest will be detected at different scales in constructing an image as a series of image sets of different scales. Briefly, larger scales provide more structural information about a scene, while smaller scales reflect more texture information.

In existing literature of various types, various image filtering methods attempt to decompose an image into single-scale structural and texture parts. However, for different images and tasks, it is difficult to have a reasonable way to determine which scales are correct or best. Rather than disambiguating the scale and finding an optimal image separation strategy, images are separated in an organized, natural and efficient manner, thereby yielding a plurality of dimensionally diverse results for selection by the user. Establishing a multi-scale representation of an image should be a very useful function, as users wish to adjust and select the most satisfactory results in a variety of multimedia, computer vision and graphics applications. Applications of scale-space filtering in multimedia, computer vision and graphics include image restoration, image stylization, stereo matching, optical flow, semantic flow, and the like.

In the last decades, the computer vision and multimedia fields have increasingly paid attention to organizing images hierarchically or in multiple dimensions, through research into the principles of the human eye's perception of the outside world. Taking image segmentation as an example, an image may be spatially segmented into a set of object instances or superpixels, which may be used as a basis for subsequent further processing. Unlike image segmentation, from the point of view of information extraction, another hierarchical image organization scheme is studied. This task is called image separation to distinguish from image segmentation.

Among them, image smoothing via unsupervised learning works to explore how to learn data directly with deep neural networks (without supervising the data) to produce ideal filtering effects. The advantage of using deep learning is that it can produce a better filtering effect and also can have a faster speed. However, deep learning often requires real filtered images (GT) as supervised data during training, which is often difficult to obtain. Manually marking the supervision data of the training image is time consuming and laborious. To address this problem, this work designs the training objective as a loss function, similar to an optimization-based approach, training deep neural networks in an unsupervised, unlabeled case. The major contributions of this work can be summarized as:

(1) An unsupervised image smoothing framework is presented. Unlike previous approaches, the proposed framework does not require real tags as supervision for training, learning from any sufficiently diverse image dataset can achieve the desired effect.

(2) A filtering result comparable to or even better than the previous method of the flagdrum can be obtained, and a new image smoothing loss function is designed, which is based on the spatially adaptive Lp flattening criterion and the edge image preserving regularizer.

(3) The proposed method is based on convolutional neural networks, which have a much lower computational effort than most previous methods. For example, processing 1280×720 images on modern GPUs requires only 5ms.

The specific method of the work is that the original image is taken as the input of the deep neural network, and the deep convolutional neural network outputs the smoothing result of the original image. This work devised a fully convolutional neural network with hole convolution and jump connections. The network contains 26 convolution layers in total, all using a 3 x 3 convolution kernel and outputting 64 feature maps (except the image of the 3 channels of the last layer of the network), all followed by a batch normalization layer (batch normalization) and a ReLU activation function. The third convolution layer downsamples the feature map by half using a step-size 2 convolution, and the third to last convolution layer uses deconvolution to restore the feature map to the same size as the input image.

Image smoothing typically requires the acquisition of information about the image context, which works by introducing a hole convolution with an exponentially increasing expansion rate to gradually expand the receptive field of the network to the image. More specifically, each two adjacent residual blocks have the same expansion ratio, and the expansion ratio of the two next residual blocks is multiplied by two.

The network structure of the work also adopts a residual learning mode, namely, the last layer of the website outputs a residual image, and the residual image and the input image are added to be the final filtering result.

The purpose of image smoothing is to reduce unimportant image details while preserving the original image structure, and this work works for the overall loss function formula for image smoothing as follows:

ε＝ε _d +λ _f ·ε _f +λ _e ·ε _e

wherein ε _d Is a data retention item, ε _f Is a regular term, ε _e Is an edge image retention term and weights the different loss function terms.

The data retention term minimizes the difference between the input image and the output filtered image to ensure structural similarity. The input image is denoted by I, the output image is denoted by T (note that T herein and T in the present solution are denoted by different meanings, and T denotes that the output image is valid only in the section of the prior art solution), and in the RGB color space, a simple data holding term can be defined as:

where i denotes the index of the pixel and N is the total number of all pixels. During the filtering process, some important edge images may be lost or weakened during the smoothing process due to some degree of collision between the object with the smoothed pixel values and the object held by the edge images. To address this problem, this work proposes an explicit edge image retention constraint that preserves significant edge image pixels. Before this constraint is put forward, the concept of guiding an image is first introduced, which refers to the image's apparent edge image response. One simple form of edge image response is the local gradient values:

Where N (i) represents the field of the ith pixel point and c represents the c-th color channel. The T-edge image response for the output filtering result can also be denoted as E (T). The edge image retention term is defined by minimizing the difference in edge image response of the leading edge images E (I) and E (T). Let B be a binary image, wherein B _i =1 denotes an important edge image point, and the edge image retention term is defined as:

here, the

Is the sum of all significant edge image points. The definition of the important edge image is more subjective and diversified, and needs to be determined according to different application scenes. The ideal method of obtaining the binary image B is to use the manual marking of the user's preference. However, pixel level manual marking is quite time consuming and labor intensive. This work utilizes existing edge image detection techniques to obtain B. This process is not a major contribution to this work and is therefore not described in any great detail. Given enough training images and edge images B, the depth network will learn the information of important edge images explicitly by minimizing the edge image retention term and react the features of the important edge images into the filtering result.

This work suggests a new smoothing/flattening term and spatially variable Lp norms for better quality and greater flexibility. To remove unwanted image details, a smoothing or flattening term guarantees the smoothness of the filtering result by penalizing the gradient between adjacent pixels:

Wherein N is _h (i) Representing adjacent pixels in an h×h window around an ith pixel point, w _i，j Weights for each pixel pair. Weight w _i，j The spatial position relation or the pixel value relation can be calculated by respectively:

wherein sigma _r Sum sigma _s The standard deviation of the gaussian function, which calculates the pixel value and the spatial position difference, respectively, c represents the channel of the image, and x and y are the coordinates of the pixel. It is not trivial to determine the P value of Lp. To locate in the algorithm which regions in the image use what p-values, this work uses the edge image to guide the image to define the p-value for each image pixel i and its corresponding weight as:

wherein p is ^large =2 and p ^small =0.8 is p two values, c ₁ And c ₂ Is two non-negative thresholds. It can be seen that the value of p is not determined by the input image, but is conditioned on the output filtering result.

The reason for determining the p-value in this way is that when minimizing the loss function, L is used first _0.8 Norms up to due to p ^small The =0.8 regularization term causes some over-sharpened artifacts to appear in the output image, and L is applied again ₂ Norms to suppress artifacts. These dummy structures are identified as: the edge image response of pixel I on original image I is low (e.g _i (I)＜c ₁ Shown), but in the output image T (E _i (T)-E _i (i)＞c ₂ ) Structure of significant enhancement, no L ₂ The norms can achieve a strong smoothing effect, but also produce stepwise artifacts. On the other hand, in the absence of L _0.8 In the case of norms, due to L ₂ The presence of norms, the optimized image is very blurred and many important structures are not well preserved. In contrast, better visual effects can be obtained with a complete regularization term.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a real-time and controllable scale space filtering method, belongs to one of deep learning methods, is fast in speed by only one forward propagation operation, and can obtain a plurality of filtering results of different scales of one image. The edge image of the introduced image controls the multi-scale filtering result, and the intuitiveness is better than the control of the smoothing degree by only using the super-parameters.

The invention aims at realizing the following technical scheme:

a real-time and controllable scale-space filtering method based on a cyclic neural network model comprising a guiding network G and a stripping network P, comprising the steps of:

(1) Taking original image I as input of a guiding network G, outputting a plurality of edge images G with different scales of images in a cyclic manner by using the guiding network ^t Then edge image G of the image ^t And filtering result I ^t-1 Together into the stripping network P; wherein t represents the t-th step of the cycle; t=1, 2,3 … … T, wherein I ⁰ =i represents original;

(2) Stripping network P at edge image G ^t Is guided to output the filtering result I of the next step ^t ；

(3) Filtering result I ^t Recycled and edge image G ^t+1 Together as input to the stripping network, again a new filtering result I is obtained ^t+1 This operation is repeated until the number of cycles reaches the set total number of cycles T; wherein the filtering result I output by the network P is stripped ^t And an input edge image G ^t Maintaining the same image structure, i.e. at G ^t Is guided by the (c) to perform image stripping.

Further, the stripping network P can hierarchically strip each component from the image: the structure/edge image contained in the filtering result of each step is a subset of the structure/edge image contained in the filtering result of the previous step.

Further, for the edge image G ^t In the filtering result I ^t The gradient of the pixel corresponding to the same location should remain unchanged; for G ^t Pixels belonging to non-edge images in the filtering result I ^t The more thoroughly the corresponding co-located pixels are smoothed.

Further, edge imageG ^t Can be obtained by any existing depth or non-depth edge image detection method.

Further, the loop of each step in the filtering method is realized by single forward propagation operation or iterative two-step or three-step operation.

Furthermore, the stripping network P can supervise and learn any one of the existing filtering methods, and can train in an unsupervised mode; at each step of the loop, the core of the stripping network P is to accept the output I of the last step ^t-1 As input, in the edge image G ^t Is guided by (1) to carry out image filtering, thereby obtaining a target image from I ^t-1 Stripping out I ^t 。

Further, the filtering result of the input image is obtained by setting the superparameter, and the guiding network G is used for establishing the superparameter and the edge image G ^t A link between them; for different scales, different superparameters are set, each superparameter corresponding to an edge image G ^t 。

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

1. the present invention formally defines a general image separation problem, which by introducing the concept of scale-space filtering, focuses on a specific member of the image separation series of tasks: hierarchical image stripping. The original image separation problem is simplified through theoretical analysis, and the original complex problem is converted into a series of small sub-problems on the basis of theoretical analysis, so that the complexity is greatly reduced.

2. The method belongs to one of deep learning methods, only one forward propagation operation is needed when the method is executed, the speed is high, and a plurality of filtering results with different scales of one image can be obtained. The invention introduces the edge image of the image to control the multi-scale filtering result, and has stronger intuitiveness than controlling the smooth degree by only utilizing the super-parameters.

3. Compared with the numerical superparameter of the model, the invention adopts a more visual and flexible method to generate filtering results with different scales, namely adopts an edge image with perception meaning to replace the superparameter.

4. Many of the tasks that exist require the model to perform in a real-time manner. As a core function of the model, besides the useful function, the efficiency of the model is also of great importance. The invention designs a lightweight cyclic neural network model, namely a hierarchical image stripping network, so that the task of hierarchical image stripping can be efficiently and effectively completed, and the supervised and unsupervised conditions can be flexibly processed. The size of the image is about 3.5Mb, and on the GTX 2080Ti GPU, the speed of repeatedly processing 1080p images each time exceeds 60fps, so that the image processing method has very strong practical value.

5. The method can realize hierarchical organization of the images, finally obtain multi-scale representation of the images, and can acquire interested information in different scales aiming at the images.

Drawings

FIGS. 1a and 1b are schematic diagrams illustrating general image filtering; fig. 1c to 1g are schematic diagrams of scale-space filtering.

Fig. 2 is a schematic diagram of a framework structure and a working process of the recurrent neural network model in this embodiment. The number below each network block in the figure represents the number of channels output by the corresponding convolution module in the neural network, and letters K, S and D represent the size of the convolution kernel, the step size of the convolution and the expansion rate of the hole convolution, respectively.

Fig. 3a to 3c show the input image, the guidance map and the resulting map, respectively, during the application of the method according to the invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and the specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

In the last decades, the computer vision and multimedia fields have increasingly paid attention to organizing images hierarchically or in multiple dimensions, through research into the principles of the human eye's perception of the outside world. Taking image segmentation as an example, an image may be spatially segmented into a set of object instances or superpixels, which may be used as a basis for subsequent further processing. Unlike image segmentation, this embodiment introduces another hierarchical image organization from the point of view of information extraction. And is referred to as image separation to distinguish from image segmentation.

The object of the invention is to hierarchically organize images, resulting in a multi-scale representation of the images. Specifically, given an image I, a plurality of components C satisfying a hierarchical relationship are gradually stripped from the image _i The components are added at the pixel level to obtain the original image I. Unlike image segmentation (considered in whole space), the present embodiment decomposes an image into a series of components from the perspective of scale-space filtering. The term image filtering, also referred to herein as image smoothing, refers to removing image texture while maintaining the main structure of the image unchanged. The structure of the image is generally an edge image part, and can reflect the overall outline and shape of an object in the image; the texture of an image refers to a visual pattern that is distributed inside the object and repeatedly appears in a regular or irregular manner. Fig. 1a shows an example of filtering of an image, in which the dog, the facial sense of the dog, the rope pulling the dog and the edge image corresponding to the square frame can be seen as the structure of the image, while the small squares of the mildly hemp distributed inside the dog's body can be seen as the texture details. Judging whether a pixel in an image belongs to a structure or a texture is a very subjective matter, for example, in fig. 1a and 1b, eyes and mouth inside a dog are judged to be structure, and therefore are not smoothed but kept, but the eyes and mouth of the dog can be completely judged to be texture-removed, and the two ways have no difference in error and quality, and completely depend on what the image is wanted to be smoothed by a user, in other words, the result of image filtering is not single, and many different choices are possible. As shown in fig. 1c to 1g, scale space filtering refers to processing a single input image to obtain a plurality of filtering results with different scales and different degrees, and describing the image under multiple scales in this way, so as to obtain interesting information in different scales.

The core of the invention is to design a real-time and controllable scale space filtering method. For simplicity of explanation, let

And->

I ^t And P ^t Satisfy i=i ^t +P ^t Where n represents that n components are peeled off in total from the original image I. In order to effectively perform scale-space filtering, the present embodiment provides a recurrent neural network model to cyclically output a plurality of filtering results of different scales. The cyclic neural network refers to that the output of the neural network is taken as the input of the network again to obtain a new output, and then the new output is taken as the input of the neural network again to obtain a new output, so that the cyclic termination condition is met. The recurrent neural network model can be trained in both a supervised and an unsupervised manner. The recurrent neural network model includes a bootstrap network G and a lift-off network P. The method comprises taking original image I as input of guide network, outputting multiple edge images G of different scales of image in cyclic manner by guide network ^t Then edge image G of the image ^t And the filtering result I of the last step ^t-1 (t=1, 2,3 … … T; the result of the zero-th step is the original image, i.e.) ⁰ The =i) is input together into the stripping network, which strips the image G at the edge ^t Is guided to output the filtering result I of the next step ^t Filtering result I of the next step ^t Recycled and edge image G ^t+1 Together as input to the stripping network, again a new filtering result I is obtained ^t+1 This operation is repeated until the number of cycles reaches the set total number of cycles T. It should be emphasized that the filtering result I of the stripping network output ^t Edge image G to be input with ^t Maintaining the same image structure, i.e. at G ^t Is guided by the (c) to perform image stripping.

Specifically, definition 1 (figureImage separation): given an image I, 1 is separated to obtain its constituent c= { C ₁ ，C ₂ ，…，C _n And (3)

This is called image decomposition, where n represents the separation of n components from the image.

The object of the present embodiment can be represented by definition 1;

representing a first order gradient of the decomposed components. The gradient may represent structural and detail information of the image, obtained by differencing adjacent pixels of the image, i.e. each pixel value of the image gradient is the result of subtracting the pixel at the corresponding position in the original image from its adjacent pixels. The calculation formula is +.>

Where x, y is the position index of the image pixel. Generally, the gradient corresponding to the edge image portion of the image is large. Due to a plurality of components C _i The complex relationship between them makes it very difficult to directly decompose multiple components from an image, and in order to make the problem easier to handle, image stripping based on the following sequence reduces it to a series of sub-problems in which two components are iteratively stripped from an image.

Theorem 1 (image stripping of sequence): assume for any t, [ C _t ，I ^t ]Are all from I ^t-1 The two components stripped from I are iteratively separated in this way from I ^t-1 Stripping C from the middle _t The results obtained are the same as those obtained by directly separating all the components from the image.

And (3) proving: depending on the nature of the hierarchical stripping,

the non-zero element in (2) should be +.>

A subset of non-zero elements in the set. And then protected by the structureThe nature of holding is available->

And->

Should not have a correlation between them, can be expressed as +.>

Now given I ^t ＝I ^t+1 +C _t+1 And->

Available->

And->

And so on, can obtain

And->

Theorem 1 is proved.

Hierarchical image stripping is a special member of the image separation series of tasks, which progressively separates/strips out the components of the image in a progressive manner, without separating all components from the original at once. The above analysis converts the initial problem into a sequential processing problem, which naturally inspires to be solved with a recurrent neural network. Each cycle can be considered as performing an image stripping operation of a controllable retention structure. The goal to be accomplished for each cycle of steps can generally be written in the form of:

wherein Φ (& gt) and ψ (& gt) are respectively related to C _t And I ^t Sigma represents a hyper-parameter that controls the filtering/stripping strength. The formula represents the C output at each step of the cycle _t And I ^t Should be such that the optimized objective function Φ (C _t )+Ψ(I ^t ) A solution of the minimum value is obtained. Many conventional methods, such as L0[1 ]]，RGF[2]，RTV[3]，muGIF[4]Etc., involve very computationally intensive operations such as inverting the matrix, which is slow, limiting the real-time application of such conventional methods in real-time situations. When using the deep learning method, one deep neural network can be trained from the input and output point of view to simulate the effects of the conventional method (the result of the conventional method is used as the supervised data to train the deep neural network). Once training is completed, the neural network needs only one forward propagation operation when executing, and the required operation cost is greatly reduced. However, using the numerical parameter α to control the degree of filtering/stripping is not intuitive, and it is difficult to obtain the desired result by adjusting the super parameter α, which is only a number, a scalar, after all. In contrast, guiding information that is visually meaningful and intuitive is more practical. Among the many alternative visually meaningful cues, edge images are a good choice, because edge images are very simple and intuitively reflect the semantic features of the image and the overall outline of the image. Unfortunately, in general, at each stage of the loop, the edge image used as a guide is unknown. To solve this problem, what the present invention can do is to predict in advance a reasonable edge image, i.e., G, for each stage of the loop using the guided network G ^t ←G(I ^t-1 ) Reusing the predicted edge image to direct the stripping network P to strip from I ^t-1 Stripping out I ^t . With the above considerations in mind, the present invention therefore proposes a cyclic strategy, cyclically taking G ^t Slave I as a guide ^t-1 Stripping out I ^t I.e. [ C ] _t ，I ^t ]←P(I ^t-1 ，G ^t ). t represents the t-th step of the cycle. Notably, G ^t Not justCan be output by the guidance network, and also allows the user to custom create the edge image G ^t . The value ranges of all the image pixel values used in the invention are 0,1]For a value range of [0, 255]Can be obtained by simple normalization operations]Is a picture of the image of (a).

As shown in fig. 2, the recurrent neural network model according to the present embodiment logically includes two modules, one is the bootstrap network G and the other is the peeling network P. The stripping network P is implemented by using edge image G ^t As conditional execution, edge image G ^t Either output by the guidance network G or provided by the user. With this logical partitioning, the bootstrap network and the lift-off network can be decoupled to a large extent, further simplifying the problem. In addition, through this logical partitioning, the solution space of the original problem is greatly constrained, so that the solution space becomes smaller, which is beneficial to model compaction and training.

First, with respect to the stripping network; the stripping network not only can supervised learn the effect of any existing conventional filtering method (taking the result of the conventional method as supervision during training), but also can train in an unsupervised manner. At each step of the loop, the core task of the stripping network is to accept the output I of the last step ^t-1 As input, in the edge image G ^t Is guided by (1) to carry out image filtering, thereby obtaining a target image from I ^t-1 Stripping out I ^t . Whether used as a leading edge image G ^t How well it looks, peel result I ^t Should strictly adhere to G ^t Is provided. Since this section mainly introduces the peeling network, it is assumed that the edge image G ^t Is existing and known, and the following section will describe in detail how the bootstrap network is designed to get G ^t A kind of electronic device. Then based on I ^t-1 ＝I ^t +C _t This hard constraint, the peel-off network can be defined as I ^t-1 As input, output I ^t Or C _t . Provided that I ^t Or C _t One of the two is output and the other can simply pass through I ^t-1 The result of subtracting the output is obtained. Furthermore, to make it betterConsidering the context information of the image, the stripping network needs to have a large receptive field. Thus, as shown in FIG. 2, a hole convolution is introduced to gradually increase the receptive field, where the expansion rate of the hole convolution increases exponentially. Although the purpose of increasing the receptive field can be achieved by designing a deeper network, the parameter quantity of a very deep network is very large, so that the receptive field is not enlarged by deepening the layer number of the network in order to save the storage cost of a model, but is enlarged by introducing hole convolution under the condition of not increasing the parameter quantity of the network.

And secondly, guiding the network. For input image I ^t-1 The result of the filtering is, for example, by setting the super-parameter sigma ^t Obtained (sigma for different scales ^t Different), can be expressed as

F represents a specific filtering method, < >>

Is the target result of image filtering. The only thing to do to guide the network is to build up a value σ ^t And the corresponding visual perceptually significant guide map (edge image in the present invention). To achieve this goal, a->

Gradient of->

Can be used as G ^t Is provided for training a bootstrap network. However, there is also a certain difference between the image gradient and the edge image in the true sense. A reasonable edge image should be considered semantic while binary (binary image takes a value of 0 or 1). The edge image considering the semantics can reflect the object which can be perceived by human eyes in the image of the image, and the binary edge image can avoid ambiguity when judging whether a certain pixel belongs to the edge image. In order to reduce the difference between the gradient and the edge image in the real sense, on the one hand, the method is similar to strippingFrom the network, the bootstrap network also employs hole convolution to increase receptive fields to better learn the contextual characteristics of the image. The perception of the object with semantic meaning by the neural network can be obviously improved by using the context information obtained by the cavity convolution learning; on the other hand, a Sigmoid activation function is added to the last layer of the network to force the input as close to binary as possible. Further details can be appreciated from the bootstrapping network part of fig. 2. However, real supervision data +. >

Not always available during the training phase. Thus, when->

In the absence of->

Gradient of (2)

Should use I ⁰ Gradient of->

To approximate. Approximate result->

And may not be particularly accurate as long as the overall structure of the image can be reflected. By repeatedly using I ⁰ Gradient of->

A series of different scales +.>

Can be constructed to train a reasonable guided network.

Why the edge image G is not obtained directly with the existing edge image detection method ^t Such as conventional edge image detectionOperators Roberts, sobel and Canny, or other edge image detection methods based on deep learning. In theory, the existing edge image detection method is feasible, but the existing method has the defects of sensitivity to noise, inaccurate edge image positioning, too thick detected edge image, unsatisfied hierarchy property and the like. In addition, the basic solution idea of the deep learning method is to learn multi-scale features of a single input image and then fuse the multi-scale features to obtain a final predicted edge image. Existing deep learning methods require acquisition of multi-scale features by means of pre-trained image classification networks such as AlexNet or VGG 16. The specific method of the existing deep learning method is that the 1×1 convolution operation is utilized to fuse the features output by different layers of the pre-trained image classification network, and then the fused features are utilized to obtain the predicted edge image, so that the features of different layers of the pre-trained image classification network can be regarded as the features with multiple scales. As shown in fig. 2, the framework of the present embodiment can overcome the problems of the existing deep learning method, because the present invention reuses the parameters of the network, and cyclically outputs the feature F of the previous step ^t-1 Re-input to obtain the next feature F ^t 。F ^t Compared with F ^t-1 Has larger receptive field.

How the neural network model in this embodiment is trained is specifically as follows: for different scales, different sigma needs to be set ^t ，I ^t Can be regarded as setting different sigma ^t Each sigma ^t Will have an edge image G ^t Corresponding to this. T represents the total number of cycles, and this embodiment is temporarily set to t=4. It should be noted that the recurrent neural network may be cycled any number of times, not limited to T times. The interval of the filtering degree of each step is formed by sigma ^t Control, sigma ^t And may be adjusted to the particular needs.

The loss function during network training can be applied to both supervised and unsupervised training modes, including guided consistency loss, stripping reconstruction loss, stripping maintenance loss, and stripping consistency loss. For a supervised image stripping,

and

is present and available, wherein ∈ ->

Can be generated by any existing image filtering method, and can be used as supervision during neural network training; for unsupervised image stripping, +.>

The specific form of the loss function is described below. The loss of the guiding consistency is to ensure the guiding of the edge image G of the network output ^t And (3) with

And keep the same. Let 1 denote a pixel value of all 1 and the size and I ^t The same image, the ° represents the Hadamard product, or pixel-by-pixel multiplication, i.e. the corresponding multiplication of pixels in the same position of two images. If->

Is not available, can obtain

Wherein G is ^gr To further enhance the important edge image in the image. G ^gr Either can be an edge image marked manually or a pair of edge images

And (5) binarizing results. For simplicity of explanation, the following description will be given with +.>

Representation->

The boot consistency loss may be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,

||·|| ₁ representing 1 norm, beta _g Is a constant used to balance two of the loss functions. The cycle consistency loss is determined by comparison +.>

And->

The size of the two is judged G ^t Whether or not the pixel at a certain position belongs to the edge image.

Stripping reconstruction loses desired output result I ^t And

as close as possible in color space. Let I I.I ₂ Is a 2-norm, which takes the form of:

the purpose of peel retention loss is to retain I ^t The gradient of the moieties in (a) is unchanged. I ^t Pixels and G identified as belonging to a structure ^t Pixels with median value close to 1 have the same position, in other words, I ^t Is a knot of (2)The pixels correspond to G ^t Pixels with a middle pixel value close to 1 (considered to belong to the edge image). Since the gradient of the image can naturally reflect the structural information of the image, the definition of the peel holding loss is I ^t And

the distance between gradients of (i.e.)>

And->

Distance between:

the peel consistency loss severely constrains the image peeling process such that the result of the peeling is consistent with G ^t And keep the same. Peel consistency loss vs. I ^t To a different degree. For I ^t The penalty is smaller for pixels belonging to the structure and larger for pixels belonging to the texture. The specific form of peel consistency loss is:

where e is used to avoid denominator 0. In order to make the training more stable and to increase the convergence speed during training, the bootstrap network and the peeling network are each trained independently. G is required due to the network stripping ^t The stripping process is guided as input, so that the guiding network is trained with the guided consistency loss, the parameters of the trained guiding network are fixed, and the stripping network is trained with the stripping reconstruction loss, the stripping maintenance loss, and the stripping consistency loss.

Further, to demonstrate the significant progress of the method of the present invention, the following is presented in conjunction with some experimental results:

Firstly, in order to fully utilize the multi-scale characteristics of the method, a new strategy is provided for applying the method to the significance detection task. Image saliency detection refers to the fact that a computer algorithm is utilized to simulate visual characteristics of human beings, so that salient areas in an image are extracted. The salient regions of an image can be considered as the most noticeable parts of the human eye when looking at an image, and in general the salient, contrasting, and color changing regions of an image are noticeable to the human eye. Saliency detection is closely related to the characteristic of content seen by selective processing of a human eye vision system, and aims to locate important and remarkable areas or objects in an image, so that the saliency detection is an important and popular research direction in the field of computer vision. This example first uses the existing significance detection model CSF [6 ]]And EGNet [7]The saliency detection is carried out on the original image I and four filtering results (5 in total) generated by the method of the invention, and then the saliency detection is carried out on DUTS-TR [8 ]]A lightweight network (only 91 KB) was trained on the dataset to predict a better saliency map from the five saliency detection results. The invention refers to a method for evaluating the accuracy of the saliency detection in the related literature, which utilizes the average absolute error to measure the quality of the saliency detection, and the calculation formula is MAE (S) ^o ，Sg ^t )：＝mean(|S ^o -Sg ^t I), wherein S ^o For the saliency map output by the model, S ^gt Is a true saliency map (marked by manpower). The evaluation dataset of the present embodiment is a public significance detection dataset: ECSSD [9 ]]，PASCAL-S[10]，HKU-IS[11]，SOD[12]And DUTS-TE [8 ]]. The results show that the method can effectively improve the performance of the existing saliency detection model, because some features which are useful for saliency detection can be more prominent on different scales, and the removal of unwanted textures in the image helps to enhance the contrast of the salient regions. In addition to saliency detection, the present invention can flexibly enhance the performance of many other visual and graphic models.

In addition, this embodiment is also compared with other filtering methods. Conventional methods for comparison include L0[1 ]]，RTV[3]，RGF[2]，SD[17]，muGIF[4]，realLS[18]And enBF [19 ]]The method comprises the steps of carrying out a first treatment on the surface of the The deep learning method includes DEAF 15]，FIP[16]And PIO [5 ]]. In the visual results, the use of Ours-S in this example

Generated by muGIF) and resurs represent models obtained by supervised and unsupervised training, respectively. To evaluate the quality of an image, gradient correlation coefficients are used

To evaluate the degree of uncorrelation between the stripped image texture and structure. For fairness, the hyper-parameters of the comparison method are also carefully adjusted so that all methods reach a similar degree of filtering/smoothing, calculated as +. >

Table 1 quantitative comparison of GCC and execution speed of each method model

Note 1: quantitative comparisons with respect to GCC. For fair comparison, the degree of smoothing/filtering was controlled at 0.146±0.01 for all comparison methods. The best results are bolded. Smaller GCC values indicate better results.

And (2) injection: run-time contrast when 1080p (1627×1080) images are processed. The time of CPU in the table is not marked with any mark, and the time of GPU is used

And (5) marking.

From the quantitative results of Table 1, it can be seen that the method of the present invention ranks first among the indices of GCC as compared with other methods, which illustrates the utilization of the recurrent nerves of the present inventionObtained by a network framework

And->

The property of mutual orthogonality is well satisfied. In addition, the cyclic neural network model is executed much faster than conventional methods, whether running on a CPU or GPU. Based on the advantages of the deep learning technology, the cyclic neural network model and the PIO can reach real-time speed when processing 1080p images. In terms of visual effect, it was observed that as the degree of filtering/smoothing increased, the visual effect of L0, RGF and PIO was very poor, and PIO also had a very serious color shift problem. While RTV and mulif perform relatively well, both methods do not completely smooth or preserve certain areas in the image, in contrast to the present method which achieves visually pleasing results in both smoothing the texture details of the image and preserving the main structure/edge image of the image. It should be noted here that, except the method of the present invention, neither the conventional method nor the deep learning method can generate a filtering result conforming to the instruction of the guidance image by using, as the guidance image, an edge image whose scale is changed stepwise or an edge image provided/edited by the user. As shown in fig. 3a to 3c, model flexibility was verified using a manually edited boot graph. The guide map is a map combining four different scale edge images into one map. Compared with other methods, only the method of the invention can successfully output the filtering result which is consistent with the guiding image in structure.

The invention also evaluates the edge image output by the guiding network. Because of the multi-scale nature of the framework involved in the present invention, the edge image output by the guidance network at each step of the loop can be used to construct an edge image confidence map. The edge image confidence map is also known as an edge image, except that the value of each pixel in the edge image confidence map is not necessarily near 0 or 1, and there may be many pixel values near 0.5 because the value of each pixel in the edge image confidence map may be considered as if the pixel isThe probability of belonging to an edge image, the greater the value of a pixel in the edge image confidence map, the more likely that pixel will belong to the edge image. Specifically, the actual edge image artificially marked in the BSDS500 data set is used as G in unsupervised learning ^gr Training the guiding network, iterating the guiding network for 24 times when the guiding network is executed, and averaging the edge images obtained by the iterating 24 times to obtain the edge image confidence map. The constructed edge image confidence map is quantitatively evaluated with an accuracy-recall curve. The edge image confidence map is subjected to non-maximum suppression prior to evaluation.

Finally, an ablation experiment aiming at the cyclic neural network model is introduced. Since the guidance network has only one guidance consistency loss, it is no longer necessary nor necessary to do a relevant ablation analysis of the loss function of the guidance network. For I ^t-1 As an input, there are two implementations of the stripping network: is output I ^t The other is output C _t . The present invention tends to output C _t C, i.e _t ←P(I ^t-1 ，G ^t ). The reason for this is C _t Ratio of information contained I ^t Fewer, and simpler distribution.

In addition, the technical scheme aiming at the invention can also be replaced by:

alternative scheme one: edge image G ^t Rather than using a guided network output, it is obtained using any of the existing depth or non-depth edge image detection methods.

Alternative scheme II: each loop is not a single forward propagation operation, but rather requires two or three more iterations, which is equivalent to a large loop with a small loop embedded therein. The cyclic neural network model of the invention is that each step of cycle only has one forward operation, and a small cycle is not embedded in a large cycle.

An alternative scheme III: the manner in which the bootstrap network and the lift-off network interact is not to take the output of the bootstrap network as input to the lift-off network, but to directly use the bootstrap network to output model parameters of the lift-off network.

The invention is not limited to the embodiments described above. The above description of specific embodiments is intended to describe and illustrate the technical aspects of the present invention, and is intended to be illustrative only and not limiting. Numerous specific modifications can be made by those skilled in the art without departing from the spirit of the invention and scope of the claims, which are within the scope of the invention.

Reference is made to:

[1]Li Xu，Cewu Lu，Yi Xu，and Jiaya Jia.Image smoothing via L0 gradient minimization.TOG，30(6)：112，2011.

[2]Qi Zhang，Xiaoyong Shen，Li Xu，and Jiaya Jia.Rolling guidance filter.In ECCV，2014.

[3]L..Xu，Q.Yan，Y.Xia，and J.Jia.Structure extraction from texture via relative total variation.TOG.，31(6)：139，2012.

[4]X.Guo，Y.Li，J.Ma，and H.Ling.Mutually guided image filtering.TPAMI，42(3)：694-707，2020.

[5]Qingnan Fan，Dongdong Chen，Lu Yuan，Gang Hua，Nenghai Yu，and Baoquan Chen.A general decoupled learning framework for parameterized image operators.TPAMI，2019.

[6]Shang-Hua Gao，Yong-Qiang Tan，Ming-Ming Cheng，Chengze Lu，Yunpeng Chen，and Shuicheng Yan.Highly efficient salient object detection with 100k parameters.In ECCV，2020.

[7]Jia-Xing Zhao，Jiang-Jiang Liu，Deng-Ping Fan，Yang Cao，Jufeng Yang，and Ming-Ming Cheng.Egnet：edge guidance network for salient object detection.In ICCV，Oct 2019.

[8]Chuan Yang，Lihe Zhang，Ruan Xiang Lu，Huchuan，and Ming-Hsuan Yang.Saliency detection via graph-based manifold ranking.In CVPR，pages 3166-3173，2013.

[9]Q.Yan，L.Xu，J.Shi，and J.Jia.Hierarchical saliency detection.In CVPR，pages 1155-1162，2013.

[10]Y.Li，X.Hou，C.Koch，J.M.Rehg，and A.L.Yuille.The secrets of salient object segmentation.In CVPR，pages 280-287，2014.

[11]Guanbin Li and Y.Yu.Visual saliency based on multiscale deep features.In CVPR，pages5455-5463，2015.

[12]David Martin，Charless Fowlkes，Doron Tal，and Jitendra Malik.A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics.In ICCV，2001.

[13]Michal Gharbi，Gaurav Chaurasia，Sylvain Paris，and Frdo Durand.Deep joint demosaicking and denoising.TOG，35(6)：1-12，2016.

[14]Sifei Liu，Jinshan Pan，and Ming-Hsuan Yang.Learning recursive filters for low-level vision via a hybrid neural network.In ECCV，2016.

[15]Li Xu，Jimmy S.J.Ren，Qiong Yan，Renjie Liao，and Jiaya Jia.Deep edge-aware filters.In ICML，2015.

[16]Qifeng Chen，Jia Xu，and Vladlen Koltun.Fast image processing with fully-convolutional networks.In ICCV，pages 2516-2525，2017.

[17]Bumsub Ham，Minsu Cho，and Jean Ponce.Robust guided image filtering using nonconvex potentials.TPAMI，40(1)：192-207，2017.

[18]Wei Liu，Pingping Zhang，Xiaolin Huang，Jie Yang，Chunhua Shen，and Ian Reid.Real-time image smoothing via iterative least squares.TOG，39(3)：28，2020.

[19]Wei Liu，Pingping Zhang，Xiaogang Chen，Chunhua Shen，Xiaolin Huang，and Jie Yang.Embedding bilateral filter in least squares for efficient edge-preserving image smoothing.TCSVT，30(1)：23-35，2020.

[20]John Canny.A computational approach to edge detection.TPAMI，8(6)：679-698，1986.

[21]David R Martin，Charless C Fowlkes，and Jitendra Malik.Learning to detect natural image boundaries using local brightness，color，and texture cues.TPAMI，26(5)：530-549，2004.

[22]Pablo Arbelaez，Michael Maire，Charless Fowlkes，and Jitendra Malik.Contour detection and hierarchicalimage segmentation.TPAMI，33(5)：898-916，2011.

[23]Zhile Ren and Gregory Shakhnarovich.Image segmentation by cascaded region agglomeration.In CVPR，pages 2011-2018，2013.

[24]Piotr Doll′ar and C Lawrence Zitnick.Structured forests for fast edge detection.In CVPR，pages 1841-1848，2013.

[25]Wei Shen，Xinggang Wang，Yan Wang，Xiang Bai，and Zhijiang Zhang.DeepContour：A deep convolutional feature learned by positive-sharing loss for contour detection.In CVPR，pages 3982-3991，2015.

[26]Saining Xie and Zhuowen Tu.Holistically-nested edge detection.In CVPR，2015.

[27]Yun Liu，Ming-Ming Cheng，Xiaowei Hu，Kai Wang，and Xiang Bai.Richer convolutional features for edge detection.In CVPR，2017.

Claims

1. a real-time and controllable scale-space filtering method, characterized by comprising the following steps based on a cyclic neural network model comprising a guiding network G and a stripping network P:

(1) Taking original image I as input of a guiding network G, outputting a plurality of edge images G with different scales of images in a cyclic manner by using the guiding network ^t Then edge image G of the image ^t And filtering result I ^t-1 Together into the stripping network P; wherein t represents the t-th step of the cycle; t=1, 2,3 … … T, wherein I ⁰ =i represents original; the filtering result of the input image is obtained by setting the super parameter, and the guiding network G is used for establishing the super parameter and the edge image G ^t A link between them; for different scales, different superparameters are set, each superparameter corresponding to an edge image G ^t ；

The loss function during network training is simultaneously suitable for supervision andtwo training modes without supervision, including a guided consistency loss, a peeling reconstruction loss, a peeling maintenance loss and a peeling consistency loss; for a supervised image stripping,

and->

Is available, wherein +.>

Is the target result of the image filtering, expressed as +.>

σ ^t Is a hyper-parameter, σ, for different scales ^t Different, F is a filtering method; />

Can be generated by any existing image filtering method to be used as supervision in the training of the neural network; />

Is->

For use as an edge image G ^t Training the bootstrap network G; the image gradient represents the structure and detail information of the image, and is obtained by performing difference on adjacent pixels of the image, namely, each pixel value of the image gradient is the result of subtracting the pixel at the corresponding position in the original image from the adjacent pixel thereof; wherein->

Can be generated by any existing image filtering method, and can be used as supervision during neural network training; for unsupervised image stripping,

wherein I is ⁰ =i represents original, ->

Is made use of I ⁰ Gradient of->

Constructing to obtain;

guide consistency loss for ensuring guiding of network output edge image G ^t And (3) with

Keeping consistency; let 1 denote a pixel value of all 1 and the size and I ^t The same image, the Hadamard product is expressed by the degree, or the multiplication is pixel by pixel, namely the corresponding multiplication of pixels at the same position of the two images; if->

Is not available, get

Wherein G is ^gr For further enhancing important edge images in the image, G ^gr Either can be an edge image marked manually or can be a pair of edge images

A binarization result; sigma (sigma) ^t Is super parameter, different scales need to be provided with different sigma ^t ，I ^t Is regarded asSetting different sigma ^t Each sigma ^t Will have an edge image G ^t Correspondingly, the interval of each filtering degree is formed by sigma ^t Controlling; the boot consistency loss is expressed as:

||·|| ₁ representing 1 norm, beta _g Is a constant for balancing the two terms in the loss function>

And->

The pilot consistency loss is determined by comparison +.>

And beta _g />

The size of the two is judged G ^t Whether or not the pixel at a certain position in the image belongs to the edge image;

the peel consistency penalty is used to constrain the image peeling process such that the result of the peeling is consistent with G ^t Keeping consistency; peel consistency loss vs. I ^t To a different degree of smoothing; for I ^t The punishment of the pixels belonging to the structure is small, and the punishment of the pixels belonging to the texture is large; the specific form of peel consistency loss is:

wherein, E is used to avoid denominator of 0; the bootstrap network and the lift-off network are each trained independently; g is required due to the network stripping ^t As input to guide the stripping process, the guiding network is trained with a guiding consistency loss, then parameters of the trained guiding network are fixed, and the stripping network is trained with stripping reconstruction loss, stripping maintenance loss and stripping consistency loss;

2. A real-time and controllable scale-space filtering method according to claim 1, characterized in that the stripping network P is capable of hierarchically stripping individual components from the image: the structure or edge image contained in the filtering result of each step is a subset of the structure or edge image contained in the filtering result of the last step.

3. A real-time and controllable scale-space filtering method according to claim 1, characterized in that for the edge image G ^t In the filtering result I ^t The gradient of the pixel corresponding to the same location should remain unchanged; for G ^t Pixels belonging to non-edge images in the filtering result I ^t The more thoroughly the corresponding co-located pixels are smoothed.

4. A method of real-time and controllable scale-space filtering according to claim 1, characterized in that the edge image G ^t Can pass through any existing depthThe method for detecting the edge image of the degree or the non-depth is obtained.

5. A real-time and controllable scale-space filtering method according to claim 1, wherein the loop of each step in the filtering method is implemented by a single forward propagation operation or by an iterative two-step or three-step operation.