WO2023128790A1

WO2023128790A1 - Image segmentation with super resolution and frequency domain enhancements

Info

Publication number: WO2023128790A1
Application number: PCT/RU2021/000628
Authority: WO
Inventors: Dmitry Vladimirovich Dylov; Oleg Yur'yevich ROGOV; Vito Michele LELI; Aleksandr Yevgenyevich SARACHAKOV; Viktor SHIPITSIN
Original assignee: Autonomous Non-Profit Organization For Higher Education "Skolkovo Institute Of Science And Technology"
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2023-07-06

Abstract

The invention proposes to improve image segmentation pipeline by using an adaptive module for image segmentation comprising one or more artificial agents based on reinforcement learning, wherein one of the agents is implemented as an additional layer of an image segmentation neural network and is configured to obtain an optimal frequency domain image processing in terms of a given metric. Corresponding methods and system are proposed intended to improve image segmentation results and allow the existing segmentation means to be used with the most simple and commercially available image capturing devices and provide good results, especially in terms of IoU score.

Description

IMAGE SEGMENTATION WITH SUPER RESOLUTION AND FREQUENCY DOMAIN ENHANCEMENTS

FIELD OF THE INVENTION

The following generally relates to improving the computer vision efficiency in medicine and in particular to images segmentation.

BACKGROUND OF THE INVENTION

According to analytical studies, the number of performed blood tests can be estimated as more than 1.5 million per day, with roughly 45% of them involving discomfort of various degrees of severity, e.g., rashes, hematomas, or damaged veins due to repeated venipuncture. The risk group includes people with obesity (2.1 billion people in the world), diabetes (415 million), chronic venous damage (30 million), infants and children up to 10 years (more than 1.5 billion).

Peripheral difficult venous access (PDVA) problem is characterized by poorly discernible and non-palpable veins when even a highly experienced doctor resorts to the use of technological aids for guiding the needle of the veinpuncturing device. Most frequent causes are thin or deep veins, excessive adipose tissue layers, loss of color contrast due to the tone or the hairiness of the skin, edemas, and prior puncture damage.

Unfortunately, revealing the location of the vein by touching the patient’s hand increases the likelihood of sample contamination and, as a result, the likelihood of a false test. Also, incorrect detection can lead to hematomas, fainting, nerve damage and similar negative consequences for the patient. There are techniques that can visualize blood vessels better than naked human eye (i.e. CT and Ultrasonic imaging). However, they are quite expensive and require special procedures, which limits their wide use during injections. In this context, infrared reflection method is cheaper and allows non-invasive detection, therefore it can be called a promising technique to get used widely. The NIR image can clearly show the contours of the veins on a separate screen or directly on the body part of the patient using a visible light projector. The use of such instruments is reported to result in improvement of the catheterization process by 81%, in reduction of the procedure time by 78%, and in doubling of the first-time venipuncture success rate in the pediatric care segment.

The pending invention is focused on solving the problems relating to improving the NIR images segmentation method by using a multiple agents approach with some peculiarities accounting for the NIR imaging field. The inherent noise present in the NIR sensors affects the performance of the embedded vein image segmentation algorithms, causing the projected mask to be not accurate enough and/or be misaligned with the true position of the vasculature. Besides a missing functional feedback loop, required for fixing the corrupted projection, the development of a universal vein scanner is also hindered by other algorithmic and image processing challenges. These stem from the high variation in the vasculature contrast and the dynamic range among the patients (e.g., due to the skin tone, thickness, etc.), as well as size, flatness, and the position of the imaged body part.

Image segmentation is a very vast domain of computer vision problems. When speaking about medical image segmentation we cannot omit to mention the U-Net encoder-decoder like model. At the moment there are various modifications, each of which brings improvements and represents a state-of-the-art solution, such as attention U-Net, U-Net++ , and U-Net 3+. In this relation, numerous techniques have been already proposed using deep learning for medical image segmentation.

One of the main problems in vein image segmentation is the lack of a clear pattern (or even a pattern) of veins in visible light. However, the infrared light could help us when veins are not easily visible. It penetrates the skin about 3 mm. So, infrared images are used for biometric identification of a person by highlighting the individual characteristics of the vein pattern. In other words, low initial image quality and the need to use several different filters for preprocessing is an issue to be solved by the present invention.

The most used principle was to employ fully convolutional segmentation networks with minor modifications and train them on custom datasets. In different similar known solutions, standard approaches were successfully developed - for example, CN107492071 proposes utilizing a segmentation algorithm based on deep learning to obtain a normalized image labeled with the attention area; multiplying the pixel matrix of the normalized image and the pixel matrix of the medical image to obtain a pre-reconstruction image; and performing superresolution reconstruction on the image before reconstruction by using a super-resolution algorithm based on deep learning to obtain a reconstructed image. Various other prior art solutions are known using neural networks for obtaining some improvements in segmentation, but those lack precision in terms of Intersection Over-Union (loU) score. It is an important metric, particularly when specking of vein imaging, as in industrial vein scanners precision is crucial to exclude errors during injections and attendant damage.

To overcome the problem of low quality image that is further used for segmentation, according to one aspect of the present invention it is proposed to use the wave nature of images - the frequency-domain here carries some information that can be useful for improved image segmentation.

Using the direct and inverse Fourier transform, it is possible to switch from the spatial domain of the image description to the frequency domain and vice versa. Considering the image in frequency space allows to account for the physics of infrared examination. Fourier analysis has been successfully used for dynamic structure segmentation problems, where dynamic structures were distinguished using only the phase spectrum. What is more, such technique can be of particular use in medical semantic segmentation using domain adaptation, where the frequency amplitudes of the source and target images are explored, and various high- frequency low-dimensional regression problems, where Fourier features improved the results of coordinate-based multilayer perceptrons for image regression, 3D shape regression, MRI reconstruction, inverse rendering tasks.

In the face of all the above mentioned problems and special aspects, it is an object of this invention to provide an improved approach to image segmentation that can be used with the most simple and commercially available image capturing devices and still provide good results, especially in terms of loU score.

SUMMARY OF THE INVENTION

As the aspects of the invention described herein address the above-referenced problem of improving the image segmentation, while the procedures of are not covered in detail by the present description. According to particular embodiments, the proposed invention involves using a frequency-space trainable layer for improving the qualities in the different computer vision tasks such as segmentation and corruption restoration. The invention also touches upon the problem of dataset augmentation for training models that are more resistant to frequency noise. Some of the embodiments also apply a technique for additionally improving vein segmentation pipeline by employing super-resolution step of image processing.

For solving the above mentioned problems of image segmentation, according to one of the aspects of the proposed invention it is offered to use an adaptive module for image segmentation comprising one or more artificial agents based on reinforcement learning, wherein one of the agents is implemented as an additional layer of an image segmentation neural network and is configured to obtain an optimal frequency domain image processing in terms of a given metric. According to another aspect, the invention relates to a system for image segmentation, the system comprising: a pretrained baseline segmentation neural network; wherein the pretrained baseline segmentation neural network comprises an additional trainable layer that has been trained along with the segmentation neural network.

According to some of the embodiments, the additional trainable layer is configured for frequency domain image enhancement, including at least one of denoising and corruption restoration.

According to some of the embodiments, the system further comprises a super-resolution generative adversarial neural network.

According to some of the embodiments, system is used for obtaining veins mask from the image.

According to another one aspect of the present invention, it relates to a method image segmentation for obtaining veins mask from the image, the proposed method comprises the steps of: using a super-resolution neural network to obtain an image with higher resolution; using a pretrained baseline segmentation neural network to obtain a veins mask from the image with higher resolution, wherein the segmentation neural network comprises an additional trainable layer that has been trained along with the segmentation neural network.

According to some embodiments, the additional trainable layer is configured for frequency domain image enhancement, including at least one of denoising and corruption restoration.

In another aspect, the invention proposes a method of image segmentation for obtaining veins mask from the image, wherein the method comprises: cropping and denoising the image; increasing the resolution of the cropped denoised image; segmenting the resulting cropped denoised high-resolution image using a pretrained baseline segmentation neural network that comprises an additional trainable layer that has been trained along with the segmentation neural network.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention may take form in various components and arrangements of components, and in various steps and arrangements of steps. The drawings are only for purposes of explaining the essential features of the proposed method and system and illustrating that preferred embodiments and are not to be construed as limiting the invention. For a more complete understanding of the invention, reference is made to the following description and accompanying drawings, in which:

FIG. 1 shows an exemplary embodiment of a vein imaging system including the proposed image segmentation module.

FIG. 2 shows the denoising results (metrics) for different embodiments using different models: DnCNN FL - with the Fourier layer; DnCNN FL1 - with the Fourier layer, logarithmic version.

FIG. 3 illustrates performance metrics: the Denoiser PSNR for added Rician noise.

FIG. 4 illustrates different metrics during the learning process: a)-f) PSNR, SSIM and FSIM metrics for noise and same for erasing corruption on validation sets; g)-h) Loss functions and Dice scores for U-Net models with and without denoising after corruption on validations sets.

FIG. 5 depicts an image processing flowchart for the embodiment that includes increasing the image resolution.

DETAILED DESCRIPTION OF EMBODIMENTS

The pending invention was originally developed along with another improvement of a NIR vein imaging system relating to the projection alignment agent. In order to illustrate the full range of capabilities the pending invention may provide, it will be further described in an embodiment including the duet of the denoiser and the aligner agents that are capable of learning together within a near-infrared 2D imager under the RL framework.

It should be understood that though the following embodiments illustrate the particular application of the proposed invention in the near-infrared images segmentation, the principles of the invention are also benefitial for many other applications involving image segmentation.

One of the embodiments of the vein imaging system comprising the image segmentation module of the pending invention is shown on Fig. 1. The Denoiser optimizes spectral decomposition of NIR image for the encoder-decoder (E-D) network to generate an optimal mask, the Aligner then searches for an optimal image-mask alignment. The adjusted mask updates both the RL environment in the Aligner and the adaptive layer F in Denoiser.

To address the denoising problem of the NIR images, the pending invention employs the frequency formalism by means of the 2D Fourier transform over an unbounded domain. Denoting

one can write the spectral decomposition as follows:

where

) denote the image in the spatial and in the frequency domains, respectively. Considering the image in frequency space, it is possible to account for the physics of the infrared signal. Due to the hemoglobin properties in the veins which absorbs light in the NIR range rather well, one can visualize the enhanced veins against the background. The main method of visual analysis of Fourier transformation is to calculate its spectrum, i.e., the coordinate-wise absolute value

, or the energy spectrum

To filter the image in the frequency domain, we need to take a function that modifies its spectrum in a specific way. By changing the frequencies in the Fourier spectrum of the image, the vein contrast in the imaged area can be increased. A high-pass filter would naturally enhance the edges of the veins, while the low-pass would smooth out the scatter detail. And here there is need to define an optimal frequency representation, which may vary from image to image. This problem here can be solved by exploiting different enhancing approaches via using an agent that learns the optimal frequency representation in the dataset for every image. It is to say that according to the present invention, an adaptive learnable layer is devised that is a part of a denoising module (Denoiser) and is trained together with the Denoiser, which will be explained in more details below.

Now turning to the adaptive denoising global filtering layer (Layer F on Fig. 1), its function is defined by the following logic. As the simplest model, we can denoise the image by direct ramping of the spectrum:

where W is a pre-trained non-negative matrix of weights (the exact algorithm can be found in the Appendix). Typically, the majority of frequency values in

matrix are rather small, suggesting a log transform

for boosting the visibility of the small values across the image

So, one could also learn the damping/amplification weights of certain frequencies as follows:

Note that in both configurations, (3) and (4), the weight matrix W(u, v) adapts at each frequency according to a given optimization problem. If the problem at hand is to maximize the segmentation quality to find optimal mask of the object, W(u, v) will update itself with the optimal frequencies for that particular object. The proposed implementation of the adaptive filtering algorithm for the layer F is a sequence of operations: perform the Fourier transform, replace the spectrum with one of the above techniques, (3) or (4), and transform back into the spatial domain. Given an important symmetry property

which allows a twofold reduction of the size of the image in the frequency space, the direct and the inverse Fast Fourier Transforms (FFT) for the real signals could be used in practice. The adaptive layer could be placed prior to any denoising network (in various embodiments, different models of networks can be used here, e.g. popular state-of-the-art model DnCNN or basic U-Net), jointly comprising the proposed adaptive agent Denoiser. According to one of the embodiments, FSIM metric, which also relies on the Fourier spectra, was used to guide the Denoiser agent at each time step t. Should the mask of the object change due to motion of the object (the forearm) or be adjusted by the other agent Aligner at (t + 1), the Denoiser would update the weights of the filtering layer accordingly until a new ‘optimal’ mask is produced. As such, filtering provides an optimal denoising sequence for a given frame in terms of the proposed frequency-space paradigm. Possible Denoiser actions at each step are to select one or a combination of the filtering layers:

which means not changing the

spectrum and, consequently, no actions taken by RL algorithm,

corresponds to weight matrices in operations 3 and 4. Weight matrices in adaptive layers are trained regardless of the selected action. The possible actions are therefore different denoising filters mentioned above.

The Denoiser agent is meant to learn a policy to ensure the quality improvement of the image at each step while maximizing the cumulative reward Therefore a

reward function is designed as follows:

where PSNR is calculated with the pair of images acquired at (t + 1) and (t) steps and particularly PSNRo = 0.

To demonstrate the proposed denoising and alignment procedures, various embodiments were successfully implemented and tested, several of which are described below. According to one of the embodiments, the implemented approach was tested on annotated public dataset of raw NIR images of 90 patients forearms (with equally represented genders and the mean age of patients ~25 years, including various skin types). To collect the dataset, the venipuncture procedures typically done in clinics were performed, without actually using the needles. The patient’s elbow was put over a cushion and a tourniquet was imposed on the upper arm. Then, the image of the forearm area was acquired with the NIR camera.

The example of DnCNN Training

The dataset images are resized to image size = (512, 512) which is a common resolution used by U-Net. The batches have batch size = 4, a tuned hyperparameter known for a person skilled in the art. DnCNN is selected as the baseline denoising model, the objective of which is to estimate a residual image from the noisy input image. To hold a proper comparison, these hyperparameters are selected: init_features = 32 (16) - number of features in initial convolution, num_layers = 20 (17) - number of layers for base model without (with) adaptive layer, respectively. The IA loss is commonly used for denoising tasks in computer vision. However, by weighing the colour and the brightness of pixels regardless of the local structure, L1 loss function is inferior to the MS-SSIM, which is more sensitive to contrast in high- frequency regions than SSIM and relies on structural content. Thus, in this part, the Combined Loss function of MS-SSIM and LI loss with weights 0.8 and 0.2 respectively, was used. Also studied, were the PSNR, SSIM, and FSIM metrics.

The example of U-Net Training

The input of the baseline U-Net are the denoised images in order to provide the Aligner with {It, Mt} pairs. The following model hyperparameters are used: init_features = 16 - number of channels in initial convolution, depth = 3 - number of downsteps. We use a Combined Loss function of Dice and Cross Entropy losses with approximately equal weights (0.6 and 0.4 respectively). The Dice coefficient is measured to evaluate the quality of segmentation. To optimize the loss functions in the both aforementioned training processes, Adam optimizer with a learning rate = 0.001 is used, with including the learning rate reduction by the plateau method.

The aforementioned hyperparameters were selected with respect to the needs to compare the baseline model with and without the adaptive layer, with approximately the same number of parameters: 168,225 and 167,169, respectively. A decrease in the number of initial channels and the total number of layers led to a faster learning process for a model with an adaptive layer due to the higher speed of element-wise multiplications and the Fourier transform operations. To verify efficiency, two different modes of image corruption were chosen: Rician noise and erasing. According to this embodiment, the algorithm for obtaining an image with Rician noise is identical to that used by the authors in J. Yang, J. Fan, D. Ai, S. Zhou, S. Tang, and Y. Wang, “Brain MR image denoising for Rician noise using pre-smooth non-local means filter,” Biomed. Eng. Online, vol. 14, no. 1, pp. 1-20, 2015. This type of noise is present in sampling-based medical scanners (infrared and others, e.g., MRI).

The noisy image could be written as:

where σ ~ U(0, 0.1) is taken from a uniform distribution and N (μ = 0; σ) is a random tensor with the same dimensions as the image I. The erasing corruption is well described in in the prior art and allows to mimic the corruption occurring when the needle or a hand block a part of the field of view. The denoising results are provided in the table on Fig. 2, the learning curves are shown on Fig. 4. For each curve the average value is highlighted in bold, and the area around is the standard deviation. The training/validation curves are also shown in the Appendix. The issue of especially low-level illumination, e.g., with Poisson noise statistics, is outside the scope of this study and will require additional forays into NIR-specific regularization schemes.

According to different embodiments, using denoised images results in an increase in metrics as compared to the original baseline performance. In the embodiment with the Rician noise corruption, the model with 117410 parameters a better result was obtained after denoising than without it (Dice score: 0.79 vs. 0.74, respectively). The exemplary U-Net learning curves are provided on Fig. 4: (g, h) with standard k-fold cross-validation and (a) for denoising. The results obtained according to this embodiment indicate that no overfitting occurred.

As it was mentioned above, the Denoiser agent is trained with a continuous scenario. A NIR images stream is processed by the agent with different sets of filtering actions. The agent solves the problem of balancing exploration and exploitation. Exploitation of current known best FSIM combination is needed to maximize the overall quality of images received. Exploration is needed to identify possible new combinations that may lead to better result due to the non-stationarity of the metric among different images in the stream. The Fig. 3 shows the reward collected in such non-stationary scenario. According to yet another one aspect of the invention, the proposed technique also involves increasing the resolution of the image. In order to see the efficiency of such approach, the following actions were performed (see Fig. 5).

74 initial IR images of hands as well as corresponding segmentation masks of venous area were processed using random augmentation techniques (random crop, random rotation, random color blending). Obtained augmented dataset of 2000 images were split into 1500 training images (training set 1), 500 were left as hold-out test set (case 1). Another complex of train-test sets of images was obtained by applying SR network (ESRGAN in this embodiment) to input images (IR images only) - case 2 (training set 2).

Pre-trained baseline segmentation network (for this embodiment, it was U-Net) in both cases uses the same architecture, except that the case 2 network has two additional 2d- convolution+ max-pooling layers to increase effectiveness for processing high resolution images. Case 1 setup takes in the augmented image, processes it through base segmentation network, and then the augmented image is compared to the output segmentation mask by means of loU. Obtained gradient is used to train the base network. The hold-out test set is used to evaluate final loU value, that was reached by the setup.

The superior perceptual performance of in-pipeline usage of ESRGAN using testing. The quality of super-resolved images obtained by the above embodiments with different network architecture suggests that shallower networks have the potential to provide very efficient alternatives at a small reduction of qualitative performance. The result of the comparison between combinations of the above described features also allows to speculate that the use of ESRGAN has a substantial impact on the performance of deep segmentation networks such as U-Net. Further, when aiming for clinical-oriented solutions to the super-resolution problem the choice of aforementioned loss function is of particular importance.

According to one example of the case 2 setup embodiment, it takes in the augmented image, performs super-resolution step, that scales the image resolution from 512x512 to 2048x2048 px. Obtained SR-image undergoes segmentation. The resulting segmentation mask image has the resolution of 512x512 and the loU score is calculated. Resulting loUs on the hold-out test set were recorded and compared

Summing up the above description, the pending invention add up to the feedback loop mechanisms entailed in NIR imagers by introducing the duet of the adaptive denoiser agents. The Denoiser agent learns the proper frequency decomposition of the acquired infrared data by co- training the adaptive layer and the base denoising model. The independent Fourier layer can be used with any model that solves a restoration or any other denoising problem, for each of which the weight matrices are formed depending on the learnt frequency amplification or attenuation.

The proposed solution can be easily deployed in embedded systems with the advantage of a limited selection of actions. It is also to be noted that the advantageous effect of the invention occurs without a noticeable lag or any other impact on the vein imaging system’s performance, offering an efficient control solution for the infrared imaging or other similar imaging techniques where the proposed denoiser module can be of help.

The invention has been described with reference to the preferred embodiments. Modifications and alterations may occur to others upon reading and understanding the preceding detailed description. It is intended that the invention be constructed as including all such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

CLAIMS:

1. An adaptive module for image segmentation comprising one or more artificial agents based on reinforcement learning, wherein one of the agents is implemented as an additional layer of an image segmentation neural network and is configured to obtain an optimal frequency domain image processing in terms of a given metric.

2. A system for image segmentation, the system comprising: a pretrained baseline segmentation neural network; wherein the pretrained baseline segmentation neural network comprises an additional trainable layer that has been trained along with the segmentation neural network.

3. The system for image segmentation of claim 2, wherein the additional trainable layer is configured for frequency domain image enhancement, including at least one of denoising and corruption restoration.

4. The system for image segmentation of any one of claims 1-2, further comprising a super-resolution generative adversarial neural network.

5. The system for image segmentation of any one of claims 1-4, wherein the system is used for obtaining veins mask from the image.

6. A method of image segmentation for obtaining veins mask from the image, the method comprising: using a super-resolution neural network to obtain an image with higher resolution; using a pretrained baseline segmentation neural network to obtain a veins mask from the image with higher resolution, wherein the segmentation neural network comprises an additional trainable layer that has been trained along with the segmentation neural network.

7. The method of image segmentation of claim 6, wherein the additional trainable layer is configured for frequency domain image enhancement, including at least one of denoising and corruption restoration.

8. A method of image segmentation for obtaining veins mask from the image, wherein the method comprises: cropping and denoising the image; increasing the resolution of the cropped denoised image; segmenting the resulting cropped denoised high-resolution image using a pretrained baseline segmentation neural network that comprises an additional trainable layer that has been trained along with the segmentation neural network.

9. The method of image segmentation for obtaining veins mask from the image of claim

8, wherein the additional trainable layer is configured for frequency domain image enhancement, including at least one of denoising and corruption restoration.