JP2023169526A

JP2023169526A - Video denoising method and device by twin-domain process in space-time domain and frequency conversion domain

Info

Publication number: JP2023169526A
Application number: JP2022080677A
Authority: JP
Inventors: 力松永; Tsutomu Matsunaga
Original assignee: For A Co Ltd
Current assignee: For A Co Ltd
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2023-11-30

Abstract

To provide a video denoising device and a video denoising method that correspond to a complex movement such as a scale change and rotation of an image frame in a video image.SOLUTION: The video denoising method is for removing noise included in a video by combining a spatiotemporal process directly using pixel values of an image in a video and a process performed in a conversion domain in which the image is converted into a frequency domain. Preferably, the spatiotemporal process directly using pixels in the image is a method for performing weighted average according to similarity of the images through block matching. Furthermore, the method applies a frame cyclic filter and performs weighted addition of results of block matching between results of self-matching within current frame input images and results of the block matching between the current frame input images and previous frame output images.SELECTED DRAWING: Figure 1

Description

本発明は、時空間領域と周波数変換領域の双領域処理による映像デノイジング方法とその装置等に関する。 The present invention relates to a video denoising method using dual-domain processing in the spatio-temporal domain and frequency transform domain, and an apparatus therefor.

画像や映像のデノイジング処理は基本的な動画処理のひとつであり、古くて新しい処理でもある。これまでに膨大な数の研究がなされており、現在でも様々な方法が提案されているが、処理の方法としては、“空間領域における方法”と“変換領域における方法”に大別される。 Image and video denoising processing is one of the basic video processing processes, and it is both old and new. A huge amount of research has been carried out so far, and various methods are still being proposed, but processing methods can be broadly divided into "methods in the spatial domain" and "methods in the transformation domain."

空間領域における画素値を直接扱い、画像の時間空間的な類似性に着目した方法が“空間領域における方法”である。代表的な方法にバイラテラルフィルタ［非特許文献１７］、非局所平均フィルタ（Non-localmeans,ＮＬＭ）［非特許文献３］［非特許文献４］がある。 A "method in the spatial domain" is a method that directly handles pixel values in the spatial domain and focuses on the temporal and spatial similarities of images. Typical methods include a bilateral filter [Non-patent Document 17] and a non-local mean filter (NLM) [Non-patent Document 3] [Non-patent Document 4].

本発明者［非特許文献１０］［非特許文献１２］は、ＣＭＯＳセンサを用いたカメラにより撮影された映像におけるローリングシャッター動き歪み変形と揺れの補正を同時に行い、移動カメラの場合に、推定した並進パラメータの時系列変化に対してバイラテラルフィルタ［非特許文献１７］を拡張した巡回型バイラテラルフィルタを発見的に導入した。 The present inventors [Non-Patent Document 10] [Non-Patent Document 12] simultaneously corrected rolling shutter movement distortion deformation and shaking in images captured by a camera using a CMOS sensor, and estimated that in the case of a moving camera, We heuristically introduced a recursive bilateral filter, which is an extension of the bilateral filter [Non-Patent Document 17], for time-series changes in translational parameters.

本発明者は、これを“拡張バイラテラルフィルタ”と称して、画像ノイズ除去処理にも用いた［非特許文献１１］。これらの方法は比較的簡易な方法であるものの、そのノイズ除去性能は高い。画像や映像の圧縮における予測符号化［非特許文献７］に相当するだろう。 The inventor called this an "enhanced bilateral filter" and also used it for image noise removal processing [Non-Patent Document 11]. Although these methods are relatively simple, their noise removal performance is high. This corresponds to predictive coding in image and video compression [Non-Patent Document 7].

一方、“変換領域における方法”は、変換領域における信号レベルの縮小処理を行ってノイズ成分を除去した後、逆変換を行うものである［非特許文献６］［非特許文献１６］［非特許文献１４］［非特許文献５］［非特許文献１８］。画像や映像の圧縮における変換符号化［非特許文献７］に相当するものであり、基底として三角関数を取るフーリエ変換が最も親しみがあるだろう。その他の変換としては、ウォルシュ－アダマール変換が和と差のみから計算される簡便さから歴史的にもよく用いられてきた。カルーネン－レーベ変換は最小二乗の意味で最適な変換として理論的によく扱われている。 On the other hand, the "method in the transform domain" performs inverse transformation after reducing the signal level in the transform domain and removing noise components [Non-Patent Document 6] [Non-Patent Document 16] [Non-Patent Document 16] Document 14] [Non-patent document 5] [Non-patent document 18]. It corresponds to transform coding in the compression of images and videos [Non-Patent Document 7], and the Fourier transform that takes trigonometric functions as a basis is probably the most familiar. As for other transformations, the Walsh-Hadamard transformation has historically been frequently used because of its simplicity as it can be calculated only from sums and differences. The Karhunen-Loewe transformation is often theoretically treated as an optimal transformation in the sense of least squares.

近年では、コサイン変換が画像・映像圧縮の国際標準ＪＰＥＧ［非特許文献１］／ＭＰＥＧ［非特許文献２］として採用されている。１９８０年代からウェーブレット変換が提案され、様々な分野で用いられてきた。 In recent years, cosine transformation has been adopted as the international standard for image/video compression, JPEG [Non-Patent Document 1]/MPEG [Non-Patent Document 2]. Wavelet transform has been proposed since the 1980s and has been used in various fields.

デノイジング処理においても、その後、直交変換の枠組みも超えた様々な“レット（-ｌｅｔ）”が提案され、用いられている［非特許文献１６］［非特許文献１４］［非特許文献５］［非特許文献８］。 In denoising processing, various "lets" (-lets) that go beyond the framework of orthogonal transformation have been proposed and used [Non-patent Document 16], [Non-patent Document 14], [Non-patent Document 5], [ Non-Patent Document 8].

［マルチフレーム拡張と残差画像の変換領域レベル縮小］
通常、非局所平均フィルタは１枚の画像に対して用いられるが、これを連続するフレームの動画像列に対してマルチフレーム拡張する。図１はマルチフレーム非局所平均フィルタの構成を表す。同図（ａ）はフレーム非巡回構成であり、現フレームＩ_ｎを基準フレームとして、前フレームＩ_ｎ-１中に探索領域を取る。同図（ｂ）はフレーム巡回構成である。Ｚ^-１は１フレーム遅延を表し、現フレームＩ_ｎを基準フレームとして、前フレーム出力Ｏ_ｎ-１中に探索領域を取る。Ｌｉｕら［非特許文献９］も、非局所平均フィルタをフレーム時間方向に拡張した。彼らは１１フレームによるフレーム非巡回構成を用いてビデオデノイジング処理を行った。 [Multi-frame expansion and residual image transformation domain level reduction]
Normally, a non-local average filter is used for one image, but it is extended to multiple frames for a moving image sequence of continuous frames. FIG. 1 shows the configuration of a multi-frame non-local average filter. FIG. 5A shows a frame acyclic configuration, in which the current frame I _n is used as a reference frame, and a search area is taken in the previous frame I _n-1 . FIG. 4B shows a frame cyclic configuration. Z ^-1 represents a one frame delay, and with the current frame I _n as a reference frame, a search area is taken in the previous frame output O _n-1 . Liu et al. [Non-patent Document 9] also extended the non-local average filter in the frame time direction. They performed video denoising processing using a frame acyclic configuration with 11 frames.

フレーム巡回フィルタ処理はビデオ信号におけるノイズリデューサーとして古くから用いられている［特許文献１］［非特許文献１５］［非特許文献６］［非特許文献１３］［非特許文献７］。フレーム非巡回構成と比較して、フレームメモリや処理に要するフレーム遅延が少なくなる。これは、フレーム時間方向の無限インパルス応答（Infinite Impulse Response,ＩＩＲ）フィルタであるが、吹抜［特許文献１］［非特許文献７］は、フレーム巡回フィルタ処理において、フレーム間差分に対してレベル縮小処理を行うことにより、１次遅れＩＩＲフィルタの乗算器を用いない構成を示した（図２（ａ））。図２は、フレーム巡回フィルタ処理ブロック図を説明しており、図２（ａ）がフレーム巡回とレベル縮小［特許文献１］［非特許文献７］を説明している。 Frame cyclic filter processing has long been used as a noise reducer in video signals [Patent Document 1] [Non-patent Document 15] [Non-patent Document 6] [Non-patent Document 13] [Non-patent Document 7]. Compared to a frame acyclic configuration, the frame memory and frame delay required for processing are reduced. This is an Infinite Impulse Response (IIR) filter in the frame time direction, and Fukinuki [Patent Document 1] [Non-patent Document 7] reduces the level of inter-frame differences in frame cyclic filter processing. By performing the processing, a configuration of a first-order lag IIR filter that does not use a multiplier was shown (FIG. 2(a)). FIG. 2 illustrates a frame cyclic filter processing block diagram, and FIG. 2(a) illustrates frame cyclic and level reduction [Patent Document 1] [Non-Patent Document 7].

フレーム間差分（残差画像）においてフレーム間の動きに対してノイズのレベルは小さいという事前の知識によりノイズとフレーム間の動きを判別しようとしている。レベル縮小処理のためのしきい値関数は、画像中に含まれるノイズ成分を除去した画像を得るのではなく、フレーム間差分からノイズ成分を抽出するために、図５のレベル縮小作用素Ｓ［ｘ］を用いていることに注意する。 An attempt is made to distinguish between noise and interframe motion based on prior knowledge that the level of noise is smaller than interframe motion in interframe differences (residual images). The threshold function for level reduction processing is used to extract noise components from inter-frame differences, rather than to obtain an image with noise components included in the image removed. ] Please note that

動画像であるビデオ信号を扱う際の困難は動きによる妨害である。単純な構成であるフレーム間差分のレベル縮小では動きとノイズの分離が十分ではなく、フレーム間動きがある場合、しきい値によっては、前フレームの動きによる残像が妨害として現れる。海老原ら［非特許文献６］はフレーム間差分に対してアダマール変換を用いることにより、動きとノイズの分離を行った。さらに、二宮ら［非特許文献１３］はフレーム間における動きをブロックマッチングにより推定する動き補正型フレーム巡回フィルタを提案した（図２（ｂ））。図２は、フレーム巡回フィルタ処理ブロック図を説明しており、図２（ｂ）が動き補正フレーム巡回とレベル縮小［非特許文献１３］を説明している。 A difficulty in dealing with video signals that are moving images is the disturbance caused by motion. The level reduction of the inter-frame difference, which is a simple configuration, does not sufficiently separate motion and noise, and when there is inter-frame motion, an afterimage due to the motion of the previous frame appears as interference depending on the threshold value. Ebihara et al. [Non-Patent Document 6] separated motion and noise by using Hadamard transform for inter-frame differences. Furthermore, Ninomiya et al. [Non-Patent Document 13] proposed a motion-corrected frame cyclic filter that estimates motion between frames by block matching (FIG. 2(b)). FIG. 2 illustrates a frame cyclic filter processing block diagram, and FIG. 2(b) illustrates motion compensation frame cyclic and level reduction [Non-Patent Document 13].

ここでは、前フレーム出力から現フレームにおける画素値を推定する問題として、フレーム巡回構成によるフィルタ処理を統一的に再定義する。フレーム巡回＆レベル縮小の場合は、次のように書ける（図２（ａ））。 Here, filter processing using a frame cyclic configuration is unified and redefined as a problem of estimating the pixel value in the current frame from the output of the previous frame. In the case of frame rotation and level reduction, it can be written as follows (Fig. 2(a)).

フレーム巡回構成における前フレーム出力と現フレーム入力の間の動きを推定するブロックマッチングは次のように書ける。 Block matching for estimating the motion between the previous frame output and the current frame input in a frame cyclic configuration can be written as follows.

これは、フレーム間動きが存在する場合の“最尤推定”である。
さらに、非局所平均フィルタを導入すると、 This is "maximum likelihood estimation" in the presence of interframe motion.
Furthermore, if we introduce a non-local average filter,

であり（図２（ｃ））、これは，“ベイズ推定”と見なすことができる。図２は、フレーム巡回フィルタ処理ブロック図を説明しており、図２（ｃ）が動き補正フレーム巡回非局所平均とレベル縮小を説明している。
最終的に、動き補正フレーム巡回非局所平均フィルタにおけるレベル縮小処理を変換領域レベル縮小処理とすることにより、次のようになる（図２（ｄ））。図２（ｄ）が動き補正フレーム巡回非局所平均と変換領域レベル縮小を説明している。 (FIG. 2(c)), which can be considered as "Bayesian estimation." FIG. 2 illustrates a frame cyclic filter processing block diagram, and FIG. 2(c) illustrates motion compensated frame cyclic non-local averaging and level reduction.
Finally, by changing the level reduction process in the motion compensation frame cyclic non-local average filter to the transform area level reduction process, the following is achieved (FIG. 2(d)). FIG. 2(d) illustrates motion compensated frame cyclic non-local averaging and transform domain level reduction.

吹抜敬彦,テレビジョン信号の雑音抑圧装置,特公昭59-17580（昭53(1978)６月２日出願）Takahiko Fukinuki, Noise suppression device for television signals, Special Publication No. 59-17580 (filed on June 2, 1978)

ITU-T Recommendation T.81, “Information technology - Digital compression and coding of continuous-tone still images - Requirements and guidelines”, Retrieved 2009-11-07(aka ISO／IEC 10918-１:1994).ITU-T Recommendation T.81, “Information technology - Digital compression and coding of continuous-tone still images - Requirements and guidelines”, Retrieved 2009-11-07 (aka ISO/IEC 10918-1:1994). ISO／IEC 13818-２:2003, “Information technology - Generic coding of moving pictures and associated audio information - Part 2: Video”, Retrieved 2013-09-27(aka ITU-T Recommendation H.262).ISO/IEC 13818-2:2003, “Information technology - Generic coding of moving pictures and associated audio information - Part 2: Video”, Retrieved 2013-09-27 (aka ITU-T Recommendation H.262). A. Buades, B. Coll, and J.-M. Morel, A non-local algorithm for image denoising, IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR 2005), San Diego, CA, U.S.A., Vol. 2, pp. 60-65, June 2005.A. Buades, B. Coll, and J.-M. Morel, A non-local algorithm for image denoising, IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR 2005), San Diego, CA, U.S.A., Vol. 2, pp. 60-65, June 2005. A. Buades, B. Coll, and J.-M. Morel, A review of image denoising methods, with a new one,Multiscale Modeling and Simulation, Vol. 4, No. 2, pp. 490-530, 2006.A. Buades, B. Coll, and J.-M. Morel, A review of image denoising methods, with a new one, Multiscale Modeling and Simulation, Vol. 4, No. 2, pp. 490-530, 2006. M. N. Do and M. Vetterli, The contourlet transform: an efficient directional multiresolution image representation, IEEE Transactions on Image Processing, Vol. 14, No. 12, pp. 2091-2106, December 2005.M. N. Do and M. Vetterli, The contourlet transform: an efficient directional multiresolution image representation, IEEE Transactions on Image Processing, Vol. 14, No. 12, pp. 2091-2106, December 2005. 海老原規郎, 藤田忠男, 立沢加一, 高森勉, アダマール変換を用いたテレビジョン信号のノイズリデューサー, テレビジョン学会誌, Vol. 37, No. 12, pp. 1030-1036, 1983 年12 月.Norio Ebihara, Tadao Fujita, Kaichi Tatsuzawa, Tsutomu Takamori, Noise reducer for television signals using Hadamard transform, Journal of the Society of Television Engineers, Vol. 37, No. 12, pp. 1030-1036, December 1983. 吹抜敬彦, 「TV 画像の多次元信号処理」, 日刊工業新聞社, 1988 年11 月.Takahiko Fukinuki, “Multidimensional signal processing of TV images”, Nikkan Kogyo Shimbun, November 1988. E. Le Pennec and S. Mallat, Sparse geometric image representations with bandelets, IEEE Transactions on Image Processing, vol.14, no.4, pp.423-438, April 2005.E. Le Pennec and S. Mallat, Sparse geometric image representations with bandelets, IEEE Transactions on Image Processing, vol.14, no.4, pp.423-438, April 2005. C. Liu andW. T. Freeman, A high-quality video denoising algorithm based on reliable motion estimation, Proceedings of the １１th European Conference on Computer Vision Conference on Computer Vision(ECCV’10):Part III, Heraklion, Crete, Greece, pp. 706-719, September 2010.C. Liu and W. T. Freeman, A high-quality video denoising algorithm based on reliable motion estimation, Proceedings of the 11th European Conference on Computer Vision Conference on Computer Vision(ECCV'10):Part III, Heraklion, Crete, Greece, pp. 706-719, September 2010. 松永力, 対応点を用いないローリングシャッタ歪み補正と映像安定化, 第19 回画像センシングシンポジウム(SSII2013) 講演論文集, 横浜(パシフィコ横浜), 2013 年6 月.Tsutomu Matsunaga, Rolling shutter distortion correction and image stabilization without using corresponding points, Proceedings of the 19th Image Sensing Symposium (SSII2013), Yokohama (Pacifico Yokohama), June 2013. 松永力, 無限インパルス応答システムによる拡張バイラテラルフィルタ, 第19 回画像センシングシンポジウム(SSII2013) 講演論文集, 横浜(パシフィコ横浜), 2013 年6 月.Tsutomu Matsunaga, Enhanced bilateral filter using infinite impulse response system, Proceedings of the 19th Image Sensing Symposium (SSII2013), Yokohama (Pacifico Yokohama), June 2013. 松永力, 対応点を用いないローリングシャッター歪み補正と映像の安定化～並進から回転へ, 第21回画像センシングシンポジウム(SSII2015) 講演論文集, 横浜(パシフィコ横浜), 2015 年6 月.Tsutomu Matsunaga, Rolling shutter distortion correction and image stabilization without using corresponding points - From translation to rotation, Proceedings of the 21st Image Sensing Symposium (SSII2015), Yokohama (Pacifico Yokohama), June 2015. 二宮佑一, 大塚吉道, 動き補正型ノイズリデューサ, テレビジョン学会誌, Vol. 39, No. 10, pp.956-962, 1985 年10 月.Yuichi Ninomiya, Yoshimichi Otsuka, Motion-compensated noise reducer, Journal of the Society of Television Engineers, Vol. 39, No. 10, pp.956-962, October 1985. J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli, Image denoising using scale mixtures of Gaussians in the wavelet domain, IEEE Transactions on Image Processing, Vol. 12, No. 11, pp.1338-1351, November 2003.J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli, Image denoising using scale mixtures of Gaussians in the wavelet domain, IEEE Transactions on Image Processing, Vol. 12, No. 11, pp.1338-1351, November 2003. J. P. Rossi, Digital techniques for reducing television noise, SMPTE Journal, Vol. 87, No. 3, pp. 134-140, March 1978.J. P. Rossi, Digital techniques for reducing television noise, SMPTE Journal, Vol. 87, No. 3, pp. 134-140, March 1978. J.-L. Starck, E. J. Candes, and D. L. Donoho, The curvelet transform for image denoising, IEEE Transactions on Image Processing, Vol. １１, No. 6, pp. 670-684, June 2002.J.-L. Starck, E. J. Candes, and D. L. Donoho, The curvelet transform for image denoising, IEEE Transactions on Image Processing, Vol. 11, No. 6, pp. 670-684, June 2002. C. Tomasi and R. Manduchi, Bilateral filtering for gray and color images, Proceedings of the Sixth IEEE International Conference on Computer Vision (ICCV’98), Bombay, India, pp. 839-846, January 1998.C. Tomasi and R. Manduchi, Bilateral filtering for gray and color images, Proceedings of the Sixth IEEE International Conference on Computer Vision (ICCV’98), Bombay, India, pp. 839-846, January 1998. G. Yu and G. Sapiro, DCT image denoising: a simple and effective image denoising algorithm,Image Processing On Line, Vol. １ (2011).G. Yu and G. Sapiro, DCT image denoising: a simple and effective image denoising algorithm, Image Processing On Line, Vol. 1 (2011).

画像成分とノイズ成分を如何に分離するか、特に映像の場合、被写体を撮影するカメラの動きのみならず、画像フレーム中の物体の動きの存在への対応が問題になる。これまでに、ノイズと動きによる画素値レベル差による分離や動きを推定して補正することによる分離等が行われてきたが、従来の技術では十分とは言えない。そもそも、ノイズ成分を含む画像フレーム間の様々な動きを精度良く推定することは従来困難であるとされており、処理に掛かるコストや遅延も発生する。特に、ブロックマッチングを基本とした動きの推定は水平垂直方向の動きを組合せた並進動きであり、カメラのズーム、移動による拡大縮小や回転等の複雑な動きに対応することは従来難しいとされている。 How to separate image components and noise components, especially in the case of video, is a problem that involves not only the movement of the camera that photographs the subject, but also the presence of movement of the object in the image frame. So far, separation has been performed based on pixel value level differences due to noise and motion, separation by estimating and correcting motion, etc., but conventional techniques cannot be said to be sufficient. To begin with, it has traditionally been difficult to accurately estimate various movements between image frames, including noise components, and processing costs and delays occur. In particular, motion estimation based on block matching involves translational motion that combines horizontal and vertical motion, and it has traditionally been difficult to handle complex motions such as zooming, scaling, and rotation of the camera. There is.

画像フレーム間の動きを直接的に推定するのではなく、予め定められた微小な動き、スケール変化、回転等の幾何学変換を行った画像フレーム間でブロックマッチングを“投機的に実行”することにより対応する。連続する画像フレーム間の動きは、ほぼ微小であり、各々の画像フレーム幾何学変換における微小パラメータ値により十分近似可能である。計算リソースの許す限り、パラメータ値を変えた“投機的な実行”の数を増やすことも可能かつ容易であろう。 Rather than directly estimating the movement between image frames, block matching is "speculatively executed" between image frames that have undergone geometric transformations such as predetermined minute movements, scale changes, and rotations. We will respond accordingly. The movement between successive image frames is almost minute and can be sufficiently approximated by minute parameter values in each image frame geometric transformation. It would be possible and easy to increase the number of "speculative executions" in which parameter values are changed, as long as computational resources permit.

本発明では、ビデオ信号のためのノイズリデューサーとして用いられてきた時間方向フレーム巡回フィルタ［非特許文献１５］［非特許文献６］［非特許文献１３］［非特許文献７］において、“空間領域における方法”と“変換領域における方法”を組み合わせた方法を提案する。従来の研究におけるフレーム巡回フィルタを前フレーム出力から現フレームにおける画素値を推定する問題として、これまでに提案されてきたデノイジング処理の手法により再定義する。 In the present invention, in the temporal frame cyclic filters [Non-Patent Document 15], [Non-Patent Document 6], [Non-Patent Document 13], and [Non-Patent Document 7] that have been used as noise reducers for video signals, "spatial domain We propose a method that combines the "method in the transformation domain" and the "method in the transform domain." The frame cyclic filter in conventional research is redefined as the problem of estimating the pixel value in the current frame from the output of the previous frame using denoising processing methods that have been proposed so far.

本発明により、映像における画像フレームのスケール変化や回転等の複雑な動きに対応することができて、高精度なノイズ除去が可能となる。映像における動きを直接的に推定するのではないため、動き推定に掛かるコストを削減しつつ、複雑な動きに対応することができる。フレーム巡回構成により、余分なフレームメモリが不要となり、フレーム遅延も発生しない。“投機的な実行”は、計算リソースの許す限り、独立かつ並列に処理することが可能であり、映像のためのデノイジング処理には、最良な方法と言えるだろう。 According to the present invention, it is possible to cope with complex movements such as scale changes and rotations of image frames in a video, and it is possible to perform highly accurate noise removal. Since the motion in the video is not directly estimated, it is possible to deal with complex motion while reducing the cost of motion estimation. The frame cyclic configuration eliminates the need for extra frame memory and eliminates frame delays. “Speculative execution” allows processing to be performed independently and in parallel as long as computational resources allow, and can be said to be the best method for denoising processing for video.

マルチフレーム非局所平均フィルタを説明する図である。FIG. 2 is a diagram illustrating a multi-frame non-local average filter. フレーム巡回フィルタ処理ブロック図を説明する図である。FIG. 2 is a diagram illustrating a frame cyclic filter processing block diagram. 本発明における双領域処理による映像デノイジング処理全体を説明する概略ブロック図である。FIG. 2 is a schematic block diagram illustrating the entire video denoising process using dual-region processing according to the present invention. 双領域処理による映像デノイジング処理の詳細を説明するブロック図である。FIG. 3 is a block diagram illustrating details of video denoising processing using dual region processing. レベル縮小作用素Ｓ［ξ］のグラフ（Ｔ＝５０）を説明する図である。FIG. 3 is a diagram illustrating a graph (T=50) of a level reduction operator S[ξ]. 図４の双領域処理による映像デノイジング処理の詳細ブロック図をさらに拡張したものである。This is a further expanded version of the detailed block diagram of the video denoising process using dual area processing in FIG. 4 . 非局所平均フィルタ処理の動作を説明する概念図であり、注目画素（Ａ）の周りに５×５画素サイズの参照ブロックを生成し、１３×１３画素サイズの探索領域の範囲をブロックマッチングしている。This is a conceptual diagram illustrating the operation of non-local average filter processing, in which a 5x5 pixel size reference block is generated around the pixel of interest (A), and a search area range of 13x13 pixel size is block matched. There is. ８×８画像の６４個の離散コサイン変換の基底画像を説明する図であり、紙面向かって左上が定数で、値が大きい白ほど値が大きく、黒いほど値が小さく、また紙面向かって右に行くほど水平方向の変化が激しくなり、下に行くほど垂直方向の変化が激しくなる。This is a diagram explaining the base image of 64 discrete cosine transforms of an 8×8 image. The further you go, the more intense the changes in the horizontal direction, and the more you go down, the more intense the changes in the vertical direction. 映像デノイジング処理例を説明する図であり、上下段ともに、左側がノイズを含む入力画像（Before（σ_ｎ=20)）であり、右側がデノイジング処理画像（After(２D denoised／３D₊₊denoised)）であり、上段は動画像列の先頭フレーム、下段は１２０フレームの結果である（図１０参照）。It is a diagram illustrating an example of video denoising processing, in both the upper and lower rows, the left side is the input image containing noise (Before (σ _n =20)), and the right side is the denoising processed image (After (2D denoised/3D ₊₊ denoised)). ), the upper row is the first frame of the video sequence, and the lower row is the result for 120 frames (see FIG. 10). 映像デノイジング処理の動画像列フレーム番号に対するノイズを含まない真の画像とデノイジング処理結果の差分画像のピークＳＮ比（ＰＳＮＲ［ｄＢ］）の結果のグラフを示す図である。FIG. 4 is a diagram showing a graph of the peak SN ratio (PSNR [dB]) of a true image that does not include noise and a difference image as a result of the denoising process for a moving image sequence frame number of the video denoising process.

（概要）
本発明では、双領域処理を用いて、映像中に含まれるノイズを除去するデノイジング処理を行う。双領域処理とは、画像中の画素値を直接用いて時空間的に処理を行うものと、画像を周波数領域に変換した変換領域で行われる処理の２つを組合せて、デノイジング処理を行うものである。 (overview)
In the present invention, denoising processing for removing noise included in a video is performed using bi-region processing. Bi-domain processing is a denoising process that combines two processes: one that performs spatiotemporal processing directly using pixel values in an image, and the other that performs processing in a transformation domain that transforms the image into the frequency domain. It is.

画像中の画素を直接用いた時空間的な処理は、ブロックマッチングにより、画像の類似度による重み付け平均を行うものであり、さらに、フレーム巡回フィルタを適用して、現フレーム入力画像内における自己マッチングの結果と前フレーム出力画像との間のブロックマッチングの結果を重み付け加算する。 Spatiotemporal processing that directly uses pixels in an image uses block matching to perform weighted averaging based on image similarity, and then applies a frame cyclic filter to perform self-matching within the current frame input image. The results of block matching between the results of and the output image of the previous frame are weighted and added.

前フレーム出力画像に対して、予め定められた微小なスケール変換や回転変換を“投機的に実行”した結果との間のブロックマッチングを行うことにより、単純な並進動きのみならず、画像の拡大・縮小、回転等にも対応可能とする。 By performing block matching on the output image of the previous frame with the results of "speculatively executing" minute scale transformations and rotation transformations determined in advance, it is possible to perform not only simple translational movement but also image enlargement.・Supports reduction, rotation, etc.

そのような画像の時空間的な処理結果を現フレーム入力画像から引いた残差画像に対して、離散コサイン変換を用いて画像を周波数領域に変換し、変換領域におけるレベル縮小処理により、ノイズ成分と画像成分を分離することにより、さらにノイズ除去性能を高めるものとする。 For the residual image obtained by subtracting the spatiotemporal processing results of such an image from the current frame input image, the image is transformed into the frequency domain using discrete cosine transformation, and the noise component is removed by level reduction processing in the transformation domain. By separating the and image components, the noise removal performance is further improved.

新しい特徴的処理構成等としては、画像の類似度による重み付け加算平均を行う非局所平均フィルタとその非局所平均フィルタと入力ノイズ画像の差分残差画像を離散コサイン変換により周波数領域へ変換して、変換領域におけるレベル縮小作用素により、さらに高精度にノイズを抽出する双領域処理にフレーム巡回構成を導入し、前フレーム出力との間の時間方向の３次元処理への拡張を行うことが挙げられる。 The new characteristic processing configuration includes a non-local average filter that performs weighted averaging based on image similarity, and a differential residual image between the non-local average filter and the input noise image, which is converted to the frequency domain by discrete cosine transformation. An example of this is to introduce a frame cyclic configuration to bi-domain processing that extracts noise with higher precision using a level reduction operator in the transformation domain, and to extend it to three-dimensional processing in the time direction between the output of the previous frame.

新しい観点としてさらに、前フレーム出力に対して、スケール変換、回転変換等の予め定められた幾何学変換を行った結果を非局所平均フィルタにおける画像の類似度を測るブロックマッチングに適用する“投機的な実行”により映像における様々な複雑な動きに対して、高精度にノイズを抽出するものとする。 As a new point of view, we have added a "speculative method" in which the results of predetermined geometric transformations such as scale transformation and rotation transformation are applied to the output of the previous frame in block matching to measure the similarity of images in a non-local average filter. The objective is to extract noise with high precision from various complex movements in images by "execution of the same method".

また、このような本発明の方法や装置を構成・構築するための新規かつ重要な特徴的技術思想として、
・画像中の画素値を直接用いて時空間的に処理を行うフレーム巡回構成マルチフレーム非局所平均フィルタ処理部
・上記フレーム巡回構成マルチフレーム非局所平均フィルタ処理結果の重み付け総和を取る重み付け総和部
・上記フレーム巡回構成マルチフレーム非局所平均フィルタ処理結果の重み付け総和部からの出力を入力ノイズ画像から差し引く第一の残差画像生成部
・上記残差画像を離散コサイン変換により周波数領域に変換する周波数領域変換部
・上記周波数領域変換により周波数変換されたノイズ成分と画像成分を含む残差画像から変換領域レベルしきい値によりノイズ成分と画像成分を判別し，ノイズ成分を抽出する変換レベル縮小部
・上記周波数領域における変換領域レベル縮小部からの出力を逆離散コサイン変換により元の画素値に戻す逆周波数領域変換部
・上記逆周波数領域変換部から出力される最終的なノイズ成分を入力ノイズ画像から差し引く第二の残差画像生成部
・上記第二の残差画像生成部からのノイズを除去された出力画像を保存するフレームメモリ部
・上記フレームメモリに保存された前フレーム画像をスケール変換、回転変換等の幾何学的変換を行う幾何学的変換部
などを挙げることができる。なお、後述するように幾何学的変換部は省略乃至はスルーすることとしても良い（例えば図４を参照）。 In addition, as a new and important characteristic technical idea for configuring and constructing the method and device of the present invention,
・Frame cyclic configuration multi-frame non-local average filter processing unit that performs spatiotemporal processing directly using pixel values in an image ・Weighted summation unit that takes the weighted sum of the frame cyclic configuration multi-frame non-local average filter processing results ・A first residual image generation unit that subtracts the output from the weighted summation unit of the frame cyclic configuration multi-frame non-local average filter processing result from the input noise image.A frequency domain that converts the residual image into the frequency domain by discrete cosine transformation. Conversion unit - Conversion level reduction unit that distinguishes noise components and image components from the residual image containing the noise component and image component frequency-converted by the above frequency domain conversion using the conversion domain level threshold and extracts the noise component. Inverse frequency domain transform unit that returns the output from the transform domain level reduction unit in the frequency domain to the original pixel value by inverse discrete cosine transform. Subtracts the final noise component output from the above inverse frequency domain transform unit from the input noise image. Second residual image generation unit - Frame memory unit that stores the noise-removed output image from the second residual image generation unit - Scale conversion and rotation conversion of the previous frame image stored in the frame memory Examples include a geometric transformation section that performs geometric transformations such as . Note that, as described later, the geometric transformation section may be omitted or skipped (see, for example, FIG. 4).

上述した方法や装置の実現方法としては、ベースバンドビデオ信号を処理するハードウェア装置により実現することも可能であるし、ＭＸＦ（Material Exchange Format）ファイルを処理するソフトウェア、およびそれを実行するコンピュータをベースとした装置により実現することも可能であるし、ＭＸＦファイルをベースバンドビデオ信号に変換、あるいは逆変換する装置を用いれば、いかなる構成による実現も可能である。カメラ映像を動画像圧縮したもの、あるいはＭＸＦファイルをＩＰ（インターネット・プロトコル）伝送して、クラウド上で処理を行うことも可能である。ＩＰ伝送された圧縮映像をベースバンドビデオ信号に復号して、デノイジング処理を行った結果を再び圧縮してストリーム配信する等、様々なシステム形態への展開が考えられる。 The methods and devices described above can be implemented by hardware devices that process baseband video signals, or by software that processes MXF (Material Exchange Format) files and a computer that executes them. It is possible to realize this using a device based on the baseband video signal, or it can be realized with any configuration using a device that converts an MXF file into a baseband video signal or inversely converts it. It is also possible to transmit compressed camera images or MXF files via IP (Internet Protocol) and process them on the cloud. Various system configurations are possible, such as decoding IP-transmitted compressed video into a baseband video signal, denoising the result, compressing it again, and delivering it as a stream.

図３は、本発明における双領域処理による映像デノイジング処理全体の概略ブロック図である。ノイズを含む入力映像（Input Ｉ_ｎ）と映像デノイジング処理出力（Output Ｏ_ｎ）をフレームメモリ（Frame Memory）により、フレーム巡回させた前フレーム出力Ｏ_ｎ－１を用いて、時空間領域デノイジング処理（Spatio-Temporal Domain Denoising）する。下付き添字_{ｎ－１，ｎ}は、映像におけるフレーム番号を表す。
さらに、時空間領域デノイジング処理映像と入力映像の差分映像を周波数変換領域デノイジング処理（Frequency Transform Domain Denoising）する。そして、その結果を入力映像から引いたものを最終的な映像デノイジング処理出力（Output Ｏ_ｎ）とする。フレームメモリからの前フレーム出力Ｏ_ｎ－１に対して、画像を拡大縮小するスケール変化や回転を行う幾何学変換（Geometric Transform）を施すとよい。 FIG. 3 is a schematic block diagram of the entire video denoising process using bi-region processing according to the present invention. Spatio-temporal domain denoising processing is performed using the previous frame output O _n-1 obtained by rotating the input video (Input I _n ) containing noise and the video denoising processing output (Output O _n ) by frames using a frame memory (Frame Memory). Spatio-Temporal Domain Denoising). Subscripts _{n-1 and n} represent frame numbers in the video.
Further, the difference video between the spatio-temporal domain denoising processed video and the input video is subjected to frequency transform domain denoising processing (Frequency Transform Domain Denoising). Then, the result is subtracted from the input video and the result is the final video denoising processing output (Output _On ). It is preferable to apply a geometric transformation (Geometric Transform) to the previous frame output O _n-1 from the frame memory to change the scale or rotate the image.

図４は、双領域処理による映像デノイジング処理の詳細ブロック図である。ノイズを含む入力映像（Input Ｉ_ｎ）と映像デノイジング処理出力（Output Ｏ_ｎ）をフレームメモリ（Frame Memory）により、フレーム巡回させた前フレーム出力Ｏ_ｎ－１を用いて、時空間領域デノイジング処理（Spatio-Temporal Domain Denoising）する。 FIG. 4 is a detailed block diagram of video denoising processing using dual region processing. Spatio-temporal domain denoising processing is performed using the previous frame output O _n-1 obtained by rotating the input video (Input I _n ) containing noise and the video denoising processing output (Output O _n ) by frames using a frame memory (Frame Memory). Spatio-Temporal Domain Denoising).

時空間領域デノイジング処理には、非局所平均フィルタ（Non-local means，ＮＬＭ）［非特許文献３］［非特許文献４］を用いる（詳細後述）。したがって、映像デノイジング出力（Output Ｏ_ｎ）は、 A non-local means (NLM) [Non-patent Document 3] [Non-patent Document 4] is used for the spatio-temporal domain denoising process (details will be described later). Therefore, the video denoising output (Output O _n ) is

ここで、Ｔはノイズ成分と画像成分を判別するしきい値である。数７の関係をグラフに示すと図５のように表せる。 Here, T is a threshold value for distinguishing noise components from image components. The relationship of number 7 can be expressed in a graph as shown in FIG.

図６は、図４の双領域処理による映像デノイジング処理の詳細ブロック図をさらに拡張したものである。時空間領域デノイジング処理である非局所平均フィルタ（ＮＬＭ）の重み付け総和（Σ）出力Ｚ_ｎは、次のようになる。 FIG. 6 is a further expanded detailed block diagram of the video denoising process using the bi-region process shown in FIG. The weighted summation (Σ) output Z _n of the nonlocal mean filter (NLM), which is a spatiotemporal domain denoising process, is as follows.

幾何学変換は、予め定められたパラメータによる変換であり、カメラや映像中の物体の動きを予測する“投機的な実行”である。例えば、カメラがパンしながらズーム拡大しているような映像であれば、画像を水平方向に移動する並進変換と画像を拡大するスケール変換を行えばよい。 Geometric transformation is a transformation using predetermined parameters, and is "speculative execution" that predicts the movement of a camera or an object in a video. For example, if the image is a video that is zoomed in while the camera is panning, it is sufficient to perform translation conversion to move the image in the horizontal direction and scale conversion to enlarge the image.

非局所平均フィルタ自体にもブロックマッチングによる並進変換に相当する探索が行われているが、フレームメモリからの前フレーム出力Ｏ_ｎ－１に対して、そのような並進変換を予測して投機的に実行することにより、非局所平均フィルタにおけるブロックマッチング探索範囲を越える移動量の大きいカメラパン及びズームにも対応できて、ノイズ除去性能が高まる可能性が期待される。 A search equivalent to translational transformation by block matching is performed in the non-local average filter itself, but such translational transformation is predicted and speculatively performed for the previous frame output O _n-1 from the frame memory. By implementing this method, it is possible to cope with camera panning and zooming that require large amounts of movement exceeding the block matching search range in the non-local average filter, and it is expected that the noise removal performance may be improved.

予測が外れた場合でも、非局所平均フィルタにおけるマッチング残差が著しく大きくなり、そのような探索ブロックはノイズ除去処理には寄与しない（非局所平均フィルタにおけるそのような探索ブロック画素の重み付け出力がほぼ０になる）。 Even if the prediction is wrong, the matching residual in the non-local mean filter becomes significantly large, and such a search block does not contribute to the denoising process (the weighted output of such search block pixels in the non-local mean filter is approximately becomes 0).

各幾何学変換及び非局所平均フィルタはそれぞれ独立かつ並列に処理される。“投機的な実行”のため、精度面では、真のカメラの動きを推定する場合と比較して劣るものの、そもそもノイズを含む画像からカメラの動きを精密に推定する処理は困難であり、その処理に掛かる計算コストも遅延も大きい。 Each geometric transformation and non-local mean filter is processed independently and in parallel. Due to “speculative execution,” the accuracy is inferior to estimating true camera movement, but it is difficult to accurately estimate camera movement from images containing noise, and The computational cost and delay required for processing are large.

連続する画像フレーム間の動きは、ほぼ微小であり、各幾何学変換における微小なパラメータ値により十分近似可能である。計算リソースの許す限り、パラメータ値を変えた“投機的な実行”の数を増やすことも可能かつ容易であろう。“投機的な実行”とは、どれが“当たり”か不明なものについて複数の実行乃至準備を遂行しておいて、後に“当たり”（または正解）であると判明したものについてのみそれを有効活用するという意味である。 The movement between successive image frames is almost minute and can be sufficiently approximated by minute parameter values in each geometric transformation. It would be possible and easy to increase the number of "speculative executions" in which parameter values are changed, as long as computational resources permit. “Speculative execution” means carrying out multiple executions or preparations for something that is not known as a “hit”, and only validating it for the one that later turns out to be a “hit” (or correct answer). It means to make use of it.

図７は、非局所平均フィルタ処理の動作を説明する概念図であり、注目画素（Ａ）の周りに５×５画素サイズの参照ブロックを生成し、１３×１３画素サイズの探索領域の範囲をブロックマッチングしている。マッチング残差により各ブロック中心画素を重み付けしたものを加算平均する。これを画素毎にブロックを移動させながら処理を行う。 FIG. 7 is a conceptual diagram explaining the operation of non-local average filter processing, in which a reference block of 5×5 pixel size is generated around the pixel of interest (A), and the range of the search area of 13×13 pixel size is determined. Block matching. The center pixel of each block is weighted by the matching residual and then averaged. This processing is performed while moving the block pixel by pixel.

［非局所平均フィルタ］
［画素単位（pixel-wise）による処理］ [Nonlocal average filter]
[Pixel-wise processing]

ブロックマッチング残差ＳＳＤ（ｕ，ｖ）は次のようになる。 The block matching residual SSD(u,v) is as follows.

これは、画像自身における（２Ｎ＋１）×（２Ｎ＋１）画素サイズの参照ブロックで（２Ｗ＋１）×（２Ｗ＋１）画素サイズの探索領域をブロックマッチング処理するものであり、類似する近傍ブロック画素に対して重みを大きくした重み付け平均処理である。図７は、非局所平均フィルタ処理の動作を示した概念図である。注目画素（Ａ）の周りに５×５画素サイズのブロックを生成し、１３×１３画素サイズの探索領域の範囲をブロックマッチングしている。マッチング残差により各ブロック中心画素を重み付けしたものを加算平均する。これを画素毎にブロックを移動させながら処理を行う。 This is a block matching process for a search area of (2W+1) x (2W+1) pixel size using a reference block of (2N+1) x (2N+1) pixel size in the image itself, and weights are applied to similar neighboring block pixels. This is weighted averaging processing. FIG. 7 is a conceptual diagram showing the operation of non-local average filter processing. A 5×5 pixel block is generated around the pixel of interest (A), and block matching is performed within a 13×13 pixel search area. The center pixel of each block is weighted by the matching residual and then averaged. This processing is performed while moving the block pixel by pixel.

数１１におけるｈが大きいほど、平滑化の度合いが強まる。Ｃとしては、ＲＧＢ３次元あるいは輝度色差ＹＵＶ３次元以外に、Ｙ１次元＋ＵＶ２次元、Ｙ／Ｕ／Ｖ各１次元を用いることができる。本発明では、ＲＧＢカラー画像の直交色変換（後述）した結果に対して、Ｃ_１（Ｙ）１次元＋Ｃ_２Ｃ_３（ＵＶ）２次元を用いる。１次元および２次元の場合のＳＳＤはそれぞれｋ＝１，２として、正規化の係数も１，２になる。 The larger h in Equation 11, the stronger the degree of smoothing. As C, in addition to the three dimensions of RGB or the three dimensions of luminance and color difference YUV, one dimension of Y1 + two dimensions of UV, and one dimension each of Y/U/V can be used. In the present invention, C ₁ (Y) one dimension + C ₂ C ₃ (UV) two dimensions are used for the result of orthogonal color conversion (described later) of an RGB color image. The SSD in the one-dimensional and two-dimensional cases is set to k=1 and 2, respectively, and the normalization coefficients are also 1 and 2.

［ブロック単位（block-wise）による処理］
カラー画像におけるブロック｛Ｂ_ｋ｜ｋ＝１，２，３｝内のすべての画素値の非局所平均フィルタ処理は、次のようになる。 [Block-wise processing]
The non-local average filtering of all pixel values in a block {B _k |k=1,2,3} in a color image is as follows.

ブロックＢ_ｋは画素単位の処理同様（２Ｎ＋１）×（２Ｎ＋１）画素サイズであり、（２Ｗ＋１）×（２Ｗ＋１）画素サイズの探索領域におけるブロックマッチングＳＳＤ（ｕ，ｖ）のマッチング残差による重みω（ｕ，ｖ）の重み付け平均として推定される。画素毎にブロックを移動させながら処理を行うと、推定結果のブロック画素が重複して得られるので、重複するすべての画素値の加算平均を取った最終的な出力は、次のようになる。 The block B _k has a (2N+1)×(2N+1) pixel size, similar to pixel-by-pixel processing, and the weight ω( u, v). If processing is performed while moving the block pixel by pixel, the block pixels in the estimation result will be duplicated, so the final output obtained by taking the average of all the duplicate pixel values will be as follows.

なお、“非局所”とあるが、実際のフィルタ処理は注目画素を中心とする参照ブロックとその周辺の“局所”ブロックを探索領域としている。したがって、実際には、“局所”平均フィルタとすべきであろう。しかし、［非特許文献３］［非特許文献４］では、画像全体の“非局所”領域を探索領域としており、そのような処理を行うことは不可能ではない。 Note that although it is written as "non-local," the search area in actual filter processing is a reference block centered on the pixel of interest and "local" blocks around it. Therefore, it should actually be a "local" average filter. However, in [Non-Patent Document 3] and [Non-Patent Document 4], a "non-local" area of the entire image is used as a search area, and it is not impossible to perform such processing.

本発明では、探索領域を現実的な範囲での“局所”ブロックとしているが、［非特許文献３］［非特許文献４］に従い、名称としては、“非局所平均（Non-local means，ＮＬＭ）フィルタ”を用いる。 In the present invention, the search area is a "local" block within a realistic range, but according to [Non-patent Document 3] and [Non-patent Document 4], the name is "Non-local means (NLM)". ) filter” is used.

［ＲＧＢカラー画像の直交色変換］
本発明では、ＲＧＢカラー画像を次のように直交色変換する［非特許文献１８］。 [Orthogonal color conversion of RGB color images]
In the present invention, an RGB color image is orthogonally converted as follows [Non-Patent Document 18].

この変換行列は３点コサイン変換基底であり、変換されたＣ１，Ｃ２，Ｃ３の順にほぼ輝度色差ＹＵＶに相当する。変換した各チャネルに対して非局所平均フィルタ及び周波数領域におけるデノイジング処理（詳細後述）を行い、逆変換によりＲＧＢカラー画像に変換する。この変換行列は直交変換のため、その逆変換行列は変換行列の転置になる。 This transformation matrix is a three-point cosine transformation basis, and the transformed C1, C2, and C3 approximately correspond to the luminance chrominance YUV in that order. A non-local average filter and frequency domain denoising processing (described in detail later) are performed on each transformed channel, and the image is converted into an RGB color image by inverse transformation. Since this transformation matrix is an orthogonal transformation, its inverse transformation matrix is the transpose of the transformation matrix.

図８は、８×８画像の６４個の離散コサイン変換の基底画像を説明する図であり、紙面向かって左上が定数で、値が大きい白ほど値が大きく、黒いほど値が小さい。紙面向かって右に行くほど水平方向の変化が激しくなり、下に行くほど垂直方向の変化が激しくなる。 FIG. 8 is a diagram illustrating 64 base images of discrete cosine transformation of an 8×8 image, where the upper left when facing the page is a constant, the whiter value is larger, the value is larger, and the darker the value is, the smaller the value. The horizontal changes become more intense as you go to the right in the paper, and the vertical changes become more intense as you go down.

［周波数領域におけるレベル縮小によるデノイジング処理］
離散コサイン変換（Discrete Cosine Transform，ＤＣＴ）の基底関数を次のように定義する。 [Denoising processing by level reduction in frequency domain]
The basis function of Discrete Cosine Transform (DCT) is defined as follows.

離散化された信号をｆ_ｎとすると、離散コサイン変換、逆離散コサイン変換は次のようになる。 When the discretized signal is _fn , the discrete cosine transform and inverse discrete cosine transform are as follows.

２次元画像の場合、水平方向に上記１次元ＤＣＴした結果を、次に垂直方向に１次元ＤＣＴをすればよい。したがって、数１５の基底関数が、２次元画像の場合には、図８の２次元基底画像として表される。図８は、８×８画像の６４個の離散コサイン変換の基底画像になる。 In the case of a two-dimensional image, the result of the one-dimensional DCT described above in the horizontal direction may then be subjected to one-dimensional DCT in the vertical direction. Therefore, in the case of a two-dimensional image, the basis function of Equation 15 is expressed as the two-dimensional basis image in FIG. 8. FIG. 8 is a base image of 64 discrete cosine transforms of an 8×8 image.

ＤＣＴ係数｛ξ_ｋ｝は通常ｋが大きいと急速に０に近づく。このため、数１６の第１式の小さいｋに対する少数の項のみで画像がよく表せる。この性質から画像圧縮にも用いられている。 The DCT coefficient {ξ _k } usually approaches 0 quickly when k is large. Therefore, an image can be expressed well with only a small number of terms for small k in the first equation of Equation 16. Because of this property, it is also used for image compression.

基底画像に任意のＤＣＴ係数｛ξ_ｋ｝を掛けて足し合わせると、任意の画像を表すことができる。すなわち、数１６の第１式によるＤＣＴ係数｛ξ_ｋ｝と基底画像を用いて元の画像に戻すのが数１６の第２式の逆ＤＣＴである。 By multiplying the base image by an arbitrary DCT coefficient {ξ _k } and adding them together, an arbitrary image can be expressed. That is, the inverse DCT of the second equation of equation 16 uses the DCT coefficients {ξ _k } of the first equation of equation 16 and the base image to restore the original image.

したがって、ノイズ成分を含む画像成分のＤＣＴ係数｛ξ_ｋ｝からノイズ成分のみを分離することができれば、分離されたノイズ成分を入力ノイズ画像から引くことにより、ノイズが除去された画像が得られる。 Therefore, if only the noise component can be separated from the DCT coefficients {ξ _k } of the image component including the noise component, an image from which noise has been removed can be obtained by subtracting the separated noise component from the input noise image.

前段の時空間領域におけるデノイジング処理である非局所平均フィルタ処理結果と入力ノイズ画像の残差画像を処理するため、ノイズ成分は画像成分と比較してそのレベルが相対的に小さいと予想されることから、その残差画像の変換画像に数７のレベル縮小作用素Ｓ［ξ］を作用させて、ノイズ成分のみをさらに高精度に抽出することが期待できる。 Since the non-local average filter processing result, which is the denoising process in the spatiotemporal domain in the previous stage, and the residual image of the input noise image are processed, the noise component is expected to have a relatively low level compared to the image component. Therefore, it is expected that only the noise component can be extracted with higher precision by applying the level reduction operator S[ξ] of Equation 7 to the transformed image of the residual image.

変換は８×８画像のようなブロック毎に行われ、デノイジング処理の結果はブロック毎に得られる。そこで、処理ブロックを１画素ずつずらしながら、最終的にデノイジング処理を行ったすべての処理ブロックの“集合平均”を取る。これはブロック単位による非局所平均フィルタ処理と同じである。 The transformation is performed block by block, such as an 8x8 image, and the results of the denoising process are obtained block by block. Therefore, while shifting the processing blocks one pixel at a time, the "collective average" of all the processing blocks that have been finally subjected to the denoising process is taken. This is the same as non-local average filter processing on a block-by-block basis.

図９は、映像デノイジング処理例を説明する図であり、上下段ともに、左側がノイズを含む入力画像（Before（σ_ｎ=20)）であり、右側がデノイジング処理画像（After(２D denoised／３D₊₊denoised)）であり、上段は動画像列の先頭フレーム、下段は１２０フレームの結果である（図１０参照）。 FIG. 9 is a diagram illustrating an example of video denoising processing. In both the upper and lower rows, the left side is the input image containing noise (Before (σ _n =20)), and the right side is the denoising processed image (After (2D denoised/3D ₊₊ denoised)), the upper row is the first frame of the video sequence, and the lower row is the result of 120 frames (see FIG. 10).

［動画像列シミュレーション］
図９において、上段の動画像列の先頭フレームは、前フレーム出力を用いない現フレーム入力のみのデノイジング処理結果（図１０の２D denoised）である。下段１２０フレームの時空間領域における非局所平均フィルタは、現フレーム入力とフレーム巡回による前フレーム出力、さらに前フレーム出力をスケール変換した投機的実行数２（スケール係数0.99，0.98）の結果である（図１０の３D₊₊denoised）。投機的実行における重み係数の比率は、現フレーム入力を１とすると、フレーム巡回による前フレーム出力を９９、フレーム巡回による前フレーム出力のスケール変換もそれぞれ９９とした。非局所平均フィルタのパラメータはいずれも参照ブロックサイズ５×５、探索領域１５×１５、平滑化パラメータｈ＝２２とした。周波数変換領域におけるデノイジング処理は、入力ノイズ画像から、その非局所平均フィルタ処理結果を引いた残差画像に対して、１６×１６画像領域を離散コサイン変換して、数７のレベル縮小作用素Ｓ［ξ］のしきい値Ｔを５５とした。ノイズを含む入力ノイズ画像は平均０、標準偏差２０の正規分布に従うノイズをノイズを含まない真の画像に付加した。 [Video sequence simulation]
In FIG. 9, the first frame of the upper moving image sequence is the result of denoising processing of only the current frame input without using the previous frame output (2D denoised in FIG. 10). The non-local average filter in the spatiotemporal domain of the lower 120 frames is the result of the current frame input, the previous frame output by frame rotation, and the speculative execution number 2 (scale coefficients 0.99, 0.98) by scaling the previous frame output ( 3D ₊₊ denoised in Figure 10). The ratio of weighting coefficients in speculative execution is set to 1 for the current frame input, 99 for the previous frame output by frame cycling, and 99 for scale conversion of the previous frame output by frame cycling, respectively. The parameters of the non-local average filter were a reference block size of 5×5, a search area of 15×15, and a smoothing parameter h=22. In the denoising process in the frequency transform domain, the residual image obtained by subtracting the non-local average filter processing result from the input noise image is subjected to discrete cosine transform of the 16×16 image domain, and the level reduction operator S[ ξ] was set to 55. For the input noise image containing noise, noise following a normal distribution with an average of 0 and a standard deviation of 20 was added to the true image containing no noise.

図１０は、映像デノイジング処理の動画像列フレーム番号に対するノイズを含まない真の画像とデノイジング処理結果の差分画像のピークＳＮ比（ＰＳＮＲ［ｄＢ］）の結果のグラフである。非局所平均フィルタが現フレーム入力とフレーム巡回による前フレーム出力を用いた場合（３D denoised）、さらに前フレーム出力をスケール変換した投機的実行数１の場合（３D₊ denoised）、投機的実行数２の場合（３D₊₊ denoised）、現フレーム入力のみの場合（２D denoised）である。 FIG. 10 is a graph of the peak SN ratio (PSNR [dB]) of the true image without noise and the difference image as a result of the denoising process for the moving image sequence frame number of the video denoising process. When the non-local average filter uses the current frame input and the previous frame output by frame rotation (3D denoised), and when the previous frame output is scaled and the number of speculative executions is 1 (3D ₊ denoised), the number of speculative executions is 2. (3D ₊₊ denoised), and only current frame input (2D denoised).

ノイズを含む入力ノイズ画像は平均０、標準偏差２０の正規分布に従うノイズをノイズを含まない真の画像に付加したものであり、入力ノイズ画像のＰＳＮＲ値は、平均で22.54(±0.0418)［ｄＢ］であった（括弧内は標準偏差）。 The input noise image containing noise is the one in which noise that follows a normal distribution with a mean of 0 and standard deviation of 20 is added to the true image that does not contain noise, and the PSNR value of the input noise image is 22.54 (±0.0418) [dB] on average. ] (Standard deviations are in parentheses).

ほぼ、非局所平均フィルタが現フレーム入力のみの２D denoisedよりも、現フレーム入力とフレーム巡回による前フレーム出力を用いた３D denoised、さらに前フレーム出力をスケール変換した投機的実行数を増した３D₊denoised、３D₊₊denoisedのＰＳＮＲ値が増加している。 Generally speaking, 3D denoised uses the current frame input and the previous frame output by frame cycling, and 3D ₊ increases the number of speculative executions by scaling the previous frame output, rather than 2D denoised where the non-local average filter only uses the current frame input. The PSNR values of denoised and 3D ₊₊ denoised are increasing.

一部非局所平均フィルタが現フレーム入力とフレーム巡回による前フレーム出力を用いた３D denoisedが、ＰＳＮＲ値最大になるフレームが存在するが（７０フレーム付近）、それは、カメラ、画像フレーム中の局所移動物体ともにスケール変換（この場合画像フレームの縮小）されずに、並進の動きのみと見なせるからである。また、フレーム後半（おおよそ１００フレーム以降）のＰＳＮＲ値の上昇は、画像下部における路面の平坦領域が増していくためである。 There is a frame (near frame 70) where the PSNR value is maximum in 3D denoised in which a partial non-local average filter uses the current frame input and the previous frame output by frame cycling, but this is due to the local movement of the camera and the image frame. This is because neither the object nor the object is scale-transformed (in this case, the image frame is reduced) and can be regarded as only a translational movement. Furthermore, the increase in the PSNR value in the latter half of the frame (approximately after the 100th frame) is due to the increase in the flat area of the road surface at the bottom of the image.

（補足説明）
上記した説明における、画像中の画素値を直接用いて時空間的に処理を行うフレーム巡回構成マルチフレーム非局所平均フィルタ処理部と、上記フレーム巡回構成マルチフレーム非局所平均フィルタ処理結果の重み付け総和を取る重み付け総和部と、は、図３に示す“Spatio-Temporal Domain Denoising”に対応し、図４に示す“ＮＬＭ”（非局所平均フィルタ／Non-localmeans）及びｗ_１，ｗ_２及びΣに対応し、図６に示す“ＮＬＭ”及びｗ_１，・・ｗ_ｎ及びΣに対応する。 (supplementary explanation)
In the above explanation, the frame cyclic configuration multi-frame non-local average filter processing unit that performs spatio-temporal processing directly using pixel values in an image, and the weighted sum of the frame cyclic configuration multi-frame non-local average filter processing results. The weighted summation section to be taken corresponds to "Spatio-Temporal Domain Denoising" shown in FIG. 3, and corresponds to "NLM" (Non-local mean filter/Non-localmeans) shown in FIG. 4, w ₁ , w ₂ and Σ. This corresponds to "NLM", w ₁ , . . . , _n and Σ shown in FIG.

また、上記フレーム巡回構成マルチフレーム非局所平均フィルタ処理結果の重み付け総和部からの出力を入力ノイズ画像から差し引く第一の残差画像生成部は、図３に示す“Spatio-Temporal Domain Denoising”のすぐ右横に示す減算処理“－”に対応し、図４に示すΣのすぐ右横に示す減算処理“－”に対応し、図６に示すΣのすぐ右横に示す減算処理“－”に対応する。 In addition, the first residual image generation unit that subtracts the output from the weighted summation unit of the frame cyclic configuration multi-frame non-local average filter processing result from the input noise image is located immediately after the “Spatio-Temporal Domain Denoising” shown in Figure 3. Corresponds to the subtraction process "-" shown on the right side, corresponds to the subtraction process "-" shown immediately to the right side of Σ shown in Figure 4, and corresponds to the subtraction process "-" shown immediately to the right side of Σ shown in Figure 6. handle.

また、上記残差画像を離散コサイン変換により周波数領域に変換する周波数領域変換部と、上記周波数領域変換により周波数変換されたノイズ成分と画像成分を含む残差画像から変換領域レベルしきい値によりノイズ成分と画像成分を判別し，ノイズ成分を抽出する変換レベル縮小部と、上記周波数領域における変換領域レベル縮小部からの出力を逆離散コサイン変換により元の画素値に戻す逆周波数領域変換部と、は、図３に示す“Frequency Transform Domain Denoising”に対応し、図４に示す“Ｔ”，“Ｓ”，“Ｔ^－１”に対応し、図６に示す“Ｔ”，“Ｓ”，“Ｔ^－１”に対応する。 In addition, a frequency domain transform unit converts the residual image into the frequency domain by discrete cosine transform, and a transform domain level threshold is used to convert the residual image containing the noise component and image component frequency-converted by the frequency domain transform into noise by using a transform domain level threshold. a transform level reduction unit that discriminates between the image component and the image component and extracts the noise component; an inverse frequency domain transform unit that returns the output from the transform domain level reduction unit in the frequency domain to the original pixel value by inverse discrete cosine transformation; corresponds to “Frequency Transform Domain Denoising” shown in FIG. 3, corresponds to “T”, “S”, “T ⁻¹ ” shown in FIG. 4, and corresponds to “T”, “S”, “ T ⁻¹ ”.

また、上記逆周波数領域変換部から出力される最終的なノイズ成分を入力ノイズ画像から差し引く第二の残差画像生成部は、図３に示す“Ｔ^－１”のすぐ右横に示す減算処理“－”に対応し、図４に示す“Ｔ^－１”のすぐ右横に示す減算処理“－”に対応し、図６に示す“Ｔ^－１”のすぐ右横に示す減算処理“－”に対応する。 In addition, the second residual image generation unit that subtracts the final noise component output from the inverse frequency domain transformation unit from the input noise image performs the subtraction process shown immediately to the right of “T ⁻¹ ” shown in FIG. Corresponding to "-", the subtraction process "-" shown immediately to the right of "T ^-1 " shown in FIG. 4 corresponds to the subtraction process "-" shown immediately to the right of "T ^-1 " shown in FIG. ” corresponds to

また、上記第二の残差画像生成部からのノイズを除去された出力画像を保存するフレームメモリ部は、図３と図４と図６とにそれぞれに示す“Frame Memory”に対応する。 Further, the frame memory section that stores the noise-removed output image from the second residual image generation section corresponds to the "Frame Memory" shown in FIGS. 3, 4, and 6, respectively.

また、上記フレームメモリに保存された前フレーム画像をスケール変換、回転変換等の幾何学的変換を行う幾何学的変換部は、図３に示す“Geometric Transform”に対応し、図６に示す“Translation”及び“Rotation”及び“Scale”に対応する。図４において“Geometric Transform”（幾何学変換）を記載していないが、図４に示すようにGeometric Transform”（幾何学変換）を使用することなくスルー省略した処理とすることもできる。 Furthermore, the geometric transformation section that performs geometric transformation such as scale transformation and rotation transformation on the previous frame image stored in the frame memory corresponds to "Geometric Transform" shown in FIG. 3, and "Geometric Transform" shown in FIG. 6. Translation” and “Rotation” and “Scale”. Although "Geometric Transform" is not shown in FIG. 4, it is also possible to skip the processing without using "Geometric Transform" as shown in FIG.

上述の実施形態で説明した開示内容は、その具体的な説明実例に限定されるものではなく、本発明の技術思想の範囲内において、当業者の知り得る公知技術または周知技術を適宜適用して、または／およびアレンジして、利用することが可能である。 The disclosure described in the above-mentioned embodiments is not limited to the specific examples thereof, and can be made by appropriately applying known techniques known to those skilled in the art or well-known techniques within the scope of the technical idea of the present invention. , or/and can be arranged and used.

本発明は、映像機器全般、特に、セキュリティ監視等の撮影映像に対しても好適である。 The present invention is also suitable for video equipment in general, and in particular for video captured by security monitoring and the like.

Claims

In video denoising equipment,
a frame cyclic multi-frame non-local average filter processing unit that performs spatiotemporal processing directly using pixel values in an image;
a weighted summation unit that calculates a weighted summation of processing results by the frame cyclic configuration multi-frame non-local average filter processing unit;
a first residual image generation unit that subtracts the output from the weighted summation unit of the frame cyclic configuration multi-frame non-local average filter processing result from the input noise image;
a frequency domain transformation unit that transforms the first residual image into a frequency domain by discrete cosine transformation;
a conversion level reduction unit that distinguishes noise components and image components from a residual image containing noise components and image components frequency-converted by the frequency domain conversion unit using a conversion domain level threshold, and extracts the noise components;
an inverse frequency domain transform unit that returns the output from the transform domain level reduction unit in the frequency domain to the original pixel value by inverse discrete cosine transform;
a second residual image generation unit that subtracts the final noise component output from the inverse frequency domain transformation unit from the input noise image;
A video denoising device comprising: a frame memory unit that stores an output image from which noise has been removed from the second residual image generation unit.

In the video denoising method,
a frame cyclic multi-frame non-local average filter processing stage that performs spatio-temporal processing directly using pixel values in the image;
a weighted summation step of taking a weighted summation of the frame cyclic configuration multi-frame non-local average filter processing results;
a first residual image generation step of subtracting the output from the weighted summation result of the frame cyclic configuration multi-frame non-local average filter processing results from the input noise image;
a frequency domain transformation step of transforming the first residual image into the frequency domain by discrete cosine transformation;
a transform level reduction step of discriminating noise components and image components from a residual image including noise components and image components frequency-converted by the frequency domain transform using a transform domain level threshold and extracting the noise components;
an inverse frequency domain transform step of returning the transform domain level reduction result in the frequency domain to the original pixel value by inverse discrete cosine transform;
a second residual image generation step of subtracting a final noise component obtained as a result of the inverse frequency domain transform from the input noise image;
A video denoising method, comprising: storing the output image from which noise has been removed in the second residual image generation step in a frame memory.

In the video denoising method,
Noise contained in the video is processed by combining two processes: one that performs spatiotemporal processing directly using the pixel values of the image in the video, and the other that performs processing in a transformation domain where the image is converted into a frequency domain. A video denoising method characterized by performing denoising processing to remove.

The video denoising method according to claim 3,
Spatio-temporal processing that directly uses pixels in an image uses block matching to perform weighted averaging based on image similarity.Furthermore, a frame cyclic filter is applied to calculate the self-image in the current frame input image. A video denoising method characterized by weighted addition of block matching results between a matching result and a previous frame output image.

The video denoising method according to claim 4,
By performing block matching on the output image of the previous frame with the results of "speculatively executing" small predetermined scale transformations and rotation transformations, it is possible to perform not only simple translational movements but also image transformations. A video denoising method characterized by being compatible with enlargement/reduction, rotation, etc.

The video denoising method according to claim 5,
The residual image obtained by subtracting the spatiotemporal processing result of the image from the current frame input image is transformed into the frequency domain using discrete cosine transformation, and the noise component and the image are reduced by level reduction processing in the transformation domain. A video denoising method characterized by processing that further improves noise removal performance by separating components.

In video denoising equipment,
Noise contained in the video is processed by combining two processes: one that performs spatiotemporal processing directly using the pixel values of the image in the video, and the other that performs processing in a transformation domain where the image is converted into a frequency domain. A video denoising device comprising: a denoising processing unit that removes.

The video denoising device according to claim 7,
The spatio-temporal processing by the denoising processing unit that directly uses pixels in the image is to perform weighted averaging according to the similarity of the images by block matching, and further apply a frame cyclic filter to calculate the current frame. A video denoising device characterized by weighted addition of a result of self-matching within an input image and a result of block matching between an output image of a previous frame.

The video denoising device according to claim 8,
The denoising processing unit further performs block matching with the results of "speculatively executing" predetermined minute scale transformations and rotation transformations on the output image of the previous frame, thereby performing simple translation processing. A video denoising device that is characterized by being able to handle not only movement, but also image enlargement/reduction, rotation, etc.

The video denoising device according to claim 9,
The denoising processing unit further transforms the residual image obtained by subtracting the spatiotemporal processing result of the image from the current frame input image into a frequency domain using discrete cosine transformation, and performs level reduction in the transformation domain. A video denoising device characterized by processing that further improves noise removal performance by separating noise components and image components.

The video denoising device according to claim 1,
A video denoising device, further comprising: a geometric transformation unit that performs geometric transformation on a previous frame image stored in the frame memory unit.

The video denoising device according to claim 11,
The video denoising device, wherein the geometrical transformation is a scale transformation or a rotational transformation.

The video denoising method according to claim 2,
A video denoising method, further comprising a geometric transformation step of performing geometric transformation on the previous frame image stored in the frame memory.

The video denoising method according to claim 13,
A video denoising method, wherein the geometric transformation is a scale transformation or a rotation transformation.