JP2020113829A

JP2020113829A - Moving image processing method and moving image processing apparatus

Info

Publication number: JP2020113829A
Application number: JP2019001491A
Authority: JP
Inventors: 一長原; Hajime Nagahara; 長原　　一; 忠大河原; Tadashi Ogawara; 道隆吉田; Michitaka Yoshida
Original assignee: Osaka University NUC
Current assignee: Osaka University NUC
Priority date: 2019-01-08
Filing date: 2019-01-08
Publication date: 2020-07-27
Anticipated expiration: 2039-01-08
Also published as: JP7272625B2

Abstract

To provide a moving image processing method that can determine an appropriate exposure pattern according to the type of an image sensor.SOLUTION: A moving image processing method includes a compression step of generating a compressed moving image by performing imaging with repeated exposure that is thinned out temporally and spatially by using an image sensor in which pixels are arranged two-dimensionally, and a first machine learning step S1 of optimizing an exposure pattern for identifying an exposure mode by machine learning prior to the compression step. In the compression step, a compressed moving image is generated using the exposure pattern obtained by the optimization in the first machine learning step S1.SELECTED DRAWING: Figure 12

Description

本開示は、動画像の処理方法及び当該方法を実行する装置に関する。 The present disclosure relates to a moving image processing method and an apparatus that executes the method.

近年、監視カメラ及び車載カメラ等のＩｏＴ（ＩｎｔｅｒｎｅｔｏｆＴｈｉｎｇｓ）デバイスで撮影された映像の解析が盛んに行われている。これらのカメラで撮影された映像（つまり、動画像）は、データセンタに集約され、解析などに用いられる。このとき、通信路の容量を削減するために、映像の空間解像度、及び、時間解像度（以下、フレームレートともいう。）を下げるなどの圧縮処理を行う必要がある。しかしながら、空間解像度を下げると、映像が不鮮明になり、フレームレートを下げると、映像における動きの情報が失われてしまう。この空間解像度と時間解像度とのトレードオフを解決する手段として、符号化露光画像を用いた圧縮ビデオセンシング手法が提案されている。 2. Description of the Related Art In recent years, video images captured by IoT (Internet of Things) devices such as surveillance cameras and vehicle-mounted cameras have been actively analyzed. Video images (that is, moving images) captured by these cameras are collected in a data center and used for analysis or the like. At this time, in order to reduce the capacity of the communication path, it is necessary to perform compression processing such as lowering the spatial resolution of video and temporal resolution (hereinafter, also referred to as frame rate). However, if the spatial resolution is lowered, the image becomes unclear, and if the frame rate is lowered, the motion information in the image is lost. As a means for solving the trade-off between the spatial resolution and the temporal resolution, a compressed video sensing method using a coded exposure image has been proposed.

例えば、特許文献１は、カメラのセンサの個々のピクセルで取得された光場を、対応する変調関数に従って変調し、各露出時間中に積分されたフレームを生成し、生成したフレームを凸最適化方法で再構成する手法を開示している。 For example, U.S. Pat. No. 6,096,839, modulates the light field acquired at an individual pixel of a camera sensor according to a corresponding modulation function to generate an integrated frame during each exposure time, and convex optimizes the generated frame. A method of reconstructing by a method is disclosed.

特許第５７２６０５７号公報Japanese Patent No. 572657

T. Sonoda, H. Nagahara, K. Endo, Y. Sugiyama, R. Taniguchi, “High-speed imaging using CMOS image sensor with quasi pixel-wise exposure”, International Conference on Computational Photography (ICCP), pp.1-11, 2016.T. Sonoda, H. Nagahara, K. Endo, Y. Sugiyama, R. Taniguchi, “High-speed imaging using CMOS image sensor with quasi pixel-wise exposure”, International Conference on Computational Photography (ICCP), pp.1- 11, 2016. M. Iliadis, L. Spinoulas, A. K. Katsaggelos, “Deep fully-connected networks for video compressive sensing”, Digital Signal Proessing 72: 9-18, 2018.M. Iliadis, L. Spinoulas, A. K. Katsaggelos, “Deep fully-connected networks for video compressive sensing”, Digital Signal Proessing 72: 9-18, 2018. Y. Hitomi, J. Gu, M. Gupta, T. Mitsuniga, S. K. Nayar, “Video from a single coded exposure photograph using a learned over-complete dictionary”, International Conference on Computer Vision (ICCV), pp.287-294, 2011.Y. Hitomi, J. Gu, M. Gupta, T. Mitsuniga, SK Nayar, “Video from a single coded exposure photograph using a learned over-complete dictionary”, International Conference on Computer Vision (ICCV), pp.287-294. , 2011. J. Yang, X. Yuan, X. Liao, P. Llull, D. J. Brady, G. Sapiro, L. Carin, “Video compressive sensing using Gaussian mixture models”, IEEE Transactions on Image Processing, pp.4863-4878, 2014.J. Yang, X. Yuan, X. Liao, P. Llull, DJ Brady, G. Sapiro, L. Carin, “Video compressive sensing using Gaussian mixture models”, IEEE Transactions on Image Processing, pp.4863-4878, 2014 . M. Iliadis, L. Spinoulas, A. K. Katsaggelos, “DeepBinaryMask: Learning a Binary Mask for Video Compressive Sensing”, arXiv preprint arXiv: 1607.03343 2016.M. Iliadis, L. Spinoulas, A. K. Katsaggelos, “DeepBinaryMask: Learning a Binary Mask for Video Compressive Sensing”, arXiv preprint arXiv: 1607.03343 2016. M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio, “Binarized neural networks: Training neural networks with weights and activations constrained to +1 or -1”, arXiv preprint arXiv: 1602.02830 2016.M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio, “Binarized neural networks: Training neural networks with weights and activations constrained to +1 or -1”, arXiv preprint arXiv: 1602.02830 2016. M. Gygli, H. Grabner, H. Riemenschneider, L. V. Gool, “Creating Summaries from User Videos”, ECCV,2014, https://people.ee.ethz.ch/gyglim/vsum/M. Gygli, H. Grabner, H. Riemenschneider, L. V. Gool, “Creating Summaries from User Videos”, ECCV,2014, https://people.ee.ethz.ch/gyglim/vsum/ Rty. T. D.: Survey on Contemporary Remote Surveillance Systems for Public Safety, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), Vol. 40, No.5, pp. 93-515, 2010.Rty. T. D.: Survey on Contemporary Remote Surveillance Systems for Public Safety, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), Vol. 40, No. 5, pp. 93-515, 2010. Li, Y., Ai, H., Yamashita, T., Lao, S. and Kawade,, M.: Tracking in Low Frame Rate Video: A Cascade Particle Filter with Discriminative Ovservers of Different Life Spans, Vol. 30, No. 10, pp. 1728-1740, 2008.Li, Y., Ai, H., Yamashita, T., Lao, S. and Kawade,, M.: Tracking in Low Frame Rate Video: A Cascade Particle Filter with Discriminative Ovservers of Different Life Spans, Vol. 30, No . 10, pp. 1728-1740, 2008. Yoshida, M., Torii, A., Okutomi, M., Endo, K., Sugiyama, Y., Tanigushi, R.-i. and Nagahara, H.: Joint optimization for compressive video sensing and reconstruction under hardware constraints, Proceedings of European Conference on Conmputer Vision (ECCV), 2018.Yoshida, M., Torii, A., Okutomi, M., Endo, K., Sugiyama, Y., Tanigushi, R.-i. and Nagahara, H.: Joint optimization for compressive video sensing and reconstruction under hardware constraints, Proceedings of European Conference on Conmputer Vision (ECCV), 2018. Bobick, A. F. and Davis, J. W.: The recognition of human movement using temporal templates, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, No. 3, pp. 257-267, 2001.Bobick, A. F. and Davis, J. W.: The recognition of human movement using temporal templates, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, No. 3, pp. 257-267, 2001. Blank, M., Gorelick, L., Shechtman, E., Irani, M. and Basri, R.: Actions as Space-Time Shapes, Proceedings of International Conference on Computer Vision (ICCV), pp. 1395-1402, 2005.Blank, M., Gorelick, L., Shechtman, E., Irani, M. and Basri, R.: Actions as Space-Time Shapes, Proceedings of International Conference on Computer Vision (ICCV), pp. 1395-1402, 2005 . Laptev, I.: On Space-Time Interest Points, International Journal of Comnputer Vision, Vol. 64, No. 2, pp. 107-123, 2005.Laptev, I.: On Space-Time Interest Points, International Journal of Comnputer Vision, Vol. 64, No. 2, pp. 107-123, 2005. Dalal, N. and Triggs, B.: Histograms of oriented gradients for human detection, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 886-893, 2005.Dalal, N. and Triggs, B.: Histograms of oriented gradients for human detection, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 886-893, 2005. Klaser, A., Marszalek, M. and Schmid, C.: A Spatio-Temporal Descriptor Based on 3D-Gradients, Proceedings of British Machine Vision Conference (BMVC) (Everningham, M., Needham, C. and Fraile, R., eds.), Leeds, United Kingdom, British Machine Vision Association, pp. 275:1-10, 2008.Klaser, A., Marszalek, M. and Schmid, C.: A Spatio-Temporal Descriptor Based on 3D-Gradients, Proceedings of British Machine Vision Conference (BMVC) (Everningham, M., Needham, C. and Fraile, R. , eds.), Leeds, United Kingdom, British Machine Vision Association, pp. 275:1-10, 2008. Csurka, G., Dance, C. R., Fan, L., Willamowski, J. and Bray, C.: Visual categorization with bags of keypoints, Proceedings of European Conference on Conputer Vision (ECCV), pp. 1-22, 2004.Csurka, G., Dance, C. R., Fan, L., Willamowski, J. and Bray, C.: Visual categorization with bags of keypoints, Proceedings of European Conference on Computer Vision (ECCV), pp. 1-22, 2004. Lptev, I., Marszalek, M., Schmid, C. and Rozenfeld, B.: Learning realistic human actions from movies, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1-8, 2008.Lptev, I., Marszalek, M., Schmid, C. and Rozenfeld, B.: Learning realistic human actions from movies, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1-8, 2008. Simonyan, K. and Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos, Advances in Neural Information Processing System (NIPS) (Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D. and Weinberger, K. Q., eds.), Curran Associates, Inc., pp. 568-576, 2014.Simonyan, K. and Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos, Advances in Neural Information Processing System (NIPS) (Ghahramani, Z., Welling, M., Cortes, C., Lawrence, ND and Weinberger, KQ, eds.), Curran Associates, Inc., pp. 568-576, 2014. Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M.: Learning SpatiotemporalFeatures with 3D Convolutional Networks, Proceedings of International Conference on Computer Vision (ICCV), pp. 4489-4497, 2015.Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M.: Learning SpatiotemporalFeatures with 3D Convolutional Networks, Proceedings of International Conference on Computer Vision (ICCV), pp. 4489-4497, 2015 . Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M. and Zisserman, A.: The Kinetics Human Action Video Dataset, CoRR, Vol. abs/1705.06950, 2017.Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M. and Zisserman, A.: The Kinetics Human Action Video Dataset, CoRR, Vol. abs/1705.06950, 2017. Carreira, J. and Zisserman, A.: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724-4733, 2017.Carreira, J. and Zisserman, A.: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724-4733, 2017. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A.: Going deeper with convolutions, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1-9, 2015.Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A.: Going deeper with convolutions, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1-9, 2015. Schldt, C., Laptev, I. and Caputo, B.: Recognizing Human Actions: A Local SVM Approach, Proceedings of International Conference on Pattern Recognition (ICPR), Washington, DC, USA, IEEE Computer Society, pp. 32-36, 2004.Schldt, C., Laptev, I. and Caputo, B.: Recognizing Human Actions: A Local SVM Approach, Proceedings of International Conference on Pattern Recognition (ICPR), Washington, DC, USA, IEEE Computer Society, pp. 32-36 , 2004.

特許文献１に記載の従来技術では、変調関数に基づいて各ピクセルの露光の状態を変調させているが、カメラが撮影する映像の各フレームにおける最適な露光パターンを、イメージセンサの種類に応じて適切に決定できていると言い難い。 In the conventional technique described in Patent Document 1, the exposure state of each pixel is modulated based on the modulation function, but the optimum exposure pattern in each frame of the image captured by the camera is determined according to the type of image sensor. It is hard to say that we have made an appropriate decision.

そこで、本開示は、イメージセンサの種類に応じて適切な露光パターンを決定することができる動画像処理方法及び動画像処理装置を提供する。 Therefore, the present disclosure provides a moving image processing method and a moving image processing apparatus capable of determining an appropriate exposure pattern according to the type of image sensor.

本開示の一態様に係る動画像処理方法は、２次元状にピクセルが配置されたイメージセンサを用いて時間的及び空間的に間引いた繰り返し露光による撮影を行うことで、圧縮動画像を生成する圧縮ステップと、前記圧縮ステップに先立ち、前記露光の態様を特定する露光パターンを機械学習によって最適化しておく第１機械学習ステップと、を含み、前記圧縮ステップでは、前記第１機械学習ステップによる最適化によって得られた露光パターンを用いて前記圧縮動画像を生成する。 A moving image processing method according to an aspect of the present disclosure generates a compressed moving image by performing imaging by repeated exposure that is thinned out temporally and spatially using an image sensor in which pixels are arranged two-dimensionally. A compression step, and a first machine learning step of optimizing an exposure pattern for specifying the aspect of the exposure by machine learning prior to the compression step, wherein the compression step is optimized by the first machine learning step. The compressed moving image is generated using the exposure pattern obtained by the conversion.

また、本開示の一態様に係る動画像処理装置は、２次元状にピクセルが配置されたイメージセンサを用いて時間的及び空間的に間引いた繰り返し露光による撮影を行うことで、圧縮動画像を生成するカメラに用いられる動画像処理装置であって、前記露光の態様を特定する露光パターンを機械学習によって最適化しておく第１機械学習部と、前記第１機械学習部による最適化によって得られた露光パターンを出力する出力部と、を備える。 Further, the moving image processing apparatus according to an aspect of the present disclosure captures a compressed moving image by performing imaging by repeated exposure that is thinned temporally and spatially using an image sensor in which pixels are two-dimensionally arranged. A moving image processing apparatus used for a camera to generate, which is obtained by a first machine learning unit for optimizing an exposure pattern for specifying an aspect of the exposure by machine learning, and an optimization by the first machine learning unit. And an output unit that outputs the exposed exposure pattern.

なお、これらの包括的又は具体的な態様は、システム、装置、方法、集積回路、コンピュータプログラム、又は、コンピュータで読み取り可能なＣＤ−ＲＯＭなどの非一時的な記録媒体で実現されてもよく、システム、装置、方法、集積回路、コンピュータプログラム、及び、記録媒体の任意な組み合わせで実現されてもよい。 Note that these comprehensive or specific aspects may be realized by a system, an apparatus, a method, an integrated circuit, a computer program, or a non-transitory recording medium such as a computer-readable CD-ROM, It may be realized by any combination of a system, an apparatus, a method, an integrated circuit, a computer program, and a recording medium.

本開示の一態様に係る動画像処理方法及び動画像処理装置によれば、イメージセンサの種類に応じて適切な露光パターンを決定することができる。 According to the moving image processing method and the moving image processing apparatus according to an aspect of the present disclosure, it is possible to determine an appropriate exposure pattern according to the type of image sensor.

図１は、一般的な動画像の圧縮センシングのフローの一例を示す図である。FIG. 1 is a diagram showing an example of a flow of compression sensing of a general moving image. 図２は、ハードウェアへの実装上の制約を満たす露光パターンの例を示す図である。FIG. 2 is a diagram showing an example of an exposure pattern that satisfies the restrictions on mounting on hardware. 図３は、ＳＢＥ（ＳｉｎｇｌｅＢｕｍｐＥｘｐｏｓｕｒｅ）センサの構造の一例を示す図である。FIG. 3 is a diagram showing an example of the structure of an SBE (Single Bump Exposure) sensor. 図４は、ＳＢＥセンサにおける１フレーム間の露光回数を示す図である。FIG. 4 is a diagram showing the number of exposures during one frame in the SBE sensor. 図５は、ＲＣＥ（ＲｏｗＣｏｌｕｍｎｗｉｓｅＥｘｐｏｓｕｒｅ）センサの構造の一例を示す図である。FIG. 5: is a figure which shows an example of the structure of a RCE(Row Column width Exposure) sensor. 図６は、ＲＣＥセンサにおける１フレーム間の露光回数を示す図である。FIG. 6 is a diagram showing the number of exposures during one frame in the RCE sensor. 図７は、全体を考慮する動きの表現の一例を示す図である。FIG. 7 is a diagram showing an example of a motion expression considering the whole. 図８は、全体を考慮する動きの表現の他の例を示す図である。FIG. 8 is a diagram showing another example of a motion expression considering the whole. 図９は、人間の行動認識を行う手法の概要を説明するための図である。FIG. 9 is a diagram for explaining the outline of a method of recognizing human actions. 図１０は、実施の形態における動画像処理システムの機能構成の一例を示すブロック図である。FIG. 10 is a block diagram showing an example of the functional configuration of the moving image processing system according to the embodiment. 図１１は、実施の形態における機械学習部の構成の一例を示す図である。FIG. 11 is a diagram illustrating an example of the configuration of the machine learning unit according to the embodiment. 図１２は、実施の形態に係る動画像処理方法の一例を示すフローチャートである。FIG. 12 is a flowchart showing an example of the moving image processing method according to the embodiment. 図１３は、実施の形態で使用される人工知能の一例を示す図である。FIG. 13 is a diagram showing an example of the artificial intelligence used in the embodiment. 図１４は、実施の形態における機械学習ステップの構成の一例を示す図である。FIG. 14 is a diagram showing an example of the configuration of the machine learning step in the embodiment. 図１５は、２値化された露光パターンを更新する一例を示す図である。FIG. 15 is a diagram showing an example of updating a binarized exposure pattern. 図１６は、実験例２の結果を示す図である。FIG. 16 is a diagram showing the results of Experimental Example 2. 図１７は、実験例３の結果を示す図である。FIG. 17 is a diagram showing the results of Experimental Example 3. 図１８は、カラー動画像の圧縮センシングのフローの一例を示す図である。FIG. 18 is a diagram illustrating an example of a flow of compressed sensing of a color moving image. 図１９は、カラーフィルタパターンの一例を示す図である。FIG. 19 is a diagram showing an example of a color filter pattern. 図２０は、実験例で使用した露光パターン及びカラーフィルタパターンの一例を示す図である。FIG. 20 is a diagram showing an example of the exposure pattern and the color filter pattern used in the experimental example. 図２１は、実験例４の結果を示す図である。FIG. 21 is a diagram showing the results of Experimental Example 4. 図２２は、変形例２に係る動画像処理方法の一例を示すフローチャートである。FIG. 22 is a flowchart showing an example of the moving image processing method according to Modification 2. 図２３は、変形例２における機械学習ステップの高齢の一例を示す図である。FIG. 23 is a diagram illustrating an example of elderly in the machine learning step according to the second modification. 図２４は、ＫＴＨＡｃｔｉｏｎデータセットにおける各行動クラスの１シーンを示す図である。FIG. 24 is a diagram showing one scene of each action class in the KTH Action data set. 図２５は、実験例５における比較手法の一例を示す図である。FIG. 25 is a diagram illustrating an example of the comparison method in Experimental Example 5. 図２６は、ニューラルネットワークに入力される画像のあるピクセルにおける露光の一例を示す図である。FIG. 26 is a diagram showing an example of exposure in a pixel of an image input to the neural network. 図２７は、各比較手法の混同行列を示す図である。FIG. 27 is a diagram showing a confusion matrix of each comparison method. 図２８は、実験例６の結果を示す図である。28: is a figure which shows the result of Experimental example 6. FIG.

（本開示の基礎となった知見）
高空間解像度で高フレームレートな動画像は、実際に何が起きているのかを分析するために有用である。通常、このような動画像は、ハイスピードカメラで撮像される。ハイスピードカメラは、センサからの読み出しを高速に行うため、画素毎にバッファを設ける他、アナログデジタル（ＡＤ）変換の時間を短縮するために並列のＡＤ変換器を搭載している。このような特殊なセンサは非常に高価であり、回路が複雑になることからフォトトランジスタの面積が減少するため感度が悪くなる問題もある。そこで、高空間解像度で高フレームレートな動画像を取得する手段の一つとして圧縮センシングを用いた手法が提案されてきた（非特許文献１〜４）。 (Findings that form the basis of this disclosure)
Video with high spatial resolution and high frame rate is useful for analyzing what is really happening. Usually, such a moving image is captured by a high speed camera. The high-speed camera is provided with a buffer for each pixel in order to perform high-speed reading from the sensor, and is also equipped with a parallel AD converter in order to shorten the time of analog-digital (AD) conversion. Such a special sensor is very expensive, and the circuit becomes complicated, so that the area of the phototransistor is reduced and the sensitivity is deteriorated. Therefore, a method using compressed sensing has been proposed as one of means for acquiring a moving image with high spatial resolution and high frame rate (Non-Patent Documents 1 to 4).

通常、動画像の撮影は、全ての画素が同時に露光するグローバルシャッタを有するセンサを用いて複数の静止画像を連続して撮影することで実現される。これに対して、圧縮ビデオセンシングは、圧縮ステップと再構成ステップとを有し、動画像を撮影しながら動画像の圧縮を行い、圧縮された動画像から元の動画像に再構成する。より具体的には、圧縮ステップでは、撮像センサは、隣接画素毎に露光タイミングをランダムにずらして単一画像を撮影する。これにより、時間情報を単一画像にサンプリングした符号化露光画像を得ることができる。次いで、再構成ステップでは、圧縮ステップで得られた符号化露光画像に含まれる異なる時間情報を用いて単一画像から複数フレームの動画像を再構成する。図１は、一般的な動画像の圧縮センシングのフローの一例を示す図である。図１に示すように、圧縮ビデオセンシングでは、センシング部は、一連のシーンを含む動画像を、露光パターンを用いて画素毎の露光タイミングをずらして撮影することにより、時間情報を単一画像に集約した符号化露光画像を作成する。次いで、再構成部は、符号化露光画像に含まれる異なる時間情報を用いて、一連のシーンを含む動画像を再構成する。この圧縮センシングのモデルは、以下の式（１）で表される。 Usually, shooting of a moving image is realized by continuously shooting a plurality of still images using a sensor having a global shutter in which all pixels are simultaneously exposed. On the other hand, the compressed video sensing has a compression step and a reconstruction step, compresses the moving image while shooting the moving image, and reconstructs the compressed moving image into the original moving image. More specifically, in the compression step, the image sensor randomly shifts the exposure timing for each adjacent pixel to capture a single image. This makes it possible to obtain a coded exposure image obtained by sampling the time information into a single image. Next, in the reconstruction step, moving images of a plurality of frames are reconstructed from the single image using different time information included in the coded exposure image obtained in the compression step. FIG. 1 is a diagram showing an example of a flow of compression sensing of a general moving image. As shown in FIG. 1, in the compressed video sensing, the sensing unit shoots a moving image including a series of scenes by shifting the exposure timing for each pixel using an exposure pattern, and thus the time information is converted into a single image. Create an aggregated coded exposure image. Then, the reconstruction unit reconstructs a moving image including a series of scenes by using different time information included in the coded exposure image. This compressed sensing model is expressed by the following equation (1).

式中、ｘは未知の動的シーン（未知の動画像）、ｙは符号化露光画像、φは符号化露光パターンである。 In the formula, x is an unknown dynamic scene (unknown moving image), y is a coded exposure image, and φ is a coded exposure pattern.

一般に、圧縮センシングでは、符号化露光画像ｙから符号化露光パターンφを用いて未知の動画像ｘを再構成する。式（１）から、符号化露光画像ｙから符号化露光パターンφを用いて再構成される未知の動画像ｘの品質は、符号化露光パターンφの圧縮性能に依存することが分かる。 Generally, in compressed sensing, an unknown moving image x is reconstructed from a coded exposure image y using a coded exposure pattern φ. From equation (1), it can be seen that the quality of the unknown moving image x reconstructed from the coded exposure image y using the coded exposure pattern φ depends on the compression performance of the coded exposure pattern φ.

圧縮ビデオセンシングでは、各画素でランダムなタイミングで露光された画像を撮影する必要がある。そのため、符号化露光パターンは、様々なものが提案されている。しかしながら、一般的なＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ）又はＣＭＯＳ（ＣｏｍｐｌｅｍｅｎｔａｒｙＭｅｔａｌＯｘｉｄｅＳｅｍｉｃｏｎｄｕｃｔｏｒ）センサは、全ての画素が同時に露光するグローバルシャッタ又は画素の読み出し順に露光を行うローリングシャッタが一般的であり、圧縮ビデオセンシングにおいて理想的なセンサは一般には存在しない。そのため、理想的なランダムな露光を想定した符号化露光パターン、又は、ハードウェアへの実装上の制約を考慮した符号化露光パターンが用いられている。 In compressed video sensing, it is necessary to capture an image exposed at random timing at each pixel. Therefore, various coded exposure patterns have been proposed. However, a general CCD (Charge Coupled Device) or CMOS (Complementary Metal Oxide Semiconductor) sensor is generally a global shutter in which all pixels are exposed simultaneously or a rolling shutter in which exposure is performed in the order of reading out pixels, and compressed video sensing. There is generally no ideal sensor in. Therefore, a coded exposure pattern that assumes ideal random exposure, or a coded exposure pattern that considers restrictions on hardware implementation is used.

例えば、非特許文献５では、画素毎にランダムな露光が制御可能な理想的なセンサ（以下、完全ランダムセンサ）を想定し、露光パターンの最適化を行う手法を開示している。具体的には、非特許文献５では、各画素の露光時間を１６分割し、４画素×４画素×１６のランダムなパターンを繰り返した８画素×８画素×１６の符号化露光パターンとしたシミュレーション実験を行っている。非特許文献１では、画素毎に露光を制御可能なプロトタイプのＣＭＯＳセンサを用いて、疑似ランダム露光な符号化露光を実現した。ハードウェアの制約から縦列、横列で同時に露光する８×８の符号化露光パターンを用いた実証実験を行った。 For example, Non-Patent Document 5 discloses a method of optimizing an exposure pattern by assuming an ideal sensor (hereinafter, a completely random sensor) capable of controlling random exposure for each pixel. Specifically, in Non-Patent Document 5, the exposure time of each pixel is divided into 16 and a random exposure pattern of 4 pixels×4 pixels×16 is repeated to obtain a coded exposure pattern of 8 pixels×8 pixels×16. I am conducting an experiment. In Non-Patent Document 1, pseudo-random exposure coded exposure is realized by using a prototype CMOS sensor capable of controlling exposure for each pixel. A demonstration experiment was performed using an 8×8 coded exposure pattern in which exposure was performed simultaneously in columns and rows due to hardware restrictions.

以下、ハードウェアへの実装上の制約の例として、画素毎に露光を制御できる現実的なセンサとして想定されているＣＭＯＳセンサ（非特許文献３参照）と、画素毎に露光を制御できるプロトタイプのＣＭＯＳセンサ（非特許文献１参照）について図面を参照しながら説明する。なお、非特許文献３で想定しているＣＭＯＳセンサをＳＢＥ（ＳｉｎｇｌｅＢｕｍｐＥｘｐｏｓｕｒｅ）センサと呼び、非特許文献１で想定しているプロトタイプのＣＭＯＳセンサをＲＣＥ（ＲｏｗＣｏｌｕｍｎｗｉｓｅＥｘｐｏｓｕｒｅ）センサと呼ぶ。 Hereinafter, as an example of restrictions on mounting on hardware, a CMOS sensor (see Non-Patent Document 3), which is assumed to be a realistic sensor that can control exposure for each pixel, and a prototype that can control exposure for each pixel A CMOS sensor (see Non-Patent Document 1) will be described with reference to the drawings. The CMOS sensor assumed in Non-Patent Document 3 is referred to as a SBE (Single Pump Exposure) sensor, and the prototype CMOS sensor assumed in Non-Patent Document 1 is referred to as an RCE (Row Columnwise Exposure) sensor.

図２は、ハードウェアへの実装上の制約を満たす露光パターンの例を示す図である。図２の（ａ）は、上述の完全ランダムセンサに実装可能な露光パターンを示し、図２の（ｂ）は、ＳＢＥセンサに実装可能な露光パターンを示し、図２の（ｃ）は、ＲＣＥセンサに実装可能な露光パターンを示している。 FIG. 2 is a diagram showing an example of an exposure pattern that satisfies the restrictions on mounting on hardware. 2A shows an exposure pattern that can be mounted on the above-described completely random sensor, FIG. 2B shows an exposure pattern that can be mounted on an SBE sensor, and FIG. 2C shows an RCE. The exposure pattern that can be mounted on the sensor is shown.

図３は、ＳＢＥセンサの構造の一例を示す図である。図３に示すように、ＳＢＥセンサは、画素毎に露光を制御するために、通常のＣＭＯＳセンサにアドレス線を追加したものであり、実現可能なセンサである。通常のＣＭＯＳセンサは、行毎にアドレスを制御することで１行ずつ読み出すローリングシャッタを搭載していることが多い。また、通常のＣＭＯＳセンサは、画素毎のバッファを持たないため、非破壊読出しは不可能である。一方、ＳＢＥセンサでは、通常のＣＭＯＳセンサに列毎のアドレスを決める回路を組み込むことで、画素毎の読み出しを可能としている。図４は、ＳＢＥセンサにおける１フレーム間の露光回数の一例を示す図である。図４に示すように、ＳＢＥセンサでは、１フレームの間に各画素は１回露光される。なお、露光の開始及び終了のタイミングは、一例であり、各フレームにおいてランダムである。図２の（ｂ）に示すように、非特許文献３では、ＳＢＥセンサに実装可能な露光パターン（以下、符号化露光パターンともいう。）として、１回の露光で開始及び終了を任意とする単一露光の符号化露光パターンを開示している。また、非特許文献３では、７×７の符号化露光パターンを用いて、シミュレーション実験と、反射光学系及びＬｉｑｕｉｄＣｒｙｓｔａｌｏｎＳｉｌｉｃｏｎ（ＬＣｏＳ）を用いた疑似実装による実験と、を行っている。 FIG. 3 is a diagram showing an example of the structure of the SBE sensor. As shown in FIG. 3, the SBE sensor is a feasible sensor in which an address line is added to a normal CMOS sensor in order to control the exposure for each pixel. An ordinary CMOS sensor is often equipped with a rolling shutter that reads out one row by controlling an address for each row. Moreover, since a normal CMOS sensor does not have a buffer for each pixel, nondestructive reading is impossible. On the other hand, in the SBE sensor, by incorporating a circuit for determining an address for each column in a normal CMOS sensor, it is possible to read out for each pixel. FIG. 4 is a diagram showing an example of the number of exposures in one frame in the SBE sensor. As shown in FIG. 4, in the SBE sensor, each pixel is exposed once during one frame. The timing of starting and ending the exposure is an example, and is random in each frame. As shown in FIG. 2B, in Non-Patent Document 3, the exposure pattern that can be mounted on the SBE sensor (hereinafter, also referred to as a coded exposure pattern) has an arbitrary start and end with one exposure. A single exposure coded exposure pattern is disclosed. In Non-Patent Document 3, a simulation experiment and a pseudo mounting experiment using a reflective optical system and Liquid Crystal on Silicon (LCoS) are performed using a 7×7 coded exposure pattern.

図５は、ＲＣＥセンサの構造の一例を示す図である。図５に示すようにＲＣＥセンサは、露光を制御するために信号線を追加した試作（プロトタイプ）のＣＭＯＳセンサである。図５は、ＲＣＥセンサの左上を示している。ＲＣＥセンサは、８×８のブロック構造を備える。ＲＣＥセンサは、露光を制御するための追加の信号線として８本のＲｅｓｅｔ信号線と８本のＴｒａｎｓｆｅｒ信号線とを備え、各Ｒｅｓｅｔ信号線は、８列おきに共有されており、各Ｔｒａｎｓｆｅｒ信号線は、８行おきに共有されている。そのため、符号化露光パターンはブロック毎に同じものとなる。また、ＲＣＥセンサは、非破壊読み出しが可能である。図６は、ＲＣＥセンサにおける１フレーム間の露光回数を示す図である。図６に示すように、ＲＣＥセンサでは、１フレームの間に各画素は複数回露光され得る。しかしながら、ＲＣＥセンサは、Ｒｅｓｅｔ信号線とＴｒａｎｓｆｅｒ信号線とをそれぞれ８本しか備えておらず、かつ、１本のＲｅｓｅｔ信号線と１本のＴｒａｎｓｆｅｒ信号線とがそれぞれ１つのブロック内の列及び列の画素間で共有されている。そのため、非特許文献１では、ＲＣＥセンサに実装可能な符号化露光パターンとして、列及び行で同時に露光する８×８の符号化露光パターンを用いた実証実験を行っている。 FIG. 5 is a diagram showing an example of the structure of the RCE sensor. As shown in FIG. 5, the RCE sensor is a prototype CMOS sensor in which a signal line is added to control exposure. FIG. 5 shows the upper left of the RCE sensor. The RCE sensor has an 8×8 block structure. The RCE sensor includes eight Reset signal lines and eight Transfer signal lines as additional signal lines for controlling the exposure, and each Reset signal line is shared every eight columns, and each Transfer signal line is shared. The line is shared every 8 lines. Therefore, the coded exposure pattern is the same for each block. Further, the RCE sensor is capable of nondestructive readout. FIG. 6 is a diagram showing the number of exposures during one frame in the RCE sensor. As shown in FIG. 6, in the RCE sensor, each pixel can be exposed multiple times during one frame. However, the RCE sensor includes only eight Reset signal lines and only eight Transfer signal lines, and one Reset signal line and one Transfer signal line are each a column and a column in one block. Is shared between pixels. Therefore, in Non-Patent Document 1, as a coded exposure pattern that can be mounted on the RCE sensor, a demonstration experiment is performed using an 8x8 coded exposure pattern that is simultaneously exposed in columns and rows.

このように実際の圧縮センシングに用いられるカメラには様々な制約があるため、ハードウェアへの実装上の制約を考慮しながら符号化露光パターンの最適化を行う必要がある。 As described above, since there are various restrictions on the camera used for actual compressed sensing, it is necessary to optimize the coded exposure pattern while considering the restrictions on the hardware implementation.

そこで、本願発明者らは、ＤＮＮ（ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ）を用いてハードウェアへの実装上の制約を満たした最適な符号化露光パターンを決定することにより、従来手法により決定された符号化露光パターンを用いて圧縮された画像（以下、圧縮画像）を再構成した映像よりも画質の良い映像を再構成できることを見出した。また、本願発明者らは、符号化露光パターンの最適化を行うと同時に、圧縮画像から映像（動画像）を再構成するデコーダを最適化することにより、従来手法よりもさらに再構成品質を向上させることができることを見出した。 Therefore, the inventors of the present application determine the optimum coded exposure pattern that satisfies the restrictions on mounting on hardware by using DNN (Deep Neural Network), and thus the coded exposure pattern determined by the conventional method. It was found that it is possible to reconstruct an image having a better image quality than an image reconstructed from an image compressed by using (hereinafter, compressed image). Further, the inventors of the present application optimize the coded exposure pattern and, at the same time, optimize the decoder for reconstructing the video (moving image) from the compressed image, thereby further improving the reconstruction quality as compared with the conventional method. I found that I can do it.

続いて、行動認識に関する従来技術について説明する。かつては、行動認識に３Ｄモデルを利用していた。しかし、映像から正確な３Ｄモデルを構築することは難しいため、多くの場合、代わりに全体的又は局所的な動きの表現を利用する手法が取られている。全体を考慮する動きの表現では、人体の構造又は形状、もしくは、動きのグローバルな表現を用いている。図７は、全体を考慮する動きの表現の一例を示す図であり、図８は、全体を考慮する動きの表現の他の例を示す図である。例えば、図７に示すように、非特許文献１１では、動きに関する情報を単一の画像にエンコードする２値画像を蓄積したＭｏｔｉｏｎＥｎｅｒｇｙＩｍａｇｅ（ＭＥＩ）、又は、輝度で時間を表すＭｏｔｉｏｎＨｉｓｔｏｒｙＩｍａｇｅ（ＭＨＩ）が開示されている。また、図８に示すように、非特許文献１２では、オブジェクトの輪郭を時間軸に沿って積み重ねたＳｐａｃｅ−ＴｉｍｅＶｏｌｕｍｅ（ＳＴＶ）が開示されている。全体を考慮したこれらのアプローチは、視点及び外観の変化を捕捉するのが難しく、ＳＴＶでは細部を捉えることができない問題点がある。一方、局所領域を考慮する動きの表現では、一般的な画像認識と同時に、関心点の検出、局所記述子の抽出、及び、局所記述子の集約という手順に従い、行動認識のための局所特徴を作成する。非特許文献１３では、時空間領域における関心点の検出として、２次元のＨａｒｒｉｓコーナー検出器を３次元に拡張するＳｐａｃｅ−ＴｉｍｅＩｎｔｅｒｅｓｔＰｏｉｎｔｓ（ＳＴＩＰ）が開示されている。非特許文献１５では、時空間の局所記述子として、非特許文献１４に記載のＨｉｓｔｏｇｒａｍｓｏｆＯｒｉｅｎｔｅｒＧｒａｄｉｅｎｔｓ（ＨＯＧ）をモーション記述子として利用することが開示され、また、ビデオクリップ内のピクセルレベルの動きをエンコードするＨｉｓｔｏｇｒａｍｓｏｆＯｐｔｉｃａｌＦｌｏｗ（ＨＯＦ）が開示されている。記述子の集約では、画像認識と同様にＢａｇ−ｏｆ−Ｆｅａｔｕｒｅｓ（ＢｏＦ）（非特許文献１６）が用いられた。特に、カテゴリー分類では、テキスト分類で高い評価を受けていたＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ（ＳＴＭ）がＢｏＦベクトルに対しても用いられるようになっている（非特許文献１７）。 Next, a conventional technique regarding action recognition will be described. In the past, 3D models were used for action recognition. However, since it is difficult to construct an accurate 3D model from an image, in many cases, a method of using a global or local motion expression is used instead. The motion expression that considers the whole uses a global expression of the structure or shape of the human body or motion. FIG. 7 is a diagram showing an example of a motion expression considering the whole, and FIG. 8 is a diagram showing another example of a motion expression considering the whole. For example, as illustrated in FIG. 7, in Non-Patent Document 11, a Motion Energy Image (MEI) in which a binary image that encodes information about motion into a single image is accumulated, or a Motion History Image (MEI) that represents time by luminance. MHI) is disclosed. Further, as shown in FIG. 8, Non-Patent Document 12 discloses a Space-Time Volume (STV) in which contours of objects are stacked along a time axis. These all-inclusive approaches have the problem that it is difficult to capture changes in viewpoint and appearance, and STV cannot capture details. On the other hand, in the motion representation considering the local area, the local features for the action recognition are detected according to the procedures of the interest point detection, the local descriptor extraction, and the local descriptor aggregation simultaneously with the general image recognition. create. Non-Patent Document 13 discloses Space-Time Interest Points (STIP) as a three-dimensional extension of a two-dimensional Harris corner detector as a detection of a point of interest in a spatiotemporal region. Non-Patent Document 15 discloses the use of Histograms of Orienter Gradients (HOG) described in Non-Patent Document 14 as a motion descriptor at the pixel level in a video clip as a local spatiotemporal descriptor. A Histograms of Optical Flow (HOF) for encoding is disclosed. For aggregation of descriptors, Bag-of-Features (BoF) (Non-Patent Document 16) was used as in image recognition. In particular, in the category classification, the Support Vector Machine (STM), which has been highly evaluated in the text classification, is also used for the BoF vector (Non-Patent Document 17).

画像認識の分野で、畳み込みニューラルネットワーク（ＣＮＮ）が注目されるようになると、映像認識の分野でもＣＮＮが用いられるようになっている。ＣＮＮは、関心点の検出、局所記述子の抽出、局所記述子の集約のいずれの段階でも使用でき、画像フレームを特徴化するだけでなく、オプティカルフロー又はＨＯＧなどと組み合わせても使用されている。非特許文献１８では、ＲＧＢの画像フレームとオプティカルフローとを蓄積したものをそれぞれ外観とモーション情報として用いることを開示し、また、２つのストリームを結合することで更なる精度向上を開示している。ＵＣＦ１０１又はＨＭＤＢ５１などのデータセットにおいてＤｅｅｐＬｅａｒｎｉｎｇを使用しないかつての認識精度を大幅に改善し、２ストリームネットワークに基づく数多くの研究がなされている。一方、非特許文献１９は、３次元で畳み込むことで外観とモーションとを同時にモデルかするネットワーク（Ｃ３Ｄ：Ｃｏｎｖｏｌｕｔｉｏｎ３Ｄ）を開示している。これは、２ストリーム２ＤＣＮＮに劣るものの大規模動画データセットであるＳｐｏｒｔｓ−１Ｍを用いて良い精度を達成している。非特許文献２０は、行動認識の大規模化かつ校正されたデータセットであるＫｉｎｅｔｉｃｓを開示している。これは、比較的小規模な３ＤＣＮＮにおいて、事前学習なしのモデルでありながら、構成されたデータで学習することにより、ＩｍａｇｅＮｅｔで事前学習した２ＤＣＮＮに迫る精度を達成することを示している。非特許文献２１では、２２層の２ＤＣＮＮであるＧｏｏｇＬｅＮｅｔ（Ｉｎｃｅｐｔｉｏｎｖ１）（非特許文献２２）を３Ｄに拡張したＩ３Ｄを開示し、Ｋｉｎｅｔｉｃｓデータセットを用いて学習し最先端の精度を達成している。 When convolutional neural networks (CNN) have come to the attention in the field of image recognition, CNN has come to be used also in the field of image recognition. CNN can be used at any stage of interest point detection, local descriptor extraction, and local descriptor aggregation, and is used not only to characterize image frames but also in combination with optical flow or HOG. .. Non-Patent Document 18 discloses that the accumulation of RGB image frames and optical flows is used as the appearance and motion information, respectively, and further improves the accuracy by combining two streams. .. There have been many studies based on two-stream networks, which greatly improve the recognition accuracy that was once without Deep Learning in datasets such as UCF101 or HMDB51. On the other hand, Non-Patent Document 19 discloses a network (C3D: Convolution 3D) that simultaneously models appearance and motion by convolving in three dimensions. This achieves good accuracy using Sports-1M, which is a large-scale moving image data set, which is inferior to 2-stream 2D CNN. Non-Patent Document 20 discloses Kinetics, which is a large-scaled and calibrated data set of action recognition. This shows that, in a relatively small-scale 3D CNN, a model without pre-training can achieve accuracy close to that of the 2D CNN pre-trained by ImageNet by training with the constructed data. Non-Patent Document 21 discloses I3D, which is an extension of 22 layers of 2D CNN, GoogleLeNet (Inception v1) (Non-Patent Document 22) to 3D, and achieves the most advanced accuracy by learning using Kinetics data set. There is.

以上のように、行動認識に関する様々な技術が開示されているが、ビデオ監視システムにおける人間の行動認識、つまり、映像解析におけるデータ圧縮のトレードオフな問題に対し、圧縮センシングによる解決を考える。単に圧縮ビデオセンシングの適用を考えた場合、符号化露光画像から動画像の再構成を行うことで、通常の動画像と同様に映像解析を行うことが可能である。 As described above, various techniques regarding action recognition have been disclosed, but a solution by compressed sensing is considered for human action recognition in a video surveillance system, that is, a trade-off problem of data compression in video analysis. When simply considering the application of compressed video sensing, it is possible to perform video analysis in the same manner as normal moving images by reconstructing moving images from coded exposure images.

符号化露光画像の情報量は、符号化露光画像のサイズＷ×Ｈであり、露光時間をＴとすると、未知の動画像の情報量は、Ｗ×Ｈ×Ｔとなる。これは、観測した情報よりも多くの情報を復元することとなるため、一意に定めることはできない。そこで、非特許文献３及び非特許文献１では、動画像は、基底となる動画像及びそのスパースな係数で表現できると仮定するスパース最適化による再構成手法を用いて、観測した情報よりも十分少ない数の係数を求めることで、動画像の再構成を行っている。非特許文献３では、スパース最適化手法として、Ｌ_０ノムル正則化を行うＯｒｔｈｏｇｏｎａｌＭａｔｃｈｉｎｇＰｕｒｓｕｉｔ（ＯＭＰ）アルゴリズムを用いている。一般に、スパース最適化は、ＮＰ困難な問題であることが知られている。したがって、スパース最適化を用いた再構成手段は、膨大な時間を要するものであり、実用的な手法であるとは言えない。非特許文献４は、動画像は、ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ（ＧＭＭ）で表現可能であると仮定し、符号化露光画像が与えられた事後確率の期待値から動画像を再構成する、より高速な手段を開示している。また、非特許文献２は、ＤｅｅｐＬｅａｒｎｉｎｇを利用し、符号化露光をエンコーダとするＡｕｔｏＥｎｃｏｄｅｒを学習することで、符号化露光画像から動画像を再構成するデコーダを作成し、より高速な再構成手段を開示している。 The information amount of the coded exposure image is W×H, which is the size of the coded exposure image. When the exposure time is T, the information amount of the unknown moving image is W×H×T. This cannot be uniquely defined because it restores more information than the observed information. Therefore, in Non-Patent Document 3 and Non-Patent Document 1, it is more sufficient than the observed information by using the reconstruction method by sparse optimization that assumes that the moving image can be represented by the moving image serving as the base and its sparse coefficient. The moving image is reconstructed by obtaining a small number of coefficients. In Non-Patent Document 3, an Orthogonal Matching Pursuit (OMP) algorithm that performs L ₀ Nomul regularization is used as a sparse optimization method. In general, sparse optimization is known to be an NP-hard problem. Therefore, the reconstruction method using the sparse optimization requires a huge amount of time and cannot be said to be a practical method. Non-Patent Document 4 assumes that the moving image can be represented by the Gaussian Mixture Model (GMM), and reconstructs the moving image from the expected value of the posterior probability given the coded exposure image, which is a faster means. Is disclosed. Further, Non-Patent Document 2 uses Deep Learning to learn an AutoEncoder that uses coded exposure as an encoder, thereby creating a decoder that reconstructs a moving image from a coded exposed image, and a faster reconstruction means. Is disclosed.

なお、自動監視システムでは、カメラの視野内の人間の不審な行動を検出又は予測し、オペレータに警告する必要がある。そのため、本願発明者らは、映像解析として人間の行動認識に焦点を当てる。図９は、人間の行動認識を行う手法の概要を説明するための図である。例えば、図９の（ｂ）に示すように、人間の行動認識に圧縮ビデオセンシングの適用を考えた場合、符号化露光画像から動画像の再構成という高次元化を行った後、動画像から行動ラベルの推定という低次元化を行っており、非効率である。符号化露光画像には、時間情報が含まれているため、図９の（ａ）に示すように、動画像の再構成を介さなくても直接、行動認識を行うことができると考えられる。そこで、本願発明者らは、符号化露光カメラにより撮影される単一の符号化露光画像からＤｅｅｐＬｅａｒｎｉｎｇを用いて、直接、人間の行動認識を行う手法を見出した。 In addition, in the automatic monitoring system, it is necessary to detect or predict a suspicious action of a person within the field of view of the camera and warn the operator. Therefore, the inventors of the present application focus on human behavior recognition as image analysis. FIG. 9 is a diagram for explaining the outline of a method of recognizing human actions. For example, as shown in (b) of FIG. 9, when application of compressed video sensing to human behavior recognition is considered, after performing high-dimensional reconstruction of a moving image from a coded exposure image, the moving image is converted from the moving image. It is inefficient because it reduces the dimension of action label estimation. Since the encoded exposure image includes time information, it is considered that the action recognition can be directly performed without the reconstruction of the moving image as shown in (a) of FIG. Therefore, the inventors of the present application have found a method of directly recognizing human behavior from a single coded exposure image captured by a coded exposure camera by using Deep Learning.

本開示の一態様の概要は以下の通りである。 The outline of one aspect of the present disclosure is as follows.

これにより、機械学習によって露光パターンが最適化されるため、イメージセンサの種類に応じて適切な露光パターンを決定することができる。 As a result, since the exposure pattern is optimized by machine learning, an appropriate exposure pattern can be determined according to the type of image sensor.

例えば、本開示の一態様に係る動画像処理方法では、前記露光パターンは、前記圧縮動画像を構成するフレーム毎に前記イメージセンサを構成するピクセルのうち露光に用いるピクセルを指定した情報であってもよい。 For example, in the moving image processing method according to an aspect of the present disclosure, the exposure pattern is information specifying pixels used for exposure among pixels forming the image sensor for each frame forming the compressed moving image. Good.

これにより、当該露光パターンを用いて撮影された圧縮動画像は、各ピクセルが複数のフレームのうちのどのフレームにおいて露光されたかを示す時間的情報と、圧縮動画像における各ピクセルの位置を示す空間的情報とを有する。そのため、従来手法のように、時間的情報のみ又は空間的情報のみを犠牲にして圧縮動画像を生成した場合に比べて、高い圧縮効率が得られる。 Thus, the compressed moving image captured using the exposure pattern has temporal information indicating in which frame of each frame each pixel was exposed, and a space indicating the position of each pixel in the compressed moving image. Information. Therefore, higher compression efficiency can be obtained as compared with the case where a compressed moving image is generated at the expense of only temporal information or only spatial information as in the conventional method.

例えば、本開示の一態様に係る動画像処理方法は、さらに、前記圧縮ステップで生成された前記圧縮動画像に対して、前記イメージセンサを構成する全てのピクセルを全てのフレームにおいて露光させた場合に得られる未知動画像を目標として再構成することで、出力動画像を生成する再構成ステップを含んでもよい。 For example, in the moving image processing method according to an aspect of the present disclosure, further, in the compressed moving image generated in the compressing step, all pixels forming the image sensor are exposed in all frames. It may include a reconstruction step of generating an output moving image by reconstructing the unknown moving image obtained in 1. as a target.

これにより、圧縮動画像から、時間的及び空間的に間引かない露光パターンによる撮影によって得られる未知動画像に近い出力画像が再構成される。 As a result, an output image close to an unknown moving image obtained by shooting with an exposure pattern that is not thinned temporally and spatially is reconstructed from the compressed moving image.

例えば、本開示の一態様に係る動画像処理方法は、さらに、前記再構成ステップに先立ち、前記圧縮動画像を入力とし、前記出力動画像を出力するための人工知能を機械学習しておく第２機械学習ステップを含み、前記再構成ステップでは、前記第２機械学習ステップで機械学習された前記人工知能を用いて前記出力動画像を生成してもよい。 For example, in the moving image processing method according to an aspect of the present disclosure, prior to the reconstructing step, machine learning of artificial intelligence for inputting the compressed moving image and outputting the output moving image is performed. In the reconstructing step, the output moving image may be generated using the artificial intelligence machine-learned in the second machine learning step.

これにより、機械学習が用いられることで、圧縮動画像から高い品質で再構成された出力画像が得られる。 Thereby, by using machine learning, an output image reconstructed with high quality from the compressed moving image can be obtained.

例えば、本開示の一態様に係る動画像処理方法では、前記人工知能は、ニューラルネットワークであり、前記未知動画像から前記露光パターンに対応する重み係数を用いた演算により前記圧縮動画像を生成するセンシング層と、前記センシング層が生成した前記圧縮動画像を再構成することによって前記出力動画像を生成する再構成層と、を含み、前記第１機械学習ステップと前記第２機械学習ステップとは、前記センシング層と前記再構成層とを含む前記人工知能に対する教師あり学習によって行われてもよい。 For example, in the moving image processing method according to an aspect of the present disclosure, the artificial intelligence is a neural network, and the compressed moving image is generated from the unknown moving image by calculation using a weighting factor corresponding to the exposure pattern. The first machine learning step and the second machine learning step include a sensing layer and a reconstruction layer that reconstructs the compressed moving image generated by the sensing layer to generate the output moving image. , Supervised learning for the artificial intelligence including the sensing layer and the reconstruction layer.

これにより、未知動画像を圧縮する処理と、圧縮動画像から未知動画像を再構成する処理とを一つの人工知能を用いて行うことができる。さらに、当該人工知能は、未知動画像の圧縮のための露光パターンの最適化と、圧縮動画像から未知動画像を再構成するための再構成アルゴリズムの最適化とを、教師あり学習により行うため、入力と正解データとを基に効率よく学習することができる。 Accordingly, the process of compressing the unknown moving image and the process of reconstructing the unknown moving image from the compressed moving image can be performed using one artificial intelligence. Furthermore, the artificial intelligence performs optimization of an exposure pattern for compression of an unknown moving image and optimization of a reconstruction algorithm for reconstructing an unknown moving image from a compressed moving image by supervised learning. , Can be learned efficiently based on the input and the correct answer data.

例えば、本開示の一態様に係る動画像処理方法は、さらに、前記圧縮ステップで生成された前記圧縮動画像から、前記イメージセンサを構成する全てのピクセルを全てのフレームにおいて露光させた場合に得られる未知動画像が示す動きの種類を特定し、特定した前記動きの種類を示す動き情報を生成する動き検出ステップを含んでもよい。 For example, the moving image processing method according to an aspect of the present disclosure may be further obtained when all the pixels configuring the image sensor are exposed in all the frames from the compressed moving image generated in the compression step. The motion detection step of specifying the type of motion indicated by the unknown moving image and generating the motion information indicating the specified type of motion may be included.

これにより、圧縮動画像が有する時間的及び空間的情報から動画像を再構成することなく、直接、動画像が示す動きの種類を示す動き情報を生成することができる。そのため、従来よりもデータ量が低減されるため、迅速に、かつ、精度良く動画像が示す動きの種類を識別することができる。 This makes it possible to directly generate motion information indicating the type of motion indicated by a moving image without reconstructing the moving image from the temporal and spatial information of the compressed moving image. Therefore, the amount of data is reduced as compared with the related art, and thus it is possible to quickly and accurately identify the type of motion indicated by the moving image.

例えば、本開示の一態様に係る動画像処理方法は、さらに、前記動き検出ステップに先立ち、前記圧縮動画像を入力とし、前記動き情報を出力するための人工知能を機械学習しておく第３機械学習ステップを含み、前記動き検出ステップでは、前記第３機械学習ステップで機械学習された前記人工知能を用いて前記動き情報を生成してもよい。 For example, in the moving image processing method according to an aspect of the present disclosure, prior to the motion detection step, machine learning of artificial intelligence for inputting the compressed moving image and outputting the motion information is performed. A machine learning step may be included, and in the motion detecting step, the motion information may be generated using the artificial intelligence machine-learned in the third machine learning step.

これにより、機械学習が用いられることで、圧縮動画像から高い品質で動きが検出される。 As a result, by using machine learning, motion is detected with high quality from the compressed moving image.

例えば、本開示の一態様に係る動画像処理方法では、前記人工知能は、ニューラルネットワークであり、前記未知動画像から前記露光パターンに対応する重み係数を用いた演算により前記圧縮動画像を生成するセンシング層と、前記センシング層が生成した前記圧縮動画像から前記動き情報を生成する動き検出層とを含み、前記第１機械学習ステップと前記第３機械学習ステップとは、前記センシング層と前記動き検出層とを含む前記人工知能に対する教師あり学習によって行われてもよい。 For example, in the moving image processing method according to an aspect of the present disclosure, the artificial intelligence is a neural network, and the compressed moving image is generated from the unknown moving image by calculation using a weighting factor corresponding to the exposure pattern. A sensing layer and a motion detection layer that generates the motion information from the compressed moving image generated by the sensing layer, wherein the first machine learning step and the third machine learning step include the sensing layer and the motion. It may be performed by supervised learning for the artificial intelligence including a detection layer.

これにより、未知動画像を圧縮する処理と、圧縮動画像から未知動画像の動きの種類を示す動き情報を生成する処理とを一つの人工知能を用いて行うことができる。さらに、当該人工知能は、未知動画像の圧縮のための露光パターンの最適化と、圧縮動画像から未知動画像動きの種類を示す動き情報を生成するための動き情報生成アルゴリズムの最適化とを、教師あり学習により行うため、入力と正解データとを基に効率よく学習することができる。 Thereby, the process of compressing the unknown moving image and the process of generating the motion information indicating the type of motion of the unknown moving image from the compressed moving image can be performed using one artificial intelligence. Furthermore, the artificial intelligence performs optimization of an exposure pattern for compression of an unknown moving image and optimization of a motion information generation algorithm for generating motion information indicating a type of unknown moving image motion from a compressed moving image. Since learning is performed by supervised learning, learning can be efficiently performed based on input and correct answer data.

例えば、本開示の一態様に係る動画像処理方法では、前記イメージセンサは、前記ピクセルのそれぞれに対応させて特定の色の光を選択的に通過させるカラーフィルタを含み、前記圧縮ステップでは、前記カラーフィルタのパターンを時間的及び空間的に変化させた繰り返し露光による撮影を行うことで、前記圧縮動画像を生成し、前記第１機械学習ステップでは、さらに、前記圧縮ステップに先立ち、前記カラーフィルタのパターンの時間的及び空間的な変化の態様を特定するカラーフィルタパターンを機械学習によって最適化しておき、前記圧縮ステップでは、前記第１機械学習ステップによる最適化によって得られたカラーフィルタパターンを用いて前記圧縮動画像を生成してもよい。 For example, in the moving image processing method according to an aspect of the present disclosure, the image sensor includes a color filter that selectively passes light of a specific color corresponding to each of the pixels, and in the compression step, the The compressed moving image is generated by performing imaging by repeated exposure in which the pattern of the color filter is temporally and spatially changed, and in the first machine learning step, the color filter is further preceded by the compression step. The color filter pattern that specifies the temporal and spatial variation of the pattern is optimized by machine learning, and the compression step uses the color filter pattern obtained by the optimization by the first machine learning step. The compressed moving image may be generated.

これにより、未知のカラー動画像を構成する各フレームに最適なカラーフィルタパターンを選択して適用することができるため、動画像の再構成のために十分な情報を残しつつ、圧縮動画像のデータ量を低減することができる。そのため、未知のカラー動画像の圧縮性能が向上される。機械学習によって露光パターンだけでなくカラーフィルタパターンも最適化されるため、カラー撮像に対応したイメージセンサの種類に応じて適切な露光パターン及びカラーフィルタパターンを決定することができる。 As a result, it is possible to select and apply the optimum color filter pattern to each frame that makes up an unknown color moving image, while leaving sufficient information for reconstruction of the moving image while compressing the data of the compressed moving image. The amount can be reduced. Therefore, the compression performance of an unknown color moving image is improved. Since not only the exposure pattern but also the color filter pattern is optimized by machine learning, it is possible to determine an appropriate exposure pattern and color filter pattern according to the type of image sensor compatible with color imaging.

さらに、これらの包括的又は具体的な態様は、システム、装置、方法、集積回路、コンピュータプログラム、又は、コンピュータで読み取り可能なＣＤ−ＲＯＭなどの非一時的な記録媒体で実現されてもよく、システム、装置、方法、集積回路、コンピュータプログラム、及び、記録媒体の任意な組み合わせで実現されてもよい。 Further, these comprehensive or specific aspects may be realized by a system, a device, a method, an integrated circuit, a computer program, or a non-transitory recording medium such as a computer-readable CD-ROM, It may be realized by any combination of a system, an apparatus, a method, an integrated circuit, a computer program, and a recording medium.

以下、実施の形態について図面を参照しながら具体的に説明する。なお、以下で説明する実施の形態は、いずれも包括的又は具体的な例を示すものである。以下の実施の形態で示される数値、形状、材料、構成要素、構成要素の配置位置及び接続形態、ステップ、ステップの順序などは、一例であり、請求の範囲を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、最上位概念を示す独立請求項に記載されていない構成要素については、任意の構成要素として説明される。 Hereinafter, embodiments will be specifically described with reference to the drawings. It should be noted that each of the embodiments described below shows a comprehensive or specific example. Numerical values, shapes, materials, constituent elements, arrangement positions and connection forms of constituent elements, steps, order of steps, and the like shown in the following embodiments are examples, and are not intended to limit the scope of the claims. Further, among the constituent elements in the following embodiments, constituent elements not described in the independent claim showing the highest concept are described as arbitrary constituent elements.

また、以下の説明において、第１、第２、及び、第３等の序数が要素に付けられている場合がある。これらの序数は、要素を識別するため、要素に付けられており、意味のある順序に必ずしも対応しない。これらの序数は、適宜、入れ替えられてもよいし、新たに付与されてもよいし、取り除かれてもよい。 Further, in the following description, the ordinal numbers such as the first, second, and third cases may be attached to the elements. These ordinal numbers are attached to the elements to identify the elements and do not necessarily correspond to a meaningful order. These ordinal numbers may be replaced appropriately, may be newly added, or may be removed.

（実施の形態）
まず、本実施の形態における動画像処理システムについて図１０を参照しながら説明する。図１０は、実施の形態における動画像処理システム３００の機能構成の一例を示すブロック図である。 (Embodiment)
First, the moving image processing system according to the present embodiment will be described with reference to FIG. FIG. 10 is a block diagram showing an example of the functional configuration of the moving image processing system 300 according to the embodiment.

図１０に示すように、動画像処理システム３００は、動画像処理装置１００と、カメラ２００と、を備える。カメラ２００は、２次元状にピクセルが配置されたイメージセンサを備え、動画像処理装置１００から出力された露光パターンを用いて、時間的及び空間的に間引いた繰り返し露光による撮影を行うことで、圧縮動画像を生成する。カメラ２００は、動画像処理装置１００により最適化された露光パターンを取得して保持する露光パターン保持部９０と、露光パターン保持部９０に保持された複数の露光パターンからイメージセンサの種類に応じて適切な露光パターンを選択してイメージセンサに適用させることで圧縮動画像を生成する圧縮動画像生成部８０とを備える。 As shown in FIG. 10, the moving image processing system 300 includes a moving image processing device 100 and a camera 200. The camera 200 includes an image sensor in which pixels are two-dimensionally arranged, and uses an exposure pattern output from the moving image processing apparatus 100 to perform imaging by repeated exposure that is thinned temporally and spatially. Generate a compressed moving image. The camera 200 includes an exposure pattern holding unit 90 that acquires and holds the exposure pattern optimized by the moving image processing apparatus 100, and a plurality of exposure patterns held by the exposure pattern holding unit 90 according to the type of the image sensor. A compressed moving image generation unit 80 that generates a compressed moving image by selecting an appropriate exposure pattern and applying it to an image sensor.

動画像処理装置１００は、通信部１０と、制御部２０と、表示部６０と、入力部７０と、を備える。制御部２０は、機械学習部３０と、再構成部４０と、動き情報生成部５０と、を備える。 The moving image processing apparatus 100 includes a communication unit 10, a control unit 20, a display unit 60, and an input unit 70. The control unit 20 includes a machine learning unit 30, a reconstruction unit 40, and a motion information generation unit 50.

機械学習部３０は、例えば、ニューラルネットワークなどの人工知能（不図示）に学習を行わせる。機械学習部３０は、人工知能に学習させる学習内容の違いにより、第１、第２及び第３など複数の機能部に分けて構成されてもよい。例えば、第１機械学習部（不図示）は、露光の態様を特定する露光パターンを最適化するための人工知能に学習させる。第２機械学習部（不図示）は、圧縮動画像を入力とし、出力動画像を出力するための人工知能に学習させる。第３機械学習部(不図示)は、圧縮動画像を入力とし、動き情報を出力するための人工知能に学習させる。機械学習部３０は、例えば教師データを用いて人工知能に学習させる。なお、露光パターンは、圧縮動画像を構成するフレーム毎にイメージセンサを構成するピクセルのうち露光に用いるピクセルを指定した情報である。また、露光パターンを最適化するとは、複数の露光パターンの中から、ハードウェアへの実装上の制約を満たし、かつ、動画像を構成する各フレームに最適な露光パターンを選択することをいう。 The machine learning unit 30 causes artificial intelligence (not shown) such as a neural network to perform learning. The machine learning unit 30 may be divided into a plurality of functional units such as a first functional unit, a second functional unit, and a third functional unit, depending on the learning content to be learned by the artificial intelligence. For example, a first machine learning unit (not shown) trains artificial intelligence for optimizing an exposure pattern that specifies an exposure mode. The second machine learning unit (not shown) receives the compressed moving image as an input and causes the artificial intelligence for outputting the output moving image to learn. The third machine learning unit (not shown) receives the compressed moving image as an input and causes the artificial intelligence for outputting the motion information to learn. The machine learning unit 30 causes the artificial intelligence to learn using, for example, teacher data. The exposure pattern is information designating pixels to be used for exposure among the pixels forming the image sensor for each frame forming the compressed moving image. Further, optimizing the exposure pattern means selecting, from a plurality of exposure patterns, an exposure pattern that satisfies the mounting restrictions on the hardware and that is optimum for each frame forming a moving image.

再構成部４０は、カメラ２００が生成した圧縮動画像に対して、イメージセンサを構成する全てのピクセルを全てのフレームにおいて露光させた場合に得られる未知動画像を目標として再構成することで、出力動画像を生成する。 The reconstructing unit 40 reconstructs the compressed moving image generated by the camera 200 by targeting an unknown moving image obtained when all pixels forming the image sensor are exposed in all frames, Generate an output moving image.

再動き情報生成部５０は、カメラ２００が生成した圧縮動画像から、イメージセンサを構成する全てのピクセルを全てのフレームにおいて露光させた場合に得られる未知動画像が示す動きの種類を特定し、特定した動きの種類を示す動き情報を生成する。 The re-motion information generation unit 50 specifies, from the compressed moving image generated by the camera 200, the type of movement indicated by the unknown moving image obtained when all the pixels forming the image sensor are exposed in all frames, Motion information indicating the identified motion type is generated.

通信部１０は、第１機械学習部（不図示）による最適化によって得られた露光パターンをカメラ２００に出力する出力部（不図示）と、カメラ２００が生成した圧縮動画像を取得する取得部（不図示）と、を備える。通信部１０は、Ｗｉ−Ｆｉ（登録商標）をはじめとする無線通信、又は、Ｅｔｈｅｒｎｅｔ（登録商標）をはじめとする有線通信を利用した通信であってもよく、Ｂｌｕｅｔｏｏｔｈ（登録商標）、特定小電力無線、又は、可視光通信を利用した通信をであってもよい。 The communication unit 10 includes an output unit (not shown) that outputs the exposure pattern obtained by the optimization by the first machine learning unit (not shown) to the camera 200, and an acquisition unit that acquires the compressed moving image generated by the camera 200. (Not shown). The communication unit 10 may be a wireless communication such as Wi-Fi (registered trademark) or a communication using wired communication such as Ethernet (registered trademark). The communication may be electric power wireless communication or visible light communication.

表示部６０は、例えば、ディスプレイであり、例えば再構成部４０で再構成された動画像を、入力部７０に入力されたユーザの指示に基づいて表示する。入力部７０は、例えば、キーボード、マウス、タッチパネル、又は、マイクなどであり、ユーザの指示の入力を受け付ける。なお、動画像処理装置１００は、表示部６０及び入力部７０を備えていなくてもよい。表示部６０及び入力部７０は、例えば、動画像処理装置１００以外の他の装置が備えてもよい。また、動画像処理装置１００は、カメラ２００に実装されてもよく、コンピュータに実装されてもよく、インターネットなどの通信ネットワークを介して接続されるサーバ上に設けられてもよい。 The display unit 60 is, for example, a display, and displays the moving image reconstructed by the reconstructing unit 40, for example, based on the user's instruction input to the input unit 70. The input unit 70 is, for example, a keyboard, a mouse, a touch panel, a microphone, or the like, and receives an input of a user's instruction. The moving image processing apparatus 100 may not include the display unit 60 and the input unit 70. The display unit 60 and the input unit 70 may be included in a device other than the moving image processing device 100, for example. The moving image processing apparatus 100 may be mounted on the camera 200, a computer, or a server connected via a communication network such as the Internet.

続いて、実施の形態に係る動画像処理方法について説明する。図１１は、実施の形態に係る動画像処理方法の一例を示すフローチャートである。 Next, a moving image processing method according to the embodiment will be described. FIG. 11 is a flowchart showing an example of the moving image processing method according to the embodiment.

図１１に示すように、カメラ２００は、圧縮動画像を生成する（圧縮ステップＳ１０）。より具体的には、カメラ２００は、２次元にピクセルが配置されたイメージセンサを用いて時間的及び空間的に間引いた繰り返し露光による撮影を行い、圧縮動画像を生成する。なお、圧縮ステップでは、後述する第１機械学習ステップによる最適化によって得られた露光パターンを用いて圧縮動画像を生成する。ここで、時間的及び空間的に間引いた繰り返し露光による撮影とは、イメージセンサのピクセル毎に露光の態様を特定する複数の露光パターンの中から、動画像を構成する複数のフレームのそれぞれに対して最適な露光パターンを選択して各フレームに適用した撮影である。 As shown in FIG. 11, the camera 200 generates a compressed moving image (compression step S10). More specifically, the camera 200 uses a image sensor in which pixels are arranged two-dimensionally to perform photographing by repeated exposure with temporal and spatial thinning to generate a compressed moving image. In the compression step, the compressed moving image is generated using the exposure pattern obtained by the optimization in the first machine learning step described later. Here, imaging by repeated exposure that is thinned out temporally and spatially means that for each of a plurality of frames that form a moving image, from among a plurality of exposure patterns that specify the exposure mode for each pixel of the image sensor. The optimum exposure pattern is selected and applied to each frame.

次いで、動画像処理装置１００は、カメラ２００が生成した圧縮動画像を動画像に再構成する（再構成ステップＳ２０）。より具体的には、動画像処理装置１００は、圧縮ステップＳ１０でカメラ２００が生成した圧縮動画像に対して、イメージセンサを構成する全てのピクセルを全てのフレームにおいて露光させた場合に得られる未知動画像を目標として再構成することで、出力動画像を生成する。なお、再構成ステップＳ２０では、後述する第２機械学習ステップで機械学習された人工知能を用いて出力動画像を生成する。 Next, the moving image processing apparatus 100 reconstructs the compressed moving image generated by the camera 200 into a moving image (reconstructing step S20). More specifically, the moving image processing apparatus 100 obtains an unknown value obtained when all the pixels forming the image sensor are exposed in all the frames of the compressed moving image generated by the camera 200 in the compression step S10. An output moving image is generated by reconstructing the moving image as a target. In the reconstruction step S20, the output moving image is generated using the artificial intelligence machine-learned in the second machine learning step described later.

なお、これらの２つのステップのそれぞれに先立ち、機械学習部３０は、各ステップで使用される人工知能に学習させてもよい。以下、機械学習部３０が人工知能に学習させる学習ステップ、及び、人工知能のそれぞれについて説明する。 Note that, prior to each of these two steps, the machine learning unit 30 may cause the artificial intelligence used in each step to learn. Hereinafter, each of the learning steps that the machine learning unit 30 causes the artificial intelligence to learn and the artificial intelligence will be described.

図１２は、圧縮及び再構成ステップで使用される人工知能の学習ステップの一例を示す図である。図１２に示すように、機械学習ステップは、上記の圧縮ステップＳ１０に先立ち、露光パターンを最適化するための人工知能に学習させる第１機械学習ステップＳ１と、上記の再構成ステップＳ２０に先立ち、圧縮動画像を入力させ、出力動画像を出力するための人工知能に学習させて再構成アルゴリズムを最適化する第２機械学習ステップＳ２と、を含む。なお、これらのステップは、同時に実施されてもよく、個別に実施されてもよい。また、これらのステップは、順番を問わずに実施されてもよい。また、これらのステップの両方とも実施されてもよく、一方のみ実施されてもよい。つまり、必要に応じて適宜実施されるとよい。 FIG. 12 is a diagram showing an example of an artificial intelligence learning step used in the compression and reconstruction steps. As shown in FIG. 12, in the machine learning step, prior to the compression step S10, prior to the first machine learning step S1 in which artificial intelligence for optimizing the exposure pattern is learned and the reconstruction step S20, A second machine learning step S2 of inputting the compressed moving image and learning the artificial intelligence for outputting the output moving image to optimize the reconstruction algorithm. Note that these steps may be performed simultaneously or individually. Also, these steps may be performed in any order. Also, both of these steps may be performed, or only one of them may be performed. That is, it may be implemented as needed.

続いて、動画像の圧縮及び再構成に使用される人工知能の一例について、図１３を参照しながら説明する。図１３は、実施の形態において動画像の圧縮及び再構成に使用される人工知能の一例を示す図である。 Next, an example of artificial intelligence used for compression and reconstruction of moving images will be described with reference to FIG. FIG. 13 is a diagram showing an example of artificial intelligence used for compression and reconstruction of moving images in the embodiment.

人工知能は、ニューラルネットワーク（ＮＮ）で構成される。ニューラルネットワークは、例えば、ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ（ＤＮＮ）である。当該人工知能は、未知動画像から露光パターンに対応する重み係数を用いた演算により圧縮動画像を生成するセンシング層（以下、圧縮センシング層ともいう。）と、センシング層が生成した圧縮動画像を再構成することによって出力動画像を生成する再構成層と、を含む。 Artificial intelligence consists of neural networks (NN). The neural network is, for example, Deep Neural Network (DNN). The artificial intelligence includes a sensing layer (hereinafter, also referred to as a compressed sensing layer) that generates a compressed moving image from an unknown moving image by a calculation using a weighting coefficient corresponding to an exposure pattern, and a compressed moving image generated by the sensing layer. And a reconstruction layer that generates an output moving image by reconstructing.

図１３に示すように、センシング層は、２値化された複数の露光パターンの中から、カメラ２００が撮影する動画像（Ｗｐ×Ｈｐ×Ｔ）を構成する各フレームに最適な露光パターンをそれぞれ選択して各フレームに適用することにより圧縮動画像、つまり、符号化動画像（Ｗｐ×Ｈｐ）を生成する。 As shown in FIG. 13, the sensing layer selects an optimum exposure pattern for each frame constituting a moving image (Wp×Hp×T) captured by the camera 200 from among a plurality of binarized exposure patterns. A compressed moving image, that is, a coded moving image (Wp×Hp) is generated by selecting and applying it to each frame.

ここで、上記の２値化された複数の露光パターンは、例えば、図２の（ａ）に示すように、全画素において完全にランダムな露光が可能なセンサに実装可能な複数の露光パターンと、図２の（ｂ）及び（ｃ）に示すように、ハードウェアへの実装上の制約を考慮して準備された複数の露光パターンと、を含む。なお、全画素において完全にランダムな露光とは、動画像を構成するフレーム毎に、全画素のうちのランダムに選択された画素に露光させることである。例えば、実装を考えられ得るあらゆるハードウェアに関して、これらのハードウェアへの実装上の制約を満たす全ての種類の露光パターンを予め準備し、当該複数の露光パターンをメモリ(不図示)に格納する。人工知能は、メモリ(不図示)に格納された複数の露光パターンの中からカメラ２００が撮影する動画像の各フレームに最適な露光パターンを選択して、動画像処理装置１００からカメラ２００へ出力させることで、最適な動画像の符号化、つまり、圧縮を行う。 Here, the plurality of binarized exposure patterns are, for example, as shown in FIG. 2A, a plurality of exposure patterns that can be mounted on a sensor capable of completely random exposure in all pixels. , (B) and (c) of FIG. 2, a plurality of exposure patterns prepared in consideration of restrictions on mounting on hardware. It should be noted that the completely random exposure of all pixels means that randomly selected pixels of all pixels are exposed for each frame forming a moving image. For example, with respect to any hardware that can be considered for mounting, all kinds of exposure patterns satisfying the restrictions on mounting on these hardware are prepared in advance, and the plurality of exposure patterns are stored in a memory (not shown). The artificial intelligence selects the optimum exposure pattern for each frame of the moving image captured by the camera 200 from a plurality of exposure patterns stored in a memory (not shown), and outputs it from the moving image processing apparatus 100 to the camera 200. By doing so, optimal encoding of the moving image, that is, compression is performed.

ハードウェアへの実装上の制約のある露光パターンは、ハードウェアの構造から簡単に導出することができる。例えば、ハードウェアがＳＢＥセンサ（図３参照）である場合、ＳＢＥセンサのダイナミックレンジを考慮すると、全画素において露光時間が同じであることが望ましい。そのため、圧縮性能を高めるためにＳＢＥセンサで制御可能なことは、露光開始のタイミングを制御することである。したがって、ＳＢＥセンサにおいては、考えられ得る全ての露光開始のタイミング（開始時間（秒）ｔ＝０，１，２、・・・、Ｔ−ｄ）を求めることで全ての種類の露光パターンが導出される(図２の（ｂ）参照)。ここで、ｄは露光時間である。 An exposure pattern that has restrictions in terms of mounting on hardware can be easily derived from the hardware structure. For example, when the hardware is an SBE sensor (see FIG. 3), it is desirable that the exposure time be the same for all pixels, considering the dynamic range of the SBE sensor. Therefore, what can be controlled by the SBE sensor in order to enhance the compression performance is to control the exposure start timing. Therefore, in the SBE sensor, all possible exposure patterns are derived by determining all possible exposure start timings (start time (second) t=0, 1, 2,..., Td). (See (b) of FIG. 2). Here, d is the exposure time.

また、例えば、ハードウェアがＲＣＥセンサ（図５参照）である場合、ＲＣＥセンサにおいては、まず、全てのＲｅｓｅｔ信号（８ｂｉｔ）とＴｒａｎｓｆｅｒ信号（８ｂｉｔ）の組を生成する。次に、生成した全ての信号の組から生成される露光パターンをシミュレートすることで、全ての種類の露光パターンが導出される（図２の（ｃ）参照）。 Further, for example, when the hardware is an RCE sensor (see FIG. 5), the RCE sensor first generates a set of all Reset signals (8 bits) and Transfer signals (8 bits). Next, exposure patterns of all types are derived by simulating the exposure patterns generated from all the generated signal sets (see (c) of FIG. 2 ).

図１３に示すように、再構成層は、センシング層で作成された圧縮動画像を入力層に入力し、出力層から出力動画像を出力する。より具体的には、再構成層は、圧縮センシング層において、動画像を構成する各フレームに最適な露光パターンを用いて圧縮された単一の画像（圧縮動画像）から、複数フレームで構成される動画像を再構成する。再構成層は、入力される単一の画像から複数フレームで構成される動画像への非線形写像をＤＮＮを用いて学習する。図１３に示すように、このＤＮＮは４層の隠れ層を持ち、伝達係数にはＲｅＬＵ（ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）を用いる。ＤＮＮは、訓練動画像と再構成動画像との誤差を小さくするように学習する。再構成動画像の評価にピーク信号対雑音比（ＰＳＮＲ）を用いる。そのため、損失関数は、ＰＳＮＲと関係の深い平均二乗誤差（ＭＳＥ）を用いる。 As shown in FIG. 13, the reconstruction layer inputs the compressed moving image created in the sensing layer into the input layer, and outputs the output moving image from the output layer. More specifically, the reconstruction layer is composed of a plurality of frames from a single image (compressed moving image) compressed using an exposure pattern that is optimal for each frame that constitutes the moving image in the compressed sensing layer. Reconstruct a moving image. The reconstruction layer learns a non-linear mapping from a single input image to a moving image composed of a plurality of frames by using DNN. As shown in FIG. 13, this DNN has four hidden layers and uses ReLU (Rectified Linear Unit) as a transfer coefficient. The DNN learns to reduce the error between the training moving image and the reconstructed moving image. The peak signal-to-noise ratio (PSNR) is used to evaluate the reconstructed video. Therefore, the loss function uses a mean square error (MSE) that is closely related to PSNR.

以上のように、動画像の圧縮及び再構成を行う人工知能（ここでは、ＤＮＮ）は、センシング層と再構成層とを含み、当該人工知能に対する機械学習である第１機械学習ステップ及び第２機械学習ステップで、訓練動画像を用いた教師あり学習によって行われる。これにより、本実施の形態における人工知能は、圧縮センシングのための露光パターンの最適化と、デコーダの再構成アルゴリズムの最適化とを同時に学習することができる。 As described above, the artificial intelligence (here, DNN) that compresses and reconstructs a moving image includes a sensing layer and a reconstruction layer, and includes a first machine learning step and a second machine learning step that are machine learning for the artificial intelligence. In the machine learning step, supervised learning using the training moving image is performed. Thereby, the artificial intelligence in the present embodiment can simultaneously learn the optimization of the exposure pattern for compressed sensing and the optimization of the reconstruction algorithm of the decoder.

続いて、動画像の圧縮及び再構成を行うために用いられる人工知能（ここでは、ＤＮＮ）の機械学習の手順についてより具体的に説明する。図１４は、実施の形態における機械学習ステップの構成の一例を示す図である。 Next, the procedure of machine learning of artificial intelligence (here, DNN) used for compressing and reconstructing a moving image will be described more specifically. FIG. 14 is a diagram showing an example of the configuration of the machine learning step in the embodiment.

上述のように、ＤＮＮは、ハードウェアへの実装上の制約を満たしながら露光パターンの最適化を行うセンシング層と、圧縮動画像である観測画像から動画像を再構成する再構成層と、の二つの層から構成されている。図１４に示すように、ＤＮＮの訓練（つまり、機械学習）は、例えば、以下の手順で行う。ここでは、第１学習ステップ及び第２学習ステップを同時に実施する機械学習の一例を説明する。
（１）センシング層から再構成層に向かう処理を行うＦｏｒｗａｒｄ時には、センシング層では２値化された重みである２値化露光パターンを用い、再構成層では連続値重みを用いる。
（２）誤差逆伝播により勾配を求める。
（３）求めた勾配を用いてネットワーク全体の連続値重みを更新する。
（４）更新された連続値重みを、ハードウェアへの実装上の制約を考慮しながら２値化する。これにより、センシング層で使用する２値化重みを更新する。 As described above, the DNN includes a sensing layer that optimizes the exposure pattern while satisfying the restrictions on mounting on hardware, and a reconstruction layer that reconstructs a moving image from an observed image that is a compressed moving image. It consists of two layers. As shown in FIG. 14, DNN training (that is, machine learning) is performed, for example, in the following procedure. Here, an example of machine learning for simultaneously performing the first learning step and the second learning step will be described.
(1) During Forwarding when processing from the sensing layer to the reconstruction layer, a binary exposure pattern that is a binarized weight is used in the sensing layer, and a continuous value weight is used in the reconstruction layer.
(2) Gradient is obtained by error back propagation.
(3) The continuous value weight of the entire network is updated using the obtained gradient.
(4) The updated continuous value weights are binarized in consideration of the restrictions on mounting on hardware. As a result, the binarization weight used in the sensing layer is updated.

実際の圧縮センシングでは２値化された露光パターンが用いられるため、ニューラルネットワークの訓練におけるＦｏｒｗａｒｄ時には２値化重みを用いるが、Ｂａｃｋｗａｒｄ時には微分可能とするため連続値に緩和する（非特許文献６）。図１５は、２値化された露光パターンを更新する一例を示す図である。次のＦｏｒｗａｒｄ時に用いる重みは事前に生成した２値化された複数の露光パターンの中からＢａｃｋｗａｒｄ時に導出された連続値重みと最も近いものを内積を用いて選出し、２値化された露光パターンを更新する。 Since binary exposure patterns are used in actual compressed sensing, binarization weights are used during Forward in neural network training, but they are relaxed to continuous values because they are differentiable during Backward (Non-Patent Document 6). .. FIG. 15 is a diagram showing an example of updating a binarized exposure pattern. The weight to be used at the next Forward is selected from a plurality of binarized exposure patterns generated in advance, and the one closest to the continuous value weight derived at the Backward is selected using the inner product, and the binarized exposure pattern is selected. To update.

［実験例］
［実験例１］ＤＮＮの機械学習
以下の手順により、ＤＮＮの機械学習を行った。ネットワーク（ＤＮＮ）のサイズは、再構成を行うパッチのサイズを基に決定された。本実験例では、非特許文献１に記載の露光を制御できるプロトタイプのセンサを用いた。そのため、パッチのサイズをＷｐ＝Ｈｐ＝８、Ｔ＝１６（図１３のＷｐ×Ｈｐ×Ｔ）とした。また、再構成層の隠れ層は、４層とした。訓練データ（訓練動画像）は、以下の実験例における全ての手法で同様のものを用いた（非特許文献７）。この訓練データは、映像要約のベンチマーク用のデータセットで、この中の２０本の動画像から１６フレームをランダムに４シーンずつ取り出し、それぞれに回転（９０°、１８０°、２７０°）と反転を行ったものを用いた。このようにして用意した８２９，４４０パッチを用いて、上記の露光パターンと再構成のためのデコーダとを同時に最適化するネットワーク（ＤＮＮ）の機械学習をｅｎｄ−ｔｏ−ｅｎｄで行った。当該機械学習は、ミニパッチサイズ２００で２５０ｅｐｏｃｈ行った。 [Experimental example]
[Experimental Example 1] Machine learning of DNN Machine learning of DNN was performed by the following procedure. The size of the network (DNN) was determined based on the size of the patch to be reconstructed. In this experimental example, the prototype sensor described in Non-Patent Document 1 that can control exposure was used. Therefore, the size of the patch is set to Wp=Hp=8 and T=16 (Wp×Hp×T in FIG. 13). The hidden layer of the reconstructed layer was 4 layers. As the training data (training moving image), the same data was used in all methods in the following experimental examples (Non-Patent Document 7). This training data is a data set for benchmarking video summarization, and 16 frames are randomly extracted from each of 20 moving images in 4 scenes, and rotated (90°, 180°, 270°) and inverted respectively. What was done was used. Using the 829 and 440 patches prepared in this manner, machine learning of the network (DNN) for simultaneously optimizing the exposure pattern and the decoder for reconstruction was performed end-to-end. The machine learning was performed at a minipatch size of 200 and 250 epochs.

［実験例２］シミュレーション実験
ＳＢＥセンサとＲＣＥセンサとを実装対象と仮定して、動画像の圧縮及び再構成のシミュレーション実験を行った。実験に供した動画像は、空間解像度２５６×２５６の１６フレームから構成される動画像１４本であった。再構成した動画像の再構成品質は、ピーク信号対雑音比（ＰＳＮＲ）により評価した。図１６は、実験例２の結果を示す図である。 [Experimental Example 2] Simulation experiment A simulation experiment of compression and reconstruction of a moving image was performed assuming that the SBE sensor and the RCE sensor are mounting targets. The moving images used in the experiment were 14 moving images composed of 16 frames with a spatial resolution of 256×256. The reconstruction quality of the reconstructed moving image was evaluated by the peak signal-to-noise ratio (PSNR). FIG. 16 is a diagram showing the results of Experimental Example 2.

ＳＢＥセンサを実装対象としたシミュレーション実験では、動画像を構成する各フレームに、図２の（ｂ）に例示したＳＢＥセンサに実装可能な複数の露光パターンの中からランダムに選択された露光パターンをそれぞれ使用して撮影した圧縮動画像をシミュレートした。次いで、シミュレートして得られた圧縮動画像を再構成ネットワークへ入力し、１６フレームから構成される動画像を再構成した（図１６のＨａｎｄｃｒａｆｔＳＢＥ）。このとき、ＤＮＮの機械学習は、デコーダのみ、つまり、再構成層における再構成アルゴリズムの最適化を行うための学習のみ行った。 In a simulation experiment in which the SBE sensor is mounted, an exposure pattern randomly selected from a plurality of exposure patterns that can be mounted on the SBE sensor illustrated in FIG. Each was used to simulate a compressed video image taken. Next, the compressed moving image obtained by simulation was input to the reconstruction network, and the moving image composed of 16 frames was reconstructed (Handcraft SBE in FIG. 16). At this time, the machine learning of the DNN was performed only for the decoder, that is, only the learning for optimizing the reconstruction algorithm in the reconstruction layer.

一方、上記の実施の形態で開示したように、第１学習ステップ及び第２学習ステップによりＤＮＮの機械学習を行い、動画像を構成する各フレームに最適な露光パターンをそれぞれ選択して撮影した圧縮動画像をシミュレートし、動画像を再構成した（図１６のＯｐｔｉｍｉｚｅｄＳＢＥ）。 On the other hand, as disclosed in the above embodiment, the machine learning of the DNN is performed by the first learning step and the second learning step, and the optimum exposure pattern for each frame forming the moving image is selected and compressed. The moving image was simulated and the moving image was reconstructed (Optimized SBE in FIG. 16).

ＲＣＥセンサを実装対象としたシミュレーション実験では、図２の（ｃ）に例示したＲＣＥセンサに実装可能な複数の露光パターンを用いたこと以外は、ＳＢＥセンサを実装対象としたシミュレーション実験と同様に行った(図１６のＨａｎｄｃｒａｆｔＲＣＥ及びＯｐｔｉｍｉｚｅｄＲＣＥ)。 The simulation experiment with the RCE sensor as the mounting target was performed in the same manner as the simulation experiment with the SBE sensor as the mounting target, except that a plurality of exposure patterns mountable on the RCE sensor illustrated in FIG. 2C were used. (Handcraft RCE and Optimized RCE in FIG. 16).

図１６には、１４本の動画像を用いた結果のうちの３例の評価結果のみを示した。図１６の左端の一列は、試験に供した動画像の１シーンを示している。図１６の上段は、郵便配達車を撮影した動画像であり、郵便配達車の側面には手紙のマークが記載されている。図１６の中段は、複数の車両が走行している様子を撮影した動画像である。図１６の下段は、演者がチェロを伴奏している様子を撮影した動画像である。 FIG. 16 shows only the evaluation results of 3 cases out of the results using 14 moving images. The leftmost column in FIG. 16 shows one scene of the moving image used in the test. The upper part of FIG. 16 is a moving image of a mail delivery car, and a letter mark is written on the side of the mail delivery car. The middle part of FIG. 16 is a moving image in which a plurality of vehicles are moving. The lower part of FIG. 16 is a moving image in which the performer is accompanied by a cello.

図１６に示すように、ＳＢＥセンサ及びＲＣＥセンサを実装対象としたシミュレーション実験では、それぞれのセンサにおいて以下の評価結果が得られた。 As shown in FIG. 16, in the simulation experiment in which the SBE sensor and the RCE sensor were mounted, the following evaluation results were obtained for each sensor.

図１６の上段（Ｃａｒ）の動画像を比較すると、本実施の形態で開示したＤＮＮを用いて動画像を圧縮して再構成した動画像（ＯｐｔｉｍｉｚｅｄＳＢＥ及びＯｐｔｉｍｉｚｅｄＲＣＥ）の方が、ランダムに露光パターンを選択した場合（ＨａｎｄｃｒａｆｔＳＢＥ及びＨａｎｄｃｒａｆｔＲＣＥ）よりも手紙のマークがより鮮明に再現された。ＨａｎｄｃｒａｆｔとＯｐｔｉｍｉｚｅｄとでＰＳＮＲの値を比較しても、Ｏｐｔｉｍｉｚｅｄの方がＨａｎｄｃｒａｆｔよりも高いため、再構成した動画像は、ノイズが少なく高品質であることが確認できた。 Comparing the moving images in the upper part (Car) of FIG. 16, the moving images (Optimized SBE and Optimized RCE) obtained by compressing and reconstructing the moving image using the DNN disclosed in the present embodiment are randomly exposed. The letter marks were reproduced more clearly than when the patterns were selected (Handcraft SBE and Handcraft RCE). Even when comparing the PSNR values of Handcrafted and Optimized, it was confirmed that the reconstructed moving image has less noise and high quality because Optimized is higher than Handcrafted.

また、図１６の中段（Ｔｒａｆｆｉｃ）の動画像及び下段（Ｃｅｌｌｏ）の動画像においても、本実施の形態で開示したＤＮＮを用いて動画像を圧縮して再構成された動画像（ＯｐｔｉｍｉｚｅｄＳＢＥ及びＯｐｔｉｍｉｚｅｄＲＣＥ）の方が被写体の輪郭がより鮮明であり、各パッチにおける画素値の差が小さく、パッチの境界における画素値の変化がより滑らかで連続した値であった。 Further, also in the middle (Traffic) moving image and the lower (Cello) moving image in FIG. 16, a moving image (Optimized SBE In the Optimized RCE), the outline of the subject was clearer, the difference in pixel value between the patches was small, and the change in pixel value at the patch boundary was a smooth and continuous value.

また、図示していないが、Ｈａｎｄｃｒａｆｔ及びＯｐｔｉｍｉｚｅｄそれぞれの再構成品質を１４本の再構成動画像におけるＰＳＮＲの平均値で評価した。１４本の動画像の再構成動画像におけるＳＰＮＲの平均値は、ＨａｎｄｃｒａｆｔＳＢＥが２８．３７ｄＢであり、ＯｐｔｉｍｉｚｅｄＳＢＥが２９．３２ｄＢであり、ＨａｎｄｃｒａｆｔＲＣＥが２７．５８ｄＢであり、ＯｐｔｉｍｉｚｅｄＲＣＥが２７．８２ｄＢであった。この結果からも、本実施の形態で開示したＤＮＮを用いて動画像を圧縮する際に最適な露光パターン使用し、再構成アルゴリズムを最適化して得られた再構成動画像の方が再構成品質が良いことが確認できた。 Although not shown, the reconstruction quality of each of Handcraft and Optimized was evaluated by the average value of PSNR in 14 reconstructed moving images. The average value of SPNR in the reconstructed moving image of 14 moving images is 28.37 dB for Handcrafted SBE, 29.32 dB for Optimized SBE, 27.58 dB for Handcrafted RCE, and 27.82 dB for Optimized RCE. Met. Also from this result, the reconstructed moving image obtained by using the optimal exposure pattern when compressing the moving image using the DNN disclosed in the present embodiment and optimizing the reconstruction algorithm has better reconstruction quality. Was confirmed to be good.

以上の結果により、動画像を圧縮して再構成して得られた動画像の品質（以下、再構成品質）は、ＳＢＥセンサ及びＲＣＥセンサのどちらを実装対象とした場合でも、全てのシーンで最適化した露光パターンを適用して圧縮動画像を生成した方が再構成品質が良かった。よって、本実施の形態で開示したＤＮＮは、ハードウェアの実装上の制約下においても、動画像の各フレームにより適した露光パターンを発見することができ、かつ、それ同時に再構成層において再構成アルゴリズムも最適化できるため、圧縮性能だけでなく再構成性能までも向上されていることが確認できた。 From the above results, the quality of the moving image obtained by compressing and reconstructing the moving image (hereinafter, reconstruction quality) is the same in all scenes regardless of which of the SBE sensor and the RCE sensor is mounted. The reconstruction quality was better when the compressed exposure image was generated by applying the optimized exposure pattern. Therefore, the DNN disclosed in the present embodiment can find an exposure pattern that is more suitable for each frame of a moving image even under the restrictions on hardware implementation, and at the same time, reconstructs in a reconstruction layer. Since the algorithm can be optimized, it was confirmed that not only compression performance but also reconstruction performance was improved.

［実験例３］実実験
続いて、本実施の形態で開示したＤＮＮを実機に適用して、動画像の圧縮及び再構成を行った。使用したセンサは、実際に画素毎に露光を制御できるセンサ（特許文献２：特開２０１５−２１６５９４号公報）である。動画像の撮影は、１５フレーム毎秒（ｆｐｓ：Ｆｒａｍｅｓｐｅｒｓｅｃｏｎｄ）で行い、１６サブフレームを再構成するため、再構成動画像は２４０ｆｐｓ相当である。圧縮センシング層において最適化した露光パターンを用いて動画像を撮影し、再構成アルゴリズムを最適化した再構成層を用いて圧縮動画像から動画像を再構成した。当該センサは、ローリングシャッタ方式であるため、センサ上のピクセルの位置により（ここでは、行毎に）露光パターンを適用する順番が異なる。当該センサのセンササイズは、６７２×５１２画素である。当該センサで撮影された動画像を、１６サブフレームに分割するため、５１２÷１６＝３２行毎に各フレームに適用する露光パターンの順番が異なる。そのため、行毎にそれぞれ別のデコーダを用いる必要がある。図１７は、実験例３の結果を示す図である。図１７の左端の画像は、実際に撮影した動画像であり、左から２番目の画像は、圧縮動画像であり、右側の２つの画像は、再構成した動画像を構成する１６つのサブフレームのうちの３つのサブフレームの画像である。当該センサを搭載したカメラで、メトロノームの重りが左右に揺れている様子を撮影した。図１７に示すように、再構成動画像を構成する１６サブフレームのうち３つのサブフレームの画像から、メトロノームの重りが左右に揺れている様子が確認できた。 Experimental Example 3 Actual Experiment Subsequently, the DNN disclosed in the present embodiment was applied to an actual machine to compress and reconstruct a moving image. The sensor used is a sensor that can actually control the exposure for each pixel (Patent Document 2: JP-A-2015-216594). The moving image is captured at 15 frames per second (fps: Frames per second), and 16 subframes are reconstructed, so the reconstructed moving image corresponds to 240 fps. A moving image was shot using the optimized exposure pattern in the compressed sensing layer, and the moving image was reconstructed from the compressed moving image using the reconstruction layer with the optimized reconstruction algorithm. Since the sensor is a rolling shutter system, the order in which the exposure pattern is applied differs depending on the position of the pixel on the sensor (here, in each row). The sensor size of the sensor is 672×512 pixels. Since the moving image captured by the sensor is divided into 16 subframes, the order of the exposure patterns applied to each frame is different every 512/16=32 rows. Therefore, it is necessary to use different decoders for each row. FIG. 17 is a diagram showing the results of Experimental Example 3. The image at the left end of FIG. 17 is a moving image that was actually taken, the second image from the left is a compressed moving image, and the two images on the right side are 16 subframes that make up the reconstructed moving image. It is an image of three sub-frames. I took a picture of the metronome weight swaying left and right with a camera equipped with the sensor. As shown in FIG. 17, it was confirmed from the images of three subframes of the 16 subframes forming the reconstructed moving image that the metronome weight swayed from side to side.

（まとめ）
本実施の形態では、圧縮ビデオセンシングにおける露光パターンのハードウェアへの実装上の制約を考慮し、人工知能の一例としてＤＮＮを用いて動画像を構成する各フレームに適した露光パターンの選出と、圧縮動画像から動画像を再構成するためのデコーダの再構成アルゴリズムとを同時に最適化する動画像処理方法を開示した。上記の実験例では、ＳＢＥセンサ及びＲＣＥセンサの２つセンサへの実装上の制約について本開示の動画像処理方法を用いて動画像の圧縮及び再構成を行った。上記の実験例の結果から、本開示の動画像処理方法を用いると、実装されるハードウェアの制約に関わらず、撮影される動画像の各フレームに最適な露光パターンを選出することができ、さらに、圧縮動画像から再構成された未知の動画像の品質もより向上されることが確認できた。 (Summary)
In the present embodiment, in consideration of the restrictions on the mounting of the exposure pattern in the compressed video sensing on the hardware, the selection of the exposure pattern suitable for each frame forming the moving image using DNN as an example of artificial intelligence, Disclosed is a moving image processing method for simultaneously optimizing a reconstruction algorithm of a decoder for reconstructing a moving image from a compressed moving image. In the above experimental example, the moving image is compressed and reconstructed by using the moving image processing method of the present disclosure with respect to the restrictions on mounting the two sensors of the SBE sensor and the RCE sensor. From the results of the above experimental example, by using the moving image processing method of the present disclosure, it is possible to select an optimal exposure pattern for each frame of a moving image to be shot, regardless of restrictions of hardware to be mounted, Furthermore, it was confirmed that the quality of the unknown moving image reconstructed from the compressed moving image is further improved.

（変形例１）
実施の形態では、モノクロの動画像を圧縮及び再構成する手法について説明したが、変形例１に係る動画像処理方法は、カラーの動画像に対して適用可能である。以下、実施の形態と異なる点を中心に説明する。 (Modification 1)
In the embodiment, the method of compressing and reconstructing a monochrome moving image has been described, but the moving image processing method according to the first modification can be applied to a color moving image. Hereinafter, the points different from the embodiment will be mainly described.

変形例１における動画像処理システムでは、カメラがピクセルのそれぞれに対応させて特定の色の光を選択的に通過させるカラーフィルタを含むイメージセンサを備える点で実施の形態と異なる。 The moving image processing system according to Modification 1 differs from the embodiment in that the camera includes an image sensor including a color filter that selectively passes light of a specific color corresponding to each pixel.

また、変形例１に係る動画像処理方法は、圧縮ステップでは、カラーフィルタのパターンを時間的及び空間的に変化させた露光による撮影を行うことで、圧縮動画像を生成し、第１機械学習ステップでは、さらに、圧縮ステップに先立ち、カラーフィルタのパターンの時間的及び空間的な変化の態様を特定するカラーフィルタパターンを機械学習によって最適化しておき、圧縮ステップでは、第１機械学習ステップによる最適化によって得られたカラーフィルタパターンを用いて圧縮動画像を生成する。つまり、変形例１に係る動画像処理方法は、第１学習ステップにおいて、露光パターンの最適化に加えて、カラーフィルタパターンを最適化するステップを有する点、及び、圧縮ステップにおいて、動画像を構成する各フレームに適用された最適なカラーフィルタパターンに応じて露光パターンをそれぞれ選択する点で、実施の形態と異なる。 Further, in the moving image processing method according to the first modification, in the compression step, a compressed moving image is generated by performing shooting by exposure in which the pattern of the color filter is temporally and spatially changed, and the first machine learning is performed. In the step, prior to the compression step, the color filter pattern that specifies the temporal and spatial variation of the color filter pattern is optimized by machine learning, and in the compression step, the optimization by the first machine learning step is performed. A compressed moving image is generated using the color filter pattern obtained by the conversion. That is, the moving image processing method according to Modification 1 includes a step of optimizing the color filter pattern in addition to the optimization of the exposure pattern in the first learning step, and a moving image is configured in the compression step. The present embodiment differs from the embodiment in that the exposure pattern is selected according to the optimum color filter pattern applied to each frame.

図１８は、カラー動画像の圧縮センシングのフローの一例を示す図である。図１８に示すように、変形例１に係る動画像処理方法では、最適化された複数のカラーフィルタパターンの中から動画像を構成する各フレームに適したカラーフィルタパターンをそれぞれ選択すること以外は、実施の形態に係る動画像処理方法のフローと同様である。 FIG. 18 is a diagram illustrating an example of a flow of compressed sensing of a color moving image. As shown in FIG. 18, in the moving image processing method according to the modified example 1, except that a color filter pattern suitable for each frame forming a moving image is selected from the plurality of optimized color filter patterns. The flow is the same as the flow of the moving image processing method according to the embodiment.

図１９は、カラーフィルタパターンの一例を示す図である。図１９の（ａ）は、ベイヤーパターンのカラーフィルタの一例であり、図１９の（ｂ）は、変形例１における動画像処理方法により最適化されたカラーフィルタの一例を示す。 FIG. 19 is a diagram showing an example of a color filter pattern. 19A shows an example of a Bayer pattern color filter, and FIG. 19B shows an example of a color filter optimized by the moving image processing method in the first modification.

図２０は、カラーフィルタの違いにより適用される露光パターンの例を示す図である。図２０の左端の図は、モノクロ動画像を撮像する場合に使用される露光パターンの一例である。図２０の中央の図は、ベイヤーパターンのカラーフィルタでカラー動画像を撮像する場合に使用される露光パターンの一例である。図２０の右端の図は、最適化されたカラーフィルタパターンでカラー動画像を撮像する場合に使用される露光パターンの一例である。変形例１では、撮影される動画像を構成する各フレームに適したカラーフィルタパターンを選択することにより、カラー動画像の圧縮及び再構成の性能が向上される。 FIG. 20 is a diagram showing an example of an exposure pattern applied according to the difference in color filters. The diagram on the left end of FIG. 20 is an example of an exposure pattern used when capturing a monochrome moving image. The center diagram of FIG. 20 is an example of an exposure pattern used when a color moving image is captured by a Bayer pattern color filter. The diagram on the right end of FIG. 20 is an example of an exposure pattern used when a color moving image is captured with an optimized color filter pattern. In the first modification, the performance of compression and reconstruction of the color moving image is improved by selecting a color filter pattern suitable for each frame forming the moving image to be captured.

［実験例４］
本実験例では、ＲＣＥセンサを実装対象と仮定して、カラー動画像の圧縮及び再構成のシミュレーション実験を行った。実験に供した動画像は、空間解像度２５６×２５６の１６フレームから構成される動画像２５本であった。再構成した動画像の再構成品質は、ピーク信号対雑音比（ＰＳＮＲ）により評価した。図２１は、実験例４の結果の一例を示す図である。図２１の左端の図（ＯｒｉｇｉｎａｌＶｉｄｅｏ）は、実験に供したカラー動画像の一例である。ここでは、図２１の左端に示すＯｒｉｇｉｎａｌＶｉｄｅｏをシミュレーション実験に供した結果を例に挙げて、シミュレーション実験の内容及び結果について説明する。 [Experimental Example 4]
In this experimental example, a simulation experiment of compression and reconstruction of a color moving image was performed assuming that the RCE sensor is a mounting target. The moving images used in the experiment were 25 moving images composed of 16 frames with a spatial resolution of 256×256. The reconstruction quality of the reconstructed moving image was evaluated by the peak signal-to-noise ratio (PSNR). 21: is a figure which shows an example of the result of Experimental example 4. FIG. The leftmost diagram (Original Video) in FIG. 21 is an example of the color moving image used in the experiment. Here, the contents and results of the simulation experiment will be described by taking the result of subjecting the Original Video shown in the left end of FIG. 21 to the simulation experiment as an example.

まず、シミュレーション実験では、カラー動画像を構成する各フレームに、ベイヤーパターンのカラーフィルタと図２の（ｃ）に例示したＲＣＥセンサに実装可能な複数の露光パターンの中からランダムに選択された露光パターンとをそれぞれ使用して撮影した圧縮動画像をシミュレートした。次いで、シミュレートして得られた圧縮動画像を再構成ネットワークへ入力し、１６フレームから構成される動画像を再構成した（図２１の左から２番目の図）。このとき、ＤＮＮの機械学習は、デコーダのみ、つまり、再構成層における再構成アルゴリズムの最適化を行うための学習のみ行った。このように、デコーダのみ最適化されたＤＮＮを用いて得られた再構成動画像を「デコーダのみ」と称する。 First, in a simulation experiment, an exposure randomly selected from a plurality of exposure patterns that can be mounted on the color filter of the Bayer pattern and the RCE sensor illustrated in FIG. A compressed moving image captured using each of the pattern and was simulated. Next, the compressed moving image obtained by simulation was input to the reconstruction network to reconstruct a moving image composed of 16 frames (second diagram from the left in FIG. 21). At this time, the machine learning of the DNN was performed only for the decoder, that is, only the learning for optimizing the reconstruction algorithm in the reconstruction layer. In this way, the reconstructed moving image obtained by using the DNN optimized only for the decoder is referred to as “decoder only”.

次いで、カラー動画像を構成する各フレームにベイヤーパターンのカラーフィルタと、カラー動画像を構成する各フレームに最適な露光パターンをそれぞれ選択して撮影した圧縮動画像をシミュレートし、動画像を再構成した（図２１の左から３番目の図）。このとき、ＤＮＮの機械学習は、第１機械学習ステップ及び第２機械学習ステップにより、露光パターンの最適化と再構成層における再構成アルゴリズムの最適化を行うための学習を行った。このように、露光パターンとデコーダとが最適化されたＤＮＮを用いて得られた再構成動画像を、「露光パターン＋デコーダ」と称する。 Next, a Bayer pattern color filter is selected for each frame that composes the color moving image, and an optimal exposure pattern is selected for each frame that composes the color moving image. It was configured (third drawing from the left in FIG. 21). At this time, in the machine learning of the DNN, learning for optimizing the exposure pattern and the reconstruction algorithm in the reconstruction layer was performed by the first machine learning step and the second machine learning step. The reconstructed moving image obtained by using the DNN in which the exposure pattern and the decoder are optimized as described above is referred to as “exposure pattern+decoder”.

次いで、カラー動画像を構成する各フレームに最適なカラーフィルタパターンと、カラー動画像を構成する各フレームに最適な露光パターンと、をそれぞれ選択して撮影した圧縮動画像をシミュレートし、動画像を再構成した（図２１の右端の図）。このとき、ＤＮＮの機械学習は、第１機械学習ステップにより、カラーフィルタパターン及び露光パターンの最適化を行うための学習を行い、第２機械学習ステップにより再構成層における再構成アルゴリズムの最適化を行うための学習を行った。このように、カラーフィルタパターンと露光パターンとデコーダとが最適化されたＤＮＮを用いて得られた再構成動画像を、「カラーフィルタ＋露光パターン＋デコーダ」と称する。 Then, a color filter pattern most suitable for each frame forming the color moving image and an exposure pattern most suitable for each frame forming the color moving image are selected to simulate a compressed moving image captured, Was reconstructed (the rightmost diagram in FIG. 21). At this time, in the machine learning of the DNN, the learning for optimizing the color filter pattern and the exposure pattern is performed in the first machine learning step, and the reconstruction algorithm in the reconstruction layer is optimized in the second machine learning step. I learned to do it. The reconstructed moving image obtained by using the DNN in which the color filter pattern, the exposure pattern, and the decoder are optimized as described above is referred to as "color filter+exposure pattern+decoder".

図２１に示すように、デコーダのみ、露光パターン＋デコーダ、及び、カラーフィルタ＋露光パターン＋デコーダのそれぞれの再構成動画像を比較すると、デコーダのみの再構成動画像のＰＳＮＲ値は２４．１８ｄＢであり、露光パターン＋デコーダの再構成動画像のＰＳＮＲ値は２３．９２ｄＢであり、カラーフィルタ＋露光パターン＋デコーダの再構成動画像のＰＳＮＲ値は２３．３４ｄＢであった。したがって、これらの再構成動画像のうち、カラーフィルタ＋露光パターン＋デコーダの再構成動画像は、ノイズが最も少なく、再構成品質が高いことが確認できた。 As shown in FIG. 21, when the reconstructed moving images of the decoder only, the exposure pattern+decoder, and the color filter+exposure pattern+decoder are compared, the PSNR value of the reconstructed moving image of only the decoder is 24.18 dB. Yes, the PSNR value of the reconstructed moving image of the exposure pattern+decoder was 23.92 dB, and the PSNR value of the reconstructed moving image of the color filter+exposure pattern+decoder was 23.34 dB. Therefore, among these reconstructed moving images, it was confirmed that the reconstructed moving image of the color filter+exposure pattern+decoder has the least noise and high reconstruction quality.

また、これらの再構成動画像のうち、変形例１で開示したように、カラーフィルタ＋露光パターン＋デコーダの全てを最適化する学習を行ったＤＮＮを用いて得られた再構成動画像は、被写体の色彩及び輪郭が鮮明であり、パッチの境界における画素値の変化がより滑らかで連続した値であった。 Further, among these reconstructed moving images, the reconstructed moving image obtained by using the DNN that has undergone learning for optimizing all of the color filter+exposure pattern+decoder as disclosed in Modification 1 is The color and outline of the subject were clear, and the change in pixel value at the boundary of the patch was a smoother and continuous value.

また、図示していないが、デコーダのみ最適化した場合（以下、デコーダのみ）、露光パターン及びデコーダを最適化した場合（以下、露光パターン）、カラーフィルタパターン、及び、露光パターン及びデコーダを最適化した場合（以下、カラーフィルタ＋露光パターン）のそれぞれの再構成品質を２５本の再構成動画像におけるＰＳＮＲの平均値で評価した。２５本の動画像の再構成動画像におけるＳＰＮＲの平均値は、デコーダのみが２６．５６ｄＢであり、露光パターンが２６．４３ｄＢであり、カラーフィルタ＋露光パターンが２６．７６ｄＢであった。この結果からも、変形例１に係る動画像処理方法によりカラー動画像を圧縮する際に最適なカラーフィルタパターン及び露光パターン使用し、再構成アルゴリズムを最適化して得られた再構成動画像は、再構成品質が良いことが確認できた。 Although not shown, when only the decoder is optimized (hereinafter, only the decoder), when the exposure pattern and the decoder are optimized (hereinafter, the exposure pattern), the color filter pattern, the exposure pattern and the decoder are optimized. In each case (hereinafter, color filter+exposure pattern), the reconstruction quality was evaluated by the average value of PSNR in 25 reconstructed moving images. The average value of SPNR in the reconstructed moving image of the 25 moving images was 26.56 dB only in the decoder, 26.43 dB in the exposure pattern, and 26.76 dB in the color filter+exposure pattern. Also from this result, the reconstructed moving image obtained by using the optimal color filter pattern and exposure pattern when compressing the color moving image by the moving image processing method according to the modified example 1 and optimizing the reconstruction algorithm is: It was confirmed that the reconstruction quality was good.

（変形例２）
実施の形態及び変形例１では、動画像の圧縮及び再構成の手法について説明したが、変形例２では、圧縮動画像から被写体の動きを検出するための手法について説明する。以下、実施の形態及び変形例１と異なる点を中心に説明する。図２２は、変形例２に係る動画像処理方法の一例を示すフローチャートである。図２３は、変形例２における機械学習ステップの構成の一例を示す図である。 (Modification 2)
In the embodiment and the modified example 1, the method of compressing and reconstructing the moving image has been described, but in the modified example 2, the method of detecting the movement of the subject from the compressed moving image will be described. Hereinafter, differences from the embodiment and the first modification will be mainly described. FIG. 22 is a flowchart showing an example of the moving image processing method according to Modification 2. FIG. 23 is a diagram showing an example of the configuration of the machine learning step in Modification 2.

図２２に示すように、変形例２に係る動画像処理方法は、カメラ２００（図１０参照）は、圧縮動画像を生成する（圧縮ステップＳ１０）。次いで、動き情報生成部５０（図１０参照）は、圧縮ステップＳ１０で生成された圧縮動画像から、イメージセンサを構成する全てのピクセルを全てのフレームにおいて露光させた場合に得られる未知動画像が示す動きの種類を特定し、特定した動きの種類を示す動き情報を生成する動き検出する（検出ステップＳ３０）。 As shown in FIG. 22, in the moving image processing method according to the second modification, the camera 200 (see FIG. 10) generates a compressed moving image (compression step S10). Next, the motion information generation unit 50 (see FIG. 10) generates an unknown moving image obtained by exposing all the pixels forming the image sensor in all the frames from the compressed moving image generated in the compression step S10. The type of motion shown is specified, and motion detection is performed to generate motion information indicating the specified type of motion (detection step S30).

図２３に示すように、さらに、変形例２に係る動画像処理方法は、動き検出ステップＳ３０に先立ち、圧縮動画像を入力とし、動き情報を出力するための人工知能を機械学習しておく第３機械学習ステップＳ３を含み、動き検出ステップＳ３０では、第３機械学習ステップＳ３で機械学習された人工知能を用いて動き情報を生成する。 As shown in FIG. 23, further, in the moving image processing method according to the second modification, prior to the motion detection step S30, a compressed moving image is input and machine learning of artificial intelligence for outputting motion information is performed. In the motion detection step S30, which includes three machine learning steps S3, motion information is generated using the artificial intelligence machine-learned in the third machine learning step S3.

また、図示していないが、人工知能は、ニューラルネットワークであり、未知動画像から露光パターンに対応する重み係数を用いた演算により圧縮動画像を生成するセンシング層と、センシング層が生成した圧縮動画像から動き情報を生成する動き検出層とを含み、第１機械学習ステップＳ１と第３機械学習ステップＳ３とは、センシング層と動き検出層とを含む人工知能に対する教師あり学習によって行われる。 Further, although not shown, artificial intelligence is a neural network, and a sensing layer that generates a compressed moving image from an unknown moving image by an operation using a weighting factor corresponding to an exposure pattern, and a compressed moving image generated by the sensing layer. The first machine learning step S1 and the third machine learning step S3 include a motion detection layer that generates motion information from an image, and are performed by supervised learning for artificial intelligence including a sensing layer and a motion detection layer.

なお、変形例に係る動画像処理方法においても、実施の形態及び変形例１に係る動画像処理方法と同様に、以下の効果が期待される。 In the moving image processing method according to the modified example, the following effects are expected as in the moving image processing method according to the embodiment and the modified example 1.

変形例２で開示する動画像処理方法においても、実施の形態及び変形例１で上述した通り、イメージセンサの各ピクセルをランダムに露光可能なセンサを用いて符号化露光画像（いわゆる、圧縮動画像）を撮影する。この符号化露光の長さ分だけ、つまり、露光パターンを適用して撮影したフレームの数だけ、データ量を圧縮することが可能である。例えば、１６フレームから構成される動画像について、全てのフレームについてそれぞれ最適な露光パターンを適用して圧縮動画像を生成した場合、圧縮動画像のデータ量は、元の動画像のデータ量の１／１６倍になる。そのため、通信量の削減及び伝送にかかる消費電力の削減が期待される。 Also in the moving image processing method disclosed in the second modification, as described above in the embodiment and the first modification, a coded exposure image (so-called compressed moving image) is formed using a sensor capable of randomly exposing each pixel of the image sensor. ) Is taken. It is possible to compress the data amount by the length of the coded exposure, that is, by the number of frames captured by applying the exposure pattern. For example, when a compressed moving image is generated by applying an optimum exposure pattern for all frames of a moving image composed of 16 frames, the data amount of the compressed moving image is 1 of the data amount of the original moving image. /16 times. Therefore, it is expected that the amount of communication and the power consumption for transmission will be reduced.

通常の圧縮手法では、カメラで動画像を撮影した後に、動画像を圧縮する。一方、変形例２で開示する動画像処理方法においても、実施の形態及び変形例１で上述した通り、イメージセンサの各ピクセルをランダムに露光して符号化露光画像を撮影することにより、動画像の再構成に十分な情報を単一のフレームに圧縮して取得することができるため、非常に効率的である。そのため、従来の手法に比べて、動画像の圧縮処理にかかる電力などのコスト削減が期待される。 In a normal compression method, a moving image is captured by a camera and then the moving image is compressed. On the other hand, also in the moving image processing method disclosed in the second modification, as described above in the embodiment and the first modification, each pixel of the image sensor is randomly exposed and a coded exposure image is captured to obtain a moving image. This is very efficient because it is possible to compress and acquire sufficient information for the reconstruction of the. Therefore, compared to the conventional method, cost reduction such as power consumption for moving image compression processing is expected.

また、変形例２で開示する動画像処理方法においても、上述の通り、動画像の再構成に十分な情報を単一のフレームに圧縮して取得することができるため、従来手法に比べてデータ量を大幅に削減できる。例えば、従来手法として、近年、動画像における被写体の動き等の認識手法がある。当該認識手法では、カメラで撮影された動画像が持つ時間的及び空間的情報を、３次元畳み込みによる時空間情報の特徴化で認識精度を向上させている。しかしながら、３次元畳み込みにより得られる動画像の時空間情報は、パラメータ数及びデータ量が大きい。そのため、これらの時空間情報から動画像における被写体の動きなどを認識（識別）するためには、ニューラルネットワークは、通常のニューラルネットワークよりも多くの層を有する大規模なネットワークとなり、当該ネットワークのパラメータ数も多くなる。また、当該ネットワークを十分に学習させるために必要なデータ数が増えるため、データセットも大規模なものが必要となる。したがって、当該ネットワークでは、大規模ＧＰＵクラスタなどの演算資源が学習時に必要となり、学習にかかる時間は膨大なものとなる。一方、変形例２で開示する動画像処理方法では、実施の形態及び変形例１と同様にして生成された符号化露光画像を、動画像における被写体の動きなどを識別するためのニューラルネットワークの入力とすることで、従来手法のように３次元の畳み込みを必要とせず、２次元畳み込みにより時空間情報の特徴化が可能となる。そのため、従来手法の３次元畳み込みによるものに比べ、機械学習に必要となるパラメータ数が減少し、かつ、データ数も小さくなるため、機械学習の効率化が期待でき、少ない学習データでも精度が向上される。 Further, also in the moving image processing method disclosed in the second modification, as described above, information sufficient for reconstructing a moving image can be compressed and acquired in a single frame, and therefore, compared to the conventional method, The amount can be significantly reduced. For example, as a conventional method, there is a method of recognizing a motion of a subject in a moving image in recent years. In the recognition method, the recognition accuracy is improved by characterizing the spatiotemporal information by three-dimensional convolution of the temporal and spatial information of the moving image captured by the camera. However, the spatiotemporal information of a moving image obtained by three-dimensional convolution has a large number of parameters and a large amount of data. Therefore, in order to recognize (identify) the movement of a subject in a moving image from these spatio-temporal information, the neural network becomes a large-scale network having more layers than a normal neural network, and the parameters of the network. The number also increases. In addition, a large-scale data set is required because the number of data required to sufficiently train the network increases. Therefore, in this network, computational resources such as a large-scale GPU cluster are required for learning, and the time required for learning becomes enormous. On the other hand, in the moving image processing method disclosed in the second modification, the coded exposure image generated in the same manner as in the embodiment and the first modification is input to the neural network for identifying the motion of the subject in the moving image. By doing so, it is possible to characterize the spatiotemporal information by the two-dimensional convolution without requiring the three-dimensional convolution as in the conventional method. Therefore, the number of parameters required for machine learning is smaller and the number of data is smaller than that by the conventional method of three-dimensional convolution, so that the efficiency of machine learning can be expected and the accuracy can be improved even with a small amount of learning data. To be done.

変形例２に係る動画像処理方法におけるネットワークアーキテクチャとして、表１に示す２次元の畳み込みニューラルネットワーク（ＣＮＮ：ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）を考える。 A two-dimensional convolutional neural network (CNN) shown in Table 1 is considered as a network architecture in the moving image processing method according to the second modification.

表１に示すように、２次元のＣＮＮは、例えば、３×３のストライド１の８層の２次元畳み込みと２×２の５層の最大値プーリングと２層の全結合層から構成される。計算の簡略化のため、ｂｉａｓ項を無視するとある畳み込み層のパラメータ数Ｐは、その層の入力チャネル数Ｃ_ｉｎと出力チャネル数Ｃ_ｏｕｔとカーネルサイズＫとを用いると、以下の式（２）で表される。 As shown in Table 1, the two-dimensional CNN is composed of, for example, two-dimensional convolution of eight layers of 3×3 stride 1, maximum pooling of five layers of 2×2, and two fully connected layers. .. For simplification of calculation, the parameter number P of a certain convolutional layer when the bias term is ignored, using the input channel number C _in , the output channel number C _out, and the kernel size K of the layer, the following equation (2) is obtained. It is represented by.

したがって、非特許文献１９が最も良いとする３次元畳み込みのカーネル３×３×３を用いた場合、本変形例で開示する手法は２次元畳み込みのカーネル３×３であり、畳み込み層のパラメータ数は、非特許文献１９の手法のおよそ１／３となる。 Therefore, when the 3D convolution kernel 3×3×3, which is the best in Non-Patent Document 19, is used, the method disclosed in this modification is a 2D convolution kernel 3×3, and the number of parameters of the convolution layer is Is about 1/3 of the method of Non-Patent Document 19.

本変形例で開示する手法のニューラルネットワークは、次のように学習及び評価を行う。Ｋ種類の行動Ｃ＝｛Ｃ_１，Ｃ_２，・・・，Ｃ_Ｋ｝についてのクラス分類を行うとする。ある行動ａ∈Ｃにおける長さＮの動画像をＩ＝｛Ｉ_１，Ｉ_２，・・・，Ｉ_Ｎ｝とする。符号化露光パターンの長さをＬとするとビデオクリップの長さはＬであり、ビデオクリップは、以下の式（３）で表される。 The neural network of the method disclosed in this modification performs learning and evaluation as follows. It is assumed that K kinds of actions C={C ₁ , C ₂ ,..., C _K } are classified. A moving image of length N in a certain action aεC is I={I ₁ , I ₂ ,..., I _N }. When the length of the coded exposure pattern is L, the length of the video clip is L, and the video clip is represented by the following formula (3).

ビデオクリップに符号化露光パターンを適用し、
とする。Ｉに対して、｛（Ｘ_ｉ，ａ）｝のペアを用いてネットワークを学習する。各入力Ｘｉに対する出力Ｙｉを動画像全体で平均し、最大値を取ったものを動画像における行動ラベルとして評価を行う。すなわち、ある時点での入力Ｘｉが行動Ｃｊに属する確率ｐ（Ｃ_ｊ｜Ｘ_ｉ）は、以下の式（５）で表される。 Apply the coded exposure pattern to the video clip,
And For I, train the network using {(X _i ,a)} pairs. The output Yi for each input Xi is averaged over the entire moving image, and the maximum value is evaluated as the action label in the moving image. That is, the probability p(C _j |X _i ) that the input Xi at a certain time belongs to the behavior Cj is represented by the following formula (5).

このとき、Ｉに対して推定される行動ラベルａ^＊は、以下の式（６）で表される。 At this time, the action label a ^* estimated for I is represented by the following equation (6).

データセットの動画像の総数をＭとして、認識精度（Ａｃｃｕｒａｃｙ）Ｓは、以下の式（７）を用いて算出される。 The recognition accuracy (Accuracy) S is calculated using the following equation (7), where M is the total number of moving images in the data set.

［実験例５］評価実験
符号化露光画像から直接行動を認識するシミュレーション実験を行った。 [Experimental Example 5] Evaluation experiment A simulation experiment was performed in which the behavior was directly recognized from the encoded exposure image.

［１］データセット
シミュレーション実験には、ＫＴＨＡｃｔｉｏｎデータセット（非特許文献２３）を用いた。図２４は、ＫＴＨＡｃｔｉｏｎデータセットにおける各行動クラスの１シーンを示す図である。図２４に示すように、当該データセットは、「ｗａｌｋｉｎｇ」、「ｊｏｇｇｉｎｇ」、「ｒｕｎｎｉｎｇ」、「ｂｏｘｉｎｇ」、「ｈａｎｄｗａｖｉｎｇ」、「ｈａｎｄｃｌａｐｐｉｎｇ」の６種類の行動クラスに分類されている。各行動クラスは、撮影に使用するカメラの位置を固定し、２５人の被験者が６種類の行動を４つのシナリオで実行している様子を撮影したものである。各行動クラスの動画像は、平均４秒であり、画像解像度が６００ｄｐｉのグレースケールのビデオ（以下、動画像）である。これらの動画像は、２５ｆｐｓで撮像され、１６０×１２０の空間解像度にダウンサンプリングされている。非特許文献２３の分割手法に従い、被験者２５人を、ニューラルネットワークの訓練で８人、検証で８人、実験で９人に分割した。 [1] Data Set The KTH Action data set (Non-Patent Document 23) was used for the simulation experiment. FIG. 24 is a diagram showing one scene of each action class in the KTH Action data set. As shown in FIG. 24, the data set is classified into six types of action classes, “walking”, “jogging”, “running”, “boxing”, “hand waving”, and “hand clipping”. In each action class, the position of the camera used for shooting is fixed, and 25 subjects are shooting six types of actions in four scenarios. The moving image of each action class is an average of 4 seconds, and is a grayscale video (hereinafter, moving image) having an image resolution of 600 dpi. These moving images are captured at 25 fps and downsampled to a spatial resolution of 160×120. According to the division method of Non-Patent Document 23, 25 subjects were divided into 8 for neural network training, 8 for verification, and 9 for experiment.

［２］比較手法
上記のデータセットの各動画像は、学習時に、前後のフレーム同士で重複するデータが存在しないように選択した１６フレームのビデオクリップに分割し、１１２×１１２の空間解像度にランダムに切り抜きを行った。下記の（ｄ）に示す手法は、このビデオクリップを入力として機械学習を行ったニューラルネットワーク（ＮＮ）を使用して上記の行動クラスの識別を行った手法である。下記の（ａ）〜（ｃ）に示す手法は、このビデオクリップに対してそれぞれ異なる圧縮処理を施して得られた圧縮動画像を入力として学習を行ったニューラルネットワークを使用して上記の行動クラスの識別を行った手法である。下記（ａ）〜（ｄ）において、ビデオクリップの圧縮処理は、それぞれ、ビデオクリップの情報量の１／１６倍に圧縮されるように実施した。 [2] Comparison method Each moving image of the above data set is divided into 16-frame video clips selected so that there is no overlapping data between preceding and following frames at the time of learning, and randomly divided into 112 × 112 spatial resolution I cut it out. The method shown in (d) below is a method in which the behavior class is identified using a neural network (NN) that has been machine-learned using this video clip as an input. The methods shown in (a) to (c) below use the above-mentioned action class by using a neural network that has learned using compressed moving images obtained by performing different compression processes on this video clip as input. This is the method of identifying. In the following (a) to (d), the video clip compression processing is performed so that the video clip is compressed to 1/16 times the information amount of the video clip.

図２５は、実験例５における比較手法の一例を示す図である。図２６は、ＮＮに入力される画像のあるピクセルにおける露光の一例を示す図である。以下、図２５及び図２６を参照しながら、（ａ）〜（ｄ）に示す手法についてより具体的に説明する。 FIG. 25 is a diagram illustrating an example of the comparison method in Experimental Example 5. FIG. 26 is a diagram showing an example of exposure in a pixel of an image input to the NN. Hereinafter, the methods shown in (a) to (d) will be described more specifically with reference to FIGS. 25 and 26.

（ａ）符号化露光画像
本開示で開示する圧縮方法でビデオクリップを圧縮した。より具体的には、ビデオクリップを構成する各フレームに最適な符号化露光パターンを適用し、符号化露光画像を生成した。この符号化露光画像をＣＮＮの入力とした。符号化露光パターンは、サイズが８×８であり、各ビデオクリップのピクセルを１／１６で露光するランダムなパターンを使用した。動画像に対して１６分の１のフレームレートで、各ピクセルの露光時間は符号化露光パターンによって変化する。例えば、図２６の（ａ）に示すように、符号化露光画像では、単一のフレームの画像であり、当該単一のフレームのあるピクセルにおける露光は、例えば１フレーム中に数回行われている。この実験で用いた符号化露光パターンでは、露光時間は動画像の１フレームを撮影する露光時間と等しい。 (A) Coded exposure image A video clip was compressed by the compression method disclosed in the present disclosure. More specifically, the optimum coded exposure pattern was applied to each frame constituting the video clip to generate a coded exposure image. This coded exposure image was used as the input of CNN. The coded exposure pattern was 8x8 in size and used a random pattern that exposed 1/16 of the pixels in each video clip. The exposure time of each pixel changes depending on the coded exposure pattern at a frame rate of 1/16 for a moving image. For example, as shown in (a) of FIG. 26, the coded exposure image is an image of a single frame, and exposure at a pixel of the single frame is performed several times in one frame, for example. There is. In the coded exposure pattern used in this experiment, the exposure time is equal to the exposure time for capturing one frame of a moving image.

（ｂ）平均化画像
時間情報を１枚の画像に圧縮する単純な手法として、ビデオクリップを時間方向に平均化した平均化画像を用いた。この平均化画像をＣＮＮの入力とした。図２６の（ｂ）に示すように、平均化画像のあるピクセルは、１フレームの間、露光されている。そのため、平均化画像は、１６分の１のフレームレートで露光時間が１６倍の動画像の１フレームと等しくなる。 (B) Averaged image As a simple method of compressing time information into one image, an averaged image obtained by averaging video clips in the time direction is used. This averaged image was used as the input of CNN. As shown in FIG. 26B, a pixel having an averaged image is exposed for one frame. Therefore, the averaged image is equal to one frame of a moving image having a frame rate of 1/16 and an exposure time of 16 times.

（ｃ）１フレーム画像
時間情報を持たない画像と比較するため、１フレームの画像と比較した。ビデオクリップを構成する１６フレームのうち１フレームを選択し、これをＣＮＮの入力とした。図２６の（ｃ）に示すように、１フレーム画像は、１６分の１のフレームレートで露光時間が等しい動画像の１フレームと等しい。 (C) One-frame image In order to compare with an image without time information, it was compared with one-frame image. One frame was selected from the 16 frames forming the video clip, and this was used as the input of CNN. As shown in (c) of FIG. 26, one frame image is equal to one frame of a moving image having the same exposure time at a frame rate of 1/16.

（ｄ）動画像
従来手法の３次元畳み込みネットワーク（Ｃ３Ｄ：Ｃｏｎｖｏｌｕｔｉｏｎ３Ｄ）に相当する手法として、ビデオクリップを入力とし、Ｃ３Ｄ（非特許文献１９）で学習した。Ｃ３Ｄは、本来ＲＧＢの３チャネルであるが、グレースケールの１チャネルに変更し、事前学習なしで学習した。ビデオクリップは全てのフレームにおいて全ピクセルは露光されている。そのため、図２６の（ｄ）に示すように、全てのフレームにおいてあるピクセルは各フレームの間露光されている。 (D) Moving image As a method corresponding to a conventional three-dimensional convolutional network (C3D: Convolution 3D), a video clip was input and learning was performed in C3D (Non-Patent Document 19). C3D originally has 3 channels of RGB, but was changed to 1 channel of gray scale and learned without prior learning. The video clip has all pixels exposed in every frame. Therefore, as shown in FIG. 26D, a pixel in every frame is exposed during each frame.

［３］実験結果
上記の（ａ）〜（ｄ）の比較手法を用いてデータセットの全行動クラスを識別したシミュレーション実験の結果を表２に示す。表２の識別精度は、データセットの各行動クラスの識別精度の平均を示している。 [3] Experimental Results Table 2 shows the results of a simulation experiment in which all the behavior classes of the data set were identified using the comparison methods (a) to (d) above. The identification accuracy in Table 2 indicates the average of the identification accuracy of each behavior class of the data set.

表２に示すように、（ａ）符号化露光画像をＣＮＮの入力として機械学習を行い、動画像における被写体の動きを識別した場合、（ｄ）動画像の従来手法による識別精度に非常に近い識別精度が得られた。しかしながら、（ｂ）平均化画像をＣＮＮの入力として学習を行った動画像の識別手法と、（ｃ）１フレーム画像をＣＮＮの入力として学習を行った動画像の識別手法とは、動画像の空間的情報又は時間的情報から動画像の時空間情報を識別せざるを得ないため、（ａ）符号化露光画像をＣＮＮの入力とした場合に比べて、動画像の識別精度が著しく低下した。 As shown in Table 2, when (a) the coded exposure image is used as the input of CNN and machine learning is performed to identify the motion of the subject in the moving image, (d) the accuracy of identifying the moving image by the conventional method is very close. The identification accuracy was obtained. However, (b) the method of identifying a moving image that has been learned by using the averaged image as the input of CNN and (c) the method of identifying the moving image that is learned by using the one-frame image as the input of CNN are Since the spatiotemporal information of the moving image has to be discriminated from the spatial information or the temporal information, (a) the discrimination accuracy of the moving image is remarkably lowered as compared with the case where the coded exposure image is input to the CNN. ..

図２７は、（ａ）〜（ｄ）の各比較手法の混同行列を示す図である。図２７の混同行列から、平均化画像をＣＮＮの入力とした手法（ｂ）は、１フレーム画像をＣＮＮの入力とした手法（ｃ）と同様に、符号化露光画像をＣＮＮの入力とした手法（ａ）に比べて、「ｈａｎｄｗａｖｉｎｇ」の認識精度が低下していた。さらに、手法（ｂ）及び（ｃ）は、手法（ａ）に比べて、「ｗａｌｋｉｎｇ」、「ｊｏｇｇｉｎｇ」、及び、「ｒｕｎｎｉｎｇ」の識別精度が著しく低下していることから、これらの行動クラスの区別が難しいことが分かった。 FIG. 27 is a diagram showing a confusion matrix of each comparison method of (a) to (d). From the confusion matrix in FIG. 27, the method (b) in which the averaged image is input to CNN is the same as the method (c) in which one frame image is input to CNN, and the method in which the coded exposure image is input to CNN is used. As compared with (a), the recognition accuracy of “hand saving” was lower. Furthermore, the methods (b) and (c) have significantly lower identification accuracy of “walking”, “jogging”, and “running” than the method (a). I found it difficult to distinguish.

一方、符号化露光画像をＣＮＮの入力とした手法（ａ）は、上記の各行動クラスの識別において、動画像（ここでは、ビデオクリップ）をＣ３Ｄの入力した従来手法に相当する手法（ｄ）と同様の傾向を示していた。さらに、手法（ａ）は、上記の各行動クラスの認識精度も手法（ｄ）の認識精度に迫る高い精度を達成した。 On the other hand, the method (a) in which the coded exposure image is input to CNN is a method (d) corresponding to the conventional method in which a moving image (here, a video clip) is input in C3D in the identification of each action class. It showed a similar tendency to. Furthermore, the method (a) achieved a high accuracy in recognition of each of the above-described action classes, which is close to the recognition accuracy of the method (d).

［実験例６］
実験例５では、１６フレームのビデオクリップを識別対象画像として用いたが、実験例６では、ビデオクリップの長さＬ（以下、フレーム数）を変化させ、動画像を構成するフレーム数が増えた場合に、行動クラスの認識精度がどのように変化するかを確認するシミュレーション実験を行った。図２８は、実験例６の結果を示す図である。 [Experimental Example 6]
In Experimental Example 5, a 16-frame video clip was used as an image to be identified, but in Experimental Example 6, the length L (hereinafter, the number of frames) of the video clip was changed, and the number of frames forming a moving image was increased. In this case, we conducted a simulation experiment to confirm how the recognition accuracy of the action class changes. 28: is a figure which shows the result of Experimental example 6. FIG.

手法（ｄ）について、Ｃ３Ｄは１６フレームのビデオクリップを入力とするため、１６フレーム未満のビデオクリップを用いる場合、１６／Ｌ回同じフレームを繰り返すことで１６フレームの動画像に調整し、調整したビデオクリップをＣ３Ｄに入力した。また、１６フレームより多いビデオクリップを用いる場合は、Ｃ３Ｄの入力フレーム数をＬに変更した。そのため、Ｃ３Ｄを用いたものは、ネットワークの表現力の向上及びデータセットの不足により、公正な比較ができないことに留意されたい。 Regarding method (d), since C3D inputs a video clip of 16 frames, when using a video clip of less than 16 frames, the same frame is repeated 16/L times to adjust to a moving image of 16 frames, and the adjustment is performed. The video clip was input to C3D. When using more than 16 video clips, the number of C3D input frames is changed to L. Therefore, it should be noted that the one using C3D cannot make a fair comparison due to the improved expressiveness of the network and the lack of data sets.

また、平均化画像を入力とする手法（ｂ）については、ビデオクリップのフレーム数が４フレームから８フレームまでは若干の識別精度の改善が見られたが、ビデオクリップのフレーム数が８フレームよりも多くなると認識精度は低下した。１フレーム画像を入力とする手法（ｃ）においても、手法（ｂ）と同様の傾向が見られた。したがって、手法（ｂ）の平均化画像のように識別対象の動画像を時間方向に平均すると、当該動画像の時間情報が失われていくため、所定のフレーム数を超えると、動きの識別に必要な時間情報が得られなくなると考えられる。 Regarding the method (b) that uses an averaged image as input, there was a slight improvement in the identification accuracy when the number of frames in the video clip was 4 to 8, but the number of frames in the video clip was better than 8 frames. The recognition accuracy declined as the number increased. The same tendency as in the method (b) was observed in the method (c) in which one frame image was input. Therefore, when the moving image to be identified is averaged in the time direction like the averaged image of the method (b), the time information of the moving image is lost. It is thought that the necessary time information will not be obtained.

一方、符号化露光画像を入力とする手法（ａ）では、ビデオクリップのフレーム数Ｌが１６フレームまでは認識精度が改善した。これは動画像（ビデオクリップ）を入力とする手法（ｄ）と同様の傾向を示しているため、符号化露光画像が時間情報を十分に有していると考えられる。しかしながら、符号化露光画像を入力とした手法（ａ）は、ビデオクリップのフレーム数が１６フレームより長くなると認識精度が低下した。これは特徴化しなければならない時間情報が増え、今回、手法（ａ）で用いた符号化露光パターンでは時間情報を表現しきれなくなったためであると考えられる。 On the other hand, in the method (a) in which the coded exposure image is input, the recognition accuracy is improved up to the frame number L of the video clip being 16 frames. Since this shows the same tendency as the method (d) in which a moving image (video clip) is input, it is considered that the coded exposure image has sufficient time information. However, in the method (a) in which the coded exposure image is input, the recognition accuracy is deteriorated when the number of frames of the video clip is longer than 16 frames. It is considered that this is because the time information that must be characterized increases, and the coded exposure pattern used in the method (a) cannot express the time information.

（まとめ）
変形例２では、ビデオ監視システムにおける行動認識のトレードオフな問題に対し圧縮センシングを適用し、符号化露光カメラにより撮影される単一の画像（いわゆる、圧縮動画像）から２次元のＣＮＮを用いて、圧縮動画像から再構成動画像を生成することなく、直接、人物の行動認識を行う動画像処理方法の一例を説明した。変形例２に係る動画像処理方法の有効性を評価するため、実験例５にてＫＴＨＡｃｔｉｏｎデータセットを用いたシミュレーション実験を行った。実験例５の結果から、変形例２に係る動画像処理方法は、ニューラルネットワークへの入力のデータ量を１／１６倍に圧縮しているにもかかわらず、動画像を入力とした３次元のＣＮＮ（例えば、Ｃ３Ｄ）を用いて人物の行動識別を行った場合の識別精度に迫る高い識別精度を達成した。 (Summary)
In the second modification, compressed sensing is applied to the trade-off problem of action recognition in a video surveillance system, and a two-dimensional CNN is used from a single image (so-called compressed moving image) captured by a coded exposure camera. Thus, an example of the moving image processing method for directly recognizing the action of a person without generating a reconstructed moving image from a compressed moving image has been described. In order to evaluate the effectiveness of the moving image processing method according to Modified Example 2, a simulation experiment using the KTH Action data set was performed in Experimental Example 5. From the results of Experimental Example 5, the moving image processing method according to the modified example 2 uses the moving image as the input in the three-dimensional method even though the input data amount to the neural network is compressed to 1/16. A high identification accuracy approaching the identification accuracy when the behavior of a person is identified using CNN (for example, C3D) is achieved.

（他の実施の形態）
以上、本開示の１つ又は複数の態様に係る動画像処理方法及び動画像処理装置について、実施の形態に基づいて説明したが、本開示は、この実施の形態に限定されるものではない。本開示の主旨を逸脱しない限り、当業者が思いつく各種変形を実施の形態に施したものや、異なる実施の形態における構成要素を組み合わせて構成される形態も、本開示の１つ又は複数の態様の範囲内に含まれてもよい。 (Other embodiments)
Although the moving image processing method and the moving image processing device according to one or more aspects of the present disclosure have been described above based on the embodiment, the present disclosure is not limited to this embodiment. As long as it does not depart from the gist of the present disclosure, various modifications made by those skilled in the art may be applied to the embodiment, and configurations configured by combining components in different embodiments are also one or more aspects of the present disclosure. It may be included in the range of.

例えば、上記実施の形態における動画像処理システムでは１台のカメラを備える場合を説明したが、２台以上の複数のカメラを備えてもよい。これにより、複数の撮像された動画像を取得できるため、得られる複数の動画像から異常な行動をより迅速に、かつ、精度良く検出することができる。 For example, the moving image processing system according to the above-described embodiment has been described with the case of having one camera, but it may have two or more cameras. With this, a plurality of captured moving images can be acquired, and thus abnormal behavior can be detected more quickly and accurately from the obtained plurality of moving images.

また、例えば、上記実施の形態における動画像処理装置が備える構成要素の一部又は全部は、１個のシステムＬＳＩ（ＬａｒｇｅＳｃａｌｅＩｎｔｅｇｒａｔｉｏｎ：大規模集積回路）から構成されているとしてもよい。例えば、動画像処理装置は、通信部と、制御部と、を有するシステムＬＳＩから構成されてもよい。 Further, for example, some or all of the components included in the moving image processing apparatus according to the above-described embodiments may be configured by one system LSI (Large Scale Integration). For example, the moving image processing apparatus may be composed of a system LSI having a communication unit and a control unit.

システムＬＳＩは、複数の構成部を１個のチップ上に集積して製造された超多機能ＬＳＩであり、具体的には、マイクロプロセッサ、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）などを含んで構成されるコンピュータシステムである。ＲＯＭには、コンピュータプログラムが記憶されている。マイクロプロセッサが、コンピュータプログラムに従って動作することにより、システムＬＳＩは、その機能を達成する。 The system LSI is a super-multifunctional LSI manufactured by integrating a plurality of components on one chip, and specifically, a microprocessor, a ROM (Read Only Memory), a RAM (Random Access Memory), and the like. It is a computer system configured to include. A computer program is stored in the ROM. The system LSI achieves its function by the microprocessor operating according to the computer program.

なお、ここでは、システムＬＳＩとしたが、集積度の違いにより、ＩＣ、ＬＳＩ、スーパーＬＳＩ、ウルトラＬＳＩと呼称されることもある。また、集積回路化の手法は、ＬＳＩに限るものではなく、専用回路又は汎用プロセッサで実現してもよい。ＬＳＩ製造後に、プログラムすることが可能なＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）、あるいは、ＬＳＩ内部の回路セルの接続や設定を再構成可能なリコンフィギュラブル・プロセッサを利用してもよい。 The system LSI is used here, but it may also be called IC, LSI, super LSI, or ultra LSI depending on the degree of integration. Also, the method of circuit integration is not limited to LSI, and may be realized by a dedicated circuit or a general-purpose processor. A programmable programmable gate array (FPGA) that can be programmed after the LSI is manufactured, or a reconfigurable processor that can reconfigure connection and setting of circuit cells inside the LSI may be used.

さらには、半導体技術の進歩又は派生する別技術によりＬＳＩに置き換わる集積回路化の技術が登場すれば、当然、その技術を用いて機能ブロックの集積化を行ってもよい。バイオ技術の適用等が可能性としてあり得る。 Furthermore, if integrated circuit technology comes out to replace LSI's as a result of the advancement of semiconductor technology or a derivative other technology, it is naturally also possible to carry out function block integration using this technology. The application of biotechnology is possible.

また、本開示の一態様は、このような動画像処理装置だけではなく、動画像処理装置に含まれる特徴的な構成部をステップとする動画像処理方法であってもよい。また、本開示の一態様は、動画像処理方法に含まれる特徴的な各ステップをコンピュータに実行させるコンピュータプログラムであってもよい。また、本開示の一態様は、そのようなコンピュータプログラムが記録された、コンピュータ読み取り可能な非一時的な記録媒体であってもよい。 In addition, one aspect of the present disclosure may be not only such a moving image processing apparatus but also a moving image processing method having a characteristic configuration unit included in the moving image processing apparatus as a step. Further, one aspect of the present disclosure may be a computer program that causes a computer to execute the characteristic steps included in the moving image processing method. Further, one aspect of the present disclosure may be a computer-readable non-transitory recording medium in which such a computer program is recorded.

なお、上記各実施の形態において、各構成要素は、専用のハードウェアで構成されるか、各構成要素に適したソフトウェアプログラムを実行することによって実現されてもよい。各構成要素は、ＣＰＵ又はプロセッサなどのプログラム実行部が、ハードディスク又は半導体メモリなどの記録媒体に記録されたソフトウェアプログラムを読み出して実行することによって実現されてもよい。ここで、上記実施の形態の動画像処理装置などを実現するソフトウェアは、次のようなプログラムである。 In addition, in each of the above-described embodiments, each component may be configured by dedicated hardware, or may be realized by executing a software program suitable for each component. Each component may be realized by a program execution unit such as a CPU or a processor reading and executing a software program recorded in a recording medium such as a hard disk or a semiconductor memory. Here, the software that realizes the moving image processing apparatus and the like of the above-described embodiment is the following program.

すなわち、このプログラムは、コンピュータに、２次元状にピクセルが配置されたイメージセンサを用いて時間的及び空間的に間引いた繰り返し露光による撮影を行うことで、圧縮動画像を生成する圧縮ステップと、圧縮ステップに先立ち、露光の態様を特定する露光パターンを機械学習によって最適化しておく第１機械学習ステップと、を含み、圧縮ステップでは、第１機械学習ステップによる最適化によって得られた露光パターンを用いて圧縮動画像を生成する動画像処理方法を実行させる。 That is, the program includes a compression step of generating a compressed moving image by performing imaging with repeated exposure in a computer by using an image sensor in which pixels are two-dimensionally arranged, and which is thinned out temporally and spatially. Prior to the compression step, a first machine learning step of optimizing an exposure pattern for specifying an exposure mode by machine learning is performed. In the compression step, the exposure pattern obtained by the optimization by the first machine learning step is A moving image processing method for generating a compressed moving image is executed.

本開示は、実装されるハードウェアの制約に関わらず、ハードウェアの種類によって適切な露光パターンを、動画像を構成する各フレームにそれぞれ適用して動画像を撮影することができるため、例えば、動画像を撮影しながら圧縮品質の高い圧縮動画像を生成することができる。そのため、本開示の動画像処理装置は、例えば、観測カメラ、監視カメラなどに広く利用可能である。 The present disclosure can shoot a moving image by applying an appropriate exposure pattern depending on the type of hardware to each frame that constitutes the moving image, regardless of the constraints of the hardware to be mounted. It is possible to generate a compressed moving image with high compression quality while capturing the moving image. Therefore, the moving image processing device of the present disclosure can be widely used for, for example, an observation camera, a surveillance camera, and the like.

１０通信部
２０制御部
３０機械学習部
４０再構成部
５０動き情報生成部
６０表示部
７０入力部
８０圧縮動画像生成部
９０露光パターン保持部
１００動画像処理装置
２００カメラ
３００動画像処理システム 10 communication unit 20 control unit 30 machine learning unit 40 reconstruction unit 50 motion information generation unit 60 display unit 70 input unit 80 compressed moving image generation unit 90 exposure pattern holding unit 100 moving image processing device 200 camera 300 moving image processing system

Claims

A compression step of generating a compressed moving image by performing imaging by repeated exposure with temporal and spatial thinning using an image sensor in which pixels are arranged two-dimensionally;
A first machine learning step of optimizing an exposure pattern for specifying the exposure mode by machine learning prior to the compression step;
Including
In the compression step, the compressed moving image is generated using the exposure pattern obtained by the optimization in the first machine learning step,
Video processing method.

The exposure pattern is information designating pixels used for exposure among the pixels forming the image sensor for each frame forming the compressed moving image.
The moving image processing method according to claim 1.

Furthermore, with respect to the compressed moving image generated in the compression step, by reconstructing an unknown moving image obtained when all the pixels forming the image sensor are exposed in all frames, Including a reconstruction step to generate an output moving image,
The moving image processing method according to claim 1.

Further, prior to the reconstructing step, a second machine learning step of machine learning the artificial intelligence for outputting the output moving image using the compressed moving image as an input is included.
In the reconstruction step, the output moving image is generated by using the artificial intelligence machine-learned in the second machine learning step.
The moving image processing method according to claim 3.

The artificial intelligence is a neural network, and a sensing layer that generates the compressed moving image from the unknown moving image by a calculation using a weighting factor corresponding to the exposure pattern, and the compressed moving image generated by the sensing layer. A reconstruction layer for generating the output moving image by reconstructing,
The first machine learning step and the second machine learning step are performed by supervised learning for the artificial intelligence including the sensing layer and the reconstruction layer.
The moving image processing method according to claim 4.

Further, from the compressed moving image generated in the compression step, the type of motion indicated by the unknown moving image obtained when all the pixels forming the image sensor are exposed in all the frames is specified and specified. A motion detection step of generating motion information indicating the type of motion,
The moving image processing method according to claim 1.

Further, prior to the motion detection step, a third machine learning step of machine learning the artificial intelligence for inputting the compressed moving image and outputting the motion information is included,
In the motion detecting step, the motion information is generated using the artificial intelligence machine-learned in the third machine learning step.
The moving image processing method according to claim 6.

The artificial intelligence is a neural network, and a sensing layer that generates the compressed moving image from the unknown moving image by a calculation using a weighting factor corresponding to the exposure pattern, and the compressed moving image generated by the sensing layer. A motion detection layer for generating the motion information,
The first machine learning step and the third machine learning step are performed by supervised learning for the artificial intelligence including the sensing layer and the motion detection layer,
The moving image processing method according to claim 7.

The image sensor includes a color filter that selectively passes light of a specific color corresponding to each of the pixels,
In the compression step, the compressed moving image is generated by performing photography by exposure in which the pattern of the color filter is temporally and spatially changed,
In the first machine learning step, prior to the compressing step, a color filter pattern that specifies a temporal and spatial variation of the color filter pattern is optimized by machine learning.
In the compression step, the compressed moving image is generated using the color filter pattern obtained by the optimization in the first machine learning step.
The moving image processing method according to claim 1.

A moving image processing apparatus used in a camera that generates a compressed moving image by performing imaging by repeated exposure with temporal and spatial thinning using an image sensor in which pixels are arranged two-dimensionally,
A first machine learning unit that optimizes an exposure pattern that specifies the mode of exposure by machine learning;
An output unit for outputting the exposure pattern obtained by the optimization by the first machine learning unit;
With
Video processing device.