JP2022129792A

JP2022129792A - Area conversion apparatus, area conversion method, and area conversion system

Info

Publication number: JP2022129792A
Application number: JP2021028615A
Authority: JP
Inventors: モヒトチャブラ; Mohit Chabra; マルティンクリンキグト; Klinkigt Martin; 智明吉永; Tomoaki Yoshinaga
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2022-09-06

Abstract

To provide area conversion means for appropriately changing an aspect ratio of an image while maintaining quality and semantic information of the image to facilitate highly accurate object detection.SOLUTION: An area conversion apparatus comprises: an image frame identification unit that identifies a target image frame to be subjected to area conversion; an area-of-interest detection unit that detects an area of interest in the target image frame and calculates reliability of a processing operation for extracting an area-of-interest image including the area of interest from the target image frame; an image frame processing unit that extracts the area-of-interest image using the processing operation if the reliability of the processing operation satisfies a predetermined reliability criterion; a background synthesizing means determining unit that determines background synthesizing means for converting the target image frame into a predetermined aspect ratio if the reliability of the processing operation does not satisfy the predetermined reliability criterion or if the area-of-interest image does not satisfy a predetermined aspect ratio criterion; and a background synthesizing unit that converts the target image frame to the predetermined aspect ratio using the background synthesizing means.SELECTED DRAWING: Figure 5

Description

本開示は、領域変換装置、領域変換方法及び領域変換システムに関する。 The present disclosure relates to a domain conversion device, a domain conversion method, and a domain conversion system.

近年、ＩＴ化の進展に伴い、社会に多数のセンサが配置され、極めて大量のデータが蓄積されている。そうした中、集積された画像データを活用する様々な方策が検討されている。特に、写真、動画、画像等の映像コンテンツが増えるにつれ、その映像におけるオブジェクトを自在に検出し、正確に識別する機械学習モデルが望まれている。 In recent years, with the progress of IT, a large number of sensors have been installed in society, and an extremely large amount of data has been accumulated. Under such circumstances, various measures for utilizing the accumulated image data are being considered. In particular, as video content such as photographs, moving pictures, and images increases, a machine learning model that can freely detect and accurately identify objects in the video is desired.

任意のオブジェクトやアクティビティを高精度で認識できる機械学習モデルの一つとして、深層畳み込みニューラルネットワークが知られている。深層畳み込みニューラルネットワークは、人間の脳内にある神経細胞（ニューロン）とそのつながり、つまり神経回路網を人工ニューロンという数式的なモデルで表現したものであり、斯かるニューラルネットワークによるオブジェクト検出は、自動運転、自然言語処理、医療研究、ロボット工学等、様々な分野に応用されている。 Deep convolutional neural networks are known as one of the machine learning models that can recognize arbitrary objects and activities with high accuracy. A deep convolutional neural network expresses nerve cells (neurons) and their connections in the human brain, that is, a neural network, with a mathematical model called an artificial neuron, and object detection by such a neural network is automatic. It has been applied to various fields such as driving, natural language processing, medical research, and robotics.

しかし、従来の深層畳み込みニューラルネットワークの構造は、いわゆるシフト不変性を有しておらず、入力する画像のアスペクト比によっては、オブジェクト検出の精度が低下してしまう場合がある。 However, the structure of a conventional deep convolutional neural network does not have so-called shift invariance, and depending on the aspect ratio of the input image, the accuracy of object detection may decrease.

画像のアスペクト比を変更する手段の１つとして、例えば欧州特許第１９６８００８号明細書（特許文献１）がある。 One of means for changing the aspect ratio of an image is, for example, EP 1968008 (Patent Document 1).

特許文献１には、「コンテンツを意識した画像の再構成方法であって、画像スケーリングを用いて画像のサイズを大きくして、それをソース画像とすること、エネルギー関数に従って前記ソース画像からエネルギー画像を生成すること、前記エネルギー画像から、前記ソース画像の一端から向かいの端まで延びる各シームが最小エネルギーを有するように、１つ又は複数のシームを最小化関数に従って求めること、及び各シームを前記ソース画像から削除して、該ソース画像のコンテンツ及び長方形形状を保存するターゲット画像を得て、前記ソース画像を元の画像のサイズに縮小することを含む、コンテンツを意識した画像の再構成方法」が記載されている。 Patent Literature 1 describes a method for reconstructing a content-aware image, which includes enlarging the size of an image using image scaling and using it as a source image, and generating an energy image from the source image according to an energy function. determining one or more seams from the energy image according to a minimization function such that each seam extending from one edge of the source image to the opposite edge has a minimum energy; A method for content-aware image reconstruction comprising subtracting from a source image to obtain a target image preserving the content and rectangular shape of the source image, and reducing the source image to the size of the original image." is described.

欧州特許第１９６８００８号明細書European Patent No. 1968008

特許文献１では、連結したピクセルの集合であるシーム及びエネルギー関数を用いて画像を再構成することでアスペクト比を変換する。シームは、動的計画法を用いてエネルギーを最小化することで求められる。 In Patent Document 1, the aspect ratio is converted by reconstructing an image using a seam, which is a set of connected pixels, and an energy function. The seams are found by minimizing the energy using dynamic programming.

しかし、特許文献１に記載されている、シーム及びエネルギー関数を用いた画像再構成手段は、動的計画法に多くのコンピューティングリソースを要する上、画像における乱れを引き起こし、品質の低下を招く場合がある。更に、特許文献１における意味的内容（ｓｅｍａｎｔｉｃｃｏｎｔｅｎｔ）の判定は、画像の勾配の大きさとエネルギー関数とに基づいて行われるが、大規模のデータセット等の場合には、画像の意味的内容を十分に捉えるエネルギー関数を規定することが難しく、重要な意味的情報が失われる場合がある。 However, the image reconstruction means using the seam and energy function described in Patent Document 1 requires a lot of computing resources for dynamic programming, and causes disturbance in the image, which may lead to quality deterioration. There is Furthermore, the determination of the semantic content in Patent Document 1 is performed based on the magnitude of the gradient of the image and the energy function. It is difficult to define an energy function that captures well, and important semantic information may be lost.

そこで、本開示は、画像の意味的内容を考慮した背景合成手段を適用することで、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更し、高精度のオブジェクト検出を可能にする領域変換手段を提供することを目的とする。 Therefore, the present disclosure appropriately changes the aspect ratio of the image while maintaining the quality and semantic information of the image by applying a background synthesizing means that considers the semantic content of the image, thereby achieving high-precision object detection. It is an object of the present invention to provide a region conversion means that enables

上記の課題を解決するために、代表的な本開示の領域変換装置の一つは、領域変換装置であって、画像シーケンスの中から、領域変換の対象となる対象画像フレームを特定する画像フレーム特定部と、前記対象画像フレームにおける関心領域を検出すると共に、前記関心領域を含む関心領域画像を前記対象画像フレームから抽出するための加工動作の信頼度を計算する関心領域検出部と、前記加工動作の前記信頼度が所定の信頼度基準を満たす場合、前記加工動作を用いて前記関心領域画像を前記対象画像フレームから抽出する画像フレーム加工部と、前記加工動作の前記信頼度が所定の信頼度基準を満たさない場合、又は前記対象画像フレームから抽出された前記関心領域画像が所定のアスペクト比基準を満たさない場合、前記対象画像フレームに背景画素を追加又は削除することで前記対象画像フレームを所定のアスペクト比に変換する背景合成手段を複数の背景合成手段の候補から決定する背景合成手段決定部と、前記背景合成手段を用いて、前記対象画像フレームに背景画素を追加又は削除することで前記対象画像フレームを前記所定のアスペクト比に変換した最終画像を生成する背景合成部と、を含む。 In order to solve the above problems, one of the representative domain conversion devices of the present disclosure is a domain conversion device, which identifies a target image frame to be domain-converted from an image sequence. a region-of-interest detection unit that detects a region of interest in the target image frame and calculates reliability of a processing operation for extracting a region-of-interest image including the region of interest from the target image frame; an image frame manipulator for extracting the region of interest image from the target image frame using the manipulating action if the confidence of the action meets a predetermined confidence criterion; and If a degree criterion is not met, or if the region of interest image extracted from the target image frame does not meet a predetermined aspect ratio criterion, adding or deleting background pixels to or from the target image frame will reduce the target image frame. Background pixels are added to or deleted from the target image frame using a background synthesis means determination unit that determines a background synthesis means for converting to a predetermined aspect ratio from a plurality of background synthesis means candidates, and the background synthesis means. a background synthesizing unit for generating a final image obtained by converting the target image frame to the predetermined aspect ratio.

本開示によれば、画像の意味的内容を考慮した背景合成手段を適用することで、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更し、高精度のオブジェクト検出を可能にする領域変換手段を提供することができる。
上記以外の課題、構成及び効果は、以下の発明を実施するための形態における説明により明らかにされる。 According to the present disclosure, by applying a background synthesizing means that considers the semantic content of an image, while maintaining the quality and semantic information of the image, the aspect ratio of the image is appropriately changed, and high-precision object detection is achieved. It is possible to provide a region conversion means that enables
Problems, configurations, and effects other than the above will be clarified by the description in the following modes for carrying out the invention.

図１は、アスペクト比が不適切な場合にニューラルネットワークによるオブジェクト検出精度が低下する一例を示す図である。FIG. 1 is a diagram showing an example of deterioration in object detection accuracy by a neural network when the aspect ratio is inappropriate. 図２は、本発明の実施形態を実施するためのコンピュータシステムのブロック図である。FIG. 2 is a block diagram of a computer system for implementing embodiments of the present invention. 図３は、本開示の実施形態に係る領域変換システムの構成の一例を示す図である。FIG. 3 is a diagram illustrating an example of the configuration of a domain conversion system according to an embodiment of the present disclosure; 図４は、本開示の実施形態に係る領域変換の一例である背景合成を説明するための図である。FIG. 4 is a diagram for explaining background synthesis, which is an example of area conversion according to an embodiment of the present disclosure. 図５は、本開示の実施形態に係る領域変換処理の流れを示すフローチャートである。FIG. 5 is a flowchart showing the flow of area conversion processing according to the embodiment of the present disclosure. 図６は、本開示の実施形態に係る関心領域検出部の処理の一例を示す図である。FIG. 6 is a diagram illustrating an example of processing of the region-of-interest detection unit according to the embodiment of the present disclosure. 図７は、本開示の実施形態に係る関心領域検出部によって計算される加工動作の信頼度の一例を示す図である。FIG. 7 is a diagram showing an example of the reliability of the machining operation calculated by the region-of-interest detection unit according to the embodiment of the present disclosure. 図８は、本開示の実施形態に係る関心領域検出部によって計算される加工動作の信頼度の別の一例を示す図である。FIG. 8 is a diagram illustrating another example of reliability of machining operation calculated by the region-of-interest detection unit according to the embodiment of the present disclosure. 図９は、本開示の実施形態に係る背景合成手段決定部の処理の一例を示す図である。FIG. 9 is a diagram illustrating an example of processing of a background synthesizing means determination unit according to an embodiment of the present disclosure; 図１０は、本開示の実施形態に係る第１の背景合成手段の一例を示す図である。FIG. 10 is a diagram showing an example of first background synthesizing means according to an embodiment of the present disclosure. 図１１は、本開示の実施形態に係る第２の背景合成手段の一例を示す図である。FIG. 11 is a diagram showing an example of second background synthesizing means according to an embodiment of the present disclosure. 図１２は、本開示の実施形態に係る第３の背景合成手段の一例を示す図である。FIG. 12 is a diagram showing an example of a third background synthesizing means according to an embodiment of the present disclosure. 図１３は、本開示の実施形態に係る第４の背景合成手段の一例を示す図である。FIG. 13 is a diagram showing an example of fourth background synthesizing means according to an embodiment of the present disclosure. 図１４は、本開示の実施形態に係るガウス過程回帰モデルを説明するための図である。FIG. 14 is a diagram for explaining a Gaussian process regression model according to an embodiment of the present disclosure; 図１５は、本開示の実施形態に係る第５の背景合成手段の一例を示す図である。FIG. 15 is a diagram showing an example of a fifth background synthesizing means according to an embodiment of the present disclosure. 図１６は、本開示の実施形態に係る第６の背景合成手段の一例を示す図である。FIG. 16 is a diagram showing an example of a sixth background synthesizing means according to an embodiment of the present disclosure. 図１７は、本開示の実施形態に係る第７の背景合成手段の一例を示す図である。FIG. 17 is a diagram showing an example of a seventh background synthesizing means according to an embodiment of the present disclosure. 図１８は、本開示の実施形態に係る第８の背景合成手段の一例を示す図である。FIG. 18 is a diagram showing an example of eighth background synthesizing means according to an embodiment of the present disclosure.

以下、図面を参照して、本発明の実施形態について説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. It should be noted that the present invention is not limited by this embodiment. Moreover, in the description of the drawings, the same parts are denoted by the same reference numerals.

まず、図１を参照して、アスペクト比が不適切な場合にニューラルネットワークによるオブジェクト検出精度が低下する一例について説明する。 First, with reference to FIG. 1, an example in which the accuracy of object detection by a neural network decreases when the aspect ratio is inappropriate will be described.

図１は、アスペクト比が不適切な場合にニューラルネットワークによるオブジェクト検出精度が低下する一例を示す図である。 FIG. 1 is a diagram showing an example of deterioration in object detection accuracy by a neural network when the aspect ratio is inappropriate.

上述したように、従来の深層畳み込みニューラルネットワークの構造は、いわゆるシフト不変性を有しておらず、入力する画像のアスペクト比が不適切な場合には、オブジェクト検出の精度が低下してしまうことがある。
一般的には、深層畳み込みニューラルネットワークは、入力する画像を予め定まったアスペクト比に変換した後、オブジェクト検出を行う。画像のアスペクト比を変換する手段としては、例えばサイズ変更やクロッピング等が知られる。 As described above, the structure of the conventional deep convolutional neural network does not have so-called shift invariance, and when the aspect ratio of the input image is inappropriate, the accuracy of object detection decreases. There is
In general, a deep convolutional neural network performs object detection after transforming an input image into a predetermined aspect ratio. As means for converting the aspect ratio of an image, for example, resizing, cropping, and the like are known.

画像のサイズ変更（ｉｍａｇｅｒｅｓｉｚｉｎｇ）では、画像の意味的内容を考慮せずに、画像における対象領域が拡大又は縮小される。しかし、サイズ変更手段を用いて画像を加工すると、検出の対象となる対象領域の縦横比が維持される保証はなく、対象領域に存在するオブジェクトが歪んだり、変形したりすることがある。その後、ニューラルネットワークは、サイズ変更によってアスペクト比が変換された画像を入力すると、オブジェクトの歪みや変形によって検出精度が限定される。
一例として、図１に示すように、画像１０１における対象領域１０２をサイズ変更１０３によって所定のアスペクト比に変換すると、変換後の画像１０４におけるオブジェクトの縦横比が維持されず、オブジェクトが変形してしまう。この変形により、ニューラルネットワーク１０５のオブジェクト検出精度が低下してしまうことがある。 Image resizing involves expanding or contracting a region of interest in an image without considering the semantic content of the image. However, if the image is processed using the resizing means, there is no guarantee that the aspect ratio of the target area to be detected is maintained, and objects existing in the target area may be distorted or deformed. After that, when the neural network receives an image whose aspect ratio has been converted by resizing, the detection accuracy is limited by the distortion or deformation of the object.
As an example, as shown in FIG. 1, if the target area 102 in the image 101 is converted to a predetermined aspect ratio by resizing 103, the aspect ratio of the object in the converted image 104 will not be maintained and the object will be deformed. . This deformation may reduce the object detection accuracy of the neural network 105 .

また、クロッピング（ｃｒｏｐｐｉｎｇ）では、画像の意味的内容を考慮せずに、画像の中から、所定の大きさの領域が切り出される。しかし、クロッピングを用いて画像を加工すると、切り出される領域の中には、検出の対象となるオブジェクト以外のオブジェクトが含まれたり、オブジェクトの一部が含まれなかったりすることがある。その後、ニューラルネットワークが、クロッピングによってアスペクト比が変換された画像を入力すると、不要のオブジェクトの存在や、検出対象のオブジェクトの変形によって検出精度が限定される。
一例として、図１に示すように、画像１１１における対象領域１１２をクロッピング１１３によって所定のアスペクト比に変換すると、変換後の画像１１４における対象領域には複数のオブジェクトが含まれてしまう。このように、対象領域に複数のオブジェクトが存在するため、ニューラルネットワーク１１５のオブジェクト検出精度が低下してしまうことがある。 Also, in cropping, a region of a predetermined size is cut out of an image without considering the semantic content of the image. However, when an image is processed using cropping, an object other than the object to be detected may be included in the cropped area, or a part of the object may not be included. After that, when the neural network inputs the image whose aspect ratio has been converted by cropping, the detection accuracy is limited by the presence of unnecessary objects and deformation of the object to be detected.
As an example, as shown in FIG. 1, when a target area 112 in an image 111 is converted to a predetermined aspect ratio by cropping 113, the target area in an image 114 after conversion includes a plurality of objects. In this way, since a plurality of objects exist in the target area, the object detection accuracy of the neural network 115 may deteriorate.

そこで、上述したように、本開示によれば、画像の意味的内容を考慮した背景合成手段を適用することで、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更し、高精度のオブジェクト検出が可能な領域変換手段を提供することができる。 Therefore, as described above, according to the present disclosure, by applying a background synthesizing means that considers the semantic content of an image, the aspect ratio of the image is appropriately changed while maintaining the quality and semantic information of the image. It is possible to provide area conversion means capable of high-precision object detection.

次に、図２を参照して、本開示の実施形態を実施するためのコンピュータシステム２００について説明する。本明細書で開示される様々な実施形態の機構及び装置は、任意の適切なコンピューティングシステムに適用されてもよい。コンピュータシステム２００の主要コンポーネントは、１つ以上のプロセッサ２０１、メモリ２０２、端末インターフェース２０３、ストレージインタフェース２０４、Ｉ／Ｏ（入出力）デバイスインタフェース２０５、及びネットワークインターフェース２０６を含む。これらのコンポーネントは、メモリバス２１０、Ｉ／Ｏバス２１１、バスインターフェースユニット２２０、及びＩ／Ｏバスインターフェースユニット２２１を介して、相互的に接続されてもよい。 Referring now to Figure 2, a computer system 200 for implementing embodiments of the present disclosure will be described. The mechanisms and apparatus of various embodiments disclosed herein may be applied to any suitable computing system. The major components of computer system 200 include one or more processors 201 , memory 202 , terminal interfaces 203 , storage interfaces 204 , I/O (input/output) device interfaces 205 and network interfaces 206 . These components may be interconnected via memory bus 210 , I/O bus 211 , bus interface unit 220 and I/O bus interface unit 221 .

コンピュータシステム２００は、プロセッサ２０１と総称される１つ又は複数の汎用プログラマブル中央処理装置（ＣＰＵ）２０１Ａ及び２０１Ｂを含んでもよい。ある実施形態では、コンピュータシステム２００は複数のプロセッサを備えてもよく、また別の実施形態では、コンピュータシステム２００は単一のＣＰＵシステムであってもよい。各プロセッサ２０１は、メモリ２０２に格納された命令を実行し、オンボードキャッシュを含んでもよい。 Computer system 200 may include one or more general-purpose programmable central processing units (CPUs) 201A and 201B, collectively referred to as processors 201 . In some embodiments, computer system 200 may include multiple processors, and in other embodiments, computer system 200 may be a single CPU system. Each processor 201 executes instructions stored in memory 202 and may include an on-board cache.

ある実施形態では、メモリ２０２は、データ及びプログラムを記憶するためのランダムアクセス半導体メモリ、記憶装置、又は記憶媒体（揮発性又は不揮発性のいずれか）を含んでもよい。メモリ２０２は、本明細書で説明する機能を実施するプログラム、モジュール、及びデータ構造のすべて又は一部を格納してもよい。例えば、メモリ２０２は、領域変換アプリケーション２３０を格納していてもよい。ある実施形態では、領域変換アプリケーション２３０は、後述する機能をプロセッサ２０１上で実行する命令又は記述を含んでもよい。 In some embodiments, memory 202 may include random access semiconductor memory, storage devices, or storage media (either volatile or non-volatile) for storing data and programs. Memory 202 may store all or part of the programs, modules, and data structures that implement the functions described herein. For example, memory 202 may store domain transformation application 230 . In some embodiments, domain transformation application 230 may include instructions or descriptions that perform the functions described below on processor 201 .

ある実施形態では、領域変換アプリケーション２３０は、プロセッサベースのシステムの代わりに、またはプロセッサベースのシステムに加えて、半導体デバイス、チップ、論理ゲート、回路、回路カード、および/または他の物理ハードウェアデバイスを介してハードウェアで実施されてもよい。ある実施形態では、領域変換アプリケーション２３０は、命令又は記述以外のデータを含んでもよい。ある実施形態では、カメラ、センサ、または他のデータ入力デバイス（図示せず）が、バスインターフェースユニット２２０、プロセッサ２０１、またはコンピュータシステム２００の他のハードウェアと直接通信するように提供されてもよい。 In some embodiments, the domain conversion application 230 may be used in semiconductor devices, chips, logic gates, circuits, circuit cards, and/or other physical hardware devices instead of or in addition to processor-based systems. may be implemented in hardware via In some embodiments, domain conversion application 230 may include data other than instructions or descriptions. In some embodiments, a camera, sensor, or other data input device (not shown) may be provided in direct communication with bus interface unit 220, processor 201, or other hardware of computer system 200. .

コンピュータシステム２００は、プロセッサ２０１、メモリ２０２、表示システム２４０、及びＩ／Ｏバスインターフェースユニット２２１間の通信を行うバスインターフェースユニット２２０を含んでもよい。Ｉ／Ｏバスインターフェースユニット２２１は、様々なＩ／Ｏユニットとの間でデータを転送するためのＩ／Ｏバス２１１と連結していてもよい。Ｉ／Ｏバスインターフェースユニット２２１は、Ｉ／Ｏバス２１１を介して、Ｉ／Ｏプロセッサ（ＩＯＰ）又はＩ／Ｏアダプタ（ＩＯＡ）としても知られる複数のＩ／Ｏインタフェースユニット２０３，２０４，２０５、及び２０６と通信してもよい。 Computer system 200 may include a bus interface unit 220 that provides communication between processor 201 , memory 202 , display system 240 and I/O bus interface unit 221 . I/O bus interface unit 221 may be coupled to I/O bus 211 for transferring data to and from various I/O units. I/O bus interface unit 221 communicates, via I/O bus 211, a plurality of I/O interface units 203, 204, 205, also known as I/O processors (IOPs) or I/O adapters (IOAs); and 206.

表示システム２４０は、表示コントローラ、表示メモリ、又はその両方を含んでもよい。表示コントローラは、ビデオ、オーディオ、又はその両方のデータを表示装置２４１に提供することができる。また、コンピュータシステム２００は、データを収集し、プロセッサ２０１に当該データを提供するように構成された1つまたは複数のセンサ等のデバイスを含んでもよい。 Display system 240 may include a display controller, a display memory, or both. The display controller can provide video, audio, or both data to display device 241 . Computer system 200 may also include devices such as one or more sensors configured to collect data and provide such data to processor 201 .

例えば、コンピュータシステム２００は、心拍数データやストレスレベルデータ等を収集するバイオメトリックセンサ、湿度データ、温度データ、圧力データ等を収集する環境センサ、及び加速度データ、運動データ等を収集するモーションセンサ等を含んでもよい。これ以外のタイプのセンサも使用可能である。表示システム２４０は、単独のディスプレイ画面、テレビ、タブレット、又は携帯型デバイスなどの表示装置２４１に接続されてもよい。 For example, the computer system 200 may include a biometric sensor that collects heart rate data, stress level data, etc., an environmental sensor that collects humidity data, temperature data, pressure data, etc., and a motion sensor that collects acceleration data, motion data, etc. may include Other types of sensors can also be used. The display system 240 may be connected to a display device 241 such as a single display screen, television, tablet, or handheld device.

Ｉ／Ｏインタフェースユニットは、様々なストレージ又はＩ／Ｏデバイスと通信する機能を備える。例えば、端末インタフェースユニット２０３は、ビデオ表示装置、スピーカテレビ等のユーザ出力デバイスや、キーボード、マウス、キーパッド、タッチパッド、トラックボール、ボタン、ライトペン、又は他のポインティングデバイス等のユーザ入力デバイスのようなユーザＩ／Ｏデバイス２５０の取り付けが可能である。ユーザは、ユーザインターフェースを使用して、ユーザ入力デバイスを操作することで、ユーザＩ／Ｏデバイス２５０及びコンピュータシステム２００に対して入力データや指示を入力し、コンピュータシステム２００からの出力データを受け取ってもよい。ユーザインターフェースは例えば、ユーザＩ／Ｏデバイス２５０を介して、表示装置に表示されたり、スピーカによって再生されたり、プリンタを介して印刷されたりしてもよい。 The I/O interface unit provides the ability to communicate with various storage or I/O devices. For example, the terminal interface unit 203 may include user output devices such as video displays, speaker televisions, etc., and user input devices such as keyboards, mice, keypads, touch pads, trackballs, buttons, light pens, or other pointing devices. Such user I/O devices 250 can be attached. A user inputs input data and instructions to the user I/O device 250 and the computer system 200 by operating the user input device using the user interface, and receives output data from the computer system 200. good too. The user interface may be displayed on a display device, played by a speaker, or printed via a printer, for example, via user I/O device 250 .

ストレージインタフェース２０４は、１つ又は複数のディスクドライブや直接アクセスストレージ装置２６０（通常は磁気ディスクドライブストレージ装置であるが、単一のディスクドライブとして見えるように構成されたディスクドライブのアレイ又は他のストレージ装置であってもよい）の取り付けが可能である。ある実施形態では、ストレージ装置２６０は、任意の二次記憶装置として実装されてもよい。メモリ２０２の内容は、ストレージ装置２６０に記憶され、必要に応じてストレージ装置２６０から読み出されてもよい。Ｉ／Ｏデバイスインタフェース２０５は、プリンタ、ファックスマシン等の他のＩ／Ｏデバイスに対するインターフェースを提供してもよい。ネットワークインターフェース２０６は、コンピュータシステム２００と他のデバイスが相互的に通信できるように、通信経路を提供してもよい。この通信経路は、例えば、ネットワーク２７０であってもよい。 Storage interface 204 connects to one or more disk drives or direct access storage device 260 (typically a magnetic disk drive storage device, but an array of disk drives or other storage device configured to appear as a single disk drive). ) can be attached. In some embodiments, storage device 260 may be implemented as any secondary storage device. The contents of memory 202 may be stored in storage device 260 and read from storage device 260 as needed. I/O device interface 205 may provide an interface to other I/O devices such as printers, fax machines, and the like. Network interface 206 may provide a communication pathway to allow computer system 200 and other devices to communicate with each other. This communication path may be, for example, network 270 .

ある実施形態では、コンピュータシステム２００は、マルチユーザメインフレームコンピュータシステム、シングルユーザシステム、又はサーバコンピュータ等の、直接的ユーザインターフェースを有しない、他のコンピュータシステム（クライアント）からの要求を受信するデバイスであってもよい。他の実施形態では、コンピュータシステム２００は、デスクトップコンピュータ、携帯型コンピュータ、ノートパソコン、タブレットコンピュータ、ポケットコンピュータ、電話、スマートフォン、又は任意の他の適切な電子機器であってもよい。 In some embodiments, computer system 200 is a device that receives requests from other computer systems (clients) that do not have a direct user interface, such as multi-user mainframe computer systems, single-user systems, or server computers. There may be. In other embodiments, computer system 200 may be a desktop computer, handheld computer, laptop, tablet computer, pocket computer, phone, smart phone, or any other suitable electronic device.

次に、図３を参照して、本開示の実施形態に係る領域変換システムの構成について説明する。 Next, the configuration of the domain conversion system according to the embodiment of the present disclosure will be described with reference to FIG.

図３は、本開示の実施形態に係る領域変換システム３００の構成の一例を示す図である。図３に示す領域変換システム３００は、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更し、高精度のオブジェクト検出を円滑にする領域変換手段を提供するためのシステムである。図３に示すように、本開示の実施形態に係る領域変換システム３００は、画像取得装置３０１、記憶部３０２、領域変換装置３０３及びクライアント端末３０４を主に含む。図３に示す領域変換システム３００の各機能部の機能は、図２を参照して説明したコンピュータシステム２００によって実施されてもよい。
また、図３に示す画像取得装置３０１、記憶部３０２、領域変換装置３０３及びクライアント端末３０４は、インターネットやＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）等、任意の通信ネットワーク（図３に図示せず）を介して接続されてもよい。 FIG. 3 is a diagram illustrating an example configuration of a domain conversion system 300 according to an embodiment of the present disclosure. The domain conversion system 300 shown in FIG. 3 is a system for providing a domain conversion means that maintains image quality and semantic information while appropriately changing the aspect ratio of an image to facilitate highly accurate object detection. is. As shown in FIG. 3 , the domain conversion system 300 according to the embodiment of the present disclosure mainly includes an image acquisition device 301 , a storage unit 302 , a domain conversion device 303 and a client terminal 304 . The function of each functional unit of domain conversion system 300 shown in FIG. 3 may be implemented by computer system 200 described with reference to FIG.
Also, the image acquisition device 301, storage unit 302, area conversion device 303, and client terminal 304 shown in FIG. may be connected.

画像取得装置３０１は、ニューラルネットワークによる解析の対象となる画像シーケンスを取得するための機能部である。ここでの画像取得装置３０１は、例えば、ＲＧＢカメラ、赤外線カメラ、LiDarセンサ等、任意の画像や映像を取得するように構成された装置であってもよい。一例として、画像取得装置３０１は、駅のホームを監視するように設置された監視カメラであってもよい。画像取得装置３６０１は、取得した画像シーケンスを記憶部３０２に格納すると共に、領域変換装置３０３に送信してもよい。
なお、ここでの画像シーケンスとは、少なくとも１つの画像を含む画像の集合であり、例えば映像であってもよい。 The image acquisition device 301 is a functional unit for acquiring an image sequence to be analyzed by a neural network. The image acquisition device 301 here may be, for example, an RGB camera, an infrared camera, a LiDar sensor, or a device configured to acquire any image or video. As an example, the image acquisition device 301 may be a surveillance camera installed to monitor a station platform. The image acquisition device 3601 may store the acquired image sequence in the storage unit 302 and transmit it to the domain transformation device 303 .
Note that the image sequence here is a set of images including at least one image, and may be, for example, a video.

記憶部３０２は、画像取得装置３０１によって取得された画像シーケンスや、後述するニューラルネットワーク３３０による解析結果を記憶するための記憶部である。ここでの記憶部３０２は、例えばハードディスクドライブやソリッドステートドライブ等のローカルストレージであってもよく、クラウドのような分散型ストレージサービスであってもよい。 The storage unit 302 is a storage unit for storing the image sequence acquired by the image acquisition device 301 and the analysis result by the neural network 330 which will be described later. The storage unit 302 here may be, for example, a local storage such as a hard disk drive or a solid state drive, or may be a distributed storage service such as cloud.

領域変換装置３０３は、画像の意味的内容を考慮した背景合成手段を適用することで、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更するための装置である。ここでの意味的情報とは、画像におけるオブジェクトやアクティビティを判定するための情報である。また、ここでの背景合成手段とは、背景画素を追加又は削除することで対象画像のフレームを所定のアスペクト比に変換するための手段である。
図３に示すように、領域変換装置３０３は、前処理部３１０、領域変換部３２０、ニューラルネットワーク３３０及び出力部３４０を含む。 The area conversion device 303 is a device for appropriately changing the aspect ratio of an image while maintaining the quality and semantic information of the image by applying background synthesizing means that considers the semantic content of the image. Semantic information here is information for determining an object or activity in an image. Also, the background synthesizing means here is means for converting the frame of the target image into a predetermined aspect ratio by adding or deleting background pixels.
As shown in FIG. 3, the domain transforming device 303 includes a preprocessing unit 310 , a domain transforming unit 320 , a neural network 330 and an output unit 340 .

前処理部３１０は、画像取得装置３０１によって取得された画像シーケンスに対する前処理を実行するための機能部である。例えば、前処理部３７１０は、前処理として、画像取得装置３０１によって取得された画像シーケンスを所定のデータ形式に変換したり、対象外の画像を画像シーケンスから削除したり、暗号化された情報を復号化したりしてもよい。 The preprocessing unit 310 is a functional unit for performing preprocessing on the image sequence acquired by the image acquisition device 301 . For example, as preprocessing, the preprocessing unit 3710 converts the image sequence acquired by the image acquisition device 301 into a predetermined data format, deletes non-target images from the image sequence, and converts encrypted information. It may be decrypted.

領域変換部３２０は、上述した背景合成手段を決定し、実行するための機能部である。図３に示すように、領域変換部３２０は、画像フレーム特定部３２１、関心領域検出部３２２、画像フレーム加工部３２３、背景合成手段決定部３２４及び背景合成部３２５を含む。 The area conversion unit 320 is a functional unit for determining and executing the above-described background synthesizing means. As shown in FIG. 3 , the region conversion section 320 includes an image frame identification section 321 , a region of interest detection section 322 , an image frame processing section 323 , a background synthesis means determination section 324 and a background synthesis section 325 .

画像フレーム特定部３２１は、画像取得装置３０１によって取得された画像シーケンスの中から、領域変換の対象となる対象画像フレームを特定する機能部である。
関心領域検出部３２２は、画像フレーム特定部３２１によって特定された対象画像フレームにおける関心領域を検出すると共に、当該関心領域を含む関心領域画像を対象画像フレームから抽出するための加工動作の信頼度を計算する機能部である。
画像フレーム加工部３２３は、関心領域画像を対象画像フレームから抽出するための加工動作の信頼度が所定の信頼度基準を満たす場合、加工動作を用いて関心領域画像を対象画像フレームから抽出するための機能部である。
背景合成手段決定部３２４は、関心領域画像を対象画像フレームから抽出するための加工動作の信頼度が所定の信頼度基準を満たさない場合、又は対象画像フレームから抽出された関心領域画像が所定のアスペクト比基準を満たさない場合、対象画像フレームに背景画素を追加又は削除することで対象画像フレームを所定のアスペクト比に変換する背景合成手段を複数の背景合成手段の候補から決定するための機能部である。
背景合成部３２５は、背景合成手段決定部３２４によって決定された背景合成手段を用いて、対象画像フレームに背景画素を追加又は削除することで対象画像フレームを所定のアスペクト比に変換するための機能部である。 The image frame identification unit 321 is a functional unit that identifies a target image frame to be subjected to area conversion from among the image sequences acquired by the image acquisition device 301 .
The region-of-interest detection unit 322 detects the region of interest in the target image frame specified by the image frame specifying unit 321, and determines the reliability of the processing operation for extracting the region-of-interest image including the region of interest from the target image frame. It is a function part that calculates.
When the reliability of the processing operation for extracting the region of interest image from the target image frame satisfies a predetermined reliability standard, the image frame processing unit 323 uses the processing operation to extract the region of interest image from the target image frame. is the functional part of
If the reliability of the processing operation for extracting the region-of-interest image from the target image frame does not satisfy a predetermined reliability standard, or if the region-of-interest image extracted from the target image frame A functional unit for determining a background synthesizing means for converting the target image frame to a predetermined aspect ratio by adding or deleting background pixels to the target image frame from a plurality of background synthesizing means candidates when the aspect ratio standard is not satisfied. is.
The background synthesizing unit 325 uses the background synthesizing means determined by the background synthesizing means determining unit 324 to add or delete background pixels to or from the target image frame, thereby converting the target image frame into a predetermined aspect ratio. Department.

ニューラルネットワーク３３０は、領域変換部３２０によって適切なアスペクト比に変換された画像（最終画像）を入力し、解析するためのニューラルネットワークである。例えば、ニューラルネットワーク３３０は、領域変換部３２０によって適切なアスペクト比に変換された画像に対するオブジェクト検出を行うように構成された深層畳み込みニューラルネットワークであってもよい。ニューラルネットワーク３３０による結果を示す解析結果は、記憶部３０２に格納されると共に、出力部３４０に転送される。
ある実施形態では、ニューラルネットワーク３３０は、入力層と、１層以上の中間層と、出力層とを畳み込み演算層として含む。ニューラルネットワーク３３０では、Ｎ層目の中間層は、Ｎ－１層目から出力される値を入力値として入力し、当該入力値に対して、重み係数を有する複数の重みフィルタを用いて畳み込み演算を行うことで、Ｎ＋１層目に出力する値を生成するように構成されている。この畳み込み演算により、ニューラルネットワーク３８３０は、画像の特徴量を抽出し、オブジェクト検出等の処理を行うことができる。 The neural network 330 is a neural network for inputting and analyzing an image (final image) that has been converted to an appropriate aspect ratio by the domain conversion section 320 . For example, neural network 330 may be a deep convolutional neural network configured to perform object detection on images that have been converted to the appropriate aspect ratio by domain converter 320 . An analysis result indicating the result by the neural network 330 is stored in the storage unit 302 and transferred to the output unit 340 .
In one embodiment, neural network 330 includes an input layer, one or more hidden layers, and an output layer as convolution layers. In the neural network 330, the N-th intermediate layer receives the value output from the (N−1)-th layer as an input value, and performs a convolution operation on the input value using a plurality of weight filters having weight coefficients. is configured to generate a value to be output to the (N+1)-th layer. By this convolution operation, the neural network 3830 can extract the feature amount of the image and perform processing such as object detection.

出力部３４０は、ニューラルネットワーク３３０によって生成される解析結果を出力するための機能部である。出力部３４０は、例えば、ニューラルネットワーク３３０によって生成される解析結果を、インターネット等の通信ネットワークを介して、所定の通知先に送信してもよい。例えば、ある実施形態では、出力部３４０は、ニューラルネットワーク３３０によって生成される解析結果をクライアント端末３０４に送信してもよい。
クライアント端末３０４は、ニューラルネットワークの解析を依頼したクライアントが使用する装置であり、例えばデスクトップパソコン、ノートパソコン、スマートフォンやタブレット等の携帯端末等、任意のデバイスであってもよい。 The output unit 340 is a functional unit for outputting analysis results generated by the neural network 330 . The output unit 340 may, for example, transmit the analysis result generated by the neural network 330 to a predetermined notification destination via a communication network such as the Internet. For example, in some embodiments, output unit 340 may transmit analysis results generated by neural network 330 to client terminal 304 .
The client terminal 304 is a device used by the client who requested the analysis of the neural network, and may be any device such as a desktop computer, a notebook computer, or a mobile terminal such as a smart phone or tablet.

以上説明したように構成した領域変換システム３００によれば、画像の意味的内容を考慮した背景合成手段を適用することで、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更し、高精度のオブジェクト検出を円滑にする領域変換手段を提供することができる。 According to the area conversion system 300 configured as described above, the aspect ratio of the image is appropriately adjusted while maintaining the quality and semantic information of the image by applying the background synthesizing means that considers the semantic content of the image. to provide a region transformation means that facilitates highly accurate object detection.

次に、図４を参照して、本開示の実施形態に係る領域変換の一例である背景合成について説明する。 Next, background synthesis, which is an example of area conversion according to an embodiment of the present disclosure, will be described with reference to FIG.

図４は、本開示の実施形態に係る領域変換の一例である背景合成を説明するための図である。上述したように、本開示は、画像の意味的内容を考慮した背景合成手段を適用することで、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更することができる。 FIG. 4 is a diagram for explaining background synthesis, which is an example of area conversion according to an embodiment of the present disclosure. As described above, the present disclosure can appropriately change the aspect ratio of an image while maintaining the quality and semantic information of the image by applying the background synthesis means that considers the semantic content of the image. .

本開示の実施形態に係る背景合成手段とは、領域変換の対象となる対象画像フレームに、
背景画素を追加又は削除することで対象画像のフレームを所定のアスペクト比に変換するための手段である。また、本開示の実施形態に係る背景合成は、画像の意味的内容を考慮した上で行われてもよい。例えば、後述するように、ある実施形態では、対象画像のフレームを所定のアスペクト比に変換する背景合成手段は、ニューラルネットワークの解析に有用な意味的情報を多く含む関心領域や顕著性中心に基づいて行われてもよい。 The background synthesizing means according to the embodiment of the present disclosure is a target image frame to be subjected to area conversion,
Means for converting the frame of the target image to a predetermined aspect ratio by adding or deleting background pixels. Also, the background synthesis according to the embodiment of the present disclosure may be performed after considering the semantic content of the image. For example, as will be described later, in one embodiment, the background synthesizing means for converting frames of the target image to a predetermined aspect ratio is based on regions of interest and saliency centers that contain a lot of semantic information useful for neural network analysis. may be done.

本開示の実施形態に係る背景合成手段の一例として、例えば図４に示すように、領域変換の対象となる対象画像フレーム４０１を所望の大きさに均一にスケーリングすると共に、背景画素からなる第１の合成領域４０２を対象画像フレーム４０１の上部に追加し、背景画素からなる第２の合成領域４０３を対象画像フレーム４０１の下部に追加することができる。ある実施形態では、第１の合成領域４０２及び第２の合成領域４０３を追加する位置は、対象画像フレーム４０１の関心領域や顕著性中心に基づいて行われてもよい。
これにより、ニューラルネットワークの解析に有用な意味的情報を維持しつつ、画像のアスペクト比を適宜に変更することができる。 As an example of the background synthesizing means according to the embodiment of the present disclosure, for example, as shown in FIG. 4, a target image frame 401 to be subjected to area conversion is uniformly scaled to a desired size, and a first A composite region 402 can be added to the top of the target image frame 401 and a second composite region 403 consisting of background pixels can be added to the bottom of the target image frame 401 . In some embodiments, the location of adding the first 402 and second 403 synthesis regions may be based on the region of interest or saliency center of the target image frame 401 .
As a result, the aspect ratio of the image can be appropriately changed while maintaining semantic information useful for neural network analysis.

なお、以上では、本開示の実施形態に係る背景合成の概念を説明するために、背景合成の一例について説明したが、後述するように、本開示の実施形態に係る背景合成は、画像の特性に基づいて、いくつかの背景合成手段の候補の中から選択される。これにより、領域変換の対象となる対象画像フレーム毎に、当該対象画像フレームに適した領域変換処理を施すことができる。 In the above, an example of background synthesis has been described in order to explain the concept of background synthesis according to the embodiment of the present disclosure. is selected from among several candidates for background synthesizing means. As a result, for each target image frame to be subjected to area conversion, the area conversion process suitable for the target image frame can be performed.

次に、図５を参照して、本開示の実施形態に係る領域変換処理の流れについて説明する。 Next, with reference to FIG. 5, the flow of area conversion processing according to the embodiment of the present disclosure will be described.

図５は、本開示の実施形態に係る領域変換処理５００の流れを示すフローチャートである。図５に示す領域変換処理５００は、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更し、高精度のオブジェクト検出を円滑にする領域変換を行うための処理であり、領域変換部（例えば図３に示す領域変換部３２０）の各機能部によって実行される。 FIG. 5 is a flow chart showing the flow of region conversion processing 500 according to an embodiment of the present disclosure. The area conversion process 500 shown in FIG. 5 is a process for performing area conversion that appropriately changes the aspect ratio of an image while maintaining the quality and semantic information of the image, thereby facilitating highly accurate object detection. , are performed by each functional unit of the domain converter (for example, the domain converter 320 shown in FIG. 3).

まず、ステップＳ５０１では、画像フレーム特定部（例えば図３に示す画像フレーム特定部３２１）は、画像取得装置（例えば図３に示す画像取得装置３０１）によって取得された画像シーケンスの中から、領域変換の対象となる対象画像フレームを特定する。ここでの対象画像フレームは、例えば画像シーケンスを構成する複数のフレームの中から、ニューラルネットワークの解析の対象となるオブジェクトを含む画像フレームであってもよい。画像フレーム特定部は、例えばユーザによって予め指定されているオブジェクトに対する類似度が所定の類似度基準を満たすオブジェクトを含む画像フレームを対象画像フレームとして特定してもよい。
また、後述するように、ある実施形態では、画像フレーム特定部は、所定の影響度基準を満たすクラスのオブジェクトを含む画像フレームを対象画像フレームとして特定してもよい。 First, in step S501, an image frame identification unit (for example, the image frame identification unit 321 shown in FIG. 3) selects an image sequence acquired by an image acquisition device (for example, the image acquisition device 301 shown in FIG. Identify the target image frame to be the target of. The target image frame here may be, for example, an image frame including an object to be analyzed by the neural network from among a plurality of frames forming an image sequence. For example, the image frame identifying unit may identify, as the target image frame, an image frame containing an object whose degree of similarity to an object specified in advance by the user satisfies a predetermined similarity criterion.
Further, as will be described later, in one embodiment, the image frame identifying unit may identify, as the target image frame, an image frame that includes an object of a class that satisfies a predetermined influence criterion.

次に、ステップＳ５０２では、前処理部（例えば図３に示す前処理部３１０）は、ステップＳ５０１で特定された対象画像フレームに対するテクスチャフィルタリング（ＴｅｘｔｕｒｅＳｍｏｏｔｈｉｎｇ）を行う。ここでのテクスチャフィルタリングは、近傍の色を用いて画素にテクスチャマッピングするために、テクスチャ色を決定するために行われる方法である。言い換えれば、テクスチャフィルタリング処理では、前処理部は、テクスチャの画素をより小さな画素単位に分けて、それらを混ぜ合わせる。ここでのテクスチャフィルタリングの手法としては、例えばＧａｕｓｓｉａｎＢｌｕｒ，ＭｅｄｉａｎＢｌｕｒ，ＴｏｔａｌＲｅｌａｔｉｖｅＶａｒｉａｔｉｏｎＲｅｇｕｌａｒｉｚａｔｉｏｎ等が考えられる
対象画像フレームに対するテクスチャフィルタリングを行うことにより、後述する関心領域の検出や背景合成等の精度を向上させることができる。 Next, in step S502, a preprocessing unit (for example, the preprocessing unit 310 shown in FIG. 3) performs texture filtering (texture smoothing) on the target image frame identified in step S501. Texture filtering here is a method used to determine texture colors for texture mapping to pixels using neighboring colors. In other words, in the texture filtering process, the preprocessor divides the texture pixels into smaller pixel units and mixes them. Texture filtering methods here include, for example, Gaussian Blur, Median Blur, Total Relative Variation, Regularization, etc. By performing texture filtering on the target image frame, the accuracy of detection of the region of interest and background synthesis, etc., which will be described later, is improved. can be made

次に、ステップＳ５０３では、関心領域検出部（例えば図３に示す関心領域検出部３２２）は、ステップＳ５０２の処理を受けた対象画像フレームにおける関心領域を検出すると共に、当該関心領域を含む関心領域画像を対象画像フレームから抽出するための加工動作の信頼度を計算する。
ここでの関心領域とは、対象画像フレームの中で、検出対象のオブジェクトを含む領域である。例えば、検出対象のオブジェクトが「黒い車」の場合、関心領域は対象画像フレームにおいて黒い車を含む可能性が高い領域である。
また、ここでの加工動作とは、関心領域画像を対象画像フレームから抽出するための動作であり、例えばクロッピング（ｃｒｏｐｐｉｎｇ）やカービング（ｃａｒｖｉｎｇ）等を含んでもよい。
また、ここでの信頼度とは、上述した加工動作の適切性を表す尺度であり、後述するように、対象画像フレームのアスペクト比を変更するための領域変換手段を決定するために用いられる。
更に、ステップＳ５０３では、関心領域検出部は、対象画像フレームの顕著性マップ（サリエンシーマップ）を計算し、この顕著性マップに基づいて、顕著性の中心を示す顕著性中心を判定してもよい。ここでの顕著性マップは、対象画像フレームに対応する顕著性（サリエンシー）の強さを表すデータ構造であり、対象画像フレームに対する「人の興味」の度合いを示す。この顕著性マップは、例えば対象画像フレームにおけるオブジェクトのエッジや色などの特徴に基づいて求められてもよい。
なお、関心領域検出部の詳細は図６を参照して説明するため、ここではその説明を省略する。 Next, in step S503, a region-of-interest detection unit (for example, the region-of-interest detection unit 322 shown in FIG. 3) detects a region of interest in the target image frame that has undergone the processing of step S502, and detects a region of interest including the region of interest. Compute the reliability of the processing operation for extracting the image from the target image frame.
Here, the region of interest is a region including the object to be detected in the target image frame. For example, if the object to be detected is a "black car", the region of interest is the region in the target image frame that is likely to contain the black car.
Further, the processing operation here is an operation for extracting the region-of-interest image from the target image frame, and may include cropping, carving, and the like, for example.
Further, the reliability here is a scale representing the appropriateness of the processing operation described above, and is used to determine the area conversion means for changing the aspect ratio of the target image frame, as described later.
Furthermore, in step S503, the region-of-interest detection unit calculates a saliency map (saliency map) of the target image frame, and determines a saliency center indicating the saliency center based on this saliency map. good. The saliency map here is a data structure representing the strength of saliency corresponding to the target image frame, and indicates the degree of "human interest" in the target image frame. This saliency map may be determined, for example, based on features such as edges and colors of objects in the target image frame.
Note that the details of the region-of-interest detection unit will be described with reference to FIG. 6, so description thereof will be omitted here.

次に、ステップＳ５０４では、画像フレーム加工部（例えば、図３に示す画像フレーム加工部３２３）は、ステップＳ５０３で計算した、関心領域画像を対象画像フレームから抽出するための加工動作の信頼度が所定の信頼度基準を満たすか否かを判定する。ここでの信頼度基準は、例えば領域変換システム３００の管理者によって設定されてもよく、過去の加工動作の実績等に基づいて設定されてもよい。
加工動作の信頼度が所定の信頼度基準を満たす場合、本処理はステップＳ５０５へ進み、加工動作の信頼度が所定の信頼度基準を満たさない場合、本処理はステップＳ５０７へ進む。 Next, in step S504, the image frame processing unit (for example, the image frame processing unit 323 shown in FIG. 3) determines that the reliability of the processing operation for extracting the region-of-interest image from the target image frame calculated in step S503 is A determination is made as to whether or not a predetermined reliability criterion is met. The reliability standard here may be set, for example, by an administrator of the area conversion system 300, or may be set based on the results of past processing operations or the like.
If the reliability of the machining operation satisfies the predetermined reliability criterion, the process proceeds to step S505, and if the reliability of the machining operation does not satisfy the predetermined reliability criterion, the process proceeds to step S507.

ステップＳ５０５では、画像フレーム加工部は、関心領域画像を対象画像フレームから抽出するための加工動作を実行する。例えば、画像フレーム加工部は、関心領域画像を対象画像フレームから抽出するための加工動作として、画像のクロッピング又はカービングを実行してもよい。
ここでは、クロッピングとは、対象画像フレームの周縁部から不要（関心領域画に含まれない）な画素を削除することで、関心領域のみを示す関心領域画像を得る手段である。また、ここでのカービングとは、対象画像フレームの任意の領域（周縁部とは限らず）を切り出して削除することで関心領域のみを示す関心領域画像を得る手段である。カービングは、例えば、関心領域が複数存在する場合に、それぞれの関心領域の間に存在する不要な背景画素を削除することで複数の関心領域のみを示す関心領域画像を求める際等に有効である。 In step S505, the image frame processing unit executes a processing operation for extracting the region-of-interest image from the target image frame. For example, the image frame processor may perform image cropping or carving as the processing operation for extracting the region of interest image from the target image frame.
Here, cropping is means for obtaining a region-of-interest image showing only the region of interest by deleting unnecessary pixels (not included in the region-of-interest image) from the periphery of the target image frame. Carving here is means for obtaining a region-of-interest image showing only the region of interest by cutting out and deleting an arbitrary region (not limited to the peripheral portion) of the target image frame. Carving is effective, for example, when there are multiple regions of interest and a region of interest image showing only the plurality of regions of interest is obtained by removing unnecessary background pixels existing between the regions of interest. .

ステップＳ５０６では、背景合成手段決定部（例えば、図３に示す背景合成手段決定部３２４）は、対象画像フレームから抽出された関心領域画像が所定のアスペクト比基準を満たすか否かを判定する。このアスペクト比基準とは、例えばニューラルネットワークに入力される対象画像フレームの目的のアスペクト比を規定する基準であってもよい。
対象画像フレームから抽出された関心領域画像が所定のアスペクト比基準を満たす場合、本処理はステップＳ５０９へ進み、対象画像フレームから抽出された関心領域画像が所定のアスペクト比基準を満たさない場合、本処理はステップＳ５０７へ進む。 In step S506, the background synthesizing means determining unit (for example, the background synthesizing means determining unit 324 shown in FIG. 3) determines whether or not the region of interest image extracted from the target image frame satisfies a predetermined aspect ratio standard. This aspect ratio criterion may be, for example, a criterion that defines the desired aspect ratio of the target image frame input to the neural network.
If the region of interest image extracted from the target image frame meets the predetermined aspect ratio criterion, the process proceeds to step S509; if the region of interest image extracted from the target image frame does not meet the predetermined aspect ratio criterion, the present Processing proceeds to step S507.

ステップＳ５０７では、背景合成手段決定部３２４は、対象画像フレームに背景画素を追加又は削除することで当該対象画像フレームを所定のアスペクト比に変換する背景合成手段を複数の背景合成手段の候補から決定する。ある実施形態では、背景合成手段決定部３２４は、当該対象画像フレームの特性（関心領域の大きさ、検出対象のオブジェクトの構成等）に基づいて、複数の背景合成手段の候補のそれぞれに対して、当該背景合成手段の適切性を示す適正スコアを割り当てた後、所定の適正スコア（例えば、適正スコアが最も高い背景合成手段の候補）を決定してもよい。
なお、複数の背景合成手段の候補の詳細については後述するため、ここではその説明を省略する。 In step S507, the background synthesizing means determination unit 324 determines a background synthesizing means for converting the target image frame to a predetermined aspect ratio by adding or deleting background pixels to or from the target image frame from a plurality of background synthesizing means candidates. do. In one embodiment, the background synthesizing means determination unit 324 selects the background synthesizing means for each of the plurality of candidates for background synthesizing means based on the characteristics of the target image frame (size of the region of interest, configuration of the object to be detected, etc.). , after assigning a suitability score indicating suitability of the background synthesizing means, a predetermined suitability score (for example, a candidate for the background synthesizing method with the highest suitability score) may be determined.
Since the details of the plurality of candidates for background synthesizing means will be described later, the description thereof will be omitted here.

次に、ステップＳ５０８では、背景合成部３２５は、ステップＳ５０７で背景合成手段決定部３２４によって決定された背景合成手段を用いて、対象画像フレームに背景画素を追加又は削除することで当該対象画像フレームを所定のアスペクト比に変換する。上述したように、ある実施形態では、背景合成部は、関心領域や顕著性中心に基づいて行われてもよい。これにより、ニューラルネットワークの解析に有用な意味的情報を維持しつつ、画像のアスペクト比を適宜に変更することができる。
なお、以上では、対象画像フレームを所定のアスペクト比に変換する場合を一例として説明したが、本開示はこれに限定されず、例えばステップＳ５０７で決定した背景合成手段をステップＳ５０５で抽出された関心領域画像に対して行うことで、関心領域画像を所定のアスペクト比に変換してもよい。 Next, in step S508, the background synthesizing unit 325 uses the background synthesizing means determined by the background synthesizing means determining unit 324 in step S507 to add or delete background pixels to or from the target image frame. to a given aspect ratio. As noted above, in some embodiments, the background composition may be based on regions of interest or saliency centers. As a result, the aspect ratio of the image can be appropriately changed while maintaining semantic information useful for neural network analysis.
In the above, the case where the target image frame is converted to a predetermined aspect ratio has been described as an example, but the present disclosure is not limited to this. Performing this on the region image may convert the region of interest image to a predetermined aspect ratio.

次に、ステップＳ５０９では、前処理部３１０は、対象画像フレームをニューラルネットワークに入力するための前処理を行う。例えば、ある実施形態では、前処理部は、前処理として、対象画像フレームを所定のデータ形式に変換したり、対象画像フレームを圧縮したり、対象画像フレームを回転したりしてもよい。 Next, in step S509, the preprocessing unit 310 performs preprocessing for inputting the target image frame to the neural network. For example, in one embodiment, the preprocessing unit may convert the target image frame into a predetermined data format, compress the target image frame, or rotate the target image frame as preprocessing.

ステップＳ５１０では、ニューラルネットワーク（例えば、図３に示すニューラルネットワーク３３０）は、対象画像フレームに対する解析を行い、この解析の結果を示す解析結果を出力する。一例として、ニューラルネットワークは、領域検出部によって適切なアスペクト比に変換された画像に対するオブジェクト検出を行い、このオブジェクト検出の結果を示す解析結果を生成してもよい。 In step S510, a neural network (for example, neural network 330 shown in FIG. 3) analyzes the target image frame and outputs an analysis result indicating the result of this analysis. As an example, the neural network may perform object detection on the image that has been converted to the appropriate aspect ratio by the area detector, and generate an analysis result indicating the result of this object detection.

以上説明した領域変換処理５００によれば、画像の意味的内容を考慮した背景合成手段を適用することで、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更することができる。また、このように適切なアスペクト比に変更された画像をニューラルネットワークによる解析の対象とすることで、オブジェクト検出の精度を向上させることができる。 According to the region conversion processing 500 described above, the aspect ratio of the image can be appropriately changed while maintaining the quality and semantic information of the image by applying the background synthesizing means that considers the semantic content of the image. can be done. In addition, the accuracy of object detection can be improved by subjecting an image whose aspect ratio has been changed to an appropriate aspect ratio to analysis by a neural network.

次に、図６～図８を参照して、本開示の実施形態に係る関心領域検出部の処理について説明する。 Next, processing of the region-of-interest detection unit according to the embodiment of the present disclosure will be described with reference to FIGS. 6 to 8. FIG.

図６は、本開示の実施形態に係る関心領域検出部３２２の処理の一例を示す図である。
上述したように、本開示の実施形態に係る関心領域検出部３２２は、入力する対象画像フレーム６０１における関心領域６０２及び顕著性中心６０３を検出すると共に、関心領域６０２を含む関心領域画像を対象画像フレームから抽出するための加工動作の信頼度６０４を計算するための機能部である。
ここでの関心領域６０２とは、対象画像フレーム６０１の中で、検出対象のオブジェクトを含む領域である。例えば、検出対象のオブジェクトが「黒い車」の場合、関心領域６０２は対象画像フレーム６０１において黒い車を含む可能性が高い領域である。
ここでの顕著性中心６０３とは、対象画像フレーム６０１における顕著性の中心である。言い換えれば、顕著性中心６０３は、対象画像フレーム６０１において、視覚的に重要（ｖｉｓｕａｌｌｙｓａｌｉｅｎｔ）な領域の中心の座標を示す。一例として、顕著性中心６０３は、例えば関心領域６０２の中心を示す座標であってもよい。 FIG. 6 is a diagram illustrating an example of processing of the region-of-interest detection unit 322 according to the embodiment of the present disclosure.
As described above, the region of interest detection unit 322 according to the embodiment of the present disclosure detects the region of interest 602 and the saliency center 603 in the input target image frame 601, and converts the region of interest image including the region of interest 602 into the target image. It is a functional unit for calculating the reliability 604 of the processing operation for extracting from the frame.
The region of interest 602 here is a region including the object to be detected in the target image frame 601 . For example, if the object to be detected is a “black car”, the region of interest 602 is the region in the target image frame 601 that is likely to contain a black car.
The saliency center 603 here is the saliency center in the target image frame 601 . In other words, saliency center 603 indicates the coordinates of the center of the visually salient region in target image frame 601 . As an example, saliency center 603 may be coordinates indicating the center of region of interest 602, for example.

ある実施形態では、関心領域検出部３２２は、画像における関心領域６０２を検出するように訓練されたニューラルネットワークモデルであってもよい。関心領域検出部３２２は、例えば、予測する関心領域と、グランドトゥルースに示される関心領域とのＩｏＵ(ＩｎｔｅｒｓｅｃｔｉｏｎｏｖｅｒＵｎｉｏｎ)スコアを最大化するように訓練されてもよい。 In one embodiment, region of interest detector 322 may be a neural network model trained to detect region of interest 602 in an image. The region-of-interest detector 322 may be trained, for example, to maximize an IoU (Intersection over Union) score between a predicted region-of-interest and a region-of-interest indicated in the ground truth.

信頼度６０４とは、クロッピングやカービングなどの加工動作の適切性を表す尺度であり、対象画像フレームのアスペクト比を変更するための領域変換手段を決定するために用いられる。図６に示すように、ここでの信頼度６０４は、クロッピングやカービングなどの加工動作によって抽出される領域の座標を規定するバウンディングボックス、クロッピングの適切性を示すクロッピング親和度、カービングの適切性を示すカーブ親和度、及び顕著性中心６０３の確実性を示す顕著性中心信頼度を含んでもよい。 The reliability 604 is a measure representing the appropriateness of processing operations such as cropping and carving, and is used to determine area conversion means for changing the aspect ratio of the target image frame. As shown in FIG. 6, the reliability 604 here includes a bounding box that defines the coordinates of an area extracted by a processing operation such as cropping or carving, a cropping affinity that indicates the appropriateness of cropping, and an appropriateness of carving. A curve affinity indicating saliency center 603 and a saliency center confidence indicating the certainty of the saliency center 603 may also be included.

ある実施形態では、関心領域検出部３２２は、対象画像フレームにおける関心領域の空間的分布に対する統計的分析を行ってもよい。例えば、ある実施形態では、関心領域検出部３２２は、過去に解析した画像について検出した関心領域の空間的分布及び出現頻度に基づいて、現在分析している対象画像フレームの関心領域を推定してもよい。
一例として、過去に解析した道路の画像において、信号機が左側に存在する頻度が多かった（例えば、所定の出現頻度基準を満たす）場合、関心領域検出部３２２は、道路の画像を入力すると、信号機を含む関心領域が当該画像の左側に存在することを推定してもよい。この関心領域の空間的分布に対する統計的分析の結果は、関心領域の検出に用いられてもよく、背景合成手段を決定するために用いられてもよい。 In some embodiments, the region of interest detector 322 may perform statistical analysis on the spatial distribution of the regions of interest in the target image frame. For example, in one embodiment, the region-of-interest detection unit 322 estimates the region-of-interest of the target image frame currently being analyzed based on the spatial distribution and appearance frequency of the regions of interest detected in previously analyzed images. good too.
As an example, in the image of the road analyzed in the past, when the frequency of traffic lights existing on the left side is high (for example, a predetermined appearance frequency criterion is satisfied), the region of interest detection unit 322, when the image of the road is input, detects the traffic lights It may be estimated that a region of interest containing is on the left side of the image. The result of statistical analysis on the spatial distribution of this region of interest may be used for detection of the region of interest and may be used for determining the background synthesizing means.

図７は、本開示の実施形態に係る関心領域検出部によって計算される加工動作の信頼度の一例を示す図である。 FIG. 7 is a diagram showing an example of the reliability of the machining operation calculated by the region-of-interest detection unit according to the embodiment of the present disclosure.

上述したように、本開示の実施形態に係る関心領域検出部３２２は、クロッピングやカービング等の加工動作のそれぞれについて、当該加工動作の適切性を示す信頼度を計算する。また、この信頼度は、例えばクロッピングの適切性を示すクロッピング親和度や、カービングの適切性を示すカービング親和度を含んでもよい。関心領域検出部３２２によって計算されるクロッピング親和度が所定のクロッピング親和度基準を満たす場合に、上述した画像フレーム加工部は、クロッピング手段を用いて対象画像フレームを加工し、カービング親和度が所定のカービング親和度基準を満たす場合には、画像フレーム加工部は、カービング手段を用いて対象画像フレームを加工する。
一方、加工動作の信頼度が所定の信頼度基準を満たさない場合（つまり、クロッピング親和度が所定のクロッピング親和度基準を満たさない、且つ、カービング親和度が所定のカービング親和度基準を満たさない場合）、又は対象画像フレームから抽出された関心領域画像が所定のアスペクト比基準を満たさない場合、上述した背景合成手段決定部は、対象画像フレームに背景画素を追加又は削除することで対象画像フレームを所定のアスペクト比に変換する背景合成手段を複数の背景合成手段の候補から決定する。 As described above, the region-of-interest detection unit 322 according to the embodiment of the present disclosure calculates the reliability indicating the appropriateness of each processing operation such as cropping and carving. In addition, this reliability may include, for example, a cropping affinity indicating the appropriateness of cropping and a carving affinity indicating the appropriateness of carving. If the cropping affinity calculated by the region-of-interest detection unit 322 satisfies a predetermined cropping affinity criterion, the above-described image frame processing unit processes the target image frame using the cropping means so that the carving affinity meets the predetermined cropping affinity. If the carving affinity criterion is satisfied, the image frame processing section processes the target image frame using the carving means.
On the other hand, if the reliability of the machining operation does not meet the predetermined reliability criterion (that is, if the cropping affinity does not satisfy the predetermined cropping affinity criterion and the carving affinity does not satisfy the predetermined carving affinity criterion ), or if the region-of-interest image extracted from the target image frame does not meet the predetermined aspect ratio standard, the above-described background synthesis means determining unit adds or deletes background pixels to the target image frame to render the target image frame as A background synthesizing means for converting to a predetermined aspect ratio is determined from a plurality of background synthesizing means candidates.

図７に示す対象画像フレーム７０１を一例として検討する。上述したように、関心領域検出部３２２は、対象画像フレーム７０１を入力した後、対象画像フレーム７０１の関心領域７０２と、顕著性中心７０３を計算する。その後、関心領域検出部３２２は、計算した関心領域７０２と、顕著性中心７０３とに基づいて、クロッピング親和度及びカービング親和度を計算する。
上述したように、クロッピング親和度は、対象画像フレームに対してクロッピングを仮に実行した場合、意味的情報（例えば、関心領域の画素、顕著性中心の画素）が失われれば失われる程低くなる。言い換えれば、クロッピング親和度は、クロッピングによって失われると予測されている情報が多い程、低くなる。
同様に、カービング親和度は、対象画像フレームに対してカービングを仮に実行した場合、意味的情報（例えば、関心領域の画素、顕著性中心の画素）が失われれば失われる程低くなる。言い換えれば、カービング親和度は、カービングによって失われると予測されている情報が多い程、低くなる。
例えば、図７に示すような、関心領域７０２において検証対象のオブジェクトが多数存在している場合、仮にクロッピングを実行すると、検証対象のオブジェクトの画素が失われる可能性が高い。従って、クロッピングの適切性が低く、クロッピング親和度が、クロッピング親和度基準を満たさない値となる。
同様に、関心領域７０２において検証対象のオブジェクトが密接であり、オブジェクトとオブジェクトとの間が狭い場合、仮にカービングを実行すると、検証対象のオブジェクトの画素が失われる可能性が高い。従って、カービングの適切性が低く、カービング親和度が、カービング親和度基準を満たさない値となる。 Consider the target image frame 701 shown in FIG. 7 as an example. As described above, after receiving the target image frame 701 , the region of interest detection unit 322 calculates the region of interest 702 and the saliency center 703 of the target image frame 701 . After that, the region of interest detection unit 322 calculates cropping affinity and carving affinity based on the calculated region of interest 702 and saliency center 703 .
As mentioned above, the cropping affinity is lower the more semantic information (eg, region of interest pixels, saliency center pixels) is lost if cropping is performed on the target image frame. In other words, the cropping affinity is lower the more information is expected to be lost by cropping.
Similarly, carving affinity is lower the more semantic information (eg, region of interest pixels, saliency center pixels) is lost if carving is performed on the target image frame. In other words, the carving affinity is lower the more information is expected to be lost by carving.
For example, if there are many objects to be verified in a region of interest 702 as shown in FIG. 7, if cropping is performed, pixels of the objects to be verified are likely to be lost. Therefore, the suitability of cropping is low, and the cropping affinity becomes a value that does not satisfy the cropping affinity standard.
Similarly, if the objects under verification are close together in the region of interest 702 and the distances between the objects are narrow, pixels of the object under verification are likely to be lost if carving is performed. Therefore, the appropriateness of carving is low, and the carving affinity is a value that does not satisfy the carving affinity standard.

図８は、本開示の実施形態に係る関心領域検出部３２２によって計算される加工動作の信頼度の別の一例を示す図である。 FIG. 8 is a diagram showing another example of the reliability of the machining operation calculated by the region-of-interest detection unit 322 according to the embodiment of the present disclosure.

図８に示す対象画像フレーム８０１を一例として検討する。上述したように、関心領域検出部３２２は、対象画像フレーム８０１を入力した後、対象画像フレーム８０１の関心領域８０２と、顕著性中心８０３とを計算する。その後、関心領域検出部３２２は、計算した関心領域８０２と、顕著性中心８０３とに基づいて、クロッピング親和度及びカービング親和度を計算する。
対象画像フレーム８０１のような、関心領域８０２が１つだけであり、関心領域８０２にはオブジェクトが１つしか存在しない場合、関心領域８０２のみを示す関心領域画像をクロッピングで抽出することが可能である。従って、クロッピングの適切性が高く、クロッピング親和度が、クロッピング親和度基準を満たす値となる。
同様に、対象画像フレーム８０１のような、関心領域８０２が１つだけであり、関心領域８０２にはオブジェクトが１つしか存在しない場合、関心領域８０２のみを示す関心領域画像をカービングで抽出することも可能である。従って、カービングの適切性が高く、カービング親和度が、カービング親和度基準を満たす値となる。ただし、原則として、クロッピングはカービングに比べて、必要なコンピューティング資源が低いため、クロッピングとカービングとの両方が親和度基準を満たす場合、コンピューティング資源を抑える観点から、クロッピングを用いることが望ましい。 Consider the target image frame 801 shown in FIG. 8 as an example. As described above, after receiving the target image frame 801 , the region of interest detection unit 322 calculates the region of interest 802 and the saliency center 803 of the target image frame 801 . After that, the region of interest detection unit 322 calculates cropping affinity and carving affinity based on the calculated region of interest 802 and saliency center 803 .
If there is only one region of interest 802 and only one object in the region of interest 802, such as the target image frame 801, a region of interest image showing only the region of interest 802 can be extracted by cropping. be. Therefore, the suitability of cropping is high, and the cropping affinity becomes a value that satisfies the cropping affinity criteria.
Similarly, if there is only one region of interest 802 and only one object in the region of interest 802, such as the target image frame 801, carving extracts a region of interest image showing only the region of interest 802. is also possible. Therefore, the appropriateness of carving is high, and the carving affinity is a value that satisfies the carving affinity standard. However, in principle, cropping requires less computing resources than carving, so if both cropping and carving meet the affinity criteria, it is preferable to use cropping from the viewpoint of saving computing resources.

また、図８に示す対象画像フレーム８１１をもう一例として検討する。上述したように、関心領域検出部３２２は、対象画像フレーム８１１を入力した後、対象画像フレーム８１１の関心領域８１２と、顕著性中心８１３とを計算する。その後、関心領域検出部３２２は、計算した関心領域８１２と、顕著性中心８１３とに基づいて、クロッピング親和度及びカービング親和度を計算する。
対象画像フレーム８１１のような、関心領域８１２が複数存在する場合、仮にクロッピングを実行すると、複数の関心領域８１２の間の不要な背景画素を含んでしまう可能性が高い。従って、クロッピングの適切性が低く、クロッピング親和度が、クロッピング親和度基準を満たさない値となる。
一方、関心領域８１２が複数存在する場合、複数の関心領域８１２をそれぞれ切り出し、複数の関心領域８１２の間の不要な背景画素を排除しつつ、関心領域８１２のみを示す関心領域画像をカービングで抽出することが可能である。従って、カービングの適切性が高く、カービング親和度が、カービング親和度基準を満たす値となる。 Also consider the target image frame 811 shown in FIG. 8 as another example. As described above, after receiving the target image frame 811 , the region of interest detection unit 322 calculates the region of interest 812 and the saliency center 813 of the target image frame 811 . After that, the region of interest detection unit 322 calculates cropping affinity and carving affinity based on the calculated region of interest 812 and saliency center 813 .
If there are multiple regions of interest 812 , such as the target image frame 811 , if cropping is performed, there is a high probability of including unwanted background pixels between the multiple regions of interest 812 . Therefore, the suitability of cropping is low, and the cropping affinity becomes a value that does not satisfy the cropping affinity standard.
On the other hand, when a plurality of regions of interest 812 exist, each of the plurality of regions of interest 812 is cut out, unnecessary background pixels between the plurality of regions of interest 812 are removed, and a region of interest image showing only the regions of interest 812 is extracted by carving. It is possible to Therefore, the appropriateness of carving is high, and the carving affinity is a value that satisfies the carving affinity standard.

以上、図６～図８を参照して説明した関心領域検出部３２２の処理によれば、対象画像フレームの意味的情報を多く含む関心領域及び顕著性中心を判定することができると共に、関心領域を含む関心領域画像を対象画像フレームから抽出するための加工動作の信頼度を計算することが可能となる。また、これによれば、画像の意味的内容を考慮した領域変換手段を判定することができる。 As described above, according to the processing of the region-of-interest detection unit 322 described with reference to FIGS. It is possible to calculate the reliability of the processing operation for extracting the region of interest image containing from the target image frame. Further, according to this, it is possible to determine the area conversion means that considers the semantic content of the image.

次に、図９を参照して、本開示の実施形態に係る背景合成手段決定部の処理について説明する。 Next, with reference to FIG. 9, processing of the background synthesizing means determination unit according to the embodiment of the present disclosure will be described.

図９は、本開示の実施形態に係る背景合成手段決定部３２４の処理の一例を示す図である。上述したように、背景合成手段決定部３２４は、関心領域画像を対象画像フレームから抽出するための加工動作の信頼度が所定の信頼度基準を満たさない場合、又は対象画像フレームから抽出された関心領域画像が所定のアスペクト比基準を満たさない場合、対象画像フレームに背景画素を追加又は削除することで対象画像フレームを所定のアスペクト比に変換する背景合成手段を複数の背景合成手段の候補から選択する。 FIG. 9 is a diagram illustrating an example of processing of the background synthesizing means determination unit 324 according to the embodiment of the present disclosure. As described above, the background synthesizing means determining unit 324 determines whether the reliability of the processing operation for extracting the region-of-interest image from the target image frame does not satisfy a predetermined reliability standard, or If the area image does not satisfy a predetermined aspect ratio standard, a background synthesizing means for converting the target image frame to a predetermined aspect ratio by adding or deleting background pixels to the target image frame is selected from a plurality of background synthesizing means candidates. do.

背景合成手段決定部３２４は、対象画像フレーム９０１と関心領域９０２とに基づいて、各背景合成手段の候補９０３の適切性を示す適正スコアを計算するように計算された機械学習モデルであってもよい。例えば、ある実施形態では、背景合成手段決定部３２４は、ニューラルネットワークモデル、サポートベクターマシンモデル、決定木モデルであってもよい。
一例として、背景合成手段決定部３２４は、対象画像フレーム９０１と、関心領域検出部３２２によって検出される関心領域９０２とを入力した後、対象画像フレーム９０１の特性（関心領域９０２の大きさ、検出対象のオブジェクトの構成等）に基づいて、複数の背景合成手段の候補９０３のそれぞれに対して、当該背景合成手段の適切性を示す適正スコアを割り当ててもよい。その後、背景合成手段決定部３２４は、所定の適正スコア（例えば、適正スコアが最も高い背景合成手段の候補９０３）を、対象画像フレーム９０１に適用する背景合成手段として決定してもよい。 The background synthesizing means determination unit 324 may be a machine learning model calculated to calculate an adequacy score indicating suitability of each background synthesizing means candidate 903 based on the target image frame 901 and the region of interest 902. good. For example, in some embodiments, the background composition means determiner 324 may be a neural network model, a support vector machine model, or a decision tree model.
As an example, after inputting the target image frame 901 and the region of interest 902 detected by the region of interest detection unit 322, the background synthesizing means determination unit 324 inputs the characteristics of the target image frame 901 (the size of the region of interest 902, the detection For each of the plurality of background synthesizing means candidates 903, an adequacy score indicating the appropriateness of the background synthesizing means may be assigned based on the configuration of the target object, etc.). After that, the background synthesizing means determination unit 324 may determine a predetermined appropriateness score (for example, the background synthesizing means candidate 903 with the highest appropriateness score) as the background synthesizing means to be applied to the target image frame 901 .

上述したように、ここでの背景合成手段は、背景画素を追加又は削除することで対象画像のフレームを所定のアスペクト比に変換するための手段であり、例えばＺｅｒｏＰａｄｄｉｎｇ、ＲｅｆｌｅｃｔｉｏｎＰａｄｄｉｎｇ、ＲｅｐｌｉｃａｔｉｏｎＰａｄｄｉｎｇ等の既存の手法を含んでもよく、後述する第１～第８の背景合成手段を含んでもよい。
なお、本開示では、いくつかの背景合成手段を例として説明するが、本開示はこれに限定されず、任意の背景合成手段を用いてもよい。 As described above, the background synthesizing means here is means for converting the frame of the target image into a predetermined aspect ratio by adding or deleting background pixels. An existing method may be included, and first to eighth background synthesizing means, which will be described later, may be included.
In the present disclosure, several background synthesizing means will be described as examples, but the present disclosure is not limited to this, and any background synthesizing means may be used.

以上説明した背景合成手段決定部３２４によれば、例えば対象画像フレームをクロッピングやカービング等の加工動作によって目的のアスペクト比に変更することができない場合であっても、画像の意味的内容を考慮した適切な背景合成手段を選択することができる。これにより、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更し、高精度のオブジェクト検出を円滑にすることができる。 According to the background synthesizing means determining unit 324 described above, even if the target image frame cannot be changed to the desired aspect ratio by processing operations such as cropping and carving, the semantic content of the image can be taken into consideration. Appropriate background composition means can be selected. As a result, it is possible to appropriately change the aspect ratio of the image while maintaining the quality and semantic information of the image, thereby facilitating highly accurate object detection.

次に、図１０～図１８を参照して、本開示の実施形態に係る背景合成手段について説明する。
なお、以下では、画像のアスペクト比を変換するための背景合成手段をいくつか説明するが、これらの背景合成手段は、単独で用いられてもよく、組み合わせて用いられてもよい。 Next, the background synthesizing means according to the embodiment of the present disclosure will be described with reference to FIGS. 10 to 18. FIG.
Several background synthesizing means for converting the aspect ratio of an image will be described below, but these background synthesizing means may be used singly or in combination.

図１０は、本開示の実施形態に係る第１の背景合成手段１０００の一例を示す図である。図１０に示す第１の背景合成手段１０００は、対象画像フレームの周縁部に背景画素を追加することで当該対象画像フレームを目的のアスペクト比に変換するための手段であり、背景合成部（例えば図３に示す背景合成部３２５）によって実行される。 FIG. 10 is a diagram showing an example of the first background synthesizing means 1000 according to the embodiment of the present disclosure. The first background synthesizing means 1000 shown in FIG. 10 is a means for converting the target image frame to a target aspect ratio by adding background pixels to the periphery of the target image frame. This is performed by the background synthesizing unit 325 shown in FIG.

まず、背景合成部は、対象画像フレーム１００１を、当該対象画像フレーム１００１を構成する各チャンネルに分割したチャンネル画像１００２を生成する。例えば、対象画像フレーム１００１がＲＧＢ画像の場合、背景合成部は、対象画像フレーム１００１をＲ、Ｇ、及びＢの３つのチャンネルに分解したチャンネル画像１００２を生成する。
その後、背景合成部は、チャンネル画像１００２の画素の中央値を計算する(１０１０)。 First, the background synthesizing unit generates channel images 1002 by dividing the target image frame 1001 into channels constituting the target image frame 1001 . For example, if the target image frame 1001 is an RGB image, the background synthesizing unit generates a channel image 1002 by decomposing the target image frame 1001 into three channels of R, G, and B.
The background synthesizer then calculates 1010 the median value of the pixels of the channel image 1002 .

次に、背景合成部は、チャンネル画像１００２の中央値を計算した後、対象画像フレーム１００１と同じチャンネル数を有し、対象画像フレーム１００１のサイズに対して所定のサイズ基準を満たす合成背景画像１００３を生成する。この合成背景画像１００３の各画素の画素値は、例えば、チャンネル画像１００２の中央値であってもよい。
また、ここでのサイズ基準とは、例えば対象画像フレーム１００１より２０％大きいサイズ、３０％より大きいサイズ、５０％より大きいサイズ等、任意の倍率であってもよい。 Next, after calculating the median value of the channel image 1002, the background synthesizing unit generates a synthesized background image 1003 having the same number of channels as the target image frame 1001 and satisfying a predetermined size standard for the size of the target image frame 1001. to generate The pixel value of each pixel of this synthetic background image 1003 may be, for example, the median value of the channel image 1002 .
Also, the size standard here may be any magnification such as a size larger than the target image frame 1001 by 20%, a size larger than 30%, or a size larger than 50%.

次に、背景合成部は、上述した関心領域検出部３２２によって検出された関心領域及び顕著性中心１００４に基づいて、合成背景画像１００３と、対象画像フレーム１００１とを所定のフレームブレンディング手段１０２０によって結合することで第１の結合画像１００５を生成する。例えば、ある実施形態では、背景合成部は、顕著性中心を用いて合成背景画像１００３の中心座標を計算した後、この中心座標に基づいて合成背景画像１００３と対象画像フレーム１００１とを整合し、結合してもよい。
ここでのフレームブレンディング手段１０２０とは、合成背景画像１００３と、対象画像フレーム１００１とのエッジをぼかす手段であり、ＰｏｉｓｓｏｎＢｌｅｎｄｉｎｇ，Ｗａｖｅｌｅｔ－ｂａｓｅｄｂｌｅｎｄｉｎｇ，ａｌｐｈａｂｌｅｎｄｉｎｇ等、任意の既存のフレームブレンディング手段を含んでもよい。 Next, the background synthesizing unit combines the synthesized background image 1003 and the target image frame 1001 by the predetermined frame blending means 1020 based on the region of interest and the saliency center 1004 detected by the region of interest detection unit 322 described above. By doing so, a first combined image 1005 is generated. For example, in one embodiment, the background synthesizer calculates the center coordinates of the synthetic background image 1003 using the saliency center, then aligns the synthetic background image 1003 with the target image frame 1001 based on the center coordinates, and may be combined.
The frame blending means 1020 here is means for blurring the edge between the synthetic background image 1003 and the target image frame 1001, and includes any existing frame blending means such as Poisson Blending, Wavelet-based blending, alpha blending, etc. It's okay.

次に、背景合成部は、第１の結合画像１００５を、所定のサイズ変換手段（例えば、ｓｃａｌｉｎｇなど）を用いて目的のアスペクト比に変換した最終画像１００６を生成する。
このように、以上説明した第１の背景合成手段１０００によれば、オブジェクトの歪みや変形を生じることなく、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更した画像を得ることができる。 Next, the background synthesizing unit generates a final image 1006 by converting the first combined image 1005 into a target aspect ratio using a predetermined size converting means (for example, scaling).
As described above, according to the first background synthesizing unit 1000 described above, an image in which the aspect ratio of the image is appropriately changed while maintaining the quality and semantic information of the image without causing distortion or deformation of the object. can be obtained.

図１１は、本開示の実施形態に係る第２の背景合成手段１１００の一例を示す図である。図１１に示す第２の背景合成手段１１００は、エンコーダーデコーダーモデルを用いて、対象画像フレームの背景領域を近似した合成領域を対象画像フレームに追加することで当該対象画像フレームを目的のアスペクト比に変換するための手段であり、背景合成部（例えば図３に示す背景合成部３２５）によって実行される。 FIG. 11 is a diagram showing an example of second background synthesizing means 1100 according to an embodiment of the present disclosure. The second background synthesizing means 1100 shown in FIG. 11 uses an encoder-decoder model to add a synthesizing area that approximates the background area of the target image frame to the target image frame so that the target image frame has a desired aspect ratio. A means for transforming, which is performed by a background synthesizer (eg, the background synthesizer 325 shown in FIG. 3).

まず、背景合成部は、学習用画像１１０１における対象領域１１０２を近似した第１の合成領域１１０３を生成するエンコーダーデコーダーモデル１１１０、１１２０を訓練することで、訓練済みのエンコーダーデコーダーモデル１１１１、１１２１を生成する。
より具体的には、訓練段階では、エンコーダーモデル１１１０は、学習用画像１１０１における対象領域１１０２を入力する。この対象領域１１０２は、一つ又は複数の方向に沿った、連続する画素の集合であり、望ましくは、学習用画像１１０１の関心領域を除いた背景領域である。また、ある実施形態では、この対象領域１１０２は、順番付けられている複数の領域を含んでもよい。 First, the background synthesis unit generates trained encoder-decoder models 1111 and 1121 by training encoder-decoder models 1110 and 1120 that generate a first synthesis region 1103 that approximates the target region 1102 in the learning image 1101. do.
More specifically, during the training phase, encoder model 1110 inputs region of interest 1102 in training image 1101 . This region of interest 1102 is a set of continuous pixels along one or more directions, preferably a background region excluding the region of interest of the learning image 1101 . Also, in some embodiments, this region of interest 1102 may include multiple regions that are ordered.

対象領域１１０２を入力した後、エンコーダーモデル１１１０は、対象領域１１０２を近似した潜在表現（ｌａｔｅｎｔｒｅｐｒｅｓｅｎｔａｔｉｏｎ）を生成し、デコーダーモデル１１２０に出力する。次に、デコーダーモデル１１２０は、入力し対象領域１１０２を近似した潜在表現に基づいて、対象領域１１０２の前後の領域を示す第１の合成領域１１０３を生成する。エンコーダーデコーダーモデル１１１０、１１２０を、より高精度の第１の合成領域１１０３を生成するように訓練することで、訓練済みのエンコーダーデコーダーモデル１１１１、１１２１を生成することができる。 After inputting the region of interest 1102 , the encoder model 1110 generates a latent representation that approximates the region of interest 1102 and outputs it to the decoder model 1120 . Next, the decoder model 1120 generates a first synthesized region 1103 indicating regions before and after the region of interest 1102 based on the input latent representation that approximates the region of interest 1102 . Trained encoder-decoder models 1111, 1121 can be generated by training the encoder-decoder models 1110, 1120 to generate a first synthetic region 1103 of higher accuracy.

次に、推論段階では、訓練済みのエンコーダーモデル１１１１は、対象画像フレーム１１０４における背景領域１１０５を入力する。この背景領域１１０５は、上述した対象領域１１０２と同様に、一つ又は複数の方向に沿った、連続する画素の集合であり、望ましくは、対象画像フレーム１１０４の関心領域を除いた背景領域である。また、ある実施形態では、この背景領域１１０５は、順番付けられている複数の領域を含んでもよい。 Next, during the inference stage, the trained encoder model 1111 inputs the background region 1105 in the target image frame 1104 . This background region 1105 is a collection of contiguous pixels along one or more directions, similar to the region of interest 1102 described above, and is preferably the background region of the target image frame 1104 excluding the region of interest. . Also, in some embodiments, this background region 1105 may include multiple regions that are ordered.

背景領域１１０５を入力した後、訓練済みのエンコーダーモデル１１１１は、背景領域１１０５を近似した潜在表現を生成し、訓練済みのデコーダーモデル１１２１に出力する。次に、訓練済みのデコーダーモデル１１２１は、入力した背景領域１１０５を近似した潜在表現に基づいて、背景領域１１０５の前後の領域を示す第２の合成領域１１０６を生成する。 After inputting background region 1105 , trained encoder model 1111 generates a latent representation that approximates background region 1105 and outputs it to trained decoder model 1121 . The trained decoder model 1121 then generates a second synthetic region 1106 showing regions before and after the background region 1105 based on the input latent representation approximating the background region 1105 .

次に、背景合成部は、上述した関心領域検出部３２２によって検出された対象画像フレーム１１０４関心領域・顕著性中心１１０７に基づいて、第２の合成領域１１０６を対象画像フレーム１１０４に挿入することで、対象画像フレーム１１０４を所定のアスペクト比に変換し、最終画像１１０８を生成する。ここで、背景合成部は、第２の合成領域１１０６を対象画像フレーム１１０４に挿入する位置を、対象画像フレーム１１０４関心領域・顕著性中心１１０７に基づいて決定してもよい。 Next, the background synthesis unit inserts a second synthesis region 1106 into the target image frame 1104 based on the target image frame 1104 region of interest/salience center 1107 detected by the region of interest detection unit 322 described above. , converts the target image frame 1104 to a predetermined aspect ratio to produce the final image 1108 . Here, the background synthesizing unit may determine the position where the second synthesizing region 1106 is to be inserted into the target image frame 1104 based on the region of interest/salience center 1107 of the target image frame 1104 .

このように、以上説明した第２の背景合成手段１１００によれば、オブジェクトの歪みや変形を生じることなく、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更した画像を得ることができる。 As described above, according to the second background synthesizing unit 1100 described above, an image in which the aspect ratio of the image is appropriately changed while maintaining the quality and semantic information of the image without causing distortion or deformation of the object. can be obtained.

図１２は、本開示の実施形態に係る第３の背景合成手段１２００の一例を示す図である。
図１２に示す第３の背景合成手段１２００は、ガウス過程回帰モデルを用いて、対象画像フレームの背景領域を近似した合成領域を対象画像フレームの周縁部に追加することで当該対象画像フレームを目的のアスペクト比に変換するための手段であり、背景合成部（例えば図３に示す背景合成部３２５）によって実行される。 FIG. 12 is a diagram showing an example of a third background synthesizing means 1200 according to an embodiment of the present disclosure.
The third background synthesizing means 1200 shown in FIG. 12 uses a Gaussian process regression model to add a synthesizing area that approximates the background area of the target image frame to the periphery of the target image frame. , and is executed by a background synthesizing unit (for example, the background synthesizing unit 325 shown in FIG. 3).

まず、背景合成部は、学習用の画像の周縁部に存在する周縁領域を近似した合成周縁領域（例えば、第１の合成周縁領域）を生成するガウス過程回帰モデルを訓練することで、訓練済みのガウス過程回帰モデル１２１０を生成する。 First, the background synthesizing unit trains a Gaussian process regression model that generates a synthetic fringe area (for example, a first synthetic fringe area) that approximates the fringe area present in the fringe of the training image. generates a Gaussian process regression model 1210 of .

次に、背景合成部は、対象画像フレーム１２０１の周縁部に存在する周縁領域１２０２を訓練済みのガウス過程回帰モデル１２１０に入力する。その後、ガウス過程回帰モデル１２１０は、この周縁領域１２０２を近似した合成周縁領域（例えば、第２の合成周縁領域）を生成する（１２２０）。
例えば、訓練済みのガウス過程回帰モデル１２１０は、複数の合成周縁領域を予測した後、予測した合成周縁領域の中から、周縁領域１２０２に対する予測誤差が所定の基準を満たすものを選択してもよい。ここでの予測誤差は、例えば画素値の平均二乗誤差であってもよい。 Next, the background synthesizing unit inputs the peripheral edge region 1202 existing at the edge of the target image frame 1201 to the trained Gaussian process regression model 1210 . The Gaussian process regression model 1210 then generates 1220 a synthetic fringe region (eg, a second synthetic fringe region) that approximates this fringe region 1202 .
For example, trained Gaussian process regression model 1210 may predict a plurality of synthetic marginal regions and then select among the predicted synthetic marginal regions those whose prediction error for marginal region 1202 satisfies a predetermined criterion. . The prediction error here may be, for example, the mean square error of pixel values.

訓練済みのガウス過程回帰モデル１２１０は、対象画像フレーム１２０１を一周して、周縁部の各領域を近似した合成周縁領域を生成し、生成した合成周縁領域を、対象画像フレーム１２０１より大きいサイズとなるように、対象画像フレーム１２０１の形状に合わせて適宜に配置することで、対象画像フレーム１２０１より大きいサイズを有する合成周縁画像１２０３を得ることができる。 A trained Gaussian process regression model 1210 loops around the target image frame 1201 to generate a synthetic fringe region that approximates each region of the periphery, and the generated synthetic fringe region is of a size larger than the target image frame 1201. By appropriately arranging them in accordance with the shape of the target image frame 1201, a synthesized peripheral image 1203 having a size larger than the target image frame 1201 can be obtained.

その後、背景合成部は、上述した関心領域検出部３２２によって検出された関心領域及び顕著性中心１２０４に基づいて、合成周縁画像１２０３と、対象画像フレーム１２０１とを所定のフレームブレンディング手段１２３０によって結合することで第２の結合画像１２０５を生成する。例えば、ある実施形態では、背景合成部３２５は、顕著性中心を用いて対象画像フレーム１２０１の中心座標を計算した後、この中心座標に基づいて合成周縁画像１２０３と対象画像フレーム１２０１とを整合し、結合してもよい。
上述したように、ここでのフレームブレンディング手段１２３０とは、合成周縁画像１２０３と、対象画像フレーム１２０１とのエッジをぼかす手段であり、ＰｏｉｓｓｏｎＢｌｅｎｄｉｎｇ，Ｗａｖｅｌｅｔ－ｂａｓｅｄｂｌｅｎｄｉｎｇ，ａｌｐｈａｂｌｅｎｄｉｎｇ等、任意の既存のフレームブレンディング手段を含んでもよい。 After that, the background synthesizing unit combines the synthesized peripheral edge image 1203 and the target image frame 1201 by the predetermined frame blending means 1230 based on the region of interest and the saliency center 1204 detected by the region of interest detecting unit 322 described above. to generate a second combined image 1205 . For example, in one embodiment, the background synthesizer 325 uses the saliency center to calculate the center coordinates of the target image frame 1201 and then aligns the synthesized peripheral image 1203 with the target image frame 1201 based on the center coordinates. , may be combined.
As described above, the frame blending means 1230 here is a means for blurring the edge between the synthesized peripheral image 1203 and the target image frame 1201, and any existing blending such as Poisson Blending, Wavelet-based blending, alpha blending, etc. A frame blending means may be included.

次に、背景合成部は、第２の結合画像１２０５を、所定のサイズ変換手段（例えば、ｓｃａｌｉｎｇなど）を用いて目的のアスペクト比に変換した最終画像１２０６を生成する。
このように、以上説明した第３の背景合成手段１２００によれば、オブジェクトの歪みや変形を生じることなく、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更した画像を得ることができる。 Next, the background synthesizing unit generates a final image 1206 by converting the second combined image 1205 into a desired aspect ratio using a predetermined size converting means (for example, scaling).
As described above, according to the third background synthesizing unit 1200 described above, an image in which the aspect ratio of the image is appropriately changed while maintaining the quality and semantic information of the image without causing distortion or deformation of the object. can be obtained.

図１３は、本開示の実施形態に係る第４の背景合成手段１３００の一例を示す図である。図１３に示す第４の背景合成手段１３００は、ガウス過程回帰モデルを用いて、対象画像フレームにおけるパッチシームを生成し、生成したパッチシームを対象画像フレームに追加又は削除することで当該対象画像フレームを目的のアスペクト比に変換するための手段であり、背景合成部（例えば図３に示す背景合成部３２５）によって実行される。 FIG. 13 is a diagram showing an example of a fourth background synthesizing means 1300 according to an embodiment of the present disclosure. The fourth background synthesizing means 1300 shown in FIG. 13 uses a Gaussian process regression model to generate patch seams in the target image frame, and adds or deletes the generated patch seams to or from the target image frame. to a desired aspect ratio, and is executed by a background synthesizing unit (for example, the background synthesizing unit 325 shown in FIG. 3).

まず、背景合成部は、対象画像フレーム１３０１を、所定の方向（例えば図１３に示す対象画像フレーム１３０１の場合、縦方向）に沿って、互いに重複しない領域に分割した後、訓練済みのガウス過程回帰モデル１３１０を用いてそれぞれの領域について、当該領域を近似した合成領域を生成する。
一例として、図１３に示すように、訓練済みのガウス過程回帰モデル１３１０は、対象画像フレーム１３０１を分割した方向に沿って、複数の領域を跨ぐように配置した複数の連続する合成領域をパッチシーム１３０２として生成してもよい。ここで、パッチシームとは、上端から下端へ、又は左端から右端へ延びる画像中のｎ連結のピクセル集合である。ある実施形態では、パッチシーム１３０２の幅は、目的のアスペクト比に応じて判定されてもよい。 First, the background synthesizing unit divides the target image frame 1301 into regions that do not overlap each other along a predetermined direction (for example, the vertical direction in the case of the target image frame 1301 shown in FIG. 13). For each region using regression model 1310, a synthetic region is generated that approximates the region.
As an example, as shown in FIG. 13, a trained Gaussian process regression model 1310 uses a patch seam for a plurality of continuous synthesized regions arranged across a plurality of regions along the direction in which the target image frame 1301 is divided. 1302 may be generated. Here, a patch seam is an n-connected set of pixels in an image that extends from top to bottom or from left to right. In some embodiments, the width of the patch seam 1302 may be determined according to the desired aspect ratio.

次に、背景合成部は、対象画像フレーム１３０１におけるパッチシーム１３０２を削除又は追加することで、対象画像フレーム１３０１を目的のアスペクト比に変換することができる。例えば、目的のアスペクト比が対象画像フレーム１３０１より小さい場合、背景合成部は、生成したパッチシーム１３０２を対象画像フレーム１３０１から削除することで、対象画像フレーム１３０１を目的のアスペクト比に縮小した最終画像１３０３を生成することができる。
一方、目的のアスペクト比が対象画像フレーム１３０１より大きい場合、背景合成部は、生成したパッチシーム１３０２に基づいて、追加のパッチを生成し、対象画像フレーム１３０１に挿入することで、対象画像フレーム１３０１を目的のアスペクト比に拡大した最終画像１３０４を生成することができる。 Next, the background synthesizing unit can convert the target image frame 1301 to the target aspect ratio by deleting or adding the patch seam 1302 in the target image frame 1301 . For example, if the target aspect ratio is smaller than the target image frame 1301, the background synthesizing unit deletes the generated patch seam 1302 from the target image frame 1301, thereby reducing the target image frame 1301 to the target aspect ratio. 1303 can be generated.
On the other hand, if the target aspect ratio is larger than the target image frame 1301 , the background synthesizing unit generates additional patches based on the generated patch seam 1302 and inserts them into the target image frame 1301 . to the desired aspect ratio to produce a final image 1304 .

このように、以上説明した第４の背景合成手段１３００によれば、オブジェクトの歪みや変形を生じることなく、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更した画像を得ることができる。 As described above, according to the fourth background synthesizing unit 1300 described above, an image in which the aspect ratio of the image is appropriately changed while maintaining the quality and semantic information of the image without causing distortion or deformation of the object. can be obtained.

図１４は、本開示の実施形態に係るガウス過程回帰モデル１４００を説明するための図である。上述したように、本開示の実施形態に係る背景合成手段において、いわゆるガウス過程回帰モデル１４００を用いることがある。ここでのガウス過程回帰モデル１４００とは、ノンパラメトリックなカーネルベースの確率モデルであり、入力変数から出力変数である実数値への関数を推定するモデルである。言い換えれば、ガウス過程回帰モデル１４００は、複数のデータ点の類似度に基づいて、未観測のデータ点を予測することができる確率モデルである。 FIG. 14 is a diagram for explaining a Gaussian process regression model 1400 according to an embodiment of the present disclosure. As described above, a so-called Gaussian process regression model 1400 may be used in the background synthesis means according to embodiments of the present disclosure. The Gaussian process regression model 1400 here is a nonparametric kernel-based probabilistic model that estimates a function from an input variable to a real-valued output variable. In other words, Gaussian process regression model 1400 is a probabilistic model that can predict unobserved data points based on the similarity of multiple data points.

本開示では、ガウス過程回帰モデル１４００を用いることで、対象画像フレームにおける背景領域を入力として、当該背景領域に対する類似度が高い合成背景領域を生成することができる。この生成した合成背景領域を対象画像フレームに追加することで、当該対象画像フレームを任意のアスペクト比に変換することができる。 In the present disclosure, the Gaussian process regression model 1400 can be used to take a background region in a target image frame as input and generate a synthetic background region with a high degree of similarity to the background region. By adding the generated synthesized background area to the target image frame, the target image frame can be converted to an arbitrary aspect ratio.

例えば、図１２を参照して説明した第４の背景合成手段について上述したように、ガウス過程回帰モデル１４００を用いることで、対象画像フレームのパッチシームを生成することができる。上述したように、パッチシームとは、上端から下端へ、又は左端から右端へ延びる画像中のｎ連結のピクセルの集合である。より具体的には、パッチシームは、高さＨ及び幅Ｗを有する画像において、座標（i,j）に中心点Ｃ（ｒ）を持つパッチという領域のセットＰ^ｒである。
ここで、iは（1,W）に含まれ、jは（1,Ｈ）に含まれる。また、パッチは、高さｈ及び幅ｗを有する。 For example, patch seams in the target image frame can be generated using the Gaussian process regression model 1400, as described above for the fourth background synthesis means described with reference to FIG. As mentioned above, a patch seam is a set of n-connected pixels in an image that extends from top to bottom or from left to right. More specifically, a patch seam is a set of regions P ^r in an image of height H and width W, patches with center point C(r) at coordinates (i,j).
where i is contained in (1, W) and j is contained in (1, H). Also, the patch has a height h and a width w.

言い換えれば、パッチシームＳは、（１、Ｎ）に含まれるrによってインデックス（index）されるパッチの集合であり、以下の数式１～３によって定義される。

ここで、r=1のとき、パッチシームが縦方向に延びている場合、画像の１行目の画素が選択され、パッチシームが横方向に延びている場合、画像の一例目の画素が選択される。パッチの高さ及び幅は、画像の高さ及び幅に基づいて選択されてもよく、目的のアスペクト比に基づいて選択されてもよい。
また、パッチシームが縦方向に延びている場合、Ｎ＝Ｈ／ｈであり、パッチシームが横方向に延びている場合、Ｎ＝Ｗ／ｗ。 In other words, the patch seam S is the set of patches indexed by r contained in (1, N) and defined by Equations 1-3 below.

Here, when r=1, if the patch seam extends vertically, the pixel in the first row of the image is selected, and if the patch seam extends horizontally, the pixel in the first image is selected. be done. The height and width of the patch may be selected based on the height and width of the image and may be selected based on the desired aspect ratio.
Also, if the patch seam runs longitudinally, N=H/h, and if the patch seam runs laterally, N=W/w.

ガウス過程回帰では、データに適合するすべての許容可能な関数の空間にわたる確率分布が計算される。この手順は以下で説明する。 Gaussian process regression computes the probability distribution over the space of all admissible functions that fit the data. This procedure is described below.

まず、平均m(x)のようなガウス過程の事前分布と、k(x,x’)のような共分散関数が予め定義されているとして、ガウス過程が以下の数式４、５の通りとなる。

ここで、yは観測データ（ｏｂｓｅｒｖｅｄｄａｔａ）であり、f_*は未観測データ（Ｕｎｏｂｓｅｒｖｅｄｄａｔａ）であり、Ｘは学習用データであり、Ｘ_*は推論用のテストデータであり、Ｋはカーナル関数であり、μは観測データの平均であり、μ_*はテストデータの平均である。 First, assuming that the prior distribution of the Gaussian process such as the mean m(x) and the covariance function such as k(x,x') are defined in advance, the Gaussian process is represented by the following

equations

4 and 5. Become.

where y is observed data, f _* is unobserved data, X is training data, X _* is test data for inference, and K is the kernel function. where μ is the mean of the observed data and μ _* is the mean of the test data.

訓練段階では、カーナル関数Ｋのパラメータが学習され、推論段階では、学習した分布に対していわゆる確率的サンプリング（ｓｔｏｃｈａｓｔｉｃｓａｍｐｌｉｎｇ）が行われる。
一例として、σ及びＭをパラメータとするカーナル関数は、以下の数式６で示す。

In the training phase the parameters of the kernel function K are learned and in the inference phase a so-called stochastic sampling is performed on the learned distribution.
As an example, a kernel function with σ and M as parameters is shown in Equation 6 below.

上述した第４の背景合成手段１３００では、Ｘは対象画像フレームにおける画素値となり、{1, 2…N＋k}としてインデックスされる。 In the fourth background synthesizing means 1300 described above, X is the pixel value in the target image frame, indexed as {1, 2...N+k}.

ここでは、Ｘは{1, 2…N－k}としてインデックスされた画素値の強度を表し、Ｘ_*は{N-k＋1, N-k＋2,…N}としてインデックスされた画素値の強度を表し、ｙは{k＋1, k＋1,…N}としてインデックスされた画素値の強度を表し、f_*は{Ｎ＋1, Ｎ＋２, Ｎ＋ｋ}としてインデックスされた画素値の強度を表し、Kは放射基底関数のようなカーナル関数を表す。 Here, X represents the intensity of pixel values indexed as {1, 2...N-k} and X _* represents the intensity of pixel values indexed as {N-k+1, N-k+2,...N}. , y represents the intensity of pixel values _indexed as {k+1, k+1, . represents a kernel function

ある実施形態では、上述した第４の背景合成手段１３００では、カーナル関数Ｋのパラメータは最急降下法（ｇｒａｄｉｅｎｔｄｅｓｃｅｎｔ）を用いるニューラルネットワークによって決定されてもよい。また、ある実施形態では、カーナル関数Ｋのパラメータは、尤度を最大化する勾配計算手法によって決定されてもよい。 In one embodiment, in the fourth background synthesis means 1300 described above, the parameters of the kernel function K may be determined by a neural network using gradient descent. Also, in some embodiments, the parameters of the kernel function K may be determined by a likelihood-maximizing gradient computation technique.

共分散値（ｃｏｖａｒｉａｎｃｅｖａｌｕｅ）K(X,X),K(Ｘ_*,X),K(Ｘ_*,Ｘ_*),K(X,Ｘ_*)は、ニューラルネットワークによって計算されてもよい。 The covariance values K(X,X), K(X _* ,X), K(X _* ,X _* ), K(X,X _* ) may be calculated by a neural network.

学習段階では、実測値と予測値の平均二乗誤差は、ガウス過程回帰モデルのパラメータをチューニングするために用いられてもよい。また、推論段階では、学習したパラメータは、ガウス過程回帰モデルからのサンプリングを行うために用いられてもよい。 During the learning phase, the mean squared error between the observed and predicted values may be used to tune the parameters of the Gaussian process regression model. Also, during the inference stage, the learned parameters may be used to sample from the Gaussian process regression model.

また、推論段階において、実測値と予測値との類似度に基づいて、{Ｎ＋1, k＋1,…Ｎ＋ｋ}としてインデックスされた画素の顕著性を推定するために用いられてもよい。更に、ここで推定した顕著性に基づいて、対象画像フレームに追加するパッチシーム、又は対象画像フレームから削除するパッチシームを特定するために用いられてもよい。 It may also be used in the inference stage to estimate the salience of pixels indexed as {N+1, k+1, . Further, the estimated saliency may be used to identify patch seams to add or remove from the target image frame.

一例として、図１４に示すように、本開示の実施形態に係るガウス過程回帰モデル１４００は、学習用の画像を学習用データＸによって訓練された後、テスト用データＸ_*を入力すると、テスト用データＸ_*を近似した合成の背景領域を、未観測データf_*として出力することができる。この未観測データf_*を対象画像フレームに追加することで、対象画像フレームを目的のアスペクト比に変換することができる。 As an example, as shown in FIG. 14 , a Gaussian process regression model 1400 according to an embodiment of the present disclosure trains a learning image using learning data X, and then inputs test data X _* . A synthetic background region approximating data X _* can be output as unobserved data f _* . By adding this unobserved data f _* to the target image frame, the target image frame can be converted to the target aspect ratio.

図１５は、本開示の実施形態に係る第５の背景合成手段１５００の一例を示す図である。
図１５に示す第５の背景合成手段１５００は、対象画像フレームに、一定の画素値を有する背景画素を追加することで当該対象画像フレームを目的のアスペクト比に変換するための手段であり、背景合成部（例えば図３に示す背景合成部３２５）によって実行される。 FIG. 15 is a diagram showing an example of a fifth background synthesizing means 1500 according to an embodiment of the present disclosure.
The fifth background synthesizing means 1500 shown in FIG. 15 is a means for converting the target image frame to a target aspect ratio by adding background pixels having a constant pixel value to the target image frame. It is performed by a synthesizer (eg, the background synthesizer 325 shown in FIG. 3).

まず、背景合成部は、対象画像フレーム１５０１に、一定の画素値（ｃｏｎｓｔａｎｔｐｉｘｅｌｖａｌｕｅ）を有する背景画素を追加することで背景合成画像１５０２を生成する。例えば、図１５に示すように、背景合成部は、一定の画素値を有する背景画素を対象画像フレーム１５０１の周縁部に追加することで背景合成画像１５０２を生成してもよい。その後、背景合成部は、背景合成画像１５０２を所定のサイズ変換手段（例えば、ｓｃａｌｉｎｇなど）を用いて目的のアスペクト比に変換した最終画像１５０３を生成する。 First, the background synthesizing unit generates a background synthetic image 1502 by adding background pixels having constant pixel values to the target image frame 1501 . For example, as shown in FIG. 15, the background synthesizing unit may generate a background synthesized image 1502 by adding background pixels having a constant pixel value to the periphery of the target image frame 1501 . After that, the background synthesizing unit generates a final image 1503 by converting the background synthetic image 1502 into a target aspect ratio using a predetermined size conversion means (for example, scaling).

ある実施形態では、ここで生成した最終画像１５０３を用いることで、所定のニューラルネットワークを訓練するための学習用画像を生成することができる。例えば、最終画像１５０３において、追加された背景画素の画素値を摂動させた摂動領域１５０４を生成する。その後、訓練済みのニューラルネットワーク３３０は、この摂動領域１５０４を含む最終画像１５０３と、上述した関心領域検出部３２２によって検出された対象画像フレーム１５０１の関心領域及び／又は顕著性中心に基づいて、敵対的学習用画像（ａｄｖｅｒｓａｒｉａｌｔｒａｉｎｉｎｇｉｍａｇｅ）１５０５を生成することができる。
ある実施形態では、訓練済みのニューラルネットワーク３３０は、射影勾配法（ｐｒｏｊｅｃｔｅｄｇｒａｄｉｅｎｔｄｅｓｃｅｎｔ）手法によって生成されてもよい。また、入力される画像は、画素値が０～２５５の範囲内であることを求める所定の制約に基づいて加工されてもよい。
この敵対的学習用画像１５０５は、摂動領域を含むため、一般的な画像に比べてオブジェクトの検出難易度が高く、オブジェクト検出用のニューラルネットワークを訓練するために有用である。 In one embodiment, the final image 1503 generated here can be used to generate training images for training a given neural network. For example, in the final image 1503, a perturbed region 1504 is generated by perturbing the pixel values of the added background pixels. After that, the trained neural network 330 uses the final image 1503 containing this perturbed region 1504 and the region of interest and/or saliency center of the target image frame 1501 detected by the region of interest detection unit 322 described above to determine the adversarial An adversarial training image 1505 can be generated.
In some embodiments, trained neural network 330 may be generated by a projected gradient descent technique. Also, the input image may be processed based on a predetermined constraint requiring that the pixel values be within the range of 0-255.
Since this adversarial learning image 1505 contains perturbation regions, it is more difficult to detect objects than general images, and is useful for training a neural network for object detection.

このように、以上説明した第５の背景合成手段１５００によれば、オブジェクトの歪みや変形を生じることなく、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更した画像を得ることができる。 Thus, according to the fifth background synthesizing means 1500 described above, an image in which the aspect ratio of the image is appropriately changed while maintaining the quality and semantic information of the image without causing distortion or deformation of the object. can be obtained.

図１６は、本開示の実施形態に係る第６の背景合成手段１６００の一例を示す図である。
図１６に示す第６の背景合成手段１６００は、対象画像フレームの片側に存在する背景画素を、反対側にも複製することで、当該対象画像フレームを目的のアスペクト比に変換するための手段であり、背景合成部（例えば図３に示す背景合成部３２５）によって実行される。 FIG. 16 is a diagram showing an example of a sixth background synthesizing means 1600 according to an embodiment of the present disclosure.
The sixth background synthesizing means 1600 shown in FIG. 16 is a means for converting the target image frame to the target aspect ratio by duplicating the background pixels existing on one side of the target image frame to the opposite side. Yes, and is performed by a background synthesizer (eg, the background synthesizer 325 shown in FIG. 3).

まず、上述した関心領域検出部３２２は、対象画像フレーム１６０１を入力し、処理することで、対象画像フレーム１６０１における関心領域及び／又は顕著性中心１６０２を検出する。その後、背景合成部は、関心領域検出部３２２に検出された関心領域及び／又は顕著性中心１６０２に基づいて、対象画像フレーム１６０１の片側に存在する背景画素１６０３を、合成背景領域１６０４として、対象画像フレーム１６０１の反対側に複製することで、対象画像フレームを目的のアスペクト比に変換した最終画像１６０５を生成する。対象画像フレーム１６０１の反対側に複製する背景画素の大きさは、例えば関心領域及び／又は顕著性中心１６０２に基づいて決定されてもよい。
なお、ここでは、対象画像フレーム１６０１の左側に存在する背景画素を右側に複製する場合を一例として説明したが、本開示はこれに限定されず、例えば対象画像フレーム１６０１の右側に存在する背景画素を左側に複製したり、対象画像フレーム１６０１の上方に存在する背景画素を下方に複製したりすることも可能であり、ここでは特に限定されない。 First, the region-of-interest detection unit 322 described above receives and processes the target image frame 1601 to detect the region of interest and/or the saliency center 1602 in the target image frame 1601 . After that, the background synthesizing unit synthesizes the background pixels 1603 present on one side of the target image frame 1601 as a synthetic background region 1604 based on the region of interest and/or the saliency center 1602 detected by the region of interest detection unit 322 . Duplicating the opposite side of the image frame 1601 produces a final image 1605 that converts the target image frame to the desired aspect ratio. The size of the background pixels to replicate on the opposite side of the target image frame 1601 may be determined based on the region of interest and/or the saliency center 1602, for example.
Here, the case of copying the background pixels present on the left side of the target image frame 1601 to the right side has been described as an example, but the present disclosure is not limited to this. can also be duplicated on the left side, and background pixels existing above the target image frame 1601 can be duplicated below, and there is no particular limitation here.

より具体的には、背景合成部は、以下の数式７に従って、背景画素を複製してもよい。

ここでは、i_offsetは、合成背景領域が配置される横方向の座標を指定するi_specifiedと、背景画素の横方向の座標を指定するi_actualとの差分であり、ｊ_offsetは、合成背景領域が配置される縦方向の座標を指定するｊ_specifiedと、背景画素の縦方向の座標を指定するｊ_actualとの差分である。また、I(i,j)は、座標(i,j)での画素値の強度を表し、Hは対象画像フレームの高さを表し、Wは対象画像フレームの幅を表す。 More specifically, the background composition unit may duplicate background pixels according to Equation 7 below.

Here, i _offset is the difference between i _specified , which specifies the horizontal coordinates of the synthetic background area, and i _actual , which specifies the horizontal coordinates of the background pixels, and j _offset is the synthetic background area. It is the difference between j _specified , which specifies the vertical coordinates of the arranged pixels, and j _actual , which specifies the vertical coordinates of the background pixels. Also, I(i,j) represents the intensity of the pixel value at the coordinates (i,j), H represents the height of the target image frame, and W represents the width of the target image frame.

このように、以上説明した第６の背景合成手段１６００によれば、オブジェクトの歪みや変形を生じることなく、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更した画像を得ることができる。 As described above, according to the sixth background synthesizing means 1600 described above, an image in which the aspect ratio of the image is appropriately changed while maintaining the quality and semantic information of the image without causing distortion or deformation of the object. can be obtained.

図１７は、本開示の実施形態に係る第７の背景合成手段１７００の一例を示す図である。
図１７に示す第７の背景合成手段１７００は、対象画像フレームの特定の領域に存在する背景画素を、目的の領域に反映（ｒｅｆｌｅｃｔ）させることで、当該対象画像フレームを目的のアスペクト比に変換するための手段であり、背景合成部（例えば図３に示す背景合成部３２５）によって実行される。 FIG. 17 is a diagram showing an example of a seventh background synthesizing means 1700 according to an embodiment of the present disclosure.
The seventh background synthesizing means 1700 shown in FIG. 17 converts the target image frame to the target aspect ratio by reflecting background pixels present in a specific region of the target image frame in the target region. and is executed by a background synthesizing unit (for example, the background synthesizing unit 325 shown in FIG. 3).

まず、上述した関心領域検出部３２２は、対象画像フレーム１７０１を入力し、処理することで、対象画像フレーム１７０１における関心領域及び／又は顕著性中心１７０２を検出する。その後、背景合成部は、関心領域検出部３２２に検出された関心領域及び／又は顕著性中心１７０２に基づいて、対象画像フレーム１７０１の特定の領域に存在する背景画素１７０３を、合成背景領域１７０４として、対象画像フレーム１７０１の目的の領域に反映させることで、対象画像フレーム１７０１を目的のアスペクト比に変換した最終画像１７０５を生成する。反映させる背景画素の大きさは、例えば関心領域及び／又は顕著性中心１７０２に基づいて決定されてもよい。また、背景画素１７０３及び合成背景領域１７０４は、ユーザによって選択されてもよく、過去の背景合成の実績に基づいて背景合成部によって選択されてもよい。
なお、ここでは、対象画像フレーム１７０１の左側に存在する背景画素を、そのすぐ隣に反映させる場合を一例として説明したが、本開示はこれに限定されず、背景画素は任意の位置に反映されてもよい。 First, the above-described region-of-interest detection unit 322 receives the target image frame 1701 and processes it to detect the region of interest and/or the saliency center 1702 in the target image frame 1701 . After that, the background synthesizing unit synthesizes background pixels 1703 present in a specific region of the target image frame 1701 as a synthesized background region 1704 based on the region of interest and/or the saliency center 1702 detected by the region of interest detection unit 322. , is reflected in the target area of the target image frame 1701 to generate a final image 1705 in which the target image frame 1701 is converted to the target aspect ratio. The size of the background pixels to be reflected may be determined based on the region of interest and/or the saliency center 1702, for example. Also, the background pixels 1703 and the synthetic background region 1704 may be selected by the user, or may be selected by the background synthesizing unit based on past background synthesizing performance.
Here, the case where the background pixels existing on the left side of the target image frame 1701 are reflected immediately adjacent thereto has been described as an example, but the present disclosure is not limited to this, and the background pixels can be reflected at any position. may

より具体的には、背景合成部は、以下の数式８に従って、背景画素を反映させてもよい。

ここでは、i_offsetは、合成背景領域が配置される横方向の座標を指定するi_specifiedと、背景画素の横方向の座標を指定するi_actualとの差分であり、ｊ_offsetは、合成背景領域が配置される縦方向の座標を指定するｊ_specifiedと、背景画素の縦方向の座標を指定するｊ_actualとの差分である。また、I(i,j)は、座標(i,j)での画素値の強度を表し、Hは対象画像フレームの高さを表し、Wは対象画像フレームの幅を表す。 More specifically, the background synthesizing unit may reflect the background pixels according to Equation 8 below.

ある実施形態では、背景合成部は、背景画素１７０３と、合成背景領域１７０４との類似度を計算してもよい。例えば、背景合成部は、背景画素１７０３と合成背景領域１７０４との類似度として、背景画素１７０３と合成背景領域１７０４との平均二乗誤差を計算してもよく、背景画素１７０３と合成背景領域１７０４との画素値の強度の乖離度を計算してもよく、ユークリッド距離等の距離計算手法を用いてもよい。その後、ある実施形態では、背景合成部は、背景画素１７０３と、合成背景領域１７０４との類似度を最大化するように訓練されてもよい。 In some embodiments, the background synthesizer may calculate the similarity between the background pixel 1703 and the synthesized background region 1704 . For example, the background synthesizing unit may calculate the mean square error between the background pixel 1703 and the synthetic background region 1704 as the degree of similarity between the background pixel 1703 and the synthetic background region 1704 . may be calculated, or a distance calculation method such as Euclidean distance may be used. Thereafter, in some embodiments, the background synthesizer may be trained to maximize the similarity between background pixels 1703 and synthetic background regions 1704 .

このように、以上説明した第７の背景合成手段１７００によれば、オブジェクトの歪みや変形を生じることなく、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更した画像を得ることができる。 As described above, according to the seventh background synthesizing means 1700 described above, an image in which the aspect ratio of the image is appropriately changed while maintaining the quality and semantic information of the image without causing distortion or deformation of the object. can be obtained.

図１８は、本開示の実施形態に係る第８の背景合成手段１８００の一例を示す図である。
図１８に示す第８の背景合成手段１８００は、アフィン変換適合性判定モデルを用いて、対象画像フレームにおける各領域のアフィン変換適合性を判定した後、各領域のアフィン変換適合性に基づいて拡大することで、当該対象画像フレームを目的のアスペクト比に変換するための手段であり、背景合成部（例えば図３に示す背景合成部３２５）によって実行される。 FIG. 18 is a diagram showing an example of the eighth background synthesizing means 1800 according to the embodiment of the present disclosure.
The eighth background synthesizing means 1800 shown in FIG. 18 determines the affine transformation suitability of each region in the target image frame using the affine transformation suitability determination model, and then enlarges the image based on the affine transformation suitability of each region. By doing so, it is a means for converting the target image frame to a target aspect ratio, and is executed by a background synthesizing unit (for example, the background synthesizing unit 325 shown in FIG. 3).

まず、背景合成部は、対象画像フレーム１８０１を互いに重複しない、同じ大きさを有するブロックに分割することで、グリッド画像１８０２を生成する。その後、背景合成部は、このグリッド画像１８０２をアフィン変換適合性判定モデル１８１０に入力する。
このアフィン変換適合性判定モデル１８１０は、グリッド画像１８０２に含まれる各ブロック毎に、当該ブロックのアフィン変換に対する適合性を判定するモデルである。アフィン変換とは、平行移動（全ての点を決まった方向に一定の距離だけ動かす処理）及び線形変換（拡大縮小、剪断、回転）を含む変換を意味する。
また、ここでのアフィン変換に対する適合性は、特定の領域に対してアフィン変換を行った場合、当該領域について歪み又は変形等の乱れが生じない確率を示す。つまり、アフィン変換に対する適合性が高い領域は、アフィン変換が施されても変形しにくい。一方、アフィン変換に対する適合性が高い領域は、アフィン変換が施されると変形しやすい。 First, the background synthesizing unit generates a grid image 1802 by dividing the target image frame 1801 into non-overlapping blocks having the same size. After that, the background synthesizing unit inputs this grid image 1802 to the affine transformation suitability determination model 1810 .
This affine transformation compatibility determination model 1810 is a model for determining the compatibility of each block included in the grid image 1802 with respect to the affine transformation. Affine transformations refer to transformations including translation (the process of moving all points by a fixed distance in a fixed direction) and linear transformations (scaling, shearing, rotation).
Further, the suitability for affine transformation here indicates the probability that distortion such as distortion or deformation does not occur in a specific region when affine transformation is performed on the region. In other words, a region that is highly compatible with affine transformation is difficult to deform even if affine transformation is performed. On the other hand, regions that are highly compatible with affine transformation tend to be deformed when affine transformation is applied.

アフィン変換適合性判定モデル１８１０を用いてグリッド画像１８０２を解析することで、グリッド画像１８０２に含まれる各ブロック毎に、当該ブロックのアフィン変換に対する適合性を判定することができる。その後、アフィン変換適合性判定モデル１８１０は、判定した各ブロックのアフィン変換に対する適合性に基づいて、当該ブロックの拡大倍率を示すアフィン変換パラメータ１８２０を計算する。このアフィン変換パラメータ１８２０は、例えば数式９に示す行列として表現してもよい。

ここでは、行列の各要素（a11, a12, a21, a22）は、異なるブロックのアフィン変換パラメータを示す。 By analyzing the grid image 1802 using the affine transformation suitability determination model 1810, it is possible to determine the suitability of each block included in the grid image 1802 for the affine transformation. The affine transform suitability determination model 1810 then calculates an affine transform parameter 1820 indicating the magnification factor for each block based on the determined suitability of each block for the affine transform. This affine transformation parameter 1820 may be expressed as a matrix shown in Equation 9, for example.

Here, each element of the matrix (a11, a12, a21, a22) indicates the affine transformation parameters of a different block.

次に、背景合成部は、グリッド画像１８０２における各ブロックを、アフィン変換適合性判定モデル１８１０によって判定されたアフィン変換パラメータ１８２０に指定されている拡大倍率に基づいて拡大（又は縮小）することで対象画像フレーム１８０１を目的のアスペクト比に変換した最終画像１８０３を生成することができる。
この最終画像１８０３は、アフィン変換適合性判定モデル１８１０によって判定されたアフィン変換パラメータ１８２０に指定されている拡大倍率に基づいて生成されるため、アフィン変換に対する適合性が高いブロック（つまり、変形しにくいブロック）はより大きく拡大される。 Next, the background synthesizing unit enlarges (or reduces) each block in the grid image 1802 based on the enlargement ratio specified in the affine transformation parameter 1820 determined by the affine transformation suitability determination model 1810, thereby making the object A final image 1803 can be generated by converting the image frame 1801 to the desired aspect ratio.
Since this final image 1803 is generated based on the enlargement ratio specified in the affine transformation parameters 1820 determined by the affine transformation compatibility determination model 1810, blocks with high suitability for affine transformation (that is, blocks that are difficult to deform) blocks) are magnified larger.

このように、以上説明した第８の背景合成手段１８００によれば、オブジェクトの歪みや変形を生じることなく、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更した画像を得ることができる。 As described above, according to the eighth background synthesizing means 1800 described above, an image in which the aspect ratio of the image is appropriately changed while maintaining the quality and semantic information of the image without causing distortion or deformation of the object. can be obtained.

ある実施形態では、上述した背景合成手段によって生成された、目的のアスペクト比に変換した最終画像がニューラルネットワークによって解析された後、背景合成部は、所定の影響関数を用いて、最終画像におけるクラスの、解析結果に与えた影響度を計算する。そして、画像フレーム特定部は、所定の影響度基準を満たすクラスを含む画像を対象画像フレームとして特定し、所定の影響度基準を満たすクラスを含まない画像を対象画像フレームとして特定しない（つまり、ニューラルネットワークの訓練又は推論から排除する）。 In one embodiment, after the final image converted to the desired aspect ratio generated by the background compositing means described above is analyzed by the neural network, the background compositor uses a predetermined influence function to classify the class in the final image. , the degree of influence on the analysis result is calculated. Then, the image frame identification unit identifies an image containing a class that satisfies a predetermined influence criterion as a target image frame, and does not identify an image that does not contain a class that satisfies the predetermined influence criterion as a target image frame (that is, neural excluded from training or inference of the network).

ここでの影響関数とは、機械学習において、個々の訓練データが（学習済みモデルによる）推論に与えた影響度を計算するための関数である。一例として、モデルパラメータθがΘに含まれるデータ点ｚ（例えば、クラス）の影響関数は、以下の数式１０、図１１のように定義されてもよい。

ここで、L(z_i,θ)は損失関数であり、δはzにおける摂動である。
これにより、影響関数を用いることで、ニューラルネットワークによる解析結果に対して影響度の高いクラスを含む画像を効率良く選択することででき、ニューラルネットワークによる解析結果に対して影響度の低いクラスを含む画像を排除することができる。また、このように、ニューラルネットワークの訓練効率及び検出精度を向上させることができる。 The influence function here is a function for calculating the degree of influence that individual training data has on inference (by a learned model) in machine learning. As an example, the influence function of a data point z (eg, class) whose model parameter θ is included in Θ may be defined as shown in Equation 10 below and FIG.

where L(z _i , θ) is the loss function and δ is the perturbation in z.
As a result, by using the influence function, it is possible to efficiently select images that include classes that have a high degree of influence on the analysis results of the neural network, and include classes that have a low degree of influence on the analysis results of the neural network. Images can be eliminated. Also, in this way, the training efficiency and detection accuracy of the neural network can be improved.

以上説明した、本開示の実施形態に係る領域変換手段によれば、画像の意味的内容を考慮した背景合成手段を適用することで、画像の品質及び意味的情報を維持しつつ、画像のアスペクト比を適宜に変更し、高精度のオブジェクト検出を円滑にすることができる。また、本開示の実施形態に係る領域変換手段は、様々な分野は課題に対して適用されてもよい。例えば、本開示の実施形態に係る領域変換手段は、画像におけるオブジェクトの検出（ｄｅｔｅｃｔｉｏｎ）、オブジェクトの分類(ｃｌａｓｓｉｆｉｃａｔｉｏｎ)、オブジェクトのキーポイント検出(ｋｅｙ－ｐｏｉｎｔｄｅｔｅｃｔｉｏｎ)、オブジェクト追跡(ｏｂｊｅｃｔｔｒａｃｋｉｎｇ)、画像の分割(ｉｍａｇｅｓｅｇｍｅｎｔａｔｉｏｎ)、ニューラルネットワークに入力される画像の加工又は前処理、ロボットの操作、異なる解像度の装置から取得される画像の調整、画像のサムネイル生成、性能ベンチマークのパラメータ調整(ｐｅｒｆｏｒｍａｎｃｅｂｅｎｃｈｍａｒｋｐａｒａｍｅｔｅｒａｄｊｕｓｔｍｅｎｔ)等、任意の分野や課題に対して適用されてもよい。 According to the area conversion means according to the embodiment of the present disclosure described above, the aspect of the image is maintained while maintaining the quality and semantic information of the image by applying the background synthesizing means that considers the semantic content of the image. The ratio can be changed accordingly to facilitate highly accurate object detection. Also, the domain conversion means according to the embodiments of the present disclosure may be applied to problems in various fields. For example, the domain transformation means according to embodiments of the present disclosure can perform object detection in an image, object classification, object key-point detection, object tracking, image segmentation of images, processing or preprocessing of images input to neural networks, manipulation of robots, adjustment of images acquired from devices of different resolutions, generation of image thumbnails, performance benchmark parameter adjustment. adjustment), etc., may be applied to any field or subject.

以上、本発明の実施の形態について説明したが、本発明は、上述した実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において種々の変更が可能である。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and various modifications are possible without departing from the gist of the present invention.

３００領域変換システム
３０１画像取得装置
３０２記憶部
３０３領域変換装置
３１０前処理部
３２０領域変換部
３２１画像フレーム特定部
３２２関心領域検出部
３２３画像フレーム加工部
３２４背景合成手段決定部
３２５背景合成部
３３０ニューラルネットワーク
３４０出力部
３０４クライアント端末 300 region conversion system 301 image acquisition device 302 storage unit 303 region conversion device 310 preprocessing unit 320 region conversion unit 321 image frame identification unit 322 region of interest detection unit 323 image frame processing unit 324 background synthesis means determination unit 325 background synthesis unit 330 neural Network 340 Output unit 304 Client terminal

Claims

A domain conversion device,
an image frame identification unit that identifies a target image frame to be subjected to region conversion from the image sequence;
a region of interest detection unit that detects a region of interest in the target image frame and calculates reliability of a processing operation for extracting a region of interest image including the region of interest from the target image frame;
an image frame processing unit for extracting the region of interest image from the target image frame using the processing operation if the reliability of the processing operation satisfies a predetermined reliability criterion;
background pixels in the target image frame if the confidence of the manipulation operation does not meet a predetermined confidence criterion or if the region of interest image extracted from the target image frame does not meet a predetermined aspect ratio criterion; a background synthesizing means determination unit that determines, from a plurality of background synthesizing means candidates, a background synthesizing means for converting the target image frame to a predetermined aspect ratio by addition or deletion;
a background synthesizing unit that generates a final image in which the target image frame is converted to the predetermined aspect ratio by adding or deleting background pixels to or from the target image frame using the background synthesizing means;
A domain conversion device comprising:

The background synthesizing unit
As the first background synthesizing means included in the candidates for the plurality of background synthesizing means,
identifying a saliency center in the target image frame;
generating a channel image by decomposing the target image frame into each channel constituting the target image frame;
calculating the median value of the pixels of said channel image;
generating a synthetic background image having the same number of channels as the target image frame, pixels having the median value of the channel image, and satisfying a predetermined size criterion for the size of the target image frame;
The final aspect ratio is converted to the predetermined aspect ratio by combining the synthetic background image and the target image frame by predetermined frame blending means based on the region of interest and the saliency center of the target image frame. generate an image,
2. The domain conversion device according to claim 1, characterized by:

The background synthesizing unit
As the second background synthesizing means included in the candidates for the plurality of background synthesizing means,
generating a trained encoder-decoder model by training an encoder-decoder model that generates a first synthetic region that approximates the region of interest in the training image;
identifying a saliency center in the target image frame;
using the trained encoder-decoder model to generate a second synthetic region that approximates a background region in the target image frame;
The final image obtained by converting the target image frame to the predetermined aspect ratio by inserting the second composite region into the target image frame based on the region of interest and the saliency center of the target image frame. to generate
2. The domain conversion device according to claim 1, characterized by:

The background synthesizing unit
As a third background synthesizing means included in the candidates for the plurality of background synthesizing means,
Generating a trained Gaussian process regression model by training a Gaussian process regression model that generates a first synthetic marginal region approximating a marginal region present at the margin of the training image,
identifying a saliency center in the target image frame;
using the trained Gaussian process regression model to generate a second synthetic fringe region that approximates a fringe region present at the fringes of the target image frame;
Based on the region of interest and the saliency center of the target image frame, the second synthetic peripheral region and the target image frame are combined by a predetermined frame blending means to convert to the predetermined aspect ratio. generating said final image with
2. The domain conversion device according to claim 1, characterized by:

The background synthesizing unit
As the fourth background synthesizing means included in the candidates for the plurality of background synthesizing means,
dividing the target image frame into a first image portion and a second image portion in a region other than the region of interest;
generating overlapping patch seams across the first and second image portions using a Gaussian process regression model;
generating a processed first image portion and a processed second image portion by removing regions overlapped by the patch seam from the first image portion and the second image portion;
combining the processed first image portion and the processed second image portion to generate the final image in which the target image frame is converted to the predetermined aspect ratio;
2. The domain conversion device according to claim 1, characterized by:

The background synthesizing unit
As the fifth background synthesizing means included in the candidates for the plurality of background synthesizing means,
generating the final image by converting the target image frame to the predetermined aspect ratio by adding background pixels having a constant pixel value to the target image frame;
2. The domain conversion device according to claim 1, characterized by:

The domain conversion device is
further comprising a neural network that performs a predetermined analysis process on the final image and generates an analysis result;
2. The domain conversion device according to claim 1, characterized by:

The background synthesizing unit
Using a predetermined influence function, calculate the influence of the class in the final image on the analysis result,
The image frame identification unit
identifying an image containing a class that satisfies a predetermined impact criterion as the target image frame;
8. The domain conversion device according to claim 7, characterized by:

A domain conversion method comprising:
identifying target image frames to be domain-transformed from an image sequence;
detecting a region of interest in the target image frame and calculating a reliability of a processing operation for extracting a region of interest image including the region of interest from the target image frame;
extracting the region of interest image from the target image frame using the manipulation operation if the confidence of the manipulation operation meets a predetermined confidence criterion;
background pixels in the target image frame if the confidence of the manipulation operation does not meet a predetermined confidence criterion or if the region of interest image extracted from the target image frame does not meet a predetermined aspect ratio criterion; a step of determining, from a plurality of candidates of background synthesizing means, a background synthesizing means for converting the target image frame to a predetermined aspect ratio by adding or deleting;
converting the target image frame to the predetermined aspect ratio by adding or deleting background pixels to or from the target image frame using the background synthesizing means;
A domain transformation method comprising:

A domain conversion system,
The domain conversion system includes:
an image acquisition device for acquiring an image sequence;
an area transformation device for performing area transformation on an image;
a client terminal;
the image acquisition device, the domain conversion device, and the client terminal are connected via a communication network;
The domain conversion device is
an image frame identification unit that receives the image sequence from the image acquisition device and identifies a target image frame to be subjected to region conversion from the image sequence;
a region of interest detection unit that detects a region of interest in the target image frame and calculates reliability of a processing operation for extracting a region of interest image including the region of interest from the target image frame;
an image frame processing unit for extracting the region of interest image from the target image frame using the processing operation if the reliability of the processing operation satisfies a predetermined reliability criterion;
background pixels in the target image frame if the confidence of the manipulation operation does not meet a predetermined confidence criterion or if the region of interest image extracted from the target image frame does not meet a predetermined aspect ratio criterion; a background synthesizing means determination unit that determines, from a plurality of background synthesizing means candidates, a background synthesizing means for converting the target image frame to a predetermined aspect ratio by addition or deletion;
a background synthesizing unit that generates a final image in which the target image frame is converted to the predetermined aspect ratio by adding or deleting background pixels to or from the target image frame using the background synthesizing means;
a neural network that performs a predetermined analysis process on the final image and generates an analysis result;
an output unit that transmits the analysis result from the neural network to the client terminal;
A domain conversion system comprising: