JP7448721B2

JP7448721B2 - Imaging device and video processing system

Info

Publication number: JP7448721B2
Application number: JP2023504880A
Authority: JP
Inventors: 嵩臣神田
Original assignee: Hitachi Kokusai Electric Inc
Current assignee: Hitachi Kokusai Electric Inc
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2024-03-12
Anticipated expiration: 2041-03-08
Also published as: JPWO2022190157A1; WO2022190157A1

Description

本発明は、撮像装置及び映像処理システムに関し、特に、機械学習で推論処理可能でプライバシー保護のための映像加工処理機能を有する撮像装置及び映像処理システムに関する。 The present invention relates to an imaging device and a video processing system, and more particularly to an imaging device and a video processing system that can perform inference processing using machine learning and have a video processing function for protecting privacy.

近年、監視カメラなどで多数の人物を撮影するカメラの需要が増えている。これらのカメラはＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）に接続され、遠隔から映像監視ができるというメリットがある。一方で、セキュリティを突破された場合は、撮影された情報が流出する等して、プライバシー保護の観点で問題となることもある。 In recent years, there has been an increase in demand for cameras such as surveillance cameras that can photograph a large number of people. These cameras have the advantage of being connected to a LAN (Local Area Network) and allowing remote video monitoring. On the other hand, if the security is breached, the information captured may be leaked, which may pose a problem in terms of privacy protection.

そこで、特許文献１では撮影画像に対して、可逆型のモザイク処理やマスク処理などの加工処理を行うことによって、プライバシー保護を行う手法が開示されている。加工処理された画像は、対応する復元処理を行うことによって、元画像を復元することができる。 Therefore, Patent Document 1 discloses a method of protecting privacy by performing processing such as reversible mosaic processing and mask processing on captured images. The processed image can be restored to its original image by performing corresponding restoration processing.

特開２００９－３３７３８号公報Japanese Patent Application Publication No. 2009-33738

特許文献１では、仮に復元処理を行うための復元情報も含めて外部に流失した場合、悪意のある第三者が復元処理を行い元の画像を入手することが可能となる。これを防ぐためには非可逆の画像をＬＡＮ上に配信する必要があるが、その場合は、元画像を復元することができない。このため、画像認識技術などによる顔認識や行動認識を行うことができなくなる。 In Patent Document 1, if restoration information including restoration information for performing restoration processing is leaked to the outside, a malicious third party can perform restoration processing and obtain the original image. To prevent this, it is necessary to distribute irreversible images over a LAN, but in that case, the original images cannot be restored. For this reason, it becomes impossible to perform face recognition or action recognition using image recognition technology or the like.

本発明は、上記課題に鑑みて、画像情報のより高い保護を行いながら画像に関する所定の情報を伝えることができる撮像装置及び映像処理システムを提供することを目的とする。 SUMMARY OF THE INVENTION In view of the above-mentioned problems, an object of the present invention is to provide an imaging device and a video processing system that can transmit predetermined information regarding an image while protecting the image information at a higher level.

上記目的を達成するため、代表的な本発明の撮像装置の一つは、映像を撮影して画像を取得し、前記画像内から所定の領域を検出し、検出した検出領域をリサイズして検出領域の特徴量を抽出し、前記抽出した特徴量を二次元に配列したマスク画像として前記取得した画像の検出領域に配置した画像を出力することを特徴とする。 In order to achieve the above object, one of the typical imaging devices of the present invention captures a video to obtain an image, detects a predetermined area from within the image, resizes the detected detection area, and then detects the image. The present invention is characterized in that the feature amount of the region is extracted, and an image arranged in the detection region of the acquired image is output as a mask image in which the extracted feature amount is arranged two-dimensionally.

さらに本発明の映像処理システムの一つは、撮像装置と、映像処理装置とを備え、前記撮像装置は、映像を撮影して画像を取得し、前記画像内から所定の領域を検出し、検出した検出領域をリサイズして検出領域の特徴量を抽出し、前記抽出した特徴量を二次元に配列したマスク画像として前記取得した画像の検出領域に配置した画像を出力し、前記映像処理装置は、前記撮像装置が出力した画像を入力して、前記マスク画像から特徴量を取得し、この特徴量に基づく推論処理を行うことを特徴とする。 Furthermore, one of the video processing systems of the present invention includes an imaging device and a video processing device, and the imaging device captures a video to obtain an image, detects a predetermined area from within the image, and detects a predetermined area from within the image. resizing the detected detection area to extract feature quantities of the detection area, outputting an image arranged in the detection area of the acquired image as a mask image in which the extracted feature quantities are arranged in two dimensions, and the video processing device , the method is characterized in that an image output by the imaging device is input, a feature amount is acquired from the mask image, and inference processing is performed based on this feature amount.

本発明によれば、撮像装置及び映像処理システムにおいて、画像情報のより高い保護を行いながら画像に関する所定の情報を伝えることができる。
上記以外の課題、構成及び効果は、以下の実施形態により明らかにされる。 According to the present invention, in an imaging device and a video processing system, predetermined information regarding an image can be transmitted while providing higher protection of the image information.
Problems, configurations, and effects other than those described above will be clarified by the following embodiments.

図１は、本発明の映像処理システムの一実施形態を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of the video processing system of the present invention. 図２は、図１の処理システム部の一例を示すブロック図である。FIG. 2 is a block diagram showing an example of the processing system section of FIG. 図３は、本発明の映像処理システムで適用する特徴量を算出する処理の一例を示す図である。FIG. 3 is a diagram illustrating an example of a process for calculating feature amounts applied in the video processing system of the present invention. 図４は、本発明の映像処理システムにおける撮像装置の処理の一例を示す図である。FIG. 4 is a diagram showing an example of the processing of the imaging device in the video processing system of the present invention. 図５は、本発明の映像処理システムにおける映像処理装置の処理の一例を示す図である。FIG. 5 is a diagram showing an example of processing of the video processing device in the video processing system of the present invention.

本発明を実施するための形態を説明する。 A mode for carrying out the present invention will be described.

図１は、本発明の映像処理システムの一実施形態を示すブロック図である。図１の映像処理システムは、撮像装置１と映像処理装置５を備えている。そして、撮像装置１は、撮像部２と、処理システム部３を備えている。また、映像処理装置５は、処理システム部６と、表示出力部７を備えている。なお、表示出力部７は、映像処理装置５に備えず映像処理装置５とは別体で構成してもよい。映像処理装置５はパソコン、タブレット型コンピュータ、サーバなどを適用可能である。 FIG. 1 is a block diagram showing an embodiment of the video processing system of the present invention. The video processing system in FIG. 1 includes an imaging device 1 and a video processing device 5. The imaging device 1 includes an imaging section 2 and a processing system section 3. The video processing device 5 also includes a processing system section 6 and a display output section 7. Note that the display output section 7 may not be included in the video processing device 5 and may be configured separately from the video processing device 5. The video processing device 5 can be a personal computer, a tablet computer, a server, or the like.

撮像装置１は、１個以上のカメラの構成を備えており、様々な場所に配置可能である。例えば、監視カメラとして監視箇所に配置するなどである。 The imaging device 1 includes a configuration of one or more cameras, and can be placed in various locations. For example, it may be placed as a surveillance camera at a monitoring location.

撮像部２は、レンズや絞りを介して撮像素子に入射光を結像して情報を得るカメラの構成である。ここでの撮像素子の例としては、ＣＣＤ（Ｃｈａｒｇｅ－ＣｏｕｐｌｅｄＤｅｖｉｃｅ）イメージセンサやＣＭＯＳ（ＣｏｍｐｌｅｍｅｎｔａｒｙＭｅｔａｌＯｘｉｄｅＳｅｍｉｃｏｎｄｕｃｔｏｒ）イメージセンサ等があげられる。得られた情報は処理システム部３へ送られる。また、撮像部２は、ＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）などの映像処理用ＩＣ（ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）を用い撮影処理を行うことができる。一方この映像処理用ＩＣは、処理システム部３と一体化してもよい。 The imaging unit 2 is a camera configured to obtain information by forming an image of incident light on an image sensor through a lens or an aperture. Examples of the image sensor here include a CCD (Charge-Coupled Device) image sensor, a CMOS (Complementary Metal Oxide Semiconductor) image sensor, and the like. The obtained information is sent to the processing system unit 3. Further, the imaging unit 2 can perform imaging processing using a video processing IC (Integrated Circuit) such as an FPGA (Field Programmable Gate Array). On the other hand, this video processing IC may be integrated with the processing system section 3.

処理システム部３は、撮像部２で撮影した情報を取得して後述する図４の処理を行う。具体的な構成例については図２で後述し、具体的な処理の内容は図４で後述する。処理した情報は、処理システム部６へ送られる。 The processing system section 3 acquires information captured by the imaging section 2 and performs the processing shown in FIG. 4, which will be described later. A specific configuration example will be described later with reference to FIG. 2, and specific processing contents will be described later with reference to FIG. The processed information is sent to the processing system unit 6.

処理システム部６は、処理システム部３からの情報を取得して後述する図５の処理を行う。具体的な構成例については図２で後述し、具体的な処理の内容は図５で後述する。 The processing system unit 6 acquires information from the processing system unit 3 and performs the processing shown in FIG. 5, which will be described later. A specific configuration example will be described later with reference to FIG. 2, and specific processing details will be described later with reference to FIG.

表示出力部７は、処理システム部６で処理した内容を表示できる装置である。例えば液晶ディスプレイ（ＬＣＤ）、有機ＥＬ（ＯＥＬ）ディスプレイ、タッチパネル等の構成により表示させる。 The display output unit 7 is a device that can display the content processed by the processing system unit 6. For example, the information is displayed using a liquid crystal display (LCD), an organic EL (OEL) display, a touch panel, or the like.

撮像装置１と映像処理装置５の間は、インターネット網などを介して情報のやりとりを行える。例えばＬＡＮ等に接続する。この他、専用の通信回線を介して情報をやりとりしてもよい。すなわち、遠隔地にある撮像装置１の処理内容を映像処理装置５で確認できる。また、撮像装置１と映像処理装置５は１対１でなくともよく、１つの撮像装置１に対して複数の映像処理装置５が対応してもよく、複数の撮像装置１に対して１つの映像処理装置５が対応してもよい。また、映像処理装置５は、撮像装置１の設定や操作を可能に構成してもよい。 Information can be exchanged between the imaging device 1 and the video processing device 5 via the Internet or the like. For example, connect to a LAN or the like. In addition, information may be exchanged via a dedicated communication line. That is, the processing content of the imaging device 1 located at a remote location can be confirmed by the video processing device 5. Further, the imaging device 1 and the video processing device 5 do not have to be in a one-to-one relationship, and one imaging device 1 may have a plurality of video processing devices 5, and a plurality of imaging devices 1 may have a one-to-one correspondence. The video processing device 5 may also handle this. Further, the video processing device 5 may be configured to be capable of setting and operating the imaging device 1.

図２は、図１の処理システム部の一例を示すブロック図である。処理システム部３、６の具体例として図２のコンピュータシステム３００として説明する。 FIG. 2 is a block diagram showing an example of the processing system section of FIG. A specific example of the processing system units 3 and 6 will be described as a computer system 300 in FIG.

コンピュータシステム３００の主要コンポーネントは、１つ以上のプロセッサ３０２、メモリ３０４、端末インターフェース３１２、ストレージインターフェース３１４、Ｉ／Ｏ（入出力）デバイスインターフェース３１６、及びネットワークインターフェース３１８を含む。これらのコンポーネントは、メモリバス３０６、Ｉ／Ｏバス３０８、バスインターフェース３０９、及びＩ／Ｏバスインターフェース３１０を介して、相互的に接続されてもよい。 The main components of computer system 300 include one or more processors 302 , memory 304 , terminal interface 312 , storage interface 314 , I/O (input/output) device interface 316 , and network interface 318 . These components may be interconnected via memory bus 306, I/O bus 308, bus interface 309, and I/O bus interface 310.

コンピュータシステム３００は、プロセッサ３０２と総称される１つ又は複数の処理装置３０２Ａ及び３０２Ｂを含んでもよい。各プロセッサ３０２は、メモリ３０４に格納された命令を実行し、オンボードキャッシュを含んでもよい。処理装置としては、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＦＰＧＡ（Ｆｉｅｌｄ－ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｏｎｇＵｎｉｔ）等を適用できる。 Computer system 300 may include one or more processing devices 302A and 302B, collectively referred to as processor 302. Each processor 302 executes instructions stored in memory 304 and may include onboard cache. As the processing device, a CPU (Central Processing Unit), an FPGA (Field-Programmable Gate Array), a GPU (Graphics Processing Unit), etc. can be applied.

メモリ３０４は、データ及びプログラムを記憶するためのランダムアクセス半導体メモリ、記憶装置、又は記憶媒体（揮発性又は不揮発性のいずれか）を含んでもよい。また、メモリ３０４は、コンピュータシステム３００の仮想メモリ全体を表しており、ネットワークを介してコンピュータシステム３００に接続された他のコンピュータシステムの仮想メモリを含んでもよい。メモリ３０４は、概念的には単一のものとみなされてもよいが、キャッシュおよび他のメモリデバイスの階層など、より複雑な構成となる場合もある。 Memory 304 may include random access semiconductor memory, storage devices, or storage media (either volatile or nonvolatile) for storing data and programs. Memory 304 also represents the entire virtual memory of computer system 300 and may include virtual memory of other computer systems connected to computer system 300 via a network. Although memory 304 may be conceptually considered a single entity, it may be a more complex arrangement, such as a hierarchy of caches and other memory devices.

メモリ３０４は、本実施形態で説明する機能を実施するプログラム、モジュール、及びデータ構造のすべて又は一部を格納してもよい。例えば、メモリ３０４は、アプリケーション３５０を格納していてもよい。アプリケーション３５０は、後述する機能をプロセッサ３０２上で実行する命令又は記述を含んでもよく、あるいは別の命令又は記述によって解釈される命令又は記述を含んでもよい。アプリケーション３５０は、プロセッサベースのシステムの代わりに、またはプロセッサベースのシステムに加えて、半導体デバイス、チップ、論理ゲート、回路、回路カード、および／または他の物理ハードウェアデバイスを介してハードウェアで実施されてもよい。アプリケーション３５０は、命令又は記述以外のデータを含んでもよい。また、カメラやセンサ等の他のデータ入力デバイスが、バスインターフェース３０９、プロセッサ３０２、またはコンピュータシステム３００の他のハードウェアと直接通信するように提供されてもよい。 Memory 304 may store all or some of the programs, modules, and data structures that implement the functions described in this embodiment. For example, memory 304 may store application 350. Application 350 may include instructions or writings that perform functions described below on processor 302, or may include instructions or writings that are interpreted by other instructions or writings. Applications 350 may be implemented in hardware via semiconductor devices, chips, logic gates, circuits, circuit cards, and/or other physical hardware devices instead of or in addition to processor-based systems. may be done. Application 350 may include data other than instructions or descriptions. Other data input devices, such as cameras and sensors, may also be provided to communicate directly with bus interface 309, processor 302, or other hardware of computer system 300.

コンピュータシステム３００は、プロセッサ３０２、メモリ３０４、表示システム３２４、及びＩ／Ｏバスインターフェース３１０間の通信を行うバスインターフェース３０９を含んでもよい。Ｉ／Ｏバスインターフェース３１０は、様々なＩ／Ｏユニットとの間でデータを転送するためのＩ／Ｏバス３０８と連結していてもよい。Ｉ／Ｏバスインターフェース３１０は、Ｉ／Ｏバス３０８を介して、Ｉ／Ｏプロセッサ（ＩＯＰ）又はＩ／Ｏアダプタ（ＩＯＡ）としても知られる複数のＩ／Ｏインターフェース３１２、３１４、３１６、及び３１８と通信してもよい。表示システム３２４は、表示コントローラ、表示メモリ、又はその両方を含んでもよい。表示コントローラは、ビデオ、オーディオ、又はその両方のデータを表示装置３２６に提供することができる。また、コンピュータシステム３００は、データを収集し、プロセッサ３０２に当該データを提供するように構成された１つまたは複数のセンサ等のデバイスを含んでもよい。表示システム３２４は、単独のディスプレイ画面、テレビ、タブレット、又は携帯型デバイスなどの表示装置３２６に接続されてもよい。表示装置３２６は、オーディオをレンダリングするためスピーカを含んでもよい。あるいは、オーディオをレンダリングするためのスピーカは、Ｉ／Ｏインターフェースと接続されてもよい。これ以外に、表示システム３２４が提供する機能は、プロセッサ３０２を含む集積回路によって実現されてもよい。同様に、バスインターフェース３０９が提供する機能は、プロセッサ３０２を含む集積回路によって実現されてもよい。 Computer system 300 may include a bus interface 309 that provides communication between processor 302 , memory 304 , display system 324 , and I/O bus interface 310 . I/O bus interface 310 may couple with I/O bus 308 for transferring data to and from various I/O units. I/O bus interface 310 connects multiple I/O interfaces 312, 314, 316, and 318, also known as I/O processors (IOPs) or I/O adapters (IOAs), via I/O bus 308. You may communicate with Display system 324 may include a display controller, display memory, or both. A display controller may provide video, audio, or both data to display device 326. Computer system 300 may also include devices, such as one or more sensors, configured to collect data and provide the data to processor 302. Display system 324 may be connected to a display device 326, such as a standalone display screen, a television, a tablet, or a handheld device. Display device 326 may include speakers for rendering audio. Alternatively, a speaker for rendering audio may be connected to the I/O interface. Alternatively, the functionality provided by display system 324 may be implemented by an integrated circuit that includes processor 302. Similarly, the functionality provided by bus interface 309 may be implemented by an integrated circuit that includes processor 302.

Ｉ／Ｏインターフェースは、様々なストレージ又はＩ／Ｏデバイスと通信する機能を備える。例えば、端末インターフェース３１２は、ビデオ表示装置、スピーカテレビ等のユーザ出力デバイスや、キーボード、マウス、キーパッド、タッチパッド、トラックボール、ボタン、ライトペン、又は他のポインティングデバイス等のユーザ入力デバイスのようなユーザＩ／Ｏデバイス３２０の取り付けが可能である。ユーザは、ユーザインターフェースを使用して、ユーザ入力デバイスを操作することで、ユーザＩ／Ｏデバイス３２０及びコンピュータシステム３００に対して入力データや指示を入力し、コンピュータシステム３００からの出力データを受け取ってもよい。ユーザインターフェースは例えば、ユーザＩ／Ｏデバイス３２０を介して、表示装置に表示されたり、スピーカによって再生されたりしてもよい。 The I/O interface provides the ability to communicate with various storage or I/O devices. For example, terminal interface 312 may include a user output device such as a video display device, a speaker television, or a user input device such as a keyboard, mouse, keypad, touch pad, trackball, buttons, light pen, or other pointing device. User I/O devices 320 can be attached. Using the user interface, a user operates a user input device to input input data and instructions to user I/O device 320 and computer system 300, and to receive output data from computer system 300. Good too. The user interface may be displayed on a display device or played through a speaker via user I/O device 320, for example.

ストレージインターフェース３１４は、１つ又は複数のディスクドライブや直接アクセス記憶装置３２２の取り付けが可能である。記憶装置３２２は、任意の二次記憶装置として実装されてもよい。メモリ３０４の内容は、記憶装置３２２に記憶され、必要に応じて記憶装置３２２から読み出されてもよい。Ｉ／Ｏデバイスインターフェース３１６は、他のＩ／Ｏデバイスに対するインターフェースを提供してもよい。ネットワークインターフェース３１８は、コンピュータシステム３００と他のデバイスが相互的に通信できるように、通信経路を提供してもよい。この通信経路は、例えば、ネットワーク３３０であってもよい。 Storage interface 314 allows attachment of one or more disk drives or direct access storage devices 322 . Storage device 322 may be implemented as any secondary storage device. The contents of memory 304 may be stored in storage device 322 and read from storage device 322 as needed. I/O device interface 316 may provide an interface to other I/O devices. Network interface 318 may provide a communication pathway so that computer system 300 and other devices can communicate with each other. This communication path may be, for example, network 330.

コンピュータシステム３００は、プロセッサ３０２、メモリ３０４、バスインターフェース３０９、表示システム３２４、及びＩ／Ｏバスインターフェース３１０の間の直接通信経路を提供するバス構造を備えているが、コンピュータシステム３００は、階層構成、スター構成、又はウェブ構成のポイントツーポイントリンク、複数の階層バス、平行又は冗長の通信経路を含んでもよい。さらに、Ｉ／Ｏバスインターフェース３１０及びＩ／Ｏバス３０８が単一のユニットとして示されているが、実際には、コンピュータシステム３００は複数のＩ／Ｏバスインターフェース３１０又は複数のＩ／Ｏバス３０８を備えてもよい。また、Ｉ／Ｏバス３０８を様々なＩ／Ｏデバイスに繋がる各種通信経路から分離するための複数のＩ／Ｏインターフェースが示されているが、Ｉ／Ｏデバイスの一部または全部が、１つのシステムＩ／Ｏバスに直接接続されてもよい。 Although computer system 300 includes a bus structure that provides a direct communication path between processor 302, memory 304, bus interface 309, display system 324, and I/O bus interface 310, computer system 300 has a hierarchical configuration. , star, or web configurations, multiple hierarchical buses, and parallel or redundant communication paths. Further, although I/O bus interface 310 and I/O bus 308 are shown as a single unit, in reality, computer system 300 may include multiple I/O bus interfaces 310 or multiple I/O buses 308. may be provided. Also, although multiple I/O interfaces are shown to separate I/O bus 308 from various communication paths leading to various I/O devices, some or all of the I/O devices may It may also be connected directly to the system I/O bus.

コンピュータシステム３００は、マルチユーザメインフレームコンピュータシステム、シングルユーザシステム、又はサーバコンピュータ等の、直接的ユーザインターフェースを有しない、他のコンピュータシステム（クライアント）からの要求を受信するデバイスであってもよい。 Computer system 300 may be a device that receives requests from other computer systems (clients) without a direct user interface, such as a multi-user mainframe computer system, a single-user system, or a server computer.

図２のコンピュータシステム３００を図１の処理システム部３に適用する場合は、表示装置３２６は任意の構成であり、備えていてもいなくてもよい。また、撮像部２はユーザＩ／Ｏデバイス３２０として適用可能である。また、図２のコンピュータシステム３００を図１の処理システム部６として適用した場合は、表示装置３２６は表示出力部７として適用可能である。また、ネットワーク３３０は、処理システム部３と処理システム部６との間に介在するネットワークとして適用可能である。 When the computer system 300 in FIG. 2 is applied to the processing system unit 3 in FIG. 1, the display device 326 has an arbitrary configuration and may or may not be included. Furthermore, the imaging unit 2 can be applied as a user I/O device 320. Furthermore, when the computer system 300 in FIG. 2 is applied as the processing system unit 6 in FIG. 1, the display device 326 can be applied as the display output unit 7. Further, the network 330 can be applied as a network interposed between the processing system section 3 and the processing system section 6.

図３は、本発明の映像処理システムで適用する特徴量を算出する処理の一例を示す図である。図３は、顔の画像から人物を推定するＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎＮｅｕｒａｌＮｅｔｗｏｒｋｓ）による機械学習の構成例を示す。各層の上部に記載した数はその層のニューロンの数であるが、これらは一例を示している。 FIG. 3 is a diagram illustrating an example of a process for calculating feature amounts applied in the video processing system of the present invention. FIG. 3 shows a configuration example of machine learning using CNN (Convolution Neural Networks) for estimating a person from a face image. The numbers listed above each layer are the numbers of neurons in that layer, but these are examples.

入力層１１から特定の画像の一部分が入力され、それが１層目の畳込み層１２、プーリング層１３と伝達され、後段の層である畳込み層１２、プーリング層１３とつながっている。これらの処理の後には全結合層があり、入力層１６、中間層１７、出力層１８が存在する。出力層１８のニューロンの数はクラスの数と等価である。顔認識を行う場合は特定できる人の数とほぼ等価となる。尚、入力層１１から特定の画像の一部分が入力される場合、例として２００×２００の画像が６４×６４にリサイズされたのちに入力されている。 A part of a specific image is input from the input layer 11, and is transmitted to the first convolution layer 12 and pooling layer 13, and is connected to the convolution layer 12 and pooling layer 13, which are the subsequent layers. After these processes, there is a fully connected layer, including an input layer 16, an intermediate layer 17, and an output layer 18. The number of neurons in the output layer 18 is equivalent to the number of classes. When performing face recognition, this is approximately equivalent to the number of people that can be identified. Note that when a part of a specific image is input from the input layer 11, for example, a 200×200 image is input after being resized to 64×64.

入力層１１では、特定の大きさの画像情報（図３では６４×６４ピクセル）を取得する。図３の例では、顔検出により取り込んだ人の顔の画像である。 The input layer 11 acquires image information of a specific size (64×64 pixels in FIG. 3). The example in FIG. 3 is an image of a human face captured by face detection.

次に、畳込み層１２では畳み込み処理を行う。入力層１１で取得した画像に対してフィルタをかけていく。フィルタをかけることにより、サイズは小さくなる（図３では６０×６０）。そして、用意したフィルタの数（図３では８個）分だけ出力される。 Next, the convolution layer 12 performs convolution processing. A filter is applied to the image acquired by the input layer 11. By applying the filter, the size becomes smaller (60×60 in FIG. 3). Then, as many filters as the number of prepared filters (eight in FIG. 3) are output.

次に、プーリング層１３ではプーリング処理を行う。畳込み層１２で出力した情報に対して圧縮をかけていく。これにより、サイズは半分となる（図３では３０×３０）。 Next, the pooling layer 13 performs pooling processing. Compression is applied to the information output by the convolution layer 12. As a result, the size is halved (30×30 in FIG. 3).

次に、畳込み層１４では畳み込み処理を行う。プーリング層１３で圧縮した情報に対して、さらにフィルタをかけて、サイズを小さくする（図３では２６×２６）。そして、用意したフィルタの数（図３では１６個）分だけ出力される。 Next, the convolution layer 14 performs convolution processing. The information compressed by the pooling layer 13 is further filtered to reduce the size (26×26 in FIG. 3). Then, as many filters as the number of prepared filters (16 in FIG. 3) are output.

次に、プーリング層１５ではプーリング処理を行う。畳込み層１４で出力した情報に対して圧縮をかけていく。これにより、サイズは半分となる（図３では１３×１３）。 Next, the pooling layer 15 performs a pooling process. Compression is applied to the information output by the convolution layer 14. As a result, the size is halved (13×13 in FIG. 3).

次に、全結合層の入力層１６では、プーリング層１５で三次元の情報（１３×１３×１６）を一次元の情報（２７０４）に並べなおしたものである。ここでの情報は特徴量を示している。なお、図３では、畳込み層とプーリング層の繰り返しは、２回（２層）での繰り返しで示したが、これに限ることはなく、さらに多くの繰り返しとしてもよい。 Next, in the input layer 16 of the fully connected layer, the three-dimensional information (13×13×16) is rearranged into one-dimensional information (2704) in the pooling layer 15. The information here indicates the feature amount. Although FIG. 3 shows that the convolutional layer and the pooling layer are repeated twice (two layers), the present invention is not limited to this, and may be repeated even more times.

全結合層の入力層１６から、マスク画像を形成することができる。マスク画像は、ここでは元の画像が特定できない（顔であれば画像のみから誰かを特定できない）画像を意味する。この処理は、非可逆な映像加工処理であり、一度マスク画像を形成すると元の画像を復元することはできなくなる。 From the input layer 16 of the fully connected layer, a mask image can be formed. A mask image here means an image in which the original image cannot be identified (if it is a face, someone cannot be identified from the image alone). This process is an irreversible video processing process, and once a mask image is formed, the original image cannot be restored.

具体的には、図３に示すように全結合層の入力層１６の情報である一次元の情報１６－１（図３では２７０４）を二次元の画像情報１６－２（図３では５２×５２）に並べなおす。このときの情報は、画像の情報として、白黒画像であれば色の濃さの情報として、カラー画像であれば、色の種類と濃さの情報として、保持することができる。例えば、白黒の画像であれば１ピクセルが８ビットの情報として、ＲＧＢのカラー画像であれば１ピクセルが２４ビットの情報として変換可能である。その５２×５２ピクセルの画像情報を２００×２００ピクセルのマスク画像１６－３に引き延ばす。これは、もともと取り込んだ顔の画像の大きさに合わせるための変換処理である。 Specifically, as shown in FIG. 3, one-dimensional information 16-1 (2704 in FIG. 3), which is the information of the input layer 16 of the fully connected layer, is converted into two-dimensional image information 16-2 (52× in FIG. 3). 52). The information at this time can be held as image information, such as color density information for a monochrome image, and color type and color density information for a color image. For example, in the case of a black and white image, one pixel can be converted as 8-bit information, and in the case of an RGB color image, one pixel can be converted as 24-bit information. The 52×52 pixel image information is expanded into a 200×200 pixel mask image 16-3. This is a conversion process to match the size of the originally captured face image.

そして、作成されたマスク画像１６－３は推論処理のため元の一次元の情報に戻す。具体的には、マスク画像１６－３（図３では２００×２００）を、引き延ばす前の二次元の画像情報１６－４（図３では５２×５２）をリサイズにより戻して、さらに、一次元の情報１６－１（図３では２７０４）に並べなおす。このことにより、全結合層の入力層１６の情報を、一旦マスク画像１６－３に変換して、画像に載せることが可能となる。 The created mask image 16-3 is then returned to the original one-dimensional information for inference processing. Specifically, the mask image 16-3 (200×200 in FIG. 3) is resized to the two-dimensional image information 16-4 (52×52 in FIG. 3) before being stretched, and then the one-dimensional image The information is rearranged as information 16-1 (2704 in FIG. 3). This makes it possible to once convert the information in the input layer 16 of the fully connected layer into the mask image 16-3 and place it on the image.

次に、全結合層の中間層１７では、図３では１０００個のニューロン数を適用している。これは、一例であり、必要に応じてふさわしい数が適用できる。また、中間層１７の数を増やして、複数の層で構成してもよい。 Next, in the intermediate layer 17 of the fully connected layer, the number of neurons is 1000 in FIG. 3. This is just an example, and an appropriate number can be applied as needed. Further, the number of intermediate layers 17 may be increased to constitute a plurality of layers.

次の、全結合層の出力層１８では、１００個のニューロン数を適用している。ここでは、このニューロン数はクラス数となり、分類可能な数に相当する。例えば、顔の認識であれば、Ａさん、Ｂさん、Ｃさんというようにして、一番発火したニューロンから誰であるかを推定する。このような推論処理により、１００人の人の分類が可能である。もしくは、９９人の分類として、残りの１つはその他とすることも可能である。 In the output layer 18 of the next fully connected layer, the number of neurons is 100. Here, this number of neurons is the number of classes, and corresponds to the number that can be classified. For example, in the case of face recognition, person A, person B, person C, etc., and the person is estimated based on the neuron that fires the most. Through such inference processing, it is possible to classify 100 people. Alternatively, it is also possible to classify the remaining 99 people into "other".

図４は、本発明の映像処理システムにおける撮像装置の処理の一例を示す図である。ここでの処理は、撮像装置１側で行い、特に記載がない場合は撮像装置１の処理システム部３で行われる。ここでは、非可逆な映像加工処理が行われる。 FIG. 4 is a diagram showing an example of the processing of the imaging device in the video processing system of the present invention. The processing here is performed on the imaging device 1 side, and unless otherwise specified, is performed in the processing system section 3 of the imaging device 1. Here, irreversible video processing is performed.

撮像装置１ではまず初めに映像撮影２１を行う。これは撮像部２により行い、撮像素子とＦＰＧＡなどの映像処理用ＩＣなどで実現できる。撮影は映像で撮影される。例えば、１秒間に３０フレーム（３０ｆｐｓ）以上等の撮影とする等である。撮像部２で撮影された映像は１フレームの画像ごとに処理システム部３へ送られそれぞれ処理を行うことができる。 The imaging device 1 first performs video shooting 21. This is performed by the imaging unit 2, and can be realized using an image sensor and a video processing IC such as an FPGA. The shoot will be filmed on video. For example, shooting may be performed at 30 frames per second (30 fps) or more. The images photographed by the imaging section 2 are sent to the processing system section 3 for each frame of image, and can be processed individually.

次に、処理システム部３では、この入力された映像に対して顔検出２２を行う。顔検出２２は、人間の顔の形を識別し、顔を含む範囲を検出する処理である。これは既存の手法を用いて自動で行われる。人間の顔と識別した場合はその領域を検出する。また、後述する処理を行うため、顔と識別した範囲が、ある程度の画素数以上の場合に検出する処理とすることができる。入力層１６の１つのニューロンが扱うビット数が、１ピクセルのビット数と同じ場合、図４の例では、最小の範囲が５２×５２ピクセルに設定されている。 Next, the processing system unit 3 performs face detection 22 on this input video. Face detection 22 is a process of identifying the shape of a human face and detecting a range that includes the face. This is done automatically using existing techniques. If it is identified as a human face, that area is detected. Further, since the processing described later is performed, the processing can be performed to detect when the range identified as a face has a certain number of pixels or more. If the number of bits handled by one neuron of the input layer 16 is the same as the number of bits of one pixel, in the example of FIG. 4, the minimum range is set to 52×52 pixels.

次に、検出領域のリサイズ部で検出領域のリサイズ２３を行う。これは、顔検出２２で検出された領域をあらかじめ決めたサイズにリサイズする。このリサイズは、顔検出２２で検出される領域は一定でないため次の特徴量の計算に適した所定のサイズへの変換を行うものである。図４の例では、２００×２００ピクセルを６４×６４ピクセルへ変換する処理を行う。 Next, the detection area is resized 23 by the detection area resizing section. This resizes the area detected by face detection 22 to a predetermined size. In this resizing, since the area detected by the face detection 22 is not constant, the area is converted to a predetermined size suitable for calculating the next feature amount. In the example of FIG. 4, processing is performed to convert 200x200 pixels to 64x64 pixels.

次に、特徴量計算部で検出領域の特徴量計算２４を行う。ここでは、ＣＮＮなどを用いて顔認識に必要な特徴量を求める。この特徴量の計算は、図３で説明した入力層１１～全結合層の入力層１６までの処理と同様である。 Next, the feature amount calculation section performs feature amount calculation 24 of the detection area. Here, feature amounts necessary for face recognition are obtained using CNN or the like. The calculation of this feature amount is similar to the processing from the input layer 11 to the input layer 16 of the fully connected layer explained with reference to FIG.

次に、特徴量の再配列／リサイズ２５を行う。ここでは、顔検出を行った領域に適用できる大きさのフォーマットにデータを変換する処理を行う。全結合層の入力層１６で算出された特徴量のニューロンの数は２７０４であり、これを二次元に変換すると５２×５２の領域となる。一方、顔検出２２で検出した領域は２００×２００である。特徴量のニューロンの数から算出される二次元の領域５２×５２のデータを、顔検出２２の領域２００×２００に当てはめるため、１ニューロンのデータがおおよそ４画素に拡大して割り当てる。これにより、領域５２×５２のデータを領域２００×２００のデータに変換する。なお、ここでの特徴量の再配列／リサイズ２５の処理は、図３で説明した一次元の情報１６－１から、マスク画像１６－３までの処理と同様である。 Next, the feature values are rearranged/resized 25. Here, processing is performed to convert the data into a format of a size that can be applied to the area where the face has been detected. The number of feature neurons calculated in the input layer 16 of the fully connected layer is 2704, and when this is converted into a two-dimensional area, it becomes an area of 52×52. On the other hand, the area detected by face detection 22 is 200×200. In order to apply the data of a two-dimensional area of 52×52 calculated from the number of neurons of the feature amount to the area of 200×200 of the face detection 22, the data of one neuron is expanded to approximately 4 pixels and allocated. As a result, data in an area of 52×52 is converted to data in an area of 200×200. Note that the process of rearranging/resizing the feature amounts 25 here is the same as the process from the one-dimensional information 16-1 to the mask image 16-3 explained with reference to FIG.

ここで、上述した拡大率が大きいほどマスクの領域の画素間やフレーム間の変化が少なくなる。これにより、画素間やフレーム間の急激な変化が緩和されて非可逆コーデックによる処理が行いやすくなる。また、この特徴量は顔検出が行われる最小の画像サイズのデータ領域に収まる必要があるが、この最小サイズによっては例えばＣＮＮの途中のプーリング層の出力を特徴量として扱うことも可能である。 Here, the larger the above-mentioned enlargement ratio is, the smaller the change between pixels or between frames in the mask area becomes. This alleviates sudden changes between pixels and between frames, making it easier to perform processing using a lossy codec. Further, this feature amount needs to fit within the data area of the minimum image size in which face detection is performed, but depending on this minimum size, it is also possible to treat, for example, the output of a pooling layer in the middle of the CNN as the feature amount.

次に、再配列された特徴量は顔検出２２で検出された元画像へのマスク処理２６が行われる。これは、顔検出２２で検出した領域に再配列された特徴量（２００×２００）をマスク画像１６－３として当てはめることにより元画像上に配置される。マスク画像１６－３は、特徴量に基づく色の種類や濃さの画像のため、顔検出２２で検出した領域の元画像とは異なり、人の顔とは異なる情報となっている。 Next, mask processing 26 is performed on the rearranged feature amounts to the original image detected by face detection 22. This is placed on the original image by applying the rearranged feature amounts (200×200) to the area detected by the face detection 22 as a mask image 16-3. Since the mask image 16-3 is an image whose color type and density are based on feature amounts, it is different from the original image of the area detected by the face detection 22 and has information different from that of a human face.

次に、マスク処理２６が行われた画像に対して、マスク処理メタデータ付与２７が行われる。ここでは、マスク処理が行われた画像のインデックス番号や画像上の始点の座標、その一辺の長さなどが付与される。これにより、マスク処理が行われている領域を特定するために情報やマスク処理が行われた画像を特定するための情報が付与される。 Next, mask processing metadata addition 27 is performed on the image that has been subjected to mask processing 26 . Here, the index number of the image on which the masking process has been performed, the coordinates of the starting point on the image, the length of one side, etc. are given. As a result, information for specifying the area on which the masking process has been performed and information for specifying the image on which the masking process has been performed is provided.

次に、外部出力２８される。ここで、外部出力する際には伝送容量を圧縮するためにコーデックによる処理が行われる。映像の場合では一般に非可逆コーデックが用いられるが、アプリケーションによっては画像の間欠伝送のみでよく、その場合は可逆コーデックを用いてもよい。ここでの外部出力された情報は、インターネット網等を介して映像処理装置５へ送られる。 Next, it is output to the outside 28. Here, when outputting to the outside, processing is performed using a codec in order to compress the transmission capacity. In the case of video, a lossy codec is generally used, but depending on the application, only intermittent transmission of images is required, in which case a lossless codec may be used. The externally outputted information here is sent to the video processing device 5 via the Internet network or the like.

図５は、本発明の映像処理システムにおける映像処理装置の処理の一例を示す図である。ここでの処理は、映像処理装置５側で行い、特に記載がない場合は映像処理装置５の処理システム部６で行われる。ここでは、機械学習による推論処理を行い、人を特定する。 FIG. 5 is a diagram showing an example of processing of the video processing device in the video processing system of the present invention. The processing here is performed on the video processing device 5 side, and unless otherwise specified, it is performed in the processing system unit 6 of the video processing device 5. Here, inference processing using machine learning is performed to identify the person.

まず、図４の外部出力２８において撮像装置１から出力された画像を有する映像データを映像処理装置５の映像入力部に映像入力３１を行う。 First, video data having an image outputted from the imaging device 1 at the external output 28 in FIG.

次に、その映像データのメタデータから特徴量の抽出／リサイズ・再配列部で、特徴量の抽出、リサイズ、再配列３２の処理を行う。この処理は、まず初めに映像データから、マスク画像１６－３の抽出を行う。これは、付与されているメタデータから範囲を特定することができる。次に、二次元の画像情報１６－４（図５では５２×５２）に戻して、さらに、一次元の情報１６－５（図５では２７０４）に並べなおす。これは、図３と同様である。これにより特徴量の値が得られる。なお、この値は、途中でリサイズやコーデック等の処理を行っているため、データの値がわずかにずれて、完全に一致しない場合もある。しかし、このずれは次の特徴量から推論結果を取得する処理に影響がない程度であり、元の特徴量（一次元の情報１６－１）と同じか近しい値が得られる。 Next, a feature amount extraction/resizing/rearranging unit performs feature amount extraction, resizing, and rearranging processing 32 from the metadata of the video data. In this process, first, a mask image 16-3 is extracted from the video data. This range can be specified from the attached metadata. Next, the information is returned to two-dimensional image information 16-4 (52×52 in FIG. 5), and further rearranged into one-dimensional information 16-5 (2704 in FIG. 5). This is similar to FIG. 3. This provides the value of the feature amount. Note that this value is processed by resizing, codec, etc. during the process, so the data value may shift slightly and may not match completely. However, this deviation does not affect the process of obtaining an inference result from the next feature amount, and a value that is the same as or close to the original feature amount (one-dimensional information 16-1) can be obtained.

次に、特徴量から推論結果の取得３３を行う。これは、図３の全結合層１６～１８の処理と同様である。ここでは、特徴量から推論結果取得部によってそのクラスを特定する。図５の例の場合では、推論処理により、顔から個人を特定することができる。 Next, an inference result is acquired 33 from the feature amount. This is similar to the processing of fully connected layers 16-18 in FIG. Here, the class is identified from the feature amount by the inference result acquisition unit. In the case of the example shown in FIG. 5, an individual can be identified from the face through inference processing.

なお、個人の顔に関する情報は、映像処理装置５に記憶しておくことで、上記の処理を行える。例えば、１００人分のクラスを出力する場合は、１００人分の情報を保持しておき、特徴量から個人を特定することが可能となる。また、予め記録した人に該当しない場合は、その他の人であることを出力するクラスを１つ用意しておくことも可能である。 Note that the above processing can be performed by storing information regarding an individual's face in the video processing device 5. For example, when outputting classes for 100 people, information for 100 people is held, and individuals can be identified from their feature amounts. Furthermore, if the person does not correspond to the person recorded in advance, it is also possible to prepare one class that outputs that the person is another person.

また、特徴量のデータ構造やニューラルネットワークのパラメータ等の特徴量の抽出のためのパラメータ等の取り決めは、事前に撮像装置１と映像処理装置５の間で共有しておく。このことで、マスク画像１６－３が映像処理装置５に送られた場合、一次元の情報１６－５に戻して特徴量からクラスを出力することが可能となる。このパラメータの設定について、映像処理装置５から撮像装置１の設定も行える機能を有しておいてもよい。 Furthermore, arrangements for parameters for extracting feature quantities, such as the data structure of feature quantities and neural network parameters, are shared between the imaging device 1 and the video processing device 5 in advance. With this, when the mask image 16-3 is sent to the video processing device 5, it is possible to return it to one-dimensional information 16-5 and output the class from the feature amount. Regarding the setting of these parameters, a function may be provided that allows the video processing device 5 to also set the imaging device 1.

上記の実施形態は、顔検出により人を特定する処理の例について示したが、人の行動についても特定できる。例えば、撮像装置１では、人検出機能を備え、人全体を検出すると共に特徴量が含まれる二次元画像により人全体をマスクする。そして、映像処理装置５では、その特徴量からマスクした人の行動が何であるかを推論するものである。この場合、クラスは人の行動の種類ごとに出力する。 Although the above embodiment has been described as an example of a process for identifying a person by face detection, it is also possible to identify a person's behavior. For example, the imaging device 1 has a person detection function, detects the entire person, and masks the entire person using a two-dimensional image that includes feature amounts. Then, the video processing device 5 infers the behavior of the masked person from the feature amount. In this case, the class is output for each type of human behavior.

（効果）
上記の実施形態では、プライバシー保護が重要となる人物領域（顔や人全体）の非可逆なマスク処理が実現できる。それと同時に、その伝送先では人や行動の特定に必要なデータも含めて受信でき、必要に応じて後処理の推論を実行する。このことによってマスクされた領域でも、その人が誰であるかや行動が何であるかを判別することができる。 (effect)
In the embodiments described above, irreversible masking of a human region (face or entire person) where privacy protection is important can be realized. At the same time, the destination can receive the data necessary to identify people and actions, and perform post-processing inferences as necessary. This makes it possible to determine who the person is and what they are doing even in the masked area.

従来の可逆なマスク処理を用いる場合、マスクされていた部分を復号すると例えば元の人の画像が復元され、それが流出すると画像に含まれるあらゆる個人情報が流出することとなる。その一方で、本実施形態による手法では万が一情報が流出し悪意のある第三者に復号されたとしても、顔認識であればそれに対応付けられる名前などのラベル情報のみ、行動認識であればその行動のラベル情報のみの最小限の情報に抑えられる。 When using conventional reversible mask processing, decoding the masked portion restores the original image of the person, for example, and if that image is leaked, all personal information contained in the image will be leaked. On the other hand, with the method according to this embodiment, even if information is leaked and decrypted by a malicious third party, only the label information such as the name associated with it in the case of face recognition, and the label information associated with it in the case of behavioral recognition. The information can be kept to a minimum, consisting only of behavior label information.

さらに、撮像装置側で人認識や行動認識結果まで推論を行う場合、そのデータを伝送して、その通信を傍受されてしまうとラベル情報が流出してしまう。一方で、本実施形態では受信した映像処理装置５側で特徴量から推論を行う。このため、撮像装置１からのデータが流出したとしても、特徴量のデータ構造や、ニューラルネットワークのパラメータの構造等の取り決めが分からない限り、推論を行うことができない。このため、撮像装置１からの情報は、通信の暗号化に加えて二重に保護されており、より復号が難しいデータとすることができる。また、特徴量をマスク画像１６－３に埋め込むことで伝送容量の削減をすることができる。 Furthermore, when inferring the results of person recognition or action recognition on the imaging device side, if the data is transmitted and the communication is intercepted, the label information will be leaked. On the other hand, in this embodiment, inference is made from the received feature amount on the video processing device 5 side. Therefore, even if data from the imaging device 1 is leaked, inference cannot be made unless the data structure of the feature values, the structure of the parameters of the neural network, etc. are known. Therefore, the information from the imaging device 1 is doubly protected in addition to communication encryption, and can be data that is even more difficult to decode. Further, by embedding the feature amount in the mask image 16-3, the transmission capacity can be reduced.

以上の様に、本発明の実施形態について説明してきたが、本発明は上記した実施形態に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施形態は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、各実施形態の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and includes various modifications. For example, the above-described embodiments have been described in detail to explain the present invention in an easy-to-understand manner, and the present invention is not necessarily limited to having all the configurations described. Furthermore, it is possible to add, delete, or replace some of the configurations of each embodiment with other configurations.

例えば、上記の実施形態では、伝送容量の削減のために特徴量をマスク画像１６－３に埋め込む処理を行っている。しかし、画像には特徴量の情報を埋め込まない適当なマスク処理（例えば、同一の色と濃さでのマスク）を行い、特徴量の情報と画像とを分けて伝送する構成も適用できる。 For example, in the above embodiment, processing is performed to embed feature amounts in the mask image 16-3 in order to reduce transmission capacity. However, it is also possible to apply a configuration in which appropriate mask processing (for example, masking with the same color and density) is performed without embedding the feature information in the image, and the feature information and the image are transmitted separately.

また、上記の実施形態では、ＣＮＮによる例を示したが、機械学習としては、ＤＮＮ（ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋｓ）の手法を用いても、本発明を適用することができる。 Further, in the above embodiment, an example using CNN was shown, but the present invention can also be applied using a DNN (Deep Neural Networks) method as machine learning.

１…撮像装置、２…撮像部、３…処理システム部、５…映像処理装置、６…処理システム部、７…表示出力部、１１…入力層、１２…畳込み層、１３…プーリング層、１４…畳込み層、１５…プーリング層、１６…全結合層の入力層、１７…全結合層の中間層、１８…全結合層の出力層、２１…映像撮影、２２…顔検出、２３…検出領域のリサイズ、２４…検出領域の特徴量計算、２５…特徴量の再配列／リサイズ、２６…元画像へのマスク処理、２７…マスク処理メタデータ付与、２８…外部出力、３１…映像入力、３２…特徴量の抽出／リサイズ・再配列、３３…特徴量から推論結果の取得、３００…コンピュータシステム、３０２…プロセッサ、３０２Ａ、３０２Ｂ…処理装置、３０４…メモリ、３０６…メモリバス、３０８…Ｉ／Ｏバス、３０９…バスインターフェース、３１０…Ｉ／Ｏバスインターフェース、３１２…端末インターフェース、３１４…ストレージインターフェース、３１６…Ｉ／Ｏデバイスインターフェース、３１８…ネットワークインターフェース、３２０…ユーザＩ／Ｏデバイス、３２２…記憶装置、３２４…表示システム、３２６…表示装置、３３０…ネットワーク、３５０…アプリケーション 1... Imaging device, 2... Imaging unit, 3... Processing system unit, 5... Video processing device, 6... Processing system unit, 7... Display output unit, 11... Input layer, 12... Convolution layer, 13... Pooling layer, 14... Convolution layer, 15... Pooling layer, 16... Input layer of fully connected layer, 17... Middle layer of fully connected layer, 18... Output layer of fully connected layer, 21... Video shooting, 22... Face detection, 23... Resize detection area, 24... Calculate feature amount of detection area, 25... Rearrange/resize feature amount, 26... Mask processing to original image, 27... Add mask processing metadata, 28... External output, 31... Video input , 32... Extracting/resizing/rearranging feature quantities, 33... Obtaining inference results from feature quantities, 300... Computer system, 302... Processor, 302A, 302B... Processing device, 304... Memory, 306... Memory bus, 308... I/O bus, 309... Bus interface, 310... I/O bus interface, 312... Terminal interface, 314... Storage interface, 316... I/O device interface, 318... Network interface, 320... User I/O device, 322 ...Storage device, 324...Display system, 326...Display device, 330...Network, 350...Application

Claims

A mask in which an image is obtained by shooting a video, a predetermined region is detected from the image, a feature amount of the detected region is extracted by resizing the detected detection region, and the extracted feature amount is arranged two-dimensionally. outputting an image placed in the detection area of the acquired image as an image;
An imaging device characterized in that the extraction of the feature amount is performed using a CNN (Convolution Neural Networks) or DNN (Deep Neural Networks) method .

The imaging device according to claim 1,
An imaging device characterized in that the predetermined area is a human face area .

A mask in which an image is obtained by shooting a video, a predetermined region is detected from the image, a feature amount of the detected region is extracted by resizing the detected detection region, and the extracted feature amount is arranged two-dimensionally. outputting an image placed in the detection area of the acquired image as an image;
An imaging device characterized in that the image to be output is provided with information specifying a range of the mask image within the image .

Equipped with an imaging device and a video processing device,
The imaging device captures an image to obtain an image, detects a predetermined region from within the image, resizes the detected detection region to extract a feature amount of the detection region, and doubles the extracted feature amount. outputting an image placed in the detection area of the acquired image as a dimensionally arranged mask image;
The video processing system is characterized in that the video processing device receives an image output from the imaging device, acquires a feature amount from the mask image, and performs inference processing based on the feature amount .

The video processing system according to claim 4,
A video processing system characterized in that the feature amount extraction and the inference processing are performed using a CNN (Convolution Neural Networks) or DNN (Deep Neural Networks) method .

The video processing system according to claim 4 ,
The video processing system is characterized in that the predetermined area is a human face area, and the inference process is a process for identifying a person .

The video processing system according to claim 4 ,
A video processing system characterized by having a function of setting parameters used for extracting feature amounts in the imaging device from the video processing device .