JP2019008421A

JP2019008421A - Processing method, program, information processing apparatus, and image processing apparatus

Info

Publication number: JP2019008421A
Application number: JP2017121615A
Authority: JP
Inventors: 敬貴小宮山; Noritaka Komiyama
Original assignee: Konica Minolta Inc
Current assignee: Konica Minolta Inc
Priority date: 2017-06-21
Filing date: 2017-06-21
Publication date: 2019-01-17
Anticipated expiration: 2037-06-21
Also published as: JP6879072B2

Abstract

To provide a processing method for performing convolution operation efficiently by using a processor capable of processing SIMD instructions.SOLUTION: In an information processing apparatus comprising a plurality of registers usable for SIMD and a processor for performing SIMD processing, a size of the register to be used is selected from a kernel size, a division amount for the kernel is determined based on a number of registers of the selected size and the kernel size, and is stored in the register for each kernel divided by the determined division amount for the kernel, and convolution operation is processed in parallel.SELECTED DRAWING: Figure 1

Description

本発明は、処理方法、プログラム、情報処理装置、および画像処理装置に関する。 The present invention relates to a processing method, a program, an information processing apparatus, and an image processing apparatus.

従来、取得した画像から物体を認識する画像認識の分野では認識精度の改善が図られている。これは、多層構造の畳み込みニューラルネットワーク（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ：以下、「ＣＮＮ」という）によるところが大きい。 Conventionally, in the field of image recognition for recognizing an object from acquired images, recognition accuracy has been improved. This is largely due to a multi-layered convolutional neural network (hereinafter referred to as “CNN”).

ＣＮＮ処理では、複数の畳込み層が含まれており、この畳み込み層では畳み込み演算を行う。畳み込み演算では、複数の積和演算を繰り返し行うため、多くの処理が必要になり時間がかかるという問題がある。 In the CNN process, a plurality of convolution layers are included, and the convolution layer performs a convolution operation. In the convolution operation, since a plurality of product-sum operations are repeatedly performed, there is a problem that a lot of processing is required and it takes time.

例えば、特許文献１に開示された技術では、グラフィックプロセッサ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ：以下、「ＧＰＵ」という）を用いて、畳み込み演算を行っている。畳み込み演算時にＧＰＵのメモリに処理対象の画像データを単純に展開した場合には、必要なメモリサイズが増加し、コストが増加するという課題がある。この課題を解決するために、特許文献１に開示された技術では、画像データを複数のデータブロックに分割し、ローカルメモリに分割した複数のデータブロックと、複数のフィルターを同時に読み込んで畳み込み演算を並列的に計算している。 For example, in the technique disclosed in Patent Document 1, a convolution operation is performed using a graphics processor (Graphics Processing Unit: hereinafter referred to as “GPU”). When the image data to be processed is simply expanded in the GPU memory during the convolution operation, there is a problem that the required memory size increases and the cost increases. In order to solve this problem, in the technique disclosed in Patent Document 1, image data is divided into a plurality of data blocks, a plurality of data blocks divided into a local memory, and a plurality of filters are simultaneously read to perform a convolution operation. Calculate in parallel.

特開２０１６−４５７２号公報Japanese Patent Laying-Open No. 2006-4572

しかしながら、特許文献１に開示された技術は、ＧＰＵを前提とするものである。一方で、例えば監視カメラ等の装置への搭載した組み込みシステムによる画像認識処理を想定した場合においては、組み込みシステムにＧＰＵを搭載すると、消費電力量が大きく現実的ではない。結果として、消費電力量の小さいＣＰＵを利用せざるをえない。 However, the technique disclosed in Patent Document 1 is based on a GPU. On the other hand, for example, in the case of assuming image recognition processing by an embedded system mounted on a device such as a monitoring camera, if a GPU is mounted on the embedded system, power consumption is large and not realistic. As a result, a CPU with low power consumption must be used.

本発明は、上記事情に鑑みてなされたものであり、消費電力量の小さいプロセッサ、特にＳＩＭＤ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎＭｕｌｔｉｐｌｅＤａｔａ）命令を処理可能なプロセッサを含む情報処理装置を用いて、効率的に畳み込み演算を行う処理方法を提供することを目的とする。 The present invention has been made in view of the above circumstances, and an efficient convolution operation using an information processing apparatus including a processor with low power consumption, particularly a processor capable of processing a single instruction multiple data (SIMD) instruction. It aims at providing the processing method which performs.

本発明の上記目的は、下記の手段によって達成される。 The above object of the present invention is achieved by the following means.

（１）ＳＩＭＤ用に利用可能な複数のレジスターと、
ＳＩＭＤ処理するプロセッサと、を備える情報処理装置を制御する方法であって、
（ａ）カーネルサイズから使用する前記レジスターのサイズを選択するステップと、
（ｂ）選択したサイズのレジスターの数、前記カーネルサイズから、カーネルの分割量を決定するステップと、
（ｃ）決定した前記カーネルの分割量で分割したカーネル毎に、前記レジスターに格納し、畳み込み演算を並列処理するステップと、
を含む、処理方法。 (1) a plurality of registers available for SIMD;
A method of controlling an information processing apparatus comprising a processor that performs SIMD processing,
(A) selecting a size of the register to be used from a kernel size;
(B) determining a kernel division amount from the number of registers of the selected size and the kernel size;
(C) storing in the register for each kernel divided by the determined kernel division amount, and performing parallel processing of convolution operations;
Including a processing method.

（２）前記レジスターに１度に格納するカーネルの行数をｎとしたとき、
前記ステップ（ｂ）では、下記式を満たす最大の値に、前記ｎを設定する、上記（１）に記載の処理方法。
ｋ−（ｘ＋ｊ）≧（ｂ＋１）ｎ
ここでｊは、ｊ＝（ｗ＋１−ｙ）／ｂ、
ｂ＝ｗ／ｑ、
ｊ、ｂは小数点以下を切り上げた整数であり、
ｗ：カーネルサイズ
ｋ：利用可能な前記レジスターの個数
ｑ：１つの前記レジスターに格納するデータ数
ｘ：畳み込み演算に必要な前記レジスターの個数
ｙ：スキップする画素数である。 (2) When the number of kernel rows stored in the register at a time is n,
In the step (b), the processing method according to (1), wherein the n is set to a maximum value satisfying the following formula.
k− (x + j) ≧ (b + 1) n
Where j is j = (w + 1−y) / b,
b = w / q,
j and b are integers rounded up to the nearest decimal point;
w: kernel size k: number of available registers q: number of data stored in one register x: number of registers necessary for convolution operation y: number of pixels to be skipped.

（３）前記カーネルサイズが１１×１１のときに、
前記ステップ（ｂ）では、前記カーネルを２行ずつの分割に決定し、
前記ステップ（ｃ）では、前記カーネルを２行分毎に前記レジスターに格納し、格納した後、２行毎に畳み込み演算を実行する、上記（１）に記載の処理方法。 (3) When the kernel size is 11 × 11,
In the step (b), the kernel is determined to be divided into two lines;
The processing method according to (1), wherein in step (c), the kernel is stored in the register every two lines, and then the convolution operation is executed every two lines.

（４）前記カーネルサイズが１１×１１のときに、
前記ステップ（ａ）では、１２８ビットのレジスターを選択し、
前記ステップ（ｃ）では、入力画素４画素分を１２８ビットのレジスターに格納し、４画素ずつの畳み込み演算を並列処理する、上記（３）に記載の処理方法。 (4) When the kernel size is 11 × 11,
In step (a), a 128-bit register is selected,
In the step (c), the processing method according to (3), wherein four input pixels are stored in a 128-bit register, and convolution operations for each four pixels are processed in parallel.

（５）前記カーネルサイズが５×５以下のときに、
前記ステップ（ｂ）では、前記カーネルを分割しないことを決定し、
前記ステップ（ｃ）では、前記カーネルを１度に前記レジスターに格納し、格納した後に畳み込み演算を実行する、上記（１）に記載の処理方法。 (5) When the kernel size is 5 × 5 or less,
In step (b), it is decided not to divide the kernel;
In the step (c), the kernel is stored in the register at a time, and a convolution operation is executed after the kernel is stored.

（６）前記カーネルサイズが５×５以下のときに、
前記ステップ（ａ）では、６４ビットのレジスターを選択し、
前記ステップ（ｃ）では、入力画素２画素分を６４ビットの前記レジスターに格納し、２画素ずつの畳み込み演算を並列処理する、上記（５）に記載の処理方法。 (6) When the kernel size is 5 × 5 or less,
In step (a), a 64-bit register is selected,
The processing method according to (5), wherein in step (c), two input pixels are stored in the 64-bit register, and convolution operations for each two pixels are processed in parallel.

（７）前記複数のレジスターのうち、カーネルを格納するレジスターを固定しておき、入力画素用のレジスターに格納するデータをスライドさせながら、中間バッファに格納した結果データを更新する、上記（１）から上記（６）のいずれか１つに記載の処理方法。 (7) Among the plurality of registers, a register for storing the kernel is fixed, and the result data stored in the intermediate buffer is updated while sliding the data stored in the register for the input pixel. To the processing method according to any one of (6) above.

（８）ＳＩＭＤ用に利用可能な複数のレジスターと、
ＳＩＭＤ処理するプロセッサと、を備える情報処理装置を制御するプログラムであって、
（ａ）カーネルサイズから使用する前記レジスターのサイズを選択するステップと、
（ｂ）選択したサイズのレジスターの数、前記カーネルサイズから、カーネルの分割量を決定するステップと、
（ｃ）決定した前記カーネルの分割量で分割したカーネル毎に、前記レジスターに格納し、畳み込み演算を並列処理するステップと、
を含む、方法を実行するためのプログラム。 (8) a plurality of registers available for SIMD;
A program for controlling an information processing device comprising a processor for SIMD processing,
(A) selecting a size of the register to be used from a kernel size;
(B) determining a kernel division amount from the number of registers of the selected size and the kernel size;
(C) storing in the register for each kernel divided by the determined kernel division amount, and performing parallel processing of convolution operations;
A program for performing the method.

（９）前記レジスターに１度に格納するカーネルの行数をｎとしたとき、
前記ステップ（ｂ）では、下記式を満たす最大の値に、前記ｎを設定する、上記（８）に記載のプログラム。
ｋ−（ｘ＋ｊ）≧（ｂ＋１）ｎ
ここでｊは、ｊ＝（ｗ＋１−ｙ）／ｂ、
ｂ＝ｗ／ｑ、
ｊ、ｂは小数点以下を切り上げた整数であり、
ｗ：カーネルサイズ
ｋ：利用可能な前記レジスターの個数
ｑ：１つの前記レジスターに格納するデータ数
ｘ：畳み込み演算に必要な前記レジスターの個数
ｙ：スキップする画素数である。 (9) When n is the number of kernel rows stored in the register at a time,
In the step (b), the program according to (8), wherein the n is set to a maximum value satisfying the following formula.
k− (x + j) ≧ (b + 1) n
Where j is j = (w + 1−y) / b,
b = w / q,
j and b are integers rounded up to the nearest decimal point;
w: kernel size k: number of available registers q: number of data stored in one register x: number of registers necessary for convolution operation y: number of pixels to be skipped.

（１０）前記カーネルサイズが１１×１１のときに、
前記ステップ（ｂ）では、前記カーネルを２行ずつの分割に決定し、
前記ステップ（ｃ）では、前記カーネルを２行分毎に前記レジスターに格納し、格納した後、２行毎に畳み込み演算を実行する、上記（８）に記載のプログラム。 (10) When the kernel size is 11 × 11,
In the step (b), the kernel is determined to be divided into two lines;
The program according to (8), wherein in step (c), the kernel is stored in the register every two lines, and after storing, the convolution operation is executed every two lines.

（１１）前記カーネルサイズが１１×１１のときに、
前記ステップ（ａ）では、１２８ビットのレジスターを選択し、
前記ステップ（ｃ）では、入力画素４画素分を１２８ビットのレジスターに格納し、４画素ずつの畳み込み演算を並列処理する、上記（１０）に記載のプログラム。 (11) When the kernel size is 11 × 11,
In step (a), a 128-bit register is selected,
The program according to (10), wherein in step (c), four input pixels are stored in a 128-bit register, and convolution operations for each four pixels are processed in parallel.

（１２）前記カーネルサイズが５×５以下のときに、
前記ステップ（ｂ）では、前記カーネルを分割しないことを決定し、
前記ステップ（ｃ）では、前記カーネルを１度に前記レジスターに格納し、格納した後に畳み込み演算を実行する、上記（８）に記載のプログラム。 (12) When the kernel size is 5 × 5 or less,
In step (b), it is decided not to divide the kernel;
The program according to (8), wherein in step (c), the kernel is stored in the register at a time, and a convolution operation is executed after the kernel is stored.

（１３）前記カーネルサイズが５×５以下のときに、
前記ステップ（ａ）では、６４ビットのレジスターを選択し、
前記ステップ（ｃ）では、入力画素２画素分を６４ビットの前記レジスターに格納し、２画素ずつの畳み込み演算を並列処理する、上記（１２）に記載のプログラム。 (13) When the kernel size is 5 × 5 or less,
In step (a), a 64-bit register is selected,
The program according to (12), wherein in the step (c), two input pixels are stored in the 64-bit register, and convolution operations for each two pixels are processed in parallel.

（１４）前記複数のレジスターのうち、カーネルを格納するレジスターを固定しておき、入力画素用のレジスターに格納するデータをスライドさせながら、中間バッファに格納した結果データを更新する、上記（８）から上記（１３）のいずれか１つに記載のプログラム。 (14) Of the plurality of registers, a register for storing the kernel is fixed, and the result data stored in the intermediate buffer is updated while sliding the data stored in the register for the input pixel, (8) To any one of (13) above.

（１５）ＳＩＭＤ用に利用可能な複数のレジスターと、
ＳＩＭＤ処理するプロセッサと、を備える情報処理装置であって、
カーネルサイズから使用する前記レジスターのサイズを選択し、選択したサイズのレジスターの数、前記カーネルサイズから、カーネルの分割量を決定し、決定した前記カーネルの分割量で分割したカーネル毎に、前記レジスターに格納し、畳み込み演算を並列処理する、情報処理装置。 (15) a plurality of registers available for SIMD;
An information processing apparatus including a processor for performing SIMD processing,
The size of the register to be used is selected from the kernel size, the number of registers of the selected size is determined, the division amount of the kernel is determined from the kernel size, and the register is determined for each kernel divided by the determined division amount of the kernel. An information processing device that stores data in parallel and performs convolution operations in parallel.

（１６）前記決定した分割量で分割され、前記レジスターに１度に格納するカーネルの行数をｎとしたとき、
下記式を満たす最大の値に、前記ｎを設定する、上記（１５）に記載の情報処理装置。
ｋ−（ｘ＋ｊ）≧（ｂ＋１）ｎ
ここでｊは、ｊ＝（ｗ＋１−ｙ）／ｂ、
ｂ＝ｗ／ｑ、
ｊ、ｂは小数点以下を切り上げた整数であり、
ｗ：カーネルサイズ
ｋ：利用可能な前記レジスターの個数
ｑ：１つの前記レジスターに格納するデータ数
ｘ：畳み込み演算に必要な前記レジスターの個数
ｙ：スキップする画素数である。 (16) When the number of kernel rows that are divided by the determined division amount and stored in the register at a time is n,
The information processing apparatus according to (15), wherein the n is set to a maximum value satisfying the following formula.
k− (x + j) ≧ (b + 1) n
Where j is j = (w + 1−y) / b,
b = w / q,
j and b are integers rounded up to the nearest decimal point;
w: kernel size k: number of available registers q: number of data stored in one register x: number of registers necessary for convolution operation y: number of pixels to be skipped.

（１７）前記複数のレジスターのうち、カーネルを格納するレジスターを固定しておき、入力画素用のレジスターに格納するデータをスライドさせながら、中間バッファに格納した結果データを更新する、上記（１５）または上記（１６）に記載の情報処理装置。 (17) Of the plurality of registers, a register for storing the kernel is fixed, and the result data stored in the intermediate buffer is updated while sliding the data stored in the register for input pixels. Or the information processing apparatus as described in said (16).

（１８）撮像装置が生成した画像を取得する画像取得部と、
前記画像に映る人の特徴、および前記画像に映る人の周辺物体の形状、位置又は種別を示す周辺特徴を抽出する、上記（１５）から上記（１７）のいずれか１つに記載の情報処理装置と、
を備える、画像処理装置。 (18) an image acquisition unit that acquires an image generated by the imaging device;
The information processing according to any one of (15) to (17), wherein a feature of a person shown in the image and a peripheral feature indicating a shape, a position, or a type of a peripheral object of the person shown in the image are extracted. Equipment,
An image processing apparatus comprising:

本発明に係る処理方法によれば、カーネルサイズから使用するレジスターのサイズを選択し、選択したサイズのレジスターの数、カーネルサイズから、カーネルの分割量を決定し、決定したカーネルの分割量で分割したカーネル毎に、レジスターに格納し、畳み込み演算を並列処理する。このようにすることで、効率的に畳み込み演算を行える。 According to the processing method of the present invention, the size of a register to be used is selected from the kernel size, the kernel division amount is determined from the number of registers of the selected size and the kernel size, and the division is performed by the determined kernel division amount. Each kernel is stored in a register, and convolution operations are processed in parallel. By doing so, the convolution operation can be performed efficiently.

本発明の実施形態に係る情報処理装置を示すブロック図である。It is a block diagram which shows the information processing apparatus which concerns on embodiment of this invention. ５×５サイズのカーネルを６４ビットレジスターおよび１２８ビットレジスターにそれぞれ格納した状態を示す模式図である。It is a schematic diagram which shows the state which stored the kernel of 5 * 5 size in the 64-bit register and the 128-bit register, respectively. １１×１１サイズのカーネルを１２８ビットレジスターに格納した状態を示す模式図である。It is a schematic diagram which shows the state which stored the kernel of 11 * 11 size in the 128-bit register. １１×１１サイズのカーネルの分割量が２行の場合の処理を説明する模式図である。It is a schematic diagram explaining a process in case the division | segmentation amount of an 11x11 size kernel is 2 rows. ＣＮＮ処理においてストライド４の場合の窓の重なり状態を示す図である。It is a figure which shows the overlapping state of the window in the case of stride 4 in a CNN process. 情報処理装置が実行するＣＮＮ処理のフローチャートを示す図である。It is a figure which shows the flowchart of the CNN process which an information processing apparatus performs. 図６Ａに続く、フローチャートを示す図である。It is a figure which shows the flowchart following FIG. 6A. レジスターのデータ格納状態を示す模式図である。It is a schematic diagram which shows the data storage state of a register. レジスターのデータ格納状態を示す模式図である。It is a schematic diagram which shows the data storage state of a register. 変形例に係るフローチャートを示す図である。It is a figure which shows the flowchart which concerns on a modification. 実施形態に係る画像処理装置を示すブロック図である。1 is a block diagram illustrating an image processing apparatus according to an embodiment. 画像処理装置の機能ブロックを示す図である。It is a figure which shows the functional block of an image processing apparatus.

以下、添付した図面を参照して、本発明の実施形態を説明する。なお、図面の説明において同一の要素には同一の符号を付し、重複する説明を省略する。また、図面の寸法比率は、説明の都合上誇張されており、実際の比率とは異なる場合がある。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the description of the drawings, the same elements are denoted by the same reference numerals, and redundant description is omitted. In addition, the dimensional ratios in the drawings are exaggerated for convenience of explanation, and may be different from the actual ratios.

図１は本発明の実施形態に係る情報処理装置を示すブロック図である。同図に示すように情報処理装置１００は、ＳＩＭＤプロセッサ１０、汎用プロセッサ２０、メモリ３０を備え、これらの要素はデータバス、コマンドバス、アドレスバス等のバスにより互いに接続されている。この情報処理装置１００は、例えば監視カメラ等の装置の組み込みシステム用の情報処理装置である。 FIG. 1 is a block diagram showing an information processing apparatus according to an embodiment of the present invention. As shown in the figure, the information processing apparatus 100 includes a SIMD processor 10, a general-purpose processor 20, and a memory 30, and these elements are connected to each other by a bus such as a data bus, a command bus, and an address bus. The information processing apparatus 100 is an information processing apparatus for an embedded system such as a monitoring camera.

ＳＩＭＤプロセッサ１０は、１つの命令で複数のデータを演算するＳＩＭＤ型（単一命令複数データ処理）の命令を実行するプロセッサであり、例えばＡＲＭホールディングスのＳＩＭＤ拡張命令に対応したＮＥＯＮである。ＳＩＭＤ型の処理（以下、単に「ＳＩＭＤ処理」という）はベクトルデータ処理とも呼ばれる。 The SIMD processor 10 is a processor that executes a SIMD type (single instruction multiple data processing) instruction that calculates a plurality of data with a single instruction, and is, for example, a NEON that corresponds to a SIMD extension instruction of ARM Holdings. SIMD type processing (hereinafter simply referred to as “SIMD processing”) is also referred to as vector data processing.

ＳＩＭＤプロセッサ１０は、ＳＩＭＤレジスターファイル１１、ＡＬＵ（ＡｒｉｔｈｍｅｔｉｃＬｏｇｉｃＵｎｉｔ）１２、ＭＵＬ（ｍｕｌｔｉｐｌｉｅｒＵｎｉｔ）１３、シフター１４、およびＬＳ（Ｌｏａｄ／ＳｔｏｒｅＵｎｉｔ）１５を含む。 The SIMD processor 10 includes a SIMD register file 11, an ALU (Arithmetic Logic Unit) 12, a MUL (Multiplier Unit) 13, a shifter 14, and an LS (Load / Store Unit) 15.

ＳＩＭＤレジスターファイル１１は、２５６ビット、１２８ビット、または６４ビットの長さの複数のＳＩＭＤ用レジスター（以下、単に「レジスター」という）で構成され得る。本実施形態においては、ＳＩＭＤレジスターファイル１１１を１６個の１２８ビット長のレジスター、または３２個の６４ビット長のレジスターとして用いることができる（以下、それぞれ「１２８ビットレジスター」、「６４ビットレジスター」ともいう）。そして、３２ビット長の単精度浮動小数点数や整数であれば、１２８ビットレジスターに４個分のオペランドが、６４ビットレジスターであれば２個分のオペランドが１度に格納可能である。このようなデータは、ベクトルデータとも呼ばれる。 The SIMD register file 11 can be composed of a plurality of SIMD registers (hereinafter simply referred to as “registers”) having a length of 256 bits, 128 bits, or 64 bits. In this embodiment, the SIMD register file 111 can be used as 16 128-bit registers or 32 64-bit registers (hereinafter referred to as “128-bit registers” and “64-bit registers”, respectively). Say). If it is a 32-bit single precision floating point number or integer, four operands can be stored in a 128-bit register, and if it is a 64-bit register, two operands can be stored at one time. Such data is also called vector data.

ＡＬＵ１２は、加算、減算、および論理演算を行う演算器である。ＭＵＬ１３は、乗算、除算を行う演算器である。シフター１４は、ビットを左右にずらすシフト処理を行う演算器である。ＬＳ１５は、メモリ３０からレジスターへのデータのロード、およびレジスターからメモリ３０へのデータのロードの処理を行う。 The ALU 12 is an arithmetic unit that performs addition, subtraction, and logical operations. The MUL 13 is an arithmetic unit that performs multiplication and division. The shifter 14 is an arithmetic unit that performs a shift process for shifting the bits to the left and right. The LS 15 performs processing of loading data from the memory 30 to the register and loading data from the register to the memory 30.

汎用プロセッサ２０は、ＳＩＭＤ処理以外の一般的なスカラデータ処理を実行するプロセッサである。汎用プロセッサ２０は、ＡＬＵ２２、ＭＵＬ２３、ＬＳ２４を含む。これらは、ＡＬＵ１２、ＭＵＬ１３、およびＬＳ１５と同様の機能を備えるので説明は省略する。 The general-purpose processor 20 is a processor that executes general scalar data processing other than SIMD processing. The general-purpose processor 20 includes an ALU 22, MUL 23, and LS 24. Since these have the same functions as the ALU 12, MUL 13, and LS 15, description thereof will be omitted.

メモリ３０は、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等のＲＡＭ、およびＮＡＮＤ型フラッシュメモリ等のＲＯＭを含む。また、さらにＨＤＤなどの補助記憶装置を含んでもよい。メモリ３０には、処理する画像データを一時的に格納したり、後述する情報処理装置１００が実行するＣＮＮ処理の実行方法のプログラム、情報処理装置１００が組み込まれた機器を制御する制御プログラム、各種データ等を格納したりする。 The memory 30 includes a RAM such as a DRAM (Dynamic Random Access Memory) and a ROM such as a NAND flash memory. Further, an auxiliary storage device such as an HDD may be included. The memory 30 temporarily stores image data to be processed, a program for executing a CNN process executed by the information processing apparatus 100 described later, a control program for controlling a device in which the information processing apparatus 100 is incorporated, Store data etc.

（カーネルの格納）
本実施形態では、ＣＮＮ処理を行う際には、図１で示したようなハードウェアリソースの制約のため、カーネル（ファイルターともいう）のサイズによっては全てを一度にＳＩＭＤレジスターファイル１１１のレジスターに格納しようとした場合、レジスターが不足してしまう。そこで、本実施形態では、ＳＩＭＤレジスターファイル１１１を分割した１２８ビットレジスターまたは６４ビットレジスターに、カーネルの一部を分割して格納する。なお、適用するカーネルのサイズは、処理に応じて異なる。 (Kernel storage)
In this embodiment, when CNN processing is performed, due to hardware resource limitations as shown in FIG. 1, all of them are converted to the registers of the SIMD register file 111 at a time depending on the size of the kernel (also referred to as a filer). If you try to store it, you will run out of registers. Therefore, in this embodiment, a part of the kernel is divided and stored in a 128-bit register or a 64-bit register obtained by dividing the SIMD register file 111. Note that the size of the kernel to be applied varies depending on the process.

図２は、高さ５行、幅５列のサイズのカーネル（以下、「５×５サイズ」等という）のカーネルを６４ビットレジスターおよび１２８ビットレジスターにそれぞれ格納した状態を示す模式図である。カーネルの格納は、ＳＩＭＤ処理をするために、１行毎に、１または複数のレジスターに格納する。５×５サイズのカーネルは２５個のデータから構成され、各データのサイズが単精度浮動小数点数の３２ビットの場合、１個の１２８ビットレジスターには、連続した４個のデータを格納できる。 FIG. 2 is a schematic diagram showing a state in which kernels having a size of 5 rows in height and 5 columns in width (hereinafter referred to as “5 × 5 size”) are stored in 64-bit registers and 128-bit registers, respectively. The kernel is stored in one or a plurality of registers for each row in order to perform SIMD processing. A 5 × 5 size kernel is composed of 25 data, and when each data size is 32 bits of a single-precision floating point number, one continuous 128-bit register can store four data.

１行分の５個のデータを１２８ビットレジスターに格納した場合、同図に示すように２個の１２８ビットレジスターが必要であり、２個目の１２８ビットレジスターに格納されるデータは１個であり、残り３個分の空きが生じる。一方で、６４ビットレジスターに格納した場合は、１個の６４ビットレジスターには、それぞれ２個のカーネルのデータを格納できるので３個の６４ビットレジスターが必要であり、３個目の６４ビットレジスターには、１個分の空きが生じる。 If five rows of data are stored in a 128-bit register, two 128-bit registers are required as shown in the figure, and only one data is stored in the second 128-bit register. Yes, the remaining three spaces are available. On the other hand, when stored in a 64-bit register, each 64-bit register can store data of two kernels, so three 64-bit registers are required. The third 64-bit register One space is generated.

両者の比較では、５×５サイズのカーネルの場合であれば６４ビットレジスターを選択した方が、格納効率が高く、有効にレジスターを活用できることが分かる。 In comparison between the two, it can be seen that, in the case of a kernel of 5 × 5 size, selecting a 64-bit register has higher storage efficiency and can effectively use the register.

同様に１１×１１サイズのカーネルの場合には、６４ビットレジスター、および１２８ビットレジスターのいずれを選択した場合であっても、最後のレジスターでは１個分の空きが生じることになる。図３は、１１×１１サイズのカーネルを１２８ビットレジスターに格納した状態を示す模式図であり、３個目のレジスター（ｋ０２）では、１個分の空きがある。この場合、６４ビットレジスターおよび１２８ビットレジスターにおいて、格納効率はどちらも同じである。しかしながら、ＳＩＭＤ処理を考慮した場合、１２８ビットレジスターの方が、６４ビットレジスターに比べて格納しているデータ数が多いため、１サイクル（１つのロードまたはストア命令）でより多くのデータを一度に転送できるため、転送効率が高いと言える。 Similarly, in the case of an 11 × 11 size kernel, even if either a 64-bit register or a 128-bit register is selected, one empty space is generated in the last register. FIG. 3 is a schematic diagram showing a state in which an 11 × 11 size kernel is stored in a 128-bit register. The third register (k02) has one space. In this case, the storage efficiency is the same in the 64-bit register and the 128-bit register. However, considering SIMD processing, the 128-bit register stores more data than the 64-bit register, so more data can be stored in one cycle (one load or store instruction) at a time. Since transfer is possible, it can be said that transfer efficiency is high.

ここで、本実施形態においては、ＳＩＭＤレジスターファイルはサイズ上の制約から、カーネルの格納に割り当てることが可能なレジスターの個数が限られる。例えば１２８ビットレジスターであれば利用可能なレジスター個数は１６個である。１１×１１サイズのカーネルを一度に１２８ビットレジスターに格納するためには、３３個（＝３×１１）のレジスターが必要であり、１６個のレジスターでは不足することになる。そこで後述する本実施形態に係る方法では、カーネルのサイズ等の条件に応じて、カーネルの分割量を決定し、決定した分割量でカーネルを複数に分割し、分割したカーネル毎に畳み込み演算を行う。これにより組み込みシステムのような限られたリソースであっても効率的に、ＣＮＮ処理を行えるようにする（図４参照）。 Here, in the present embodiment, the number of registers that can be allocated to the storage of the kernel is limited due to size restrictions in the SIMD register file. For example, in the case of a 128-bit register, the number of usable registers is 16. In order to store an 11 × 11 size kernel in a 128-bit register at a time, 33 (= 3 × 11) registers are required, and 16 registers are insufficient. Therefore, in the method according to the present embodiment, which will be described later, the kernel division amount is determined according to the condition such as the kernel size, the kernel is divided into a plurality of divisions with the determined division amount, and a convolution operation is performed for each divided kernel . This enables efficient CNN processing even with limited resources such as an embedded system (see FIG. 4).

（演算結果の共通使用について）
本実施形態においては、ＣＮＮ処理（ＡｌｅｘＮｅｔ）におけるストライドは４である。図５は４画素分のストライドを実行する前と後の入力画素の範囲の重なり部分を示す図である。図５に示すようにｉ×ｉサイズの入力画素に対して、１１×１１サイズのカーネルを用いて、畳み込み演算をする場合、図５（ａ）に示すように、左上から１１×１１サイズの入力画素を、カーネルの対応する位置のデータと積和演算して、１つの出力画素のデータを算出する。太枠は、カーネルと積和演算する入力画素の範囲を示す窓である。 (About common use of calculation results)
In this embodiment, the stride in the CNN process (AlexNet) is 4. FIG. 5 is a diagram showing an overlapping portion of the input pixel range before and after the stride for four pixels is executed. As shown in FIG. 5, when performing a convolution operation on an input pixel of i × i size using a kernel of 11 × 11 size, as shown in FIG. The input pixel is summed with the data at the corresponding position in the kernel to calculate the data of one output pixel. A thick frame is a window indicating a range of input pixels for product-sum operation with the kernel.

そして、次の出力画素は、ストライドの設定値分、すなわち図５（ｂ）に示すように４画素分スキップさせた入力画素を、カーネルと積和演算することで、算出される。このとき、１つ前の窓と、現時点の窓とは、一部重なるため、レジスターに格納した入力画素を共通で使用できる。なお、この重なる量は、スキップの設定値と、カーネルのサイズにより異なる。ここで、以下の例では、カーネルをストライドさせる画素数は、畳み込み演算のためにずらす画素数（後述の格納データ数ｑ）と一致している。また、このストライドする画素数は、畳み込み演算のためにずらす画素数に対して同じ、または整数倍であることが好ましい。以下に説明する本実施形態に係る方法では、この共通の使用できる領域内の入力画素をレジスターに格納し、複数の出力画素の畳み込み演算に使用することで処理の効率化を図る。 Then, the next output pixel is calculated by multiplying the input pixel skipped by four pixels as shown in FIG. 5 (b) with the kernel. At this time, since the previous window and the current window partially overlap, the input pixels stored in the register can be used in common. This amount of overlap differs depending on the skip setting value and the kernel size. Here, in the following example, the number of pixels for striding the kernel is the same as the number of pixels shifted for the convolution operation (the number of stored data q described later). Moreover, it is preferable that the number of pixels to be stride is the same as or an integer multiple of the number of pixels shifted for the convolution operation. In the method according to the present embodiment described below, the input pixels in the common usable area are stored in a register and used for the convolution operation of a plurality of output pixels, thereby improving the processing efficiency.

（処理方法）
以下、図６Ａ〜図８を参照し、情報処理装置１００によるＣＮＮ処理の手順を説明する。図６Ａ、図６Ｂは情報処理装置１００が実行するＣＮＮ処理のフローチャートを示す図である。 (Processing method)
Hereinafter, the procedure of the CNN process performed by the information processing apparatus 100 will be described with reference to FIGS. 6A to 8. 6A and 6B are diagrams illustrating a flowchart of CNN processing executed by the information processing apparatus 100. FIG.

（Ｓ１１１）
最初に、情報処理装置１００は、カーネルサイズからレジスターサイズの選択を行う。レジスターサイズの選択は、上述したように最初に、（１）格納効率の観点から格納効率がより高い方を選択する。格納効率で差がない場合には、次に（２）転送効率の観点から、レジスターサイズがより大きい方を選択する。例えばカーネルサイズが５×５であれば格納効率の観点から６４ビットレジスターを選択する。カーネルサイズが１１×１１であれば格納効率と転送効率の観点から１２８ビットレジスターを選択する。 (S111)
First, the information processing apparatus 100 selects a register size from the kernel size. As described above, the register size is first selected as follows: (1) From the viewpoint of storage efficiency, the one with higher storage efficiency is selected. If there is no difference in storage efficiency, then (2) the larger register size is selected from the viewpoint of transfer efficiency. For example, if the kernel size is 5 × 5, a 64-bit register is selected from the viewpoint of storage efficiency. If the kernel size is 11 × 11, a 128-bit register is selected from the viewpoint of storage efficiency and transfer efficiency.

（ステップＳ１１２）
次に、ステップＳ１１１で選択したレジスターサイズの使用レジスター数、カーネルサイズを用いて、カーネルの分割量を決定する。 (Step S112)
Next, the division amount of the kernel is determined using the number of used registers and the kernel size of the register size selected in step S111.

カーネルの分割量、すなわち分割行数ｎは、下記式（１）を満たす最大の数（整数）に設定する。
ｋ−（ｘ＋ｊ）≧（ｂ＋１）ｎ（１）
ここでｊは、ｊ＝（ｗ＋１−ｙ）／ｂで、
ｂ＝ｗ／ｑ
ｊ、ｂは小数点以下を切り上げた整数であり、
ｗ：カーネルサイズ
ｋ：レジスターの個数
ｑ：１つのレジスターに格納するデータ数
ｘ：畳み込み演算に必要なレジスター数
ｙ：スキップする画素数である。 The division amount of the kernel, that is, the division row number n is set to the maximum number (integer) satisfying the following formula (1).
k− (x + j) ≧ (b + 1) n (1)
Where j is j = (w + 1−y) / b,
b = w / q
j and b are integers rounded up to the nearest decimal point;
w: Kernel size k: Number of registers q: Number of data stored in one register x: Number of registers required for convolution calculation y: Number of pixels to be skipped.

（式（１）について）
カーネルサイズｗ×ｗに基づいて選択した、ｋ個のｚビットレジスターを使用する。 (About formula (1))
Use k z-bit registers selected based on kernel size w × w.

最小ビット数を単数浮動小数点型の３２ビットとすると、ｚビットレジスターにはｚ／３２＝ｑ個のデータが格納される。 If the minimum number of bits is 32 bits of a single floating point type, z / 32 = q pieces of data are stored in the z-bit register.

畳み込み演算時に必要な積算、加算のためにｘ個のレジスターが必要である。この積算、加算演算に必要なレジスター数については、後述する。 X registers are necessary for the accumulation and addition necessary for the convolution operation. The number of registers necessary for the integration and addition operations will be described later.

カーネルサイズｗ×ｗでは、１行でｚビットレジスターを用いた計算をｗ／ｑ＝ｂ回（ｂは小数点切り上げた整数）計算する必要がある。 With the kernel size w × w, it is necessary to calculate w / q = b times (b is an integer obtained by rounding up the decimal point) using a z-bit register in one line.

また、入力画素をｙ画素ずつスキップして畳み込みを行う場合、（ｗ＋１−ｙ）画素分は共通して使用できる。 Further, when convolution is performed by skipping input pixels by y pixels, (w + 1−y) pixels can be used in common.

そのため、ｚビットレジスターにｑ個のデータが格納されることを考慮すると、中間バッファは（ｗ＋１−ｙ）／ｂ＝ｊ個用意する必要がある。（ｊは小数点を切り上げた整数）
よって、残りのレジスター数は、ｋ−（ｘ＋ｊ）＝ｍ個
カーネルをｎ行ずつに分割するとカーネルに必要なレジスター個数はｂｎ、入力画素データに必要なレジスター個数はｎ。
そこで、ｍ≧（ｂ＋１）ｎの条件を満たす、最大整数ｎを求める。 Therefore, considering that q data is stored in the z-bit register, it is necessary to prepare (w + 1−y) / b = j intermediate buffers. (J is an integer with the decimal point rounded up)
Therefore, the number of remaining registers is k− (x + j) = m. When the kernel is divided into n rows, the number of registers necessary for the kernel is bn, and the number of registers necessary for input pixel data is n.
Therefore, the maximum integer n that satisfies the condition of m ≧ (b + 1) n is obtained.

（本実施形態のｎ＝２となる具体例）
例えば、ステップＳ１１１で、カーネルサイズが１１×１１（ｗ＝１１）に応じて１２８ビットレジスターを選択した場合、ＳＩＭＤ処理に割当て可能なレジスター数ｋは、１６個である。１つの１２８ビットレジスターには、格納するデータ数ｑは、３２ビットの単精度浮動小数点数で４個である。 (Specific example where n = 2 in this embodiment)
For example, when a 128-bit register is selected according to the kernel size of 11 × 11 (w = 11) in step S111, the number of registers k that can be assigned to SIMD processing is 16. In one 128-bit register, the number of data q to be stored is four 32-bit single precision floating point numbers.

畳み込み演算時に必要な積算、加算のために少なくとも３個（ｘ個）のレジスター個数が必要である（後述）。１２８ビットレジスターには３２ビットのデータが４個（＝１２８／３２）格納される。 At least three (x) registers are required for integration and addition necessary for the convolution operation (described later). Four 32-bit data (= 128/32) are stored in the 128-bit register.

また、カーネルサイズは１１ｘ１１では、１行で１２８ビットレジスターを用いた計算を１１／４＝２．７回、小数点切り上げで３回行う必要がある（ｂ＝３）。 When the kernel size is 11 × 11, it is necessary to perform calculation using a 128-bit register in one line 11/4 = 2.7 times and rounding up the decimal point three times (b = 3).

また、入力画素を４画素ずつスキップして畳み込みを行う場合、８画素分（ｗ＋１−ｙ＝１１＋１−４）は共通して使用できるため中間バッファを３個確保すればよい（ｊ＝８／ｂ＝２．６、小数点以下切り上げでｊ＝３）。ここで、加算した「１」はレジスターに１度に格納するデータ数（ｂ：４個）に合わせるための調整値である。 In addition, when convolution is performed by skipping four input pixels, since eight pixels (w + 1−y = 11 + 1−4) can be used in common, three intermediate buffers may be secured (j = 8 / b). = 2.6, j = 3 by rounding up after the decimal point. Here, the added “1” is an adjustment value for adjusting to the number of data (b: 4) stored at one time in the register.

残りのレジスターは１６−（４＋３）＝９個
カーネルをｎ行ずつに分割すると
１６−（４＋３）≧（ｂ＋１）ｎ＝４ｎ
９≧４ｎを満たす最大整数ｎを求めるとｎ＝２である。
以上をまとめると、１６個（ｋ）の１２８（ｚ）ビットレジスターのうち、用途が決まっているものは、カーネル格納用に６個（３ｎ）、入力画素用に２個（ｎ）、中間バッファ用に３個（ｊ）、積算、加算用に４個（ｘ）で、合計１５個となる。 When the remaining registers are 16- (4 + 3) = 9 kernels divided into n rows, 16- (4 + 3) ≧ (b + 1) n = 4n
When the maximum integer n satisfying 9 ≧ 4n is obtained, n = 2.
In summary, of the 16 (k) 128 (z) bit registers, the ones that are used are determined to be 6 (3n) for kernel storage, 2 (n) for input pixels, and the intermediate buffer 3 (j) for use and 4 (x) for integration and addition, for a total of 15.

（畳み込み演算の積算、加算にｘ個（３個）のレジスターが必要な場合）
（ａ＋ｂ）×（ｃ＋ｄ）＝ｅを算出する際には、ａ、ｂをレジスターＲ０、Ｒ１にそれぞれ格納する。ａ＋ｂの結果ｘをレジスターＲ０に格納する。その後、ｃ、ｄをレジスターＲ１、Ｒ２にそれぞれ格納する。ｃ＋ｄの結果ｙをレジスターＲ１に格納する。ｘ＊ｙの結果ｅをレジスターＲ０に格納する。以上のことからレジスターＲ０〜Ｒ２の３個のレジスターが必要となる。 (When x registers (three) are required for accumulation and addition of convolution operations)
When calculating (a + b) × (c + d) = e, a and b are stored in the registers R0 and R1, respectively. The result x of a + b is stored in the register R0. Thereafter, c and d are stored in the registers R1 and R2, respectively. The result y of c + d is stored in the register R1. The result e of x * y is stored in the register R0. From the above, three registers R1 to R2 are necessary.

（他の例：畳み込み演算の積算、加算にｘ個（４個）のレジスターが必要な場合）
（ａ＋ｂ）×（ｃ＋ｄ）＝ｅを算出する際には、ａ、ｂをレジスターＲ０、Ｒ１にそれぞれ格納する。ａ＋ｂの結果ｘをレジスターＲ２格納する。その後、ｃ、ｄをレジスターＲ０、Ｒ１にそれぞれ格納する。ｃ＋ｄの結果ｙをレジスターＲ３に格納する。ｘ＊ｙの結果ｅをレジスターＲ３に格納する。以上のことからレジスターＲ０〜Ｒ３の４個のレジスターが必要となる。以上説明したように、必要な数は、あくまでも一例であり、変数の数や、レジスターの空き状況に応じて、積算、加算に必要なレジスターの数も異なる。 (Other example: When x (4) registers are required for accumulation and addition of convolution operations)
When calculating (a + b) × (c + d) = e, a and b are stored in the registers R0 and R1, respectively. The result x of a + b is stored in the register R2. Thereafter, c and d are stored in the registers R0 and R1, respectively. The result y of c + d is stored in the register R3. The result e of x * y is stored in the register R3. From the above, four registers R1 to R3 are necessary. As described above, the necessary number is merely an example, and the number of registers necessary for integration and addition varies depending on the number of variables and the availability of registers.

（ステップＳ１１３）
以上のような手順によって決定したｎ行分のカーネルをレジスターに格納する。なお、以下の処理においては、上述した実施形態の条件下でｎ＝２に決定したものとして説明を行う。 (Step S113)
The kernel for n rows determined by the above procedure is stored in the register. Note that the following processing will be described assuming that n = 2 is determined under the conditions of the above-described embodiment.

図７、図８は、レジスターのデータ格納状態を示す模式図である。図７（ａ）は、ステップＳ１１３で、２行分のカーネルを分割してｂｎ個（＝６）のレジスターに格納した状態を示している。図３に示すようにカーネルを構成する２行分の各データは、連続する４個毎にレジスターに格納される。同図においては各レジスターには、格納したデータに対応した符号ｋ００〜ｋ１２を示している。このうち、３つ目のレジスターｋ０２、ｋ１２には、４個分の格納領域のうち３個分のカーネルのデータが格納されており、データ１個分の空き領域がある（図３参照）。この空き領域にはゼロを入れている。 7 and 8 are schematic diagrams showing the data storage state of the register. FIG. 7A shows a state in which the kernels for two rows are divided and stored in bn (= 6) registers in step S113. As shown in FIG. 3, each of the two rows of data constituting the kernel is stored in a register every four consecutive data. In the figure, each register is shown with symbols k00 to k12 corresponding to the stored data. Among these, the third registers k02 and k12 store the kernel data for three of the four storage areas, and have a free area for one data (see FIG. 3). Zero is put in this empty area.

（ステップＳ１１４）
次に入力画素の右端、下端に数値０を入れるゼロパディングを行う。パディングの幅は、例えば１である。このパディング幅は、必要な出力画素のサイズに応じて、適宜設定される。 (Step S114)
Next, zero padding is performed to enter a numerical value 0 at the right and bottom edges of the input pixel. The padding width is 1, for example. This padding width is appropriately set according to the required size of the output pixel.

（ステップＳ１１５）
上端、左端から、ステップＳ１１２で決定したｎ（＝２）、および１つのレジスターに格納可能なデータ数（４個）に対応した、２×４の入力画素をｎ個（＝２）のレジスターに格納する。図７（ｂ）は、ステップＳ１１５で、２×４の入力画素を２個のレジスターに格納した状態を示している。同図においても各レジスターには、格納したデータに対応した符号ｄ００、ｄ１０を示している。 (Step S115)
From the upper end and the left end, 2 × 4 input pixels corresponding to n (= 2) determined in step S112 and the number of data that can be stored in one register (4) are stored in n (= 2) registers. Store. FIG. 7B shows a state in which 2 × 4 input pixels are stored in two registers in step S115. Also in the same figure, each register shows codes d00 and d10 corresponding to the stored data.

（ステップＳ１１６）
入力画素用のレジスターに格納したデータ（ｄ００、ｄ１０）と、左４データ分のカーネルを格納したレジスターのデータ（ｋ００、ｋ１０）とを積和演算し、結果Ｓを中間バッファ１用レジスターに格納する。ここでの積和演算（以下でも同様）においては、ＳＩＭＤ処理により、入力用、カーネル用のレジスターにそれぞれ格納されている４画素分（ｄ００、ｋ００）のデータを用いて、４画素ずつの畳み込み演算を並列処理する。図７（ｃ）は、ステップＳ１１６で積和演算し、結果Ｓをレジスターに格納した状態を示している。 (Step S116)
Multiply-and-accumulate the data (d00, d10) stored in the input pixel register and the register data (k00, k10) that stores the left 4 data kernels, and store the result S in the intermediate buffer 1 register. To do. In the product-sum operation here (the same applies below), convolution of 4 pixels is performed using SIMD processing for 4 pixels (d00, k00) stored in the input and kernel registers respectively. Parallel processing operations. FIG. 7C shows a state where the product-sum operation is performed in step S116 and the result S is stored in the register.

（ステップＳ１２１）
次に、４画素分スキップ（格納データ数ｑ分）した後の次の２×４の入力画素を入力画素用のレジスターに格納する。ここでは、例えばｄ０１、ｄ１１（図７（ｂ）参照）のデータが入力画素用の２個のレジスターにそれぞれ格納される。 (Step S121)
Next, the next 2 × 4 input pixels after skipping 4 pixels (the number of stored data q) are stored in the input pixel register. Here, for example, data of d01 and d11 (see FIG. 7B) are respectively stored in two registers for input pixels.

（ステップＳ１２２）
入力画素用のレジスターに格納したデータと、左４データ分のカーネルを格納したレジスターのデータとを積和演算し、結果Ａを中間バッファ２用レジスターに格納する。図８（ａ）は、ステップＳ１２２で積和演算し、結果Ａをレジスターに格納した状態を示している。 (Step S122)
A product-sum operation is performed on the data stored in the input pixel register and the data in the register storing the left four data kernels, and the result A is stored in the intermediate buffer 2 register. FIG. 8A shows a state in which the product-sum operation is performed in step S122 and the result A is stored in the register.

（ステップＳ１２３）
入力画素用のレジスターに格納したデータと、中央４データ分のカーネルを格納したレジスターのデータとを積和演算し、結果Ｂを中間バッファ１用レジスターに加算して格納する。図８（ｂ）は、ステップＳ１２３で積和演算し、結果Ｂをレジスターに格納した状態を示している。 (Step S123)
The sum of the data stored in the input pixel register and the data in the register storing the kernel for the central four data are added, and the result B is added to the intermediate buffer 1 register and stored. FIG. 8B shows a state where the product-sum operation is performed in step S123 and the result B is stored in the register.

（ステップＳ１３１）
次に、４画素分スキップした後の２×４の入力画素を入力画素用のレジスターに格納する。こここでは、例えばｄ０２、ｄ１２（図７（ｂ）参照）のデータが入力画素用の２個のレジスターにそれぞれ格納される。 (Step S131)
Next, 2 × 4 input pixels after skipping 4 pixels are stored in the input pixel register. Here, for example, data of d02 and d12 (see FIG. 7B) are stored in two registers for input pixels, respectively.

（ステップＳ１３２）
入力画素用のレジスターに格納したデータと、左４データ分のカーネルを格納したレジスターのデータとを積和演算し、結果Ｃを中間バッファ３用レジスターに格納する。図８（ｃ）は、ステップＳ１３２で積和演算し、結果Ｃをレジスターに格納した状態を示している。 (Step S132)
A product-sum operation is performed on the data stored in the input pixel register and the data in the register storing the left four data kernels, and the result C is stored in the intermediate buffer 3 register. FIG. 8C shows a state where the product-sum operation is performed in step S132 and the result C is stored in the register.

（ステップＳ１３３）
入力画素用のレジスターに格納したデータと、中央４データ分のカーネルを格納したレジスターのデータとを積和演算し、結果Ｄを中間バッファ２用レジスターに加算して格納する。図８（ｄ）は、ステップＳ１３３で積和演算し、結果Ｄをレジスターに格納した状態を示している。 (Step S133)
The sum of the data stored in the input pixel register and the data in the register storing the kernel for the central four data are added, and the result D is added to the intermediate buffer 2 register and stored. FIG. 8D shows a state where the product-sum operation is performed in step S133 and the result D is stored in the register.

（ステップＳ１３４）
入力画素用のレジスターに格納したデータと、右４データ分のカーネルを格納したレジスターのデータとを積和演算し、結果Ｅを中間バッファ１用レジスターに加算して格納する。図８（ｅ）は、ステップＳ１３３で積和演算し、結果Ｅをレジスターに格納した状態を示している。なお、このとき入力画素用のレジスターには、４個分のデータが格納されているが４番目のデータ、すなわち窓（図５参照）の外側の１２番目のデータは、カーネルの４番目の空データ（ゼロ値）と乗算し、加算されるので、積和演算の結果には影響を与えない。 (Step S134)
The data stored in the input pixel register and the data in the register storing the right four data kernels are summed, and the result E is added to the intermediate buffer 1 register and stored. FIG. 8E shows a state where the product-sum operation is performed in step S133 and the result E is stored in the register. At this time, the input pixel register stores four pieces of data, but the fourth data, that is, the twelfth data outside the window (see FIG. 5) is the fourth empty space in the kernel. Since the data (zero value) is multiplied and added, the result of the product-sum operation is not affected.

（ステップＳ１４１）
ここまでの処理で、出力１画素の２行分の積和演算結果のデータが中間バッファ１用のレジスターに格納されたので、格納データをメモリ３０の所定アドレス位置に転送する。なお、このメモリ３０への転送処理は、レジスター数に余裕があれば、後述のステップＳ１４３を経由して、ステップＳ１３１以降の処理を実行して、出力２画素分のデータが揃った時点で、実行するようにしてもよい。これにより処理の効率化が図れる。 (Step S141)
In the processing so far, the data of the product-sum operation result for two rows of one output pixel is stored in the register for the intermediate buffer 1, so the stored data is transferred to a predetermined address position in the memory 30. The transfer process to the memory 30 is performed when the data for the output two pixels is prepared by executing the processes after step S131 via step S143 described later if the number of registers is sufficient. You may make it perform. As a result, processing efficiency can be improved.

（ステップＳ１４２）
中間バッファ１用レジスターの役目が終わったので、順次バッファのデータを次の手順でスライドし、更新する。最初に、中間バッファ２用レジスターに格納されているデータで、中間バッファ１用レジスターのデータを更新する。次に、中間バッファ３用レジスターに格納されているデータで、中間バッファ２用レジスターのデータを更新する。最後に、中間バッファ３用レジスターをゼロで更新する。 (Step S142)
Since the role of the register for the intermediate buffer 1 is finished, the buffer data is sequentially slid and updated in the following procedure. First, the data in the intermediate buffer 1 register is updated with the data stored in the intermediate buffer 2 register. Next, the data in the intermediate buffer 2 register is updated with the data stored in the intermediate buffer 3 register. Finally, the intermediate buffer 3 register is updated with zero.

（ステップＳ１４３）
入力画素（ゼロパディング後）の最終列まで、処理が終わっていなければ（ＮＯ）、ステップＳ１３１に戻り、ステップＳ１３１で４画素分スキップした窓内の次の入力画素で、以降の処理を実行する。一方で、最終列まで処理が終わっていれば（ＹＥＳ）、処理をステップＳ１４４に進める。 (Step S143)
If the process has not been completed up to the last column of input pixels (after zero padding) (NO), the process returns to step S131, and the subsequent processes are executed with the next input pixel in the window skipped by 4 pixels in step S131. . On the other hand, if the process has been completed up to the last column (YES), the process proceeds to step S144.

（ステップＳ１４４）
カーネルの最終行まで終了していなければ、すなわち１１行目までの処理が終了していなければ（ＮＯ）、処理をステップＳ１４５に進める。一方で、最終行までの処理が終了していれば（ＹＥＳ）、１画素分の出力画素の計算は終了する（エンド）。以降は、行方向にストライドし、全出力画素の計算を繰り返す。 (Step S144)
If the process has not been completed up to the last line of the kernel, that is, if the process up to the 11th line has not been completed (NO), the process proceeds to step S145. On the other hand, if the processing up to the last row has been completed (YES), the calculation of output pixels for one pixel is completed (END). Thereafter, stride in the row direction and the calculation of all output pixels is repeated.

（Ｓ１４５）
次の２行分のカーネルを、カーネル用の６個分のレジスターに格納（更新）する。例えば、１、２行目が終了していれば、３、４行目のカーネルのデータをレジスターに格納する。なお、２行分ずつレジスターに格納するので、６回目には最後の１１行目の１行分のデータのみがレジスターに格納されることになる。この場合、使わない１行分の３個分のレジスターには全てゼロを入れて、積和計算するようにしてもよい。 (S145)
The next two rows of kernels are stored (updated) in six kernel registers. For example, if the first and second lines are completed, the kernel data of the third and fourth lines are stored in the register. Since the data is stored in the register for every two rows, only the data for the last 11th row is stored in the register at the sixth time. In this case, the sum of products may be calculated by putting all zeros in three unused registers for one row.

（ステップＳ１４６）
次の２行分の左端から２×４の入力画素を入力画素用のレジスターに格納する。以降は、ステップＳ１１６に進み、以降の処理を実行する。 (Step S146)
The 2 × 4 input pixels from the left end of the next two rows are stored in the input pixel register. Thereafter, the process proceeds to step S116, and the subsequent processing is executed.

以上説明した本実施形態に係る方法によれば、カーネルの分割量を決定し、決定した分割量でカーネルを分割して、演算を行うことで、組み込みシステムのような消費電力量に制限があり、限られたハードウェアリソースの条件下であっても、ＣＮＮ処理を効率的に行うことが可能となる。特に、カーネルを格納するレジスターは固定しておき、中間バッファ用のレジスター内のデータをシフトさせながら更新することで、演算結果を共通で使用することが可能となる（図５参照）。これによって、ＣＮＮ処理をさらに効率的に行うことが可能となる。 According to the method according to the present embodiment described above, the amount of power consumption as in an embedded system is limited by determining the kernel division amount, dividing the kernel by the determined division amount, and performing the calculation. Even under limited hardware resource conditions, the CNN process can be performed efficiently. In particular, it is possible to share the operation result by fixing the register for storing the kernel and updating the data in the intermediate buffer register while shifting (see FIG. 5). As a result, the CNN process can be performed more efficiently.

（他の実施例）
以上に説明した情報処理装置および処理方法の構成は、上記の実施形態の特徴を説明するにあたって主要構成を説明したのであって、上記の構成に限られず、特許請求の範囲内において、種々改変することができる。また、一般的な情報処理装置が備える構成および処理方法で実行される処理を排除するものではない。 (Other examples)
The configurations of the information processing apparatus and the processing method described above are the main configurations for describing the features of the above-described embodiment, and are not limited to the above-described configurations, and various modifications can be made within the scope of the claims. be able to. Further, the processing executed by the configuration and processing method included in a general information processing apparatus is not excluded.

（変形例１）
上述の実施形態では、５×５サイズのカーネルを用いる場合であっても分割量を決定し決定した分割量に基づいて、カーネルを分割してレジスターに格納するものであったが、カーネルのサイズに応じて、分割せずに、一度にレジスターに格納するようにしてもよい。図９は、変形例に係るフローチャートを示す図である。 (Modification 1)
In the above-described embodiment, even when a 5 × 5 size kernel is used, the division amount is determined and the kernel is divided and stored in the register based on the determined division amount. Depending on the situation, it may be stored in the register at one time without being divided. FIG. 9 is a diagram illustrating a flowchart according to a modification.

（ステップＳ２１１）
ここでは、ＣＮＮ処理に用いるカーネルサイズを判断する。カーネルサイズが５×５以下の場合には、処理をステップＳ２１２に進める。一方で、５×５を超える場合には、処理を図６ＡのステップＳ１１１に進める。 (Step S211)
Here, the kernel size used for the CNN process is determined. If the kernel size is 5 × 5 or less, the process proceeds to step S212. On the other hand, if it exceeds 5 × 5, the process proceeds to step S111 in FIG. 6A.

（ステップＳ２１２）
カーネルの全データをレジスターに格納する。例えば５×５サイズのカーネルであれば、６４ビットレジスターを用いて、１行あたり３個、合計１５個のレジスターにカーネルの全データを格納する。これらレジスターのデータは、処理が終了するまで固定して用いる。 (Step S212)
Store all kernel data in registers. For example, in the case of a 5 × 5 size kernel, all data of the kernel is stored in a total of 15 registers, 3 per line, using a 64-bit register. The data in these registers is fixedly used until the processing is completed.

（ステップＳ２１３）
以下は、公知の手順により、入力画素の５×５サイズの窓内のデータを順次レジスターに格納し、カーネルを格納したレジスターのデータと、積和演算することで畳み込み演算を実行し、２画素ずつの畳み込み演算を並列処理する。 (Step S213)
In the following, the data in the 5 × 5 size window of the input pixel is sequentially stored in the register according to a known procedure, and the convolution operation is performed by performing the product-sum operation with the data of the register storing the kernel. Each convolution operation is processed in parallel.

（ステップＳ２１４）
１画素分の畳み込み演算が終了し次第、メモリに転送し、終了する。以降は、全出力画素の計算を、繰り返す。 (Step S214)
As soon as the convolution calculation for one pixel is completed, it is transferred to the memory and the process is terminated. Thereafter, the calculation of all output pixels is repeated.

このように、変形例においてはカーネルのサイズが小さい場合には、分割せずに、一度にレジスターに格納した方が、レジスターへのデータのロード、ストアの転送処理回数が少なくなるので、結果としてＣＮＮ処理を効率的に行うことが可能となる。 In this way, in the modified example, when the kernel size is small, the number of times data is loaded into the register and the number of transfer processing of the store is smaller when the data is stored in the register at one time without being divided. CNN processing can be performed efficiently.

（画像処理装置への組み込み）
本実施形態の情報処理装置１００は、監視カメラ等の画像処理装置６０に適用してもよい。図１０は撮影した映像内の人の行動を認識する行動認識システムの構成を示すブロック図である。同図に示すように、行動認識システムには、撮像装置５０と、本実施形態に係る画像処理装置６０が含まれる。 (Incorporation into image processing equipment)
The information processing apparatus 100 according to the present embodiment may be applied to an image processing apparatus 60 such as a monitoring camera. FIG. 10 is a block diagram showing a configuration of an action recognition system that recognizes a person's action in a captured video. As shown in the figure, the action recognition system includes an imaging device 50 and an image processing device 60 according to the present embodiment.

撮像装置５０は、一般的なカメラや広角カメラであり、カメラの撮像素子が生成した画像信号をＡＤ変換して、画像データを生成する。撮像装置５０は、フレーム単位の画像データを連続的に生成した動画像を撮像可能である。 The imaging device 50 is a general camera or a wide-angle camera, and generates image data by performing AD conversion on an image signal generated by an imaging element of the camera. The imaging device 50 can capture a moving image obtained by continuously generating image data in units of frames.

画像処理装置６０は、画像取得部７０、および上述した情報処理装置１００を備える。画像取得部７０は、撮像装置５０が生成した動画像の画像データＤ１を取得する。 The image processing device 60 includes an image acquisition unit 70 and the information processing device 100 described above. The image acquisition unit 70 acquires image data D1 of a moving image generated by the imaging device 50.

図１１は、主に情報処理装置の機能ブロックを示す図である。情報処理装置１００は、人領域検出部１１０、人体特徴抽出部１２０、周辺特徴抽出部１３０、周辺特徴フィルター部１４０、行動判別部１５０、及び学習部１６０を備える。 FIG. 11 is a diagram mainly illustrating functional blocks of the information processing apparatus. The information processing apparatus 100 includes a human region detection unit 110, a human body feature extraction unit 120, a peripheral feature extraction unit 130, a peripheral feature filter unit 140, a behavior determination unit 150, and a learning unit 160.

人領域検出部１１０は、画像データＤ１の画像から人が含まれる区画化された人領域を検出する。人領域検出部１１０が人領域を検出する手法は、任意であり、例えば、動画像から、画像における差分画像を検出し、当該差分画像から人領域を検出する。又、人領域検出部１１０は、その他、学習済みのニューラルネットワーク、テンプレートマッチング、ＨＯＧ（ＨｉｓｔｏｇｒａｍｓｏｆＯｒｉｅｎｔｅｄＧｒａｄｉｅｎｔｓ）特徴量とＳＶＭ（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）の組み合わせ、または背景差分法等の手法を用いてもよい。 The human region detection unit 110 detects a partitioned human region including a person from the image of the image data D1. The method of detecting the human region by the human region detection unit 110 is arbitrary. For example, a difference image in the image is detected from the moving image, and the human region is detected from the difference image. The human region detection unit 110 may also use a learned neural network, template matching, a combination of HOG (Histograms of Oriented Gradients) features and SVM (Support Vector Machine), or a background difference method. Good.

人体特徴抽出部１２０は、人領域検出部１１０から画像データＤ１と人領域を示すデータＤ２を取得して、人領域の画像に対して所定の演算処理を施して、画像に映る人の姿勢特徴を抽出する。そして、人体特徴抽出部１２０は、抽出した画像に映る人の姿勢特徴のデータＤ３を周辺特徴フィルター部１４０及び行動判別部１５０に出力する。 The human body feature extraction unit 120 acquires the image data D1 and the data D2 indicating the human region from the human region detection unit 110, performs predetermined arithmetic processing on the human region image, and the posture feature of the person shown in the image To extract. Then, the human body feature extraction unit 120 outputs the human posture feature data D3 shown in the extracted image to the peripheral feature filter unit 140 and the action determination unit 150.

人体特徴抽出部１２０は、例えば、学習済みのＣＮＮ処理を用いて、画像から人の姿勢特徴を抽出する。尚、人体特徴抽出部１２０を構成するＣＮＮ処理は、例えば、人体の画像と、当該画像中における人体の関節位置の座標（二次元位置又は三次元推定位置）の対応関係を示す教師データによって学習処理が行われたものが用いられる（一般にＲ−ＣＮＮとも称される）。 The human body feature extraction unit 120 extracts a human posture feature from an image using, for example, a learned CNN process. Note that the CNN process that constitutes the human body feature extraction unit 120 is learned by, for example, teacher data indicating the correspondence between the human body image and the coordinates (two-dimensional position or three-dimensional estimated position) of the joint position of the human body in the image. What has been processed is used (generally also referred to as R-CNN).

人体特徴抽出部１２０は、例えば、前処理部、畳み込み処理部、および第１、第２全結合部を含む。前処理部では、人領域を示すデータＤ２に基づいて、全領域の画像から人領域の画像を切り出して、所定のサイズ及びアスペクト比に変換する等、画像の正規化を行う。 The human body feature extraction unit 120 includes, for example, a preprocessing unit, a convolution processing unit, and first and second full coupling units. Based on the data D2 indicating the human area, the preprocessing unit normalizes the image by cutting out the human area image from the entire area image and converting it into a predetermined size and aspect ratio.

畳み込み処理部は、複数の特徴量抽出層が階層的に接続されて構成されている。畳み込み処理部は、各特徴量抽出層において、前階層から入力される入力データに対して、上述したような畳み込み演算処理、活性化処理、及びプーリング処理を実行する。 The convolution processing unit is configured by hierarchically connecting a plurality of feature quantity extraction layers. The convolution processing unit performs the above-described convolution operation processing, activation processing, and pooling processing on the input data input from the previous layer in each feature amount extraction layer.

第１全結合部は、例えば、複数の特徴量を全結合する多層パーセプトロンで構成されている。第１全結合部は、畳み込み処理部から得られる複数の中間演算結果データを全結合して、人の姿勢特徴を示すデータＤ３を生成する。そして、第１全結合部は、当該人の姿勢特徴を示すデータＤ３を第２全結合部及び周辺特徴フィルター部１４０に対して出力する。 For example, the first fully coupled portion is configured by a multilayer perceptron that fully couples a plurality of feature amounts. The first full combination unit fully combines a plurality of intermediate calculation result data obtained from the convolution processing unit, and generates data D3 indicating a human posture characteristic. Then, the first all combination unit outputs data D3 indicating the posture characteristics of the person to the second all combination unit and the peripheral feature filter unit 140.

周辺特徴抽出部１３０は、人領域検出部１１０から画像データＤ１と人領域を示すデータＤ２を取得して、人領域の周辺の画像に対して所定の演算処理を施して、画像に映る人の周辺物体の周辺特徴を抽出する。そして、周辺特徴抽出部１３０は、抽出した周辺特徴のデータＤ４を周辺特徴フィルター部１４０に出力する。 The peripheral feature extraction unit 130 acquires the image data D1 and the data D2 indicating the human region from the human region detection unit 110, performs predetermined arithmetic processing on the image around the human region, and displays the human image in the image. Extract peripheral features of surrounding objects. Then, the peripheral feature extraction unit 130 outputs the extracted peripheral feature data D4 to the peripheral feature filter unit 140.

周辺特徴抽出部１３０は、例えば、人体特徴抽出部１２０と同様に、ＣＮＮ処理を用いて、画像から周辺特徴を抽出する。なお、周辺特徴抽出部１３０を構成するＣＮＮ処理は、例えば、物体の画像と、当該物体の形状、種別、又は各部位の位置等の対応関係を示す教師データによって学習処理が行われたものが用いられる。又、より好適には、人体を含む人体周辺の画像と、当該画像中における物体の形状及び位置関係の座標の対応関係を示す教師データによって学習処理が行われたものが用いられる。 The peripheral feature extraction unit 130 extracts peripheral features from the image using CNN processing, for example, similarly to the human body feature extraction unit 120. In addition, the CNN process which comprises the surrounding feature extraction part 130 is what performed the learning process by the teacher data which shows the correspondence of the image of an object, the shape of the said object, the classification, the position of each site | part, etc., for example. Used. More preferably, an image obtained by performing learning processing using an image around a human body including a human body and teacher data indicating a correspondence relationship between the shape of the object and the coordinates of the positional relationship in the image is used.

周辺特徴フィルター部１４０は、人領域検出部１１０から姿勢特徴のデータＤ３と周辺特徴抽出部１３０から周辺特徴のデータＤ４を取得する。そして、周辺特徴フィルター部１４０は、姿勢特徴と関連付けて設定された周辺特徴の重要度のデータＤａに基づいて、周辺特徴をフィルタリングする。そして、周辺特徴フィルター部１４０は、フィルタリングした周辺特徴のデータＤ４ａを行動判別部１５０に出力する。 The peripheral feature filter unit 140 acquires posture feature data D3 from the human region detection unit 110 and peripheral feature data D4 from the peripheral feature extraction unit 130. Then, the peripheral feature filter unit 140 filters the peripheral feature based on the importance level data Da of the peripheral feature set in association with the posture feature. Then, the peripheral feature filter unit 140 outputs the filtered peripheral feature data D4a to the behavior determination unit 150.

行動判別部１５０は、人体特徴抽出部１２０から人の姿勢特徴のデータＤ３を取得すると共に、周辺特徴フィルター部１４０からフィルタリングされた周辺特徴のデータＤ４ａを取得する。そして、行動判別部１５０は、人の姿勢特徴のデータＤ３とフィルタリングされた周辺特徴のデータＤ４ａの時系列データに基づいて、画像に映る人の行動クラスを判別する。また、本実施形態に係る行動判別部１５０は、再帰型ニューラルネットワークの一種である階層型ＬＳＴＭ（ＬｏｎｇＳｈｏｒｔ−ＴｅｒｍＭｅｍｏｒｙ）を用いて、時系列解析を行ってもよい。階層型ＬＳＴＭは、短い時間間隔（例えば、直前の画像フレーム）における関係に加えて、長い時間間隔（例えば、１分前）における関係を認識することが可能である。 The behavior determination unit 150 acquires the human posture feature data D3 from the human body feature extraction unit 120 and the filtered peripheral feature data D4a from the peripheral feature filter unit 140. Then, the action determination unit 150 determines the action class of the person shown in the image based on the time series data of the human posture feature data D3 and the filtered peripheral feature data D4a. Further, the behavior determination unit 150 according to the present embodiment may perform time series analysis using a hierarchical LSTM (Long Short-Term Memory) which is a kind of recursive neural network. The hierarchical LSTM can recognize a relationship in a long time interval (for example, one minute before) in addition to a relationship in a short time interval (for example, the immediately preceding image frame).

学習部１６０は、人体特徴抽出部１２０、周辺特徴抽出部１３０、周辺特徴フィルター部１４０、及び行動判別部１５０が上述した処理を実行し得るように、教師データを用いた機械学習を実行する。 The learning unit 160 performs machine learning using teacher data so that the human body feature extraction unit 120, the peripheral feature extraction unit 130, the peripheral feature filter unit 140, and the behavior determination unit 150 can execute the processing described above.

学習部１６０は、例えば、正規化された人領域の画像と人の姿勢特徴（例えば、関節位置）が関連付けられた教師データを用いて、人体特徴抽出部１２０の畳み込み処理部、第１、第２全結合部のネットワークパラメータ（例えば、重み係数、バイアス）を調整する。 The learning unit 160 uses, for example, the convolution processing unit of the human body feature extraction unit 120, the first and the second, using teacher data in which a normalized human region image and a human posture characteristic (for example, a joint position) are associated with each other. 2. Adjust network parameters (for example, weighting factor, bias) of all joints.

又、学習部１６０は、例えば、正規化された人周辺の物体の画像と物体の特徴（例えば、人体との位置関係）が関連付けられた教師データを用いて、周辺特徴抽出部１３０の畳み込み処理部、及び周辺特徴抽出部１３０の全結合部のネットワークパラメータ（例えば、重み係数、バイアス）を調整する。 In addition, the learning unit 160 performs, for example, convolution processing of the peripheral feature extraction unit 130 using the teacher data in which the normalized image of the object around the person and the feature of the object (for example, the positional relationship with the human body) are associated. And network parameters (for example, weighting factor, bias) of all the coupling parts of the peripheral feature extraction unit 130 are adjusted.

又、学習部１６０は、例えば、人の姿勢特徴と周辺物体の重要度が関連付けられた教師データを用いて、周辺特徴フィルター部１４０の全結合部のネットワークパラメータ（例えば、重み係数、バイアス）を調整する。 In addition, the learning unit 160 uses, for example, teacher data in which the posture characteristics of the person and the importance of the peripheral object are associated with each other, and the network parameters (for example, weighting factors and biases) of all the connection units of the peripheral feature filter unit 140. adjust.

又、学習部１６０は、例えば、人の姿勢特徴及び周辺特徴の時系列データと、正解となる行動クラスが関連付けられた教師データを用いて、行動判別部１５０の中間層６１及び行動判別部１５０の全結合部のネットワークパラメータ（例えば、重み係数、バイアス）を調整する。 Further, the learning unit 160 uses, for example, time series data of human posture features and peripheral features and teacher data in which a correct behavior class is associated with the intermediate layer 61 and the behavior determination unit 150 of the behavior determination unit 150. The network parameters (for example, weighting factor, bias) of all the coupling parts are adjusted.

尚、学習部１６０は、例えば、公知の誤差逆伝搬法等を用いて、これらの学習処理を行えばよい。そして、学習部１６０は、学習処理によって調整したネットワークパラメータを外部の記憶部等に格納する。 Note that the learning unit 160 may perform these learning processes using, for example, a known error back propagation method. Then, the learning unit 160 stores the network parameter adjusted by the learning process in an external storage unit or the like.

以上のように、本実施形態に係る画像処理装置６０によれば、人体の姿勢特徴と関連付けて周辺特徴の重要度を設定しておき、周辺特徴をフィルタリングすることによって、人の行動に関連する周辺物体のみを抽出することが可能である。これによって、本実施形態に係る画像処理装置６０は、周辺物体の種別、位置又は見え方が種々に異なる環境下においても、高精度に人の行動クラスを推定できる。 As described above, according to the image processing device 60 according to the present embodiment, the importance level of the peripheral feature is set in association with the posture feature of the human body, and the peripheral feature is filtered, so that it relates to the human action. Only surrounding objects can be extracted. As a result, the image processing apparatus 60 according to the present embodiment can estimate a human action class with high accuracy even in an environment in which the types, positions, and appearances of surrounding objects are different.

特に、本実施形態に係る情報処理装置１００を組み込んだ画像処理装置６０は、姿勢特徴と周辺特徴の時系列データに基づいて、人体の姿勢と関連する周辺物体の位置関係の時間的変化を抽出し、人の行動クラスを推定する構成となっているため、より高精度に人の行動クラスを推定できる。 In particular, the image processing apparatus 60 incorporating the information processing apparatus 100 according to the present embodiment extracts temporal changes in the positional relationship of the peripheral objects related to the posture of the human body based on the time series data of the posture features and the peripheral features. In addition, since the human behavior class is estimated, the human behavior class can be estimated with higher accuracy.

上述した情報処理装置を動作させるプログラムは、ＵＳＢメモリ、フレキシブルディスク、ＣＤ−ＲＯＭ等のコンピューター読み取り可能な記録媒体によって提供されてもよいし、インターネット等のネットワークを介してオンラインで提供されてもよい。この場合、コンピューター読み取り可能な記録媒体に記録されたプログラムのコードは、アセンブリ言語、または機械言語で記述されていてもよい。 The program for operating the information processing apparatus described above may be provided by a computer-readable recording medium such as a USB memory, a flexible disk, or a CD-ROM, or may be provided online via a network such as the Internet. . In this case, the program code recorded on the computer-readable recording medium may be written in assembly language or machine language.

１００情報処理装置
１０ＳＩＭＤプロセッサ
１１ＳＩＭＤレジスターファイル
１２ＡＬＵ
１３ＭＵＬ
１４シフター
１５ＬＳ
２０汎用プロセッサ
２１スカラーレジスターファイル
２２ＡＬＵ
２３ＭＵＬ
２４ＬＳ
３０メモリ
６０画像処理装置
100 Information processing apparatus 10 SIMD processor 11 SIMD register file 12 ALU
13 MUL
14 Shifter 15 LS
20 General-purpose processor 21 Scalar register file 22 ALU
23 MUL
24 LS
30 memory 60 image processing apparatus

Claims

Multiple registers available for SIMD;
A method of controlling an information processing apparatus comprising a processor that performs SIMD processing,
(A) selecting a size of the register to be used from a kernel size;
(B) determining a kernel division amount from the number of registers of the selected size and the kernel size;
(C) storing in the register for each kernel divided by the determined kernel division amount, and performing parallel processing of convolution operations;
Including a processing method.

When the number of kernel rows stored in the register at a time is n,
The processing method according to claim 1, wherein in step (b), n is set to a maximum value satisfying the following formula.
k− (x + j) ≧ (b + 1) n
Where j is j = (w + 1−y) / b,
b = w / q,
j and b are integers rounded up to the nearest decimal point;
w: kernel size k: number of available registers q: number of data stored in one register x: number of registers necessary for convolution operation y: number of pixels to be skipped.

When the kernel size is 11 × 11,
In the step (b), the kernel is determined to be divided into two lines;
2. The processing method according to claim 1, wherein, in the step (c), the kernel is stored in the register every two lines, and then a convolution operation is executed every two lines.

When the kernel size is 11 × 11,
In step (a), a 128-bit register is selected,
The processing method according to claim 3, wherein in step (c), four input pixels are stored in a 128-bit register, and convolution operations of four pixels are performed in parallel.

When the kernel size is 5 × 5 or less,
In step (b), it is decided not to divide the kernel;
5. The processing method according to claim 1, wherein, in the step (c), the kernel is stored in the register at a time, and a convolution operation is executed after the kernel is stored.

When the kernel size is 5 × 5 or less,
In step (a), a 64-bit register is selected,
6. The processing method according to claim 5, wherein in step (c), two input pixels are stored in the 64-bit register, and a convolution operation for each two pixels is processed in parallel.

The register for storing the kernel among the plurality of registers is fixed, and the result data stored in the intermediate buffer is updated while sliding the data stored in the register for the input pixel. The processing method as described in any one of.

Multiple registers available for SIMD;
A program for controlling an information processing device comprising a processor for SIMD processing,
(A) selecting a size of the register to be used from a kernel size;
(B) determining a kernel division amount from the number of registers of the selected size and the kernel size;
(C) storing in the register for each kernel divided by the determined kernel division amount, and performing parallel processing of convolution operations;
A program for performing the method.

When the number of kernel rows stored in the register at a time is n,
The program according to claim 8, wherein in the step (b), the n is set to a maximum value satisfying the following formula.
k− (x + j) ≧ (b + 1) n
Where j is j = (w + 1−y) / b,
b = w / q,
j and b are integers rounded up to the nearest decimal point;
w: kernel size k: number of available registers q: number of data stored in one register x: number of registers necessary for convolution operation y: number of pixels to be skipped.

When the kernel size is 11 × 11,
In the step (b), the kernel is determined to be divided into two lines;
The program according to claim 8, wherein in step (c), the kernel is stored in the register every two lines, and after the storage, a convolution operation is executed every two lines.

When the kernel size is 11 × 11,
In step (a), a 128-bit register is selected,
The program according to claim 10, wherein in step (c), four input pixels are stored in a 128-bit register, and convolution operations of four pixels are performed in parallel.

When the kernel size is 5 × 5 or less,
In step (b), it is decided not to divide the kernel;
The program according to any one of claims 8, 10 and 11, wherein in the step (c), the kernel is stored in the register at a time, and a convolution operation is executed after the kernel is stored.

When the kernel size is 5 × 5 or less,
In step (a), a 64-bit register is selected,
13. The program according to claim 12, wherein in step (c), two input pixels are stored in the 64-bit register, and a convolution operation for each two pixels is processed in parallel.

The register for storing the kernel among the plurality of registers is fixed, and the result data stored in the intermediate buffer is updated while sliding the data stored in the register for the input pixel. The program as described in any one of.

Multiple registers available for SIMD;
An information processing apparatus including a processor for performing SIMD processing,
The size of the register to be used is selected from the kernel size, the number of registers of the selected size is determined, the division amount of the kernel is determined from the kernel size, and the register is determined for each kernel divided by the determined division amount of the kernel. An information processing device that stores data in parallel and performs convolution operations in parallel.

When the number of kernel rows that are divided by the determined division amount and stored in the register at a time is n,
The information processing apparatus according to claim 15, wherein the n is set to a maximum value satisfying the following formula.
k− (x + j) ≧ (b + 1) n
Where j is j = (w + 1−y) / b,
b = w / q,
j and b are integers rounded up to the nearest decimal point;
w: kernel size k: number of available registers q: number of data stored in one register x: number of registers necessary for convolution operation y: number of pixels to be skipped.

The register for storing a kernel among the plurality of registers is fixed, and the result data stored in the intermediate buffer is updated while sliding the data stored in the register for the input pixel. The information processing apparatus described in 1.

An image acquisition unit for acquiring an image generated by the imaging device;
The information processing apparatus according to any one of claims 15 to 17, wherein a feature of a person shown in the image and a peripheral feature indicating a shape, a position, or a type of a peripheral object of the person shown in the image are extracted. ,
An image processing apparatus comprising: