JP2017058738A

JP2017058738A - Information processing apparatus and image forming apparatus

Info

Publication number: JP2017058738A
Application number: JP2015180603A
Authority: JP
Inventors: 俊治綱島; Toshiharu Tsunashima
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2015-09-14
Filing date: 2015-09-14
Publication date: 2017-03-23
Anticipated expiration: 2035-09-14
Also published as: JP6701650B2

Abstract

PROBLEM TO BE SOLVED: To reduce a time to pre-load data stored in an external memory in an information processing apparatus comprising a plurality of cores.SOLUTION: An image forming apparatus comprises: a core 101 and a core 102; an L2 cache memory 121; and a main storage device 30 (external memory) that stores data. The core 101 includes request means that requests first partial data being part of data stored in the external memory, and the core 102 includes second request means that requests second partial data different from the first partial data.SELECTED DRAWING: Figure 5

Description

本発明は、情報処理装置および画像形成装置に関する。 The present invention relates to an information processing apparatus and an image forming apparatus.

画像処理装置等の情報処理装置において、外部メモリから読み出したデータを用いた処理が行われる。しかし、メモリアクセス中はコアの処理はストール（待機）するため、外部メモリへのアクセスが頻繁になると処理の高速化を阻害する要因となる。例えば特許文献１には、外部メモリのデータをプロセッサ内部のキャッシュメモリにあらかじめコピー（プリロード）しておき、プロセッサのコアはキャッシュメモリからデータを読み出して処理を行う技術が記載されている。 In an information processing apparatus such as an image processing apparatus, processing using data read from an external memory is performed. However, since the core process stalls (waits) during memory access, frequent access to the external memory is a factor that hinders the speeding up of the process. For example, Patent Document 1 describes a technique in which data in an external memory is copied (preloaded) in advance into a cache memory inside the processor, and the core of the processor reads the data from the cache memory and performs processing.

特開２０１１−２２３１４５号公報JP 2011-223145 A

近年プロセッサのコアはマルチコア化している。マルチコアは並列的に処理を行うことで高速化を図るものであるが、外部メモリに記憶されているデータをプリロードする時間を短縮することはできなかった。 In recent years, processor cores have become multi-core. The multi-core is intended to increase the speed by performing processing in parallel, but the time for preloading the data stored in the external memory cannot be shortened.

本発明は、複数コアを有する情報処理装置において、外部メモリに記憶されているデータをプリロードする時間を短縮する技術を提供する。 The present invention provides a technique for shortening the time for preloading data stored in an external memory in an information processing apparatus having a plurality of cores.

本発明は、第１コアと、前記第１コアと並列処理を行う第２コアと、前記第１コアおよび前記第２コアで共用されるキャッシュメモリと、データを記憶した外部メモリとを有し、前記第１コアは、前記データの一部である第１部分データを要求する第１要求手段を有し、前記第２コアは、前記データのうち前記第１部分データと異なる一部である第２部分データを要求する第２要求手段を有し、前記キャッシュメモリは、前記第１要求手段からの要求および前記第２要求手段からの要求に応じて、前記第１部分データおよび前記第２部分データを前記外部メモリから読み出す読み出し手段を有し、前記第１コアおよび前記第２コアは、それぞれ、前記キャッシュメモリに記憶されている前記第１部分データおよび前記第２部分データの少なくとも一部を用いた処理を行う情報処理装置を提供する。 The present invention includes a first core, a second core that performs parallel processing with the first core, a cache memory that is shared by the first core and the second core, and an external memory that stores data. The first core includes first request means for requesting first partial data which is a part of the data, and the second core is a part of the data different from the first partial data. Second cache requesting means for requesting second partial data, the cache memory responding to a request from the first requesting means and a request from the second requesting means; Read means for reading out partial data from the external memory, and the first core and the second core respectively have a smaller amount of the first partial data and the second partial data stored in the cache memory. Also provides an information processing apparatus for performing processing using a part.

この情報処理装置は、前記第１コアおよび前記第２コアを含むＮ個のコアと、前記データを、各々アドレスが連続した部分のみから構成されるＮ個の部分データに等分し、各部分データを前記Ｎ個のコアのいずれかに割り当てる割り当て手段とを有してもよい。 The information processing apparatus equally divides the N cores including the first core and the second core, and the data into N partial data each including only a portion where addresses are continuous, And allocating means for allocating data to any of the N cores.

この情報処理装置は、前記第１コアおよび前記第２コアを含むＮ個のコアと、前記データを、各々アドレスが不連続な部分を含むＮ個の部分データに等分し、各部分データを前記Ｎ個のコアのいずれかに割り当てる割り当て手段とを有してもよい。 The information processing apparatus equally divides the N cores including the first core and the second core and the data into N partial data each including a portion where addresses are discontinuous, And assigning means for assigning to any of the N cores.

前記外部メモリはＤＲＡＭを含み、前記Ｎ個の部分データの各々は、アドレスが連続した部分を含み、前記アドレスが連続した部分のデータサイズは、前記読み出し手段が前記ＤＲＡＭからデータを読み出す際の１回当たりのデータ読み出し量以下であってもよい。 The external memory includes a DRAM, and each of the N partial data includes a portion where addresses are continuous, and the data size of the portion where the addresses are continuous is 1 when the reading means reads data from the DRAM. It may be less than the amount of data read per round.

前記第１コアおよび前記第２コアにおける処理は、画素に対応するインデックスを画素値に変換する処理であり、前記データは、前記インデックスを前記画素値に変換するためのテーブルであってもよい。 The processing in the first core and the second core may be processing for converting an index corresponding to a pixel into a pixel value, and the data may be a table for converting the index into the pixel value.

また、本発明は、第１コアと、前記第１コアと並列処理を行う第２コアと、前記第１コア専用の第１キャッシュメモリと、前記第２コア専用の第２キャッシュメモリと、前記第１コアおよび前記第２コアで共用されるキャッシュメモリと、データを記憶した外部メモリとを有し、前記第１コアは、前記データの一部である第１部分データを要求する第１要求手段を有し、前記第２コアは、前記データのうち前記第１部分データと異なる一部である第２部分データを要求する第２要求手段を有し、前記第１キャッシュメモリは、前記第１要求手段からの要求に応じて、前記第１部分データを前記外部メモリから取得する第１取得手段を有し、前記第２キャッシュメモリは、前記第２要求手段からの要求に応じて、前記第２部分データを前記外部メモリから取得する第２取得手段を有し、前記第１コアは、前記第１キャッシュメモリに記憶されている前記第１部分データを用いた処理を行い、前記第２コアは、前記第２キャッシュメモリに記憶されている前記第２部分データを用いた処理を行う情報処理装置を提供する。 The present invention also provides a first core, a second core that performs parallel processing with the first core, a first cache memory dedicated to the first core, a second cache memory dedicated to the second core, A cache memory shared by the first core and the second core; and an external memory storing data, wherein the first core requests a first partial data which is a part of the data. And the second core has second request means for requesting second partial data which is a part of the data different from the first partial data, and the first cache memory In response to a request from one requesting means, the first partial data is obtained from the external memory, and the second cache memory is configured to obtain the first partial data in response to a request from the second requesting means. 2nd partial data to the external Second acquisition means for acquiring from the memory, wherein the first core performs processing using the first partial data stored in the first cache memory, and the second core Provided is an information processing apparatus that performs processing using the second partial data stored in a memory.

さらに、本発明は、上記いずれかの情報処理装置と、前記第１コアおよび前記第２コアで処理された結果に応じて画像を形成する画像形成手段とを有する画像形成装置を提供する。 Furthermore, the present invention provides an image forming apparatus including any one of the information processing apparatuses described above and an image forming unit that forms an image according to a result processed by the first core and the second core.

請求項１に係る情報処理装置によれば、外部メモリに記憶されているデータ全体を第１コアおよび第２コアそれぞれがプリロードする場合と比較して、外部メモリに記憶されているデータをプリロードする時間を短縮することができる。
請求項２に係る情報処理装置によれば、外部メモリに記憶されているデータ全体を第１コアおよび第２コアそれぞれがプリロードする場合と比較して、各コアからのデータ要求の回数を１／Ｎに低減することができる。
請求項３に係る情報処理装置によれば、データを単純にＮ等分する場合と比較して、外部メモリからのデータの読み出し時間を短縮することができる。
請求項４に係る情報処理装置によれば、アドレスが連続した部分のデータサイズが細切れである場合と比較して、外部メモリへのアクセス回数を低減することができる。
請求項５に係る情報処理装置によれば、インデックスを画素値に変換する画像処理に用いるテーブルをプリロードする時間を短縮することができる。
請求項６に係る情報処理装置によれば、外部メモリに記憶されているデータ全体を第１コアおよび第２コアそれぞれがプリロードする場合と比較して、外部メモリに記憶されているデータをプリロードする時間を短縮することができる。
請求項７に係る画像形成装置によれば、外部メモリに記憶されているデータ全体を第１コアおよび第２コアそれぞれがプリロードする場合と比較して、外部メモリに記憶されているデータをプリロードする時間を短縮することができる。 According to the information processing apparatus of the first aspect, the data stored in the external memory is preloaded as compared with the case where each of the first core and the second core preloads the entire data stored in the external memory. Time can be shortened.
According to the information processing apparatus of the second aspect, the number of data requests from each core is reduced to 1 / compared to the case where each of the first core and the second core preloads the entire data stored in the external memory. N can be reduced.
According to the information processing apparatus of the third aspect, it is possible to shorten the time for reading data from the external memory as compared with the case where the data is simply divided into N equal parts.
According to the information processing apparatus of the fourth aspect, it is possible to reduce the number of accesses to the external memory as compared with the case where the data size of the portion where the addresses are continuous is small.
According to the information processing apparatus of the fifth aspect, it is possible to shorten the time for preloading a table used for image processing for converting an index into a pixel value.
According to the information processing apparatus of the sixth aspect, the data stored in the external memory is preloaded as compared with the case where each of the first core and the second core preloads the entire data stored in the external memory. Time can be shortened.
According to the image forming apparatus of the seventh aspect, the data stored in the external memory is preloaded as compared with the case where each of the first core and the second core preloads the entire data stored in the external memory. Time can be shortened.

関連技術に係るＣＰＵ９０のキャッシュメモリ構成を例示する図The figure which illustrates the cache memory structure of CPU90 which concerns on related technology 単一のコアによるプリロード処理を例示するシーケンスチャートSequence chart illustrating preload processing with a single core 複数のコアによるプリロード処理を例示するシーケンスチャートSequence chart illustrating preload processing with multiple cores 一実施形態に係る画像形成装置１の構成を例示する図1 is a diagram illustrating a configuration of an image forming apparatus 1 according to an embodiment. データのプリロードに関する画像形成装置１の機能構成を例示する図The figure which illustrates the function structure of the image forming apparatus 1 regarding the preload of data 画像形成装置１における画像処理を例示するフローチャートFlowchart illustrating image processing in image forming apparatus 1 ＬＵＴの分割の概要を示す模式図Schematic diagram showing the outline of LUT division ＬＵＴの分割の具体例を示す図The figure which shows the specific example of the division | segmentation of LUT ＬＵＴの分割の別の具体例を示す図The figure which shows another specific example of the division | segmentation of LUT 画像形成装置１におけるプリロード処理を例示するシーケンスチャートSequence chart illustrating preload processing in image forming apparatus 1 変形例に係るＬＵＴの分割の概要を示す模式図The schematic diagram which shows the outline | summary of the division | segmentation of LUT which concerns on a modification

１．概要
まず一例として以下の画像処理を考える。入力画像の画素値からインデックス値が算出される。ルックアップテーブルから、インデックス値で指示されるエントリ値が取得される。取得されたエントリ値から出力画素値が算出される。このような画像処理を高速化する方法の一つに、いわゆるマルチコアのＣＰＵを用い、異なる領域（例えば奇数行と偶数行）の画素をそれぞれ別のコアで並列処理させる方法がある。 1. Outline First, consider the following image processing as an example. An index value is calculated from the pixel value of the input image. The entry value indicated by the index value is acquired from the lookup table. An output pixel value is calculated from the acquired entry value. One method for speeding up such image processing is to use a so-called multi-core CPU and process pixels in different regions (for example, odd rows and even rows) in parallel with different cores.

具体的に、各コアは、（１）入力画素の読み込み、（２）インデックス値の算出、（３）ルックアップテーブルからエントリ値の読み込み、（４）出力画素値の算出、および（５）出力画素値の記憶、という処理を行う。このうち処理（１）、（３）、および（５）は外部メモリへのアクセスを伴う。外部メモリとは、ＣＰＵと異なるチップに形成されたメモリをいい、例えばコンピュータの主記憶装置（メインメモリ）がこれに該当する。外部メモリにアクセスしているときはコアの命令実行はストール（待機状態となる）する。外部メモリへのアクセスは相対的に遅いため、外部メモリへのアクセスが頻繁に発生すると、それは高速化を阻む要因となる。 Specifically, each core (1) reads an input pixel, (2) calculates an index value, (3) reads an entry value from a lookup table, (4) calculates an output pixel value, and (5) outputs A process of storing pixel values is performed. Of these, processes (1), (3), and (5) involve access to an external memory. The external memory is a memory formed on a chip different from the CPU, and corresponds to a main storage device (main memory) of a computer, for example. When accessing the external memory, the instruction execution of the core stalls (becomes a standby state). Since access to the external memory is relatively slow, frequent access to the external memory is a factor that hinders speeding up.

この問題に対処するため、外部メモリに記憶されているルックアップテーブルを、画像処理に先立ってキャッシュメモリにコピーすなわちプリロードする技術が知られている。ＬＵＴ（Look Up Table）をキャッシュメモリにプリロードすることによって処理（３）における外部メモリへのアクセスをなくすことができる。 In order to cope with this problem, a technique is known in which a lookup table stored in an external memory is copied, that is, preloaded to a cache memory prior to image processing. By preloading the LUT (Look Up Table) into the cache memory, access to the external memory in the process (3) can be eliminated.

図１は、関連技術に係るＣＰＵ９０のキャッシュメモリ構成を例示する図である。ＣＰＵ９０は、複数のコア、この例ではコア９０１〜９０４の４つのコアを有する。ここで、プロセッサの「コア」とは、プロセッサのうち命令の実行および演算を行う部分をいう。ＣＰＵ９０は、さらに、キャッシュメモリ９１１〜９１４およびキャッシュメモリ９２１を有する。キャッシュメモリ９１１〜９１４は一次キャッシュ（いわゆるＬ１キャッシュ）であり、それぞれ、コア９０１〜９０４に専用のものである。キャッシュメモリ９２１は二次キャッシュ（いわゆるＬ２キャッシュ）である。キャッシュメモリ９２１はコア９０１〜９０４で共用される。なお一般にはＬ１キャッシュまで含めて「コア」という場合があるが、ここでは「コア」にＬ１キャッシュは含まれない。 FIG. 1 is a diagram illustrating a cache memory configuration of a CPU 90 according to related technology. The CPU 90 has a plurality of cores, in this example, four cores 901 to 904. Here, the “core” of the processor refers to a portion of the processor that executes instructions and performs operations. The CPU 90 further includes cache memories 911 to 914 and a cache memory 921. The cache memories 911 to 914 are primary caches (so-called L1 caches) and are dedicated to the cores 901 to 904, respectively. The cache memory 921 is a secondary cache (so-called L2 cache). The cache memory 921 is shared by the cores 901 to 904. In general, the L1 cache may be included and referred to as “core”, but here the “core” does not include the L1 cache.

一次キャッシュとはコアから最優先にアクセスされるキャッシュメモリをいい、二次キャッシュとは一次キャッシュの次の優先順位でアクセスされるキャッシュメモリをいう。一次キャッシュは、二次キャッシュよりも高速でかつ容量が小さい。メインメモリ（外部メモリ）へのアクセス要求が発生すると、コアはまずアクセス先のアドレスのデータが一次キャッシュに格納されているか調べる。アクセス先のアドレスのデータ（以下単に「アクセス先のデータ」という）が一次キャッシュに格納されていた場合、コアは一次キャッシュからデータを読み出す。アクセス先のデータがキャッシュメモリに格納されていることを「ヒット」といい、ヒットが発生する割合を「ヒット率」という。アクセス先のデータが一次キャッシュに格納されていなかった場合、コアは、アクセス先のデータが二次キャッシュに格納されていないか調べる。アクセス先のデータが二次キャッシュに格納されていた場合、コアは二次キャッシュからデータを読み出す。アクセス先のデータが二次キャッシュに格納されていなかった場合、コアは外部メモリである主記憶装置３０からデータを読み出す。 The primary cache is a cache memory that is accessed with the highest priority from the core, and the secondary cache is a cache memory that is accessed with the next priority of the primary cache. The primary cache is faster and has a smaller capacity than the secondary cache. When an access request to the main memory (external memory) occurs, the core first checks whether the data at the access destination address is stored in the primary cache. When data at an access destination address (hereinafter simply referred to as “access destination data”) is stored in the primary cache, the core reads data from the primary cache. The fact that the access destination data is stored in the cache memory is called “hit”, and the rate at which hits occur is called “hit rate”. If the access destination data is not stored in the primary cache, the core checks whether the access destination data is stored in the secondary cache. When the access destination data is stored in the secondary cache, the core reads the data from the secondary cache. When the access destination data is not stored in the secondary cache, the core reads the data from the main memory 30 which is an external memory.

コアは、メモリ空間上のアドレスからデータを読み出す際、まずそのコア専用のＬ１キャッシュに対して読み出し要求を行う。Ｌ１キャッシュは、指定されたアドレスのデータがＬ１キャッシュ内に記憶されているか確認する。指定されたアドレスのデータがＬ１キャッシュ内に記憶されている場合、Ｌ１キャッシュは、指定されたアドレスのデータを要求元のコアに出力する。指定されたアドレスのデータがＬ１キャッシュ内に記憶されていない場合、Ｌ１キャッシュは、Ｌ２キャッシュに対して読み出し要求を行う。Ｌ２キャッシュは、指定されたアドレスのデータがＬ２キャッシュ内に記憶されているか確認する。指定されたアドレスのデータがＬ２キャッシュ内に記憶されている場合、Ｌ２キャッシュは、指定されたアドレスのデータを要求元のＬ１キャッシュに出力する。指定されたアドレスのデータがＬ２キャッシュ内に記憶されていない場合、Ｌ２キャッシュは、主記憶装置３０に対して読み出し要求を行う。データの読み出し要求を受けると、主記憶装置３０は、要求されたデータをＬ２キャッシュに出力する。Ｌ２キャッシュは、主記憶装置３０から読み出されたデータを自身で記憶し、さらに、そのデータの要求元であるＬ１キャッシュに出力する。 When reading data from an address in the memory space, the core first makes a read request to the L1 cache dedicated to the core. The L1 cache checks whether the data at the designated address is stored in the L1 cache. When the data at the designated address is stored in the L1 cache, the L1 cache outputs the data at the designated address to the requesting core. If the data at the designated address is not stored in the L1 cache, the L1 cache issues a read request to the L2 cache. The L2 cache checks whether the data at the designated address is stored in the L2 cache. When the data at the designated address is stored in the L2 cache, the L2 cache outputs the data at the designated address to the L1 cache as the request source. When the data at the designated address is not stored in the L2 cache, the L2 cache issues a read request to the main storage device 30. Upon receiving a data read request, the main storage device 30 outputs the requested data to the L2 cache. The L2 cache stores the data read from the main storage device 30 by itself, and outputs it to the L1 cache that is the request source of the data.

次に、図１のＣＰＵ９０を用いたＬＵＴのプリロードについて説明する。マルチコアによるＬＵＴのプリロードの説明に先立ち、まずは単一のコアによるＬＵＴのプリロードについて説明する。ここでは、ＬＵＴのデータサイズが、Ｌ１キャッシュ（１つ）の記憶容量よりは大きく、かつＬ２キャッシュの記憶容量よりは小さい例を説明する。 Next, LUT preloading using the CPU 90 of FIG. 1 will be described. Prior to description of LUT preloading by multi-core, first, LUT preloading by a single core will be described. Here, an example will be described in which the data size of the LUT is larger than the storage capacity of one L1 cache and smaller than the storage capacity of the L2 cache.

図２は、単一のコア（コア＃１。例えば図１のコア９０１）によるＬＵＴの関連技術に係るプリロード処理を例示するシーケンスチャートである。以下においては、ＬＵＴのデータのうちアドレスｋのデータ（エントリ値）をＰ［ｋ］と表す。また、コア＃１に対応するＬ１キャッシュをＬ１キャッシュ＃１（図面においてはＬ１＃１）と表す。この例で、図２のフローの開始前においては、Ｌ１キャッシュおよびＬ２キャッシュにはＬＵＴのデータは記憶されていない。 FIG. 2 is a sequence chart illustrating a preload process according to a related technology of the LUT using a single core (core # 1, for example, the core 901 in FIG. 1). In the following, data (entry value) at address k out of LUT data is represented as P [k]. Further, the L1 cache corresponding to the core # 1 is represented as L1 cache # 1 (L1 # 1 in the drawing). In this example, LUT data is not stored in the L1 cache and the L2 cache before the start of the flow of FIG.

まず、コア＃１は、Ｌ１キャッシュ＃１に対し、Ｐ［０］の読み出しを要求する（ステップＳ８０１）。Ｌ１キャッシュ＃１は、Ｌ２キャッシュに対し、Ｐ［０］の読み出しを要求する（ステップＳ８０２）。Ｌ２キャッシュは、外部メモリ（主記憶装置３０）に対してＰ［０］の読み出しを要求する（ステップＳ８０３）。外部メモリは、記憶しているデータのうちＰ［０］をＬ２キャッシュに出力する（ステップＳ８０４）。Ｌ２キャッシュは、Ｐ［０］をＬ１キャッシュ＃１に出力する（ステップＳ８０５）。Ｌ１キャッシュ＃１は、Ｐ［０］をコア＃１に出力する（ステップＳ８０６）。 First, the core # 1 requests the L1 cache # 1 to read P [0] (step S801). The L1 cache # 1 requests the L2 cache to read P [0] (step S802). The L2 cache requests the external memory (main storage device 30) to read P [0] (step S803). The external memory outputs P [0] of the stored data to the L2 cache (step S804). The L2 cache outputs P [0] to the L1 cache # 1 (step S805). The L1 cache # 1 outputs P [0] to the core # 1 (step S806).

次に、コア＃１は、Ｌ１キャッシュ＃１に対し、Ｐ［１］の読み出しを要求する（ステップＳ８０７）。Ｌ１キャッシュ＃１は、Ｌ２キャッシュに対し、Ｐ［１］の読み出しを要求する（ステップＳ８０８）。Ｌ２キャッシュは、外部メモリに対してＰ［１］の読み出しを要求する（ステップＳ８０９）。外部メモリは、記憶しているデータのうちＰ［１］をＬ２キャッシュに出力する（ステップＳ８１０）。Ｌ２キャッシュは、Ｐ［１］をＬ１キャッシュ＃１に出力する（ステップＳ８１１）。Ｌ１キャッシュ＃１は、Ｐ［１］をコア＃１に出力する（ステップＳ８１２）。 Next, the core # 1 requests the L1 cache # 1 to read P [1] (step S807). The L1 cache # 1 requests the L2 cache to read P [1] (step S808). The L2 cache requests the external memory to read P [1] (step S809). The external memory outputs P [1] of the stored data to the L2 cache (step S810). The L2 cache outputs P [1] to the L1 cache # 1 (step S811). The L1 cache # 1 outputs P [1] to the core # 1 (step S812).

Ｐ［２］以降のデータについても同様に処理される。このように、ＬＵＴのデータを順次プリロードすることにより、Ｌ２キャッシュにＬＵＴのデータが記憶される。 The same processing is performed for data after P [2]. Thus, the LUT data is stored in the L2 cache by sequentially preloading the LUT data.

図３は、複数のコア（コア＃１およびコア＃２。例えば図１のコア９０１およびコア９０２）によるＬＵＴの関連技術に係るプリロード処理を例示するシーケンスチャートである。この例では、複数のコアの各々において、それぞれ並列的にプリロードが行われる。コア＃２に対応するＬ１キャッシュをＬ１キャッシュ＃２（図面においてはＬ１＃２）と表す。この例で、図３のフローの開始前においては、Ｌ１キャッシュおよびＬ２キャッシュにはＬＵＴのデータは記憶されていない。 FIG. 3 is a sequence chart illustrating a preload process according to a related technology of LUT by a plurality of cores (core # 1 and core # 2, for example, core 901 and core 902 in FIG. 1). In this example, preloading is performed in parallel in each of the plurality of cores. The L1 cache corresponding to the core # 2 is represented as L1 cache # 2 (L1 # 2 in the drawing). In this example, LUT data is not stored in the L1 cache and the L2 cache before the start of the flow of FIG.

まず、コア＃１は、Ｌ１キャッシュ＃１に対し、Ｐ［０］の読み出しを要求する（ステップＳ９０１）。Ｌ１キャッシュ＃１は、Ｌ２キャッシュに対し、Ｐ［０］の読み出しを要求する（ステップＳ９０２）。Ｌ２キャッシュは、外部メモリ（主記憶装置３０）に対してＰ［０］の読み出しを要求する（ステップＳ９０３）。外部メモリは、記憶しているデータのうちＰ［０］をＬ２キャッシュに出力する（ステップＳ９０６）。 First, the core # 1 requests the L1 cache # 1 to read P [0] (step S901). The L1 cache # 1 requests the L2 cache to read P [0] (step S902). The L2 cache requests the external memory (main storage device 30) to read P [0] (step S903). The external memory outputs P [0] of the stored data to the L2 cache (step S906).

コア＃２は、Ｌ１キャッシュ＃２に対し、Ｐ［０］の読み出しを要求する（ステップＳ９０４）。Ｌ１キャッシュ＃２は、Ｌ２キャッシュに対し、Ｐ［０］の読み出しを要求する（ステップＳ９０５）。コア＃２によるステップＳ９０４〜Ｓ９０５の処理は、コア＃１によるステップＳ９０１〜Ｓ９０２の処理と並列で行われるが、ここでは便宜上、ステップＳ９０４〜Ｓ９０５の処理がステップＳ９０１〜Ｓ９０２の処理の後で行われるように記載している。 Core # 2 requests the L1 cache # 2 to read P [0] (step S904). The L1 cache # 2 requests the L2 cache to read P [0] (step S905). The processing of steps S904 to S905 by the core # 2 is performed in parallel with the processing of steps S901 to S902 by the core # 1, but here, for convenience, the processing of steps S904 to S905 is performed after the processing of steps S901 to S902. It is described as

Ｌ２キャッシュは、Ｐ［０］をＬ１キャッシュ＃１に出力する（ステップＳ９０７）。Ｌ１キャッシュ＃１は、Ｐ［０］をコア＃１に出力する（ステップＳ９０８）。さらに、Ｌ２キャッシュは、Ｐ［０］をＬ１キャッシュ＃２に出力する（ステップＳ９０９）。Ｌ１キャッシュ＃２は、Ｐ［０］をコア＃２に出力する（ステップＳ９１０）。以上で、Ｐ［０］のプリロードが完了する。 The L2 cache outputs P [0] to the L1 cache # 1 (step S907). The L1 cache # 1 outputs P [0] to the core # 1 (step S908). Further, the L2 cache outputs P [0] to the L1 cache # 2 (step S909). The L1 cache # 2 outputs P [0] to the core # 2 (step S910). The preloading of P [0] is thus completed.

次に、コア＃１は、Ｌ１キャッシュ＃１に対し、Ｐ［１］の読み出しを要求する（ステップＳ９１１）。Ｌ１キャッシュ＃１は、Ｌ２キャッシュに対し、Ｐ［１］の読み出しを要求する（ステップＳ９１２）。Ｌ２キャッシュは、外部メモリ（主記憶装置３０）に対してＰ［１］の読み出しを要求する（ステップＳ９１３）。外部メモリは、記憶しているデータのうちＰ［１］をＬ２キャッシュに出力する（ステップＳ９１６）。 Next, the core # 1 requests the L1 cache # 1 to read P [1] (step S911). The L1 cache # 1 requests the L2 cache to read P [1] (step S912). The L2 cache requests the external memory (main storage device 30) to read P [1] (step S913). The external memory outputs P [1] of the stored data to the L2 cache (step S916).

コア＃２は、Ｌ１キャッシュ＃２に対し、Ｐ［１］の読み出しを要求する（ステップＳ９１４）。Ｌ１キャッシュ＃２は、Ｌ２キャッシュに対し、Ｐ［１］の読み出しを要求する（ステップＳ９１５）。コア＃２によるステップＳ９１４〜Ｓ９１５の処理は、コア＃１によるステップＳ９１１〜Ｓ９１２の処理と並列で行われるが、ここでは便宜上、ステップＳ９１４〜Ｓ９１５の処理がステップＳ９１１〜Ｓ９１２の処理の後で行われるように記載している。 Core # 2 requests the L1 cache # 2 to read P [1] (step S914). The L1 cache # 2 requests the L2 cache to read P [1] (step S915). The processing of steps S914 to S915 by the core # 2 is performed in parallel with the processing of steps S911 to S912 by the core # 1, but for the sake of convenience, the processing of steps S914 to S915 is performed after the processing of steps S911 to S912. It is described as

Ｌ２キャッシュは、Ｐ［１］をＬ１キャッシュ＃１に出力する（ステップＳ９１７）。Ｌ１キャッシュ＃１は、Ｐ［１］をコア＃１に出力する（ステップＳ９１８）。さらに、Ｌ２キャッシュは、Ｐ［１］をＬ１キャッシュ＃２に出力する（ステップＳ９１９）。Ｌ１キャッシュ＃２は、Ｐ［１］をコア＃２に出力する（ステップＳ９２０）。以上で、Ｐ［１］のプリロードが完了する。 The L2 cache outputs P [1] to the L1 cache # 1 (step S917). The L1 cache # 1 outputs P [1] to the core # 1 (step S918). Further, the L2 cache outputs P [1] to the L1 cache # 2 (step S919). The L1 cache # 2 outputs P [1] to the core # 2 (step S920). Thus, the preloading of P [1] is completed.

図３の処理を図２の処理と対比すると、マルチコアを使用しているにもかかわらず、Ｐ［０］およびＰ［１］をプリロードするのに要する時間は図２の処理と変わらない。これは、図３の処理ではデータのプリロードに関しマルチコアの性能を発揮できていないことを意味する。本実施形態は、プリロードに要する時間を短縮する技術を提供する。 When the process of FIG. 3 is compared with the process of FIG. 2, the time required to preload P [0] and P [1] is the same as the process of FIG. This means that the processing of FIG. 3 does not exhibit multicore performance regarding data preloading. This embodiment provides a technique for reducing the time required for preloading.

２．構成
図４は、一実施形態に係る画像形成装置１の構成を例示する図である。画像形成装置１は、画像を形成する機能を有する情報処理装置の一例であり、例えばいわゆる複合機である。画像形成装置１は、ＣＰＵ１０、メモリコントローラー２０、主記憶装置（メインメモリ）３０、ＩＯコントローラー４０、補助記憶装置４１、画像読み取りユニット４２、画像形成ユニット４３、および通信ユニット４４を有する。 2. Configuration FIG. 4 is a diagram illustrating a configuration of the image forming apparatus 1 according to an embodiment. The image forming apparatus 1 is an example of an information processing apparatus having a function of forming an image, and is a so-called multifunction machine, for example. The image forming apparatus 1 includes a CPU 10, a memory controller 20, a main storage device (main memory) 30, an IO controller 40, an auxiliary storage device 41, an image reading unit 42, an image forming unit 43, and a communication unit 44.

ＣＰＵ１０は、画像形成装置１の各部を制御する制御装置であり、各々異なる処理を実行するＮ個のコア（Ｎは２以上の自然数）を含む処理手段の一例である。この例ではＮ＝４である。ＣＰＵ１０は、コア１０１〜１０４、キャッシュメモリ１１１〜１１４、およびキャッシュメモリ１２１〜１２２を有する。キャッシュメモリ１１１〜１１４は一次キャッシュ（Ｌ１キャッシュ）であり、それぞれコア１０１〜１０４に専用のものである。キャッシュメモリ１２１〜１２２は二次キャッシュ（Ｌ２キャッシュ）である。キャッシュメモリ１２１はコア１０１および１０２で共用され、キャッシュメモリ１２２はコア１０３および１０４で共用される。 The CPU 10 is a control device that controls each unit of the image forming apparatus 1 and is an example of a processing unit including N cores (N is a natural number of 2 or more) that executes different processes. In this example, N = 4. The CPU 10 includes cores 101 to 104, cache memories 111 to 114, and cache memories 121 to 122. The cache memories 111 to 114 are primary caches (L1 caches) and are dedicated to the cores 101 to 104, respectively. The cache memories 121 to 122 are secondary caches (L2 caches). The cache memory 121 is shared by the cores 101 and 102, and the cache memory 122 is shared by the cores 103 and 104.

メモリコントローラー２０は、主記憶装置３０に対するデータの読み書きを制御する。主記憶装置３０は主記憶装置であり、例えばＤＲＡＭ（Dynamic Random Access Memory）を含む。主記憶装置３０は、ＣＰＵ１０がプログラムを実行する際のワークエリアとして機能し、種々のデータを記憶する記憶手段の一例である。 The memory controller 20 controls reading and writing of data with respect to the main storage device 30. The main storage device 30 is a main storage device, and includes, for example, a DRAM (Dynamic Random Access Memory). The main storage device 30 functions as a work area when the CPU 10 executes a program and is an example of a storage unit that stores various data.

ＩＯコントローラー４０は、周辺装置をＣＰＵ１０に接続して制御する装置である。この例で、ＩＯコントローラー４０には、補助記憶装置４１、画像読み取りユニット４２、画像形成ユニット４３、および通信ユニット４４が接続されている。補助記憶装置４１はデータおよびプログラムを記憶する不揮発性の記憶装置であり、例えばＨＤＤ（Hard Disk Drive）を含む。画像読み取りユニット４２は、原稿を光学的に読み取る装置であり、例えばいわゆるスキャナーを含む。画像形成ユニット４３は、媒体（例えば紙）に画像を形成する装置であり、例えば電子写真技術またはインクジェット技術により画像形成を行う。通信ユニット４４は、他の機器と通信を行うインターフェースである。 The IO controller 40 is a device that controls peripheral devices connected to the CPU 10. In this example, an auxiliary storage device 41, an image reading unit 42, an image forming unit 43, and a communication unit 44 are connected to the IO controller 40. The auxiliary storage device 41 is a non-volatile storage device that stores data and programs, and includes, for example, an HDD (Hard Disk Drive). The image reading unit 42 is a device that optically reads a document, and includes, for example, a so-called scanner. The image forming unit 43 is an apparatus that forms an image on a medium (for example, paper), and performs image formation by, for example, an electrophotographic technique or an ink jet technique. The communication unit 44 is an interface that communicates with other devices.

図５は、外部メモリからのデータのプリロードに関する画像形成装置１の機能構成を例示する図である。補助記憶装置４１は、画像形成装置１のＯＳ（Operating System）を機能させるためのプログラム（以下「ＯＳプログラム」という）を記憶している。ＣＰＵ１０がＯＳプログラムを実行することにより、画像形成装置１にＯＳ５０が実装される。 FIG. 5 is a diagram illustrating a functional configuration of the image forming apparatus 1 relating to preloading of data from the external memory. The auxiliary storage device 41 stores a program for causing an OS (Operating System) of the image forming apparatus 1 to function (hereinafter referred to as “OS program”). When the CPU 10 executes the OS program, the OS 50 is mounted on the image forming apparatus 1.

ＯＳ５０は、割り当て手段５１を有する。割り当て手段５１は、プリロードの対象となるデータ（この例ではＬＵＴ）を、Ｎ個のデータに分割する。分割されたデータを「部分データ」という。さらに、割り当て手段５１は、各部分データをＮ個のコアのいずれかに割り当てる。コア１０１〜コア１０４は、それぞれ要求手段を有する。例えばコア１０１の要求手段（第１要求手段の一例）は、Ｎ個の部分データのうち１つ（第１部分データの一例）の読み出しをキャッシュメモリに要求する。また、コア１０２の要求手段（第２要求手段の一例）は、Ｎ個の部分データのうち別の１つ（第２部分データの一例）の読み出しをキャッシュメモリに要求する。なお図５ではＬ１キャッシュは図示を省略している。 The OS 50 has an assigning unit 51. The allocating unit 51 divides data to be preloaded (LUT in this example) into N pieces of data. The divided data is called “partial data”. Furthermore, the assigning means 51 assigns each partial data to one of the N cores. Each of the cores 101 to 104 has a request unit. For example, the request unit (an example of the first request unit) of the core 101 requests the cache memory to read one of the N pieces of partial data (an example of the first partial data). Further, the request unit (an example of the second request unit) of the core 102 requests the cache memory to read another one of the N pieces of partial data (an example of the second partial data). In FIG. 5, the L1 cache is not shown.

キャッシュメモリ１２１は読み出し手段１２１１を有する。読み出し手段１２１１は、コアからの要求に応じて主記憶装置３０からデータを読み出す。読み出し手段１２１１により、キャッシュメモリ１２１には、コア１０１〜コア１０４により要求された部分データが記憶される。なお、コア１０１〜コア１０４の要求手段は、ＯＳの機能の一部である。すなわち、ＯＳプログラムを実行している各コアが、要求手段の一例である。また、キャッシュメモリ１２１はデータの読み出しを制御するコントローラー（図示略）を有している。このコントローラーが読み出し手段の一例である。 The cache memory 121 has a reading unit 1211. The reading unit 1211 reads data from the main storage device 30 in response to a request from the core. The read unit 1211 stores the partial data requested by the cores 101 to 104 in the cache memory 121. Note that the requesting means of the cores 101 to 104 are part of the functions of the OS. That is, each core executing the OS program is an example of a request unit. The cache memory 121 also has a controller (not shown) that controls reading of data. This controller is an example of a reading unit.

３．動作
図６は、画像形成装置１における画像処理を例示するフローチャートである。図６のフローは、例えば、アプリケーションプログラムによりＬＵＴのプリロードが指示されたことを契機として開始される。以下の説明においてＯＳ５０等のソフトウェアを処理の主体として記載することがあるが、これは、そのソフトウェアを実行しているＣＰＵ１０が他のハードウェア資源と共働して処理を実行することを意味する。 3. Operation FIG. 6 is a flowchart illustrating image processing in the image forming apparatus 1. The flow in FIG. 6 is started when an LUT preload is instructed by an application program, for example. In the following description, software such as the OS 50 may be described as the subject of processing, which means that the CPU 10 executing the software executes processing in cooperation with other hardware resources. .

ステップＳ１００において、ＯＳ５０は、複数のスレッドを生成する。ここで、「スレッド」とは、プログラムにおける処理のことをいう。これらのスレッドは、ＬＵＴを分割した部分データを複数のコアに割り当てる処理、各コアに部分データの読み出しを要求させる処理、入力画像を分割し、分割された画像を各コアに割り当てる処理、および各コアに、割り当てられた部分画像のインデックス値を出力画素値に変換させる処理を含む。 In step S100, the OS 50 generates a plurality of threads. Here, “thread” means processing in a program. These threads are a process of assigning partial data obtained by dividing the LUT to a plurality of cores, a process of requesting each core to read out partial data, a process of dividing an input image and assigning the divided image to each core, and The processing includes causing the core to convert the index value of the assigned partial image into an output pixel value.

図７は、ＬＵＴの分割の概要を示す模式図である。図４の例ではＮ＝４なので、ＬＵＴは４つの部分データに分割される。この例では、ＬＵＴは４等分される。すなわち、４つの部分データはデータサイズが等しく、かつ他の部分データと重複していない。 FIG. 7 is a schematic diagram showing an outline of LUT division. In the example of FIG. 4, since N = 4, the LUT is divided into four partial data. In this example, the LUT is divided into four equal parts. That is, the four partial data have the same data size and do not overlap with other partial data.

図８は、ＬＵＴの分割の具体例を示す図である。この例で、ＬＵＴは、Ｐ［０］〜Ｐ［Ｋ−１］のｋ個のエントリ値を含んでいる。ＬＵＴは、それぞれアドレスが連続した部分のみから構成される４つの部分データ（以下「部分データ＃１〜＃４」という）に分割される。例えば、部分データ＃１はＰ［０］〜Ｐ［Ｋ／４−１］のＫ／４個のエントリ値を含んでおり、部分データ＃２はＰ［Ｋ／４］〜Ｐ［２Ｋ／４−１］のＫ／４個のエントリ値を含んでおり、部分データ＃３はＰ［２Ｋ／４］〜Ｐ［３Ｋ／４−１］のＫ／４個のエントリ値を含んでおり、部分データ＃４はＰ［３Ｋ／４］〜Ｐ［Ｋ−１］のＫ／４個のエントリ値を含んでいる。 FIG. 8 is a diagram illustrating a specific example of LUT division. In this example, the LUT includes k entry values P [0] to P [K-1]. The LUT is divided into four partial data (hereinafter referred to as “partial data # 1 to # 4”) each consisting of only a portion having consecutive addresses. For example, partial data # 1 includes K / 4 entry values P [0] to P [K / 4-1], and partial data # 2 includes P [K / 4] to P [2K / 4. -1] includes K / 4 entry values, and partial data # 3 includes P [2K / 4] to P [3K / 4-1] K / 4 entry values, Data # 4 includes K / 4 entry values P [3K / 4] to P [K-1].

図９は、ＬＵＴの分割の他の具体例を示す図である。この例で、ＬＵＴは、各々アドレスが不連続な部分を含む４つの部分データに分割される。例えば、部分データ＃１は、Ｐ［０］〜Ｐ［１５］、Ｐ［６４］〜Ｐ［７９］、…、Ｐ［Ｋ−６４］〜Ｐ［Ｋ−４９］の合計Ｋ／４個のエントリ値を含んでいる。部分データ＃２は、Ｐ［１６］〜Ｐ［３１］、Ｐ［８０］〜Ｐ［９５］、…、Ｐ［Ｋ−４８］〜Ｐ［Ｋ−３３］の合計Ｋ／４個のエントリ値を含んでいる。部分データ＃３は、Ｐ［３２］〜Ｐ［４７］、Ｐ［９６］〜Ｐ［１１１］、…、Ｐ［Ｋ−３２］〜Ｐ［Ｋ−１７］の合計Ｋ／４個のエントリ値を含んでいる。部分データ＃４は、Ｐ［４８］〜Ｐ［６３］、Ｐ［１１２］〜Ｐ［１２７］、…、Ｐ［Ｋ−１６］〜Ｐ［Ｋ−１］の合計Ｋ／４個のエントリ値を含んでいる。 FIG. 9 is a diagram illustrating another specific example of LUT division. In this example, the LUT is divided into four partial data each including a portion where the addresses are discontinuous. For example, the partial data # 1 includes P [0] to P [15], P [64] to P [79], ..., P [K-64] to P [K-49], a total of K / 4 pieces. Contains the entry value. Partial data # 2 is a total of K / 4 entry values of P [16] to P [31], P [80] to P [95], ..., P [K-48] to P [K-33]. Is included. Partial data # 3 is a total of K / 4 entry values of P [32] to P [47], P [96] to P [111], ..., P [K-32] to P [K-17]. Is included. Partial data # 4 is a total of K / 4 entry values of P [48] to P [63], P [112] to P [127], ..., P [K-16] to P [K-1]. Is included.

この例で、各部分データは、アドレスが連続した１６個のエントリ値のセットを複数、含んでいる。エントリ値１６個分のデータサイズは、キャッシュラインサイズに等しい。キャッシュラインサイズとは、Ｌ２キャッシュと外部メモリとの間の１回あたりの最大データ転送量（データ読み出し量）をいう。例えばＤＲＡＭにおいては、メモリセルが「バンク」と呼ばれるブロックに区分されており、異なるバンクに属するメモリセルにアクセスするには、アクセスするバンクを切り替える処理が必要である。外部メモリがＤＲＡＭを含んでいる場合において図８で例示した連続するＫ／４個のエントリ値が複数のバンクにまたがって記憶されているときは、外部メモリ（ＤＲＡＭ）は、複数のコアから並列的に発生するアクセスに対して、バンクを切り替えながらエントリ値を読み出さなければならない。 In this example, each partial data includes a plurality of sets of 16 entry values with consecutive addresses. The data size for 16 entry values is equal to the cache line size. The cache line size refers to the maximum data transfer amount (data read amount) per time between the L2 cache and the external memory. For example, in a DRAM, memory cells are divided into blocks called “banks”, and in order to access memory cells belonging to different banks, it is necessary to switch the bank to be accessed. When the external memory includes a DRAM and the consecutive K / 4 entry values illustrated in FIG. 8 are stored across a plurality of banks, the external memory (DRAM) is connected in parallel from a plurality of cores. For an access that occurs automatically, the entry value must be read while switching the bank.

いま、外部メモリを構成するＤＲＡＭが４つのバンクを含んでいる例を考える。図８の例では、まず、コア１０１からの要求によりＰ［０］〜Ｐ［１５］が、コア１０２からの要求によりＰ［Ｋ／４］〜Ｐ［Ｋ／４＋１５］が、コア１０３からの要求によりＰ［２Ｋ／４］〜Ｐ［２Ｋ／４＋１５］が、コア１０４からの要求によりＰ［３Ｋ／４］〜Ｐ［３Ｋ／４＋１５］が、それぞれ読み出される。しかし、これらのエントリ値はそれぞれ異なるバンクに記憶されているので、ＤＲＡＭは並列的にバンクを切り替えながらデータを読み出さなければならない。そのた、バンク切り替えの分だけデータ読み出しに時間がかかる。 Consider an example in which a DRAM constituting an external memory includes four banks. In the example of FIG. 8, first, P [0] to P [15] are requested from the core 101, and P [K / 4] to P [K / 4 + 15] are requested from the core 103 according to the request from the core 102. P [2K / 4] to P [2K / 4 + 15] are read by the request, and P [3K / 4] to P [3K / 4 + 15] are read by the request from the core 104, respectively. However, since these entry values are stored in different banks, the DRAM must read data while switching the banks in parallel. In addition, it takes time to read data by the amount of bank switching.

これに対し図９の例では、まず、コア１０１からの要求によりＰ［０］〜Ｐ［１５］が、コア１０２からの要求によりＰ［１６］〜Ｐ［３１］が、コア１０３からの要求によりＰ［３２］〜Ｐ［４７］が、コア１０４からの要求によりＰ［４８］〜Ｐ［６３］が、それぞれ読み出される。これらのデータは同一のバンクに記憶されているので、ＤＲＡＭはバンクを切り替えることなく高速にデータを読み出すことができる。 In contrast, in the example of FIG. 9, first, P [0] to P [15] are requested by the request from the core 101, and P [16] to P [31] are requested from the core 103 by the request from the core 102. Thus, P [32] to P [47] are read out, and P [48] to P [63] are read out according to the request from the core 104, respectively. Since these data are stored in the same bank, the DRAM can read the data at high speed without switching the bank.

再び図６を参照する。ステップＳ１０１、Ｓ１１１、Ｓ１２１、およびＳ１４１において、コア１０１〜コア１０４は、それぞれ自身に割り当てられた部分データをプリロードする。すなわち、コア１０１は部分データ＃１を、コア１０２は部分データ＃２を、コア１０３は部分データ＃３を、コア１０４は部分データ＃４を、それぞれプリロードする。各コアにおけるプリロードは並列的に行われる。その結果、Ｌ２キャッシュにはＬＵＴがコピーされる。 Refer to FIG. 6 again. In steps S101, S111, S121, and S141, the cores 101 to 104 preload partial data assigned to the cores 101 to 104, respectively. That is, the core 101 preloads partial data # 1, the core 102 preloads partial data # 2, the core 103 preloads partial data # 3, and the core 104 preloads partial data # 4. Preloading in each core is performed in parallel. As a result, the LUT is copied to the L2 cache.

以下、コア１０１〜コア１０４において処理が並列的に行われるが、ここではコア１０１の処理だけ説明する。コア１０２〜１０４の処理（ステップＳ１１１〜Ｓ１１６，Ｓ１２１〜Ｓ１２６，Ｓ１４１〜Ｓ１４６）については、コア１０１の処理と同様なので説明を省略する。ステップＳ１０２において、コア１０１は、対象画素のデータを外部メモリから読み出す。ステップＳ１０３において、コア１０１は、対象画素のデータから、インデックス値を算出する。ステップＳ１０４において、コア１０１は、Ｌ２キャッシュに記憶されているＬＵＴを用いて、算出されたインデックス値に対応するエントリ値を取得する。ステップＳ１０５において、コア１０１は、エントリ値から出力画素値を算出する。ステップＳ１０６において、コア１０１は、出力画素値を外部メモリに書き込む。 Hereinafter, processing is performed in parallel in the cores 101 to 104, but only the processing of the core 101 will be described here. The processing of the cores 102 to 104 (steps S111 to S116, S121 to S126, S141 to S146) is the same as the processing of the core 101, and thus the description thereof is omitted. In step S102, the core 101 reads the data of the target pixel from the external memory. In step S103, the core 101 calculates an index value from the data of the target pixel. In step S104, the core 101 acquires an entry value corresponding to the calculated index value using the LUT stored in the L2 cache. In step S105, the core 101 calculates an output pixel value from the entry value. In step S106, the core 101 writes the output pixel value in the external memory.

ステップＳ１０７において、ＯＳ５０は、全てのスレッドが完了するまで待機する。コア１０１〜コア１０４に割り当てられたスレッドが全て完了した場合、ＯＳ５０は、図６のフローを終了する。 In step S107, the OS 50 waits until all threads are completed. When all the threads assigned to the cores 101 to 104 are completed, the OS 50 ends the flow of FIG.

図１０は、画像形成装置１におけるプリロード処理を例示するシーケンスチャートである。ここでは説明を簡単にするため、コア＃１およびコア＃２（例えばコア１０１およびコア１０２）の２つのコアの処理のみ図示している。 FIG. 10 is a sequence chart illustrating the preload process in the image forming apparatus 1. Here, in order to simplify the description, only the processing of two cores, core # 1 and core # 2 (for example, core 101 and core 102) is illustrated.

まず、コア＃１は、Ｌ１キャッシュ＃１に対し、Ｐ［０］の読み出しを要求する（ステップＳ２０１）。Ｌ１キャッシュ＃１は、Ｌ２キャッシュに対し、Ｐ［０］の読み出しを要求する（ステップＳ２０２）。Ｌ２キャッシュは、外部メモリ（主記憶装置３０）に対してＰ［０］の読み出しを要求する（ステップＳ２０３）。 First, the core # 1 requests the L1 cache # 1 to read P [0] (step S201). The L1 cache # 1 requests the L2 cache to read P [0] (step S202). The L2 cache requests the external memory (main storage device 30) to read P [0] (step S203).

ステップＳ２０１〜Ｓ２０３の処理と並列的に以下の処理が行われる。コア＃２は、Ｌ１キャッシュ＃２に対し、Ｐ［１］の読み出しを要求する（ステップＳ２０４）。Ｌ１キャッシュ＃２は、Ｌ２キャッシュに対し、Ｐ［１］の読み出しを要求する（ステップＳ２０５）。Ｌ２キャッシュは、外部メモリ（主記憶装置３０）に対してＰ［１］の読み出しを要求する（ステップＳ２０６）。 The following processing is performed in parallel with the processing of steps S201 to S203. The core # 2 requests the L1 cache # 2 to read P [1] (step S204). The L1 cache # 2 requests the L2 cache to read P [1] (step S205). The L2 cache requests the external memory (main storage device 30) to read P [1] (step S206).

外部メモリは、コア＃１からの要求に応じて、記憶しているデータのうちＰ［０］をＬ２キャッシュに出力する（ステップＳ２０７）。Ｌ２キャッシュは、Ｐ［０］をＬ１キャッシュ＃１に出力する（ステップＳ２０８）。Ｌ１キャッシュ＃１は、Ｐ［０］をコア＃１に出力する（ステップＳ２０９）。 The external memory outputs P [0] of the stored data to the L2 cache in response to a request from the core # 1 (step S207). The L2 cache outputs P [0] to the L1 cache # 1 (step S208). The L1 cache # 1 outputs P [0] to the core # 1 (step S209).

ステップＳ２０７〜Ｓ２０９の処理と並列的に以下の処理が行われる。外部メモリは、コア＃２からの要求に応じて、記憶しているデータのうちＰ［１］をＬ２キャッシュに出力する（ステップＳ２１０）。Ｌ２キャッシュは、Ｐ［１］をＬ１キャッシュ＃１に出力する（ステップＳ２１１）。Ｌ１キャッシュ＃１は、Ｐ［１］をコア＃１に出力する（ステップＳ２１２）。以上で、Ｐ［０］およびＰ［１］のプリロードが完了する。 The following processing is performed in parallel with the processing of steps S207 to S209. The external memory outputs P [1] of the stored data to the L2 cache in response to a request from the core # 2 (step S210). The L2 cache outputs P [1] to the L1 cache # 1 (step S211). The L1 cache # 1 outputs P [1] to the core # 1 (step S212). Thus, preloading of P [0] and P [1] is completed.

図３のフローと対比すると、図１０のフローではＰ［０］およびＰ［１］のプリロードが完了するまでの時間が短縮されていることがわかる。 In contrast to the flow of FIG. 3, it can be seen that the time until the preloading of P [0] and P [1] is completed is shortened in the flow of FIG.

４．変形例
本発明は上述の実施形態に限定されず、種々の変形実施が可能である。以下、変形例をいくつか説明する。以下の変形例のうち２つ以上のものが組み合わせて用いられてもよい。 4). Modifications The present invention is not limited to the above-described embodiments, and various modifications can be made. Hereinafter, some modifications will be described. Two or more of the following modifications may be used in combination.

４−１．変形例１
図１１は、変形例１に係るＬＵＴの分割方法の概要を示す図である。ＬＵＴの分割方法は実施形態で説明した例に限定されない。この例では、４つの分割データのデータサイズは等しくなく、また互いに一部が重複している。さらに、４つの分割データを合わせても主記憶装置３０に記憶されているＬＵＴは完全に再現されず、一部のエントリ値が欠落している。これは以下の場合に有効である。例えば、アプリケーションプログラム等のソフトウェアコンポーネントが、ＬＵＴのうち、対象画像を画像処理する際に用いられる部分を特定する。ＯＳ５０は、こうして特定された部分をカバーするようにＬＵＴを分割する。 4-1. Modification 1
FIG. 11 is a diagram showing an outline of the LUT dividing method according to the first modification. The LUT dividing method is not limited to the example described in the embodiment. In this example, the data sizes of the four divided data are not equal, and some of them overlap each other. Further, even if the four pieces of divided data are combined, the LUT stored in the main storage device 30 is not completely reproduced, and some entry values are missing. This is effective in the following cases. For example, a software component such as an application program specifies a portion of the LUT that is used when image processing is performed on the target image. The OS 50 divides the LUT so as to cover the part thus specified.

４−２．変形例２
変形例１においてさらに、アプリケーションプログラム等のソフトウェアコンポーネントは、ＬＵＴのうち画像処理で用いられる部分が、コア毎に特定されてもよい。この場合、ＯＳ５０は、コア毎に用いられる部分を含むようにＬＵＴを分割する。図１１の例で、部分データ＃１は対象画像のうちコア１０１が担当する領域の画像処理に用いられるエントリ値をカバーしている。同様に、部分データ＃２は対象画像のうちコア１０２が担当する領域の、部分データ＃３は対象画像のうちコア１０３が担当する領域の、部分データ＃４は対象画像のうちコア１０４が担当する領域の、画像処理に用いられるエントリ値をカバーしている。各部分データのサイズがＬ１キャッシュの容量よりも小さければ、各コアは必要なエントリ値をＬ１キャッシュから直に読み取ることができ、処理がより高速化される。 4-2. Modification 2
Further, in the first modification, in the software component such as an application program, a part used in image processing in the LUT may be specified for each core. In this case, the OS 50 divides the LUT so as to include a portion used for each core. In the example of FIG. 11, the partial data # 1 covers entry values used for image processing of an area handled by the core 101 in the target image. Similarly, partial data # 2 is an area of the target image that is handled by the core 102, partial data # 3 is an area of the target image that is handled by the core 103, and partial data # 4 is an area of the target image that is handled by the core 104. The entry value used for image processing of the area to be covered is covered. If the size of each partial data is smaller than the capacity of the L1 cache, each core can read the required entry value directly from the L1 cache, and the processing is further accelerated.

４−３．変形例３
主記憶装置３０に記憶されるデータおよびこれを用いた処理は実施形態で例示したものに限定されない。主記憶装置３０に記憶されるデータは、例えば各コアで実行されるコード（命令）であってもよい。この場合、コアは、指定されるアドレスに記憶されているコードを読み出し、読み出したコードを実行する。このコードキャッシュメモリにプリロードされる。 4-3. Modification 3
Data stored in the main storage device 30 and processing using the same are not limited to those exemplified in the embodiment. The data stored in the main storage device 30 may be, for example, a code (instruction) executed in each core. In this case, the core reads the code stored at the designated address and executes the read code. The code cache memory is preloaded.

４−４．他の変形例
ＣＰＵ１０の構成は図２に例示したものに限定されない。コアの数やキャッシュメモリの階層構造はあくまで例示である。ＣＰＵ１０は、少なくとも、第２キャッシュメモリを共用する２つのコアを有して入ればよい。ＣＰＵ１０は、Ｌ２キャッシュの下層にＬ３キャッシュを有していてもよい。 4-4. Other Modifications The configuration of the CPU 10 is not limited to that illustrated in FIG. The number of cores and the hierarchical structure of the cache memory are merely examples. The CPU 10 may have at least two cores sharing the second cache memory. The CPU 10 may have an L3 cache below the L2 cache.

また、ＣＰＵ１０は、物理的に１つのチップに複数のコアおよびキャッシュメモリを搭載したものに限定されない。複数のＣＰＵチップで１つのキャッシュメモリを共用する情報処理装置に対し、本発明が適用されてもよい。 Further, the CPU 10 is not limited to one in which a plurality of cores and a cache memory are physically mounted on one chip. The present invention may be applied to an information processing apparatus in which a plurality of CPU chips share one cache memory.

さらに、実施形態における「複数のコア」は物理的に異なる複数のコアに限定されない。物理的に１つのコアが、時分割で論理的に（擬似的に）複数のコアとして用いられてもよい。 Furthermore, the “plurality of cores” in the embodiments is not limited to a plurality of physically different cores. One physical core may be used as a plurality of cores in a logical (pseudo) manner in a time division manner.

本発明に係る情報処理装置は、図２に例示した画像形成装置１に限定されない。ＣＰＵ１０を用いて複数の処理を並列実行するものであれば、情報処理装置はどのような装置であってもよい。例えば、情報処理装置は、パーソナルコンピュータ、スマートフォン、またはタブレット端末であってもよい。 The information processing apparatus according to the present invention is not limited to the image forming apparatus 1 illustrated in FIG. As long as a plurality of processes are executed in parallel using the CPU 10, the information processing apparatus may be any apparatus. For example, the information processing apparatus may be a personal computer, a smartphone, or a tablet terminal.

１…画像形成装置、１０…ＣＰＵ、２０…メモリコントローラー、３０…主記憶装置、４０…ＩＯコントローラー、４１…補助記憶装置、４２…画像読み取りユニット、４３…画像形成ユニット、４４…通信ユニット、５０…ＯＳ、５１…割り当て手段、９０…ＣＰＵ、１０１〜１０４…コア、１１１〜１１４…キャッシュメモリ（Ｌ１）、１２１…キャッシュメモリ（Ｌ２）、９０１〜９０４…コア、９１１〜９１４…キャッシュメモリ（Ｌ１）、９２１…キャッシュメモリ（Ｌ２） DESCRIPTION OF SYMBOLS 1 ... Image forming apparatus, 10 ... CPU, 20 ... Memory controller, 30 ... Main storage device, 40 ... IO controller, 41 ... Auxiliary storage device, 42 ... Image reading unit, 43 ... Image forming unit, 44 ... Communication unit, 50 ... OS, 51 ... Allocation means, 90 ... CPU, 101-104 ... Core, 111-114 ... Cache memory (L1), 121 ... Cache memory (L2), 901-904 ... Core, 911-914 ... Cache memory (L1) ), 921... Cache memory (L2)

Claims

A first core;
A second core that performs parallel processing with the first core;
A cache memory shared by the first core and the second core;
And an external memory storing data,
The first core is
First request means for requesting first partial data which is a part of the data;
The second core is
Second request means for requesting second partial data that is a part of the data different from the first partial data;
The cache memory is
In response to a request from the first requesting unit and a request from the second requesting unit, the reading unit reads out the first partial data and the second partial data from the external memory,
Each of the first core and the second core performs processing using at least a part of the first partial data and the second partial data stored in the cache memory, respectively.

N cores including the first core and the second core;
2. The allocation unit according to claim 1, further comprising: an assigning unit that equally divides the data into N pieces of partial data each consisting of only a portion having consecutive addresses, and assigns each piece of partial data to one of the N cores. Information processing device.

N cores including the first core and the second core;
2. The information processing according to claim 1, further comprising: an assigning unit that equally divides the data into N pieces of partial data each including a portion in which addresses are discontinuous, and assigns each piece of partial data to one of the N cores. apparatus.

The external memory includes a DRAM;
Each of the N partial data includes a portion where addresses are continuous,
4. The information processing apparatus according to claim 2, wherein a data size of a portion where the addresses are continuous is equal to or less than a data read amount per time when the reading unit reads data from the DRAM.

The processing in the first core and the second core is processing for converting an index corresponding to a pixel into a pixel value,
The information processing apparatus according to any one of claims 1 to 4, wherein the data is a table for converting the index into the pixel value.

A first core;
A second core that performs parallel processing with the first core;
A first cache memory dedicated to the first core;
A second cache memory dedicated to the second core;
A cache memory shared by the first core and the second core;
And an external memory storing data,
The first core is
First request means for requesting first partial data which is a part of the data;
The second core is
Second request means for requesting second partial data that is a part of the data different from the first partial data;
The first cache memory is
In response to a request from the first request means, the first acquisition means for acquiring the first partial data from the external memory;
The second cache memory is
In response to a request from the second request means, the second acquisition means for acquiring the second partial data from the external memory;
The first core performs processing using the first partial data stored in the first cache memory,
The second core performs processing using the second partial data stored in the second cache memory.

An information processing apparatus according to any one of claims 1 to 6,
An image forming apparatus comprising: an image forming unit that forms an image according to a result processed by the first core and the second core.