JP2012208615A

JP2012208615A - Instruction execution analysis device, instruction execution analysis method, and program

Info

Publication number: JP2012208615A
Application number: JP2011072473A
Authority: JP
Inventors: Toshihisa Kamemaru; 敏久亀丸
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2011-03-29
Filing date: 2011-03-29
Publication date: 2012-10-25
Anticipated expiration: 2031-03-29
Also published as: JP5597584B2

Abstract

PROBLEM TO BE SOLVED: To improve the usage efficiency of cache memory.SOLUTION: An instruction execution analysis device performs: performs a simulation in which a main memory device is accessed and an instruction is executed to extract, as an inefficient memory block, a memory block where memory-access targets are concentrated on a part of data (an inefficient block extraction process 2) and extract, as an inefficient instruction, an instruction causing a memory access to the inefficient block (an inefficient instruction extraction process 3) in the simulation, thereby enabling: prevention of data with a high frequency of access to the same block of cache memory and data with a low frequency of such access being intermixed; and improvement of the usage efficiency of the cache memory.

Description

本発明は、キャッシュメモリの利用効率を向上させる技術に関する。 The present invention relates to a technique for improving the utilization efficiency of a cache memory.

情報処理装置は、プロセッサ装置、メインメモリ装置、入出力装置から構成され、特に実行性能が要求される装置については、高い周波数で動作し、キャッシュメモリを有するプロセッサ装置を活用する。
このとき、キャッシュメモリのヒット率が性能に影響する。
一般に、ヒット率は９５％以上、高速なものは９９％近いヒット率を実現する。
言いかえれば、キャッシュメモリのヒット率が数％低下するだけで、システム全体の性能が大幅に低下することは珍しいことではない。
そのため、例えば、特許文献１のように、ヒット率を中心とするキャッシュメモリの挙動情報を収集し、チューニングを行うことが提案されている。 The information processing apparatus is composed of a processor device, a main memory device, and an input / output device. Especially for a device that requires execution performance, it operates at a high frequency and utilizes a processor device having a cache memory.
At this time, the hit rate of the cache memory affects the performance.
In general, the hit rate is 95% or higher, and a high-speed one achieves a hit rate close to 99%.
In other words, it is not uncommon for the performance of the entire system to drop significantly with only a few percent drop in the cache memory hit rate.
Therefore, for example, as disclosed in Patent Document 1, it is proposed that tuning is performed by collecting cache memory behavior information centered on the hit rate.

特許文献１に記載の方式は、命令を模擬するシミュレータを有し、命令をシミュレーションするとともにキャッシュメモリもシミュレーションする。
特許文献１に記載の方式によれば、大規模な配列をもつプログラムに対して、配列の要素ごとにアクセス回数やキャッシュミス回数（アクセス回数−ヒット回数）を表示することができる。
また、特許文献１に記載の方式は、アクセス回数とキャッシュミス回数の多い行を検出して、キャッシュを有効に使用するようソースプログラムを改善する。
つまり、キャッシュミスの大きい配列は、遅延が大きいのでコンパイラによりその間に別の命令を実行するようにコード生成させる。 The method described in Patent Document 1 includes a simulator that simulates an instruction, and simulates an instruction and a cache memory.
According to the method described in Patent Document 1, it is possible to display the number of accesses and the number of cache misses (number of accesses-number of hits) for each element of an array for a program having a large array.
The method described in Patent Document 1 improves the source program so as to effectively use the cache by detecting lines with a large number of accesses and cache misses.
That is, since an array having a large cache miss has a large delay, code is generated by the compiler so that another instruction is executed in the meantime.

特開平０８−２６３３７２号公報Japanese Patent Laid-Open No. 08-263372

しかし、特許文献１の手法で得られる情報は、プログラムのデータ配列のアクセス回数とミス回数であり、キャッシュ全体に対してどういう影響を与えているかが不明である。
例えば、キャッシュミス回数＝０と表示されれば、キャッシュとしては有効に機能しているように見えるが、同一のキャッシュブロックにのっている、ある要素だけがアクセス回数＝１０，０００回、ミス回数＝１回で、他の要素はアクセス回数＝１回、ミス回数＝０回である場合、キャッシュは有効に使われているとはいえない。
アクセス回数＝１回の要素は利用頻度が低いため、本来はキャッシュメモリに乗らなくても影響がない。
むしろ、利用頻度が高い別の配列の別のデータがそのキャッシュメモリを使うのが効率的といえる。
いいかえると、このような配列があると、他の配列のキャッシュミス回数を増やしていることが想像される。
すなわち、キャッシュミスも少なく効率的にアクセスしている配列の要素が、他の配列に対して悪影響を及ぼしていることがあるが、特許文献１の方式では、そのような原因に遡ることは容易にはできない。
キャッシュメモリを有効に使う一つのテクニックとしては、キャッシュのブロックには、アクセス回数が高いものだけを固めてのせ、アクセス回数が低いものも固めておき、キャッシュのブロックには乗らないようにしておくことが必要である。
アクセス頻度が高いものと低いものが混在することがキャッシュ利用効率の低下の要因の一つである。
特に、近年の高性能プロセッサは、キャッシュ容量の大規模化に伴い、キャッシュブロックのサイズが大きくなる傾向にあるため、このような傾向がだんだん顕著になってきた。
また、特許文献１の方式では、このような原因がプログラムのどの部分に起因して発生したのかが容易にはわからないので、効率的なプログラム修正が困難である。 However, the information obtained by the method of Patent Document 1 is the number of accesses and the number of misses in the program data array, and it is unclear what kind of influence it has on the entire cache.
For example, if the number of cache misses is displayed as 0, it appears that the cache is functioning effectively, but only certain elements in the same cache block have access counts of 10,000 times. If the number of times = 1, and the other elements are the number of accesses = 1, and the number of misses = 0, the cache cannot be said to be used effectively.
Since the access frequency = 1 element is less frequently used, there is no influence even if it is not placed in the cache memory.
Rather, it can be said that it is efficient to use the cache memory for another data in another array having a high frequency of use.
In other words, it can be imagined that such an array increases the number of cache misses of other arrays.
In other words, an array element that is accessed efficiently with few cache misses may have an adverse effect on other arrays. However, in the method of Patent Document 1, it is easy to trace back to such a cause. I can't.
One technique for effectively using cache memory is to only harden cache blocks that have a high number of accesses and harden those that have a low number of accesses so that they do not get on the cache block. It is necessary.
One of the causes of a decrease in cache use efficiency is a mixture of high access frequency and low access frequency.
In particular, in recent high-performance processors, as the cache capacity increases, the size of the cache block tends to increase, and this tendency has become increasingly prominent.
Further, in the method of Patent Document 1, it is difficult to easily determine which part of the program caused such a cause, so that efficient program correction is difficult.

この発明は、上記のような課題を解決することを主な目的の一つとしており、キャッシュメモリの同一ブロックにアクセス頻度の高いデータと低いデータが混在することを防止し、キャッシュメモリの利用効率を向上させることを主な目的とする。 One of the main objects of the present invention is to solve the above-mentioned problems, and it is possible to prevent data with high access frequency and data with low access from being mixed in the same block of the cache memory, thereby improving the use efficiency of the cache memory. The main purpose is to improve.

本発明に係る命令実行分析装置は、
メインメモリ装置へのメモリアクセスが実施されて命令が実行されるシミュレーションにおける前記メインメモリ装置へのメモリアクセスの状況を分析し、分析結果に基づいて、前記メインメモリ装置内の複数のメモリブロックのうち、メモリアクセスの対象がメモリブロック内の一部のデータに偏っているメモリブロックを非効率ブロックとして抽出する非効率ブロック抽出部と、
前記非効率ブロック抽出部により抽出された非効率ブロックへのメモリアクセスを発生させた命令のうち所定の条件に合致する命令を非効率命令として抽出する非効率命令抽出部とを有することを特徴とする。 An instruction execution analysis apparatus according to the present invention includes:
Analyzing the status of memory access to the main memory device in a simulation in which a memory access to the main memory device is executed and executing an instruction, and based on the analysis result, out of a plurality of memory blocks in the main memory device An inefficient block extraction unit that extracts a memory block whose memory access target is biased toward some data in the memory block as an inefficient block;
An inefficient instruction extracting unit that extracts an instruction that satisfies a predetermined condition as an inefficient instruction among the instructions that have generated memory access to the inefficient block extracted by the inefficient block extracting unit; To do.

本発明によれば、シミュレーションにおいてメモリアクセスの対象がメモリブロック内の一部のデータに偏っているメモリブロックを非効率ブロックとして抽出するとともに、非効率ブロックへのメモリアクセスを発生させた命令を非効率命令として抽出するため、キャッシュメモリの同一ブロックにアクセス頻度の高いデータと低いデータが混在することを防止し、キャッシュメモリの利用効率を向上させることができる。 According to the present invention, in the simulation, a memory block whose memory access target is biased toward a part of the data in the memory block is extracted as an inefficient block, and the instruction that caused the memory access to the inefficient block is Since the instructions are extracted as efficient instructions, it is possible to prevent data with high access frequency and data with low access frequencies from being mixed in the same block of the cache memory, and improve the use efficiency of the cache memory.

実施の形態１に係る処理手順を示すフローチャート図。FIG. 3 is a flowchart showing a processing procedure according to the first embodiment. 実施の形態１に係る非効率ブロックの抽出処理を示すフローチャート図。FIG. 4 is a flowchart showing inefficient block extraction processing according to the first embodiment. 実施の形態１に係る高アクセス頻度と低アクセス頻度のバイト数の算出処理を示すフローチャート図。The flowchart figure which shows the calculation process of the byte number of the high access frequency and low access frequency which concern on Embodiment 1. FIG. 実施の形態１に係る非効率命令の抽出処理を示すフローチャート図。FIG. 4 is a flowchart showing inefficiency instruction extraction processing according to the first embodiment. 実施の形態１に係る非効率ブロック及び非効率命令の出力例を示す図。FIG. 4 is a diagram illustrating an output example of an inefficient block and an inefficient instruction according to the first embodiment. 実施の形態２に係るソースプログラムの例を示す図。FIG. 10 is a diagram showing an example of a source program according to the second embodiment. 実施の形態２に係るメッセージが挿入されたソースプログラムの例を示す図。The figure which shows the example of the source program in which the message which concerns on Embodiment 2 was inserted. 実施の形態２に係る処理手順を示すフローチャート図。FIG. 9 is a flowchart showing a processing procedure according to the second embodiment. 実施の形態２に係るソースプログラムとバイナリコードの対応表の例を示す図。The figure which shows the example of the correspondence table of the source program which concerns on Embodiment 2, and a binary code. 実施の形態１に係る命令実行分析装置の構成例を示す図。FIG. 3 is a diagram illustrating a configuration example of an instruction execution analysis apparatus according to the first embodiment. 実施の形態２に係る命令実行分析装置の構成例を示す図。FIG. 4 is a diagram illustrating a configuration example of an instruction execution analysis apparatus according to a second embodiment. 実施の形態１及び２に係る命令実行分析装置のハードウェア構成例を示す図。FIG. 3 is a diagram illustrating a hardware configuration example of an instruction execution analysis apparatus according to the first and second embodiments.

実施の形態１及び実施の形態２では、例えば、高性能な情報処理装置において、キャッシュメモリのブロックに、アクセス頻度の高いものと低いものが混在し、キャッシュメモリが効率的に活用できず、情報処理装置が持つ本来の性能が発揮できないようなケースにおいて、その原因を明らかにし、性能改善のためのソースプログラムの修正箇所を指示する手法を説明する。
より具体的には、本手法の中核の部分、すなわち、バイナリコードから非効率命令を取り出すところまでを実施の形態１に示し、中核の部分を実際に適用した例、すなわち、ソースプログラムに対して修正箇所を指示する例を実施の形態２で示す。 In the first embodiment and the second embodiment, for example, in a high-performance information processing apparatus, a cache memory block includes a mixture of frequently accessed and low-accessed blocks, and the cache memory cannot be efficiently used. In the case where the original performance of the processing device cannot be exhibited, the cause will be clarified and a method for instructing the correction location of the source program for performance improvement will be described.
More specifically, the core part of the present method, that is, up to extracting the inefficient instruction from the binary code is shown in the first embodiment, and an example in which the core part is actually applied, that is, the source program is shown. An example of instructing a correction location is shown in the second embodiment.

実施の形態１．
図１から図４にて、上記手法の中核部分の説明をする。
図１は、本実施の形態に係る命令実行分析方法の全体のフローを示す。
図２と図４では、図１で示したフローのうち重要構成要素を説明する。
図３は、図２での特定要素の説明を追加する。
図１０は、図１に示した命令実行分析方法を実現する命令実行分析装置の構成例を示す。 Embodiment 1 FIG.
The core part of the above method will be described with reference to FIGS.
FIG. 1 shows an overall flow of the instruction execution analysis method according to the present embodiment.
2 and 4, important components in the flow shown in FIG. 1 will be described.
FIG. 3 adds a description of the specific elements in FIG.
FIG. 10 shows a configuration example of an instruction execution analysis apparatus that realizes the instruction execution analysis method shown in FIG.

図１は、本実施の形態に係る命令実行分析方法の全体のフローである。 FIG. 1 is an overall flow of the instruction execution analysis method according to the present embodiment.

１は、入力データであり、バイナリコードと実行環境である。
実行環境とは、初期データや実行中に外部要因により変化するデータを指す。
２は非効率ブロックの抽出処理、３は非効率命令の抽出処理である。
非効率ブロックとは、ひとつのキャッシュブロックに対して、一部のデータのみ頻繁に使われ、他の部分はほとんど使われないため、ブロック全体としては、利用効率が低くなっているものを指す。
非効率命令とは、非効率ブロックの生成要因となっている命令を指す。
非効率ブロックの抽出処理２は、図２に示すように、メモリアクセス数の計測処理と非効率ブロックの判定処理に大別される。
非効率命令の抽出処理３は、図４に示すように、非効率ブロックへのアクセス回数の計測処理と非効率命令の判定処理に大別される。
図２及び図４の詳細は後述する。
４は、出力データであり、非効率命令の一覧表と非効率命令に関する情報である。
図５に具体的な出力例を示す。
図５の詳細は後述する。 Reference numeral 1 denotes input data, which are a binary code and an execution environment.
The execution environment refers to initial data or data that changes due to external factors during execution.
2 is an inefficient block extraction process, and 3 is an inefficient instruction extraction process.
An inefficient block refers to a block in which the use efficiency of the entire block is low because only a part of data is frequently used for one cache block and the other part is hardly used.
An inefficient instruction refers to an instruction that is a cause for generating an inefficient block.
As shown in FIG. 2, the inefficient block extraction process 2 is roughly divided into a memory access count measurement process and an inefficient block determination process.
As shown in FIG. 4, the inefficiency instruction extraction process 3 is broadly divided into an inefficient block access count measurement process and an inefficient instruction determination process.
Details of FIGS. 2 and 4 will be described later.
Reference numeral 4 denotes output data, which is a list of inefficient instructions and information on inefficient instructions.
FIG. 5 shows a specific output example.
Details of FIG. 5 will be described later.

図１０は、図１に示した命令実行分析方法を実現する命令実行分析装置１００の構成例を示す。
命令実行分析装置１００は、例えば、情報処理装置に実装されているツール（命令実行分析ツール）である。 FIG. 10 shows a configuration example of the instruction execution analysis apparatus 100 that realizes the instruction execution analysis method shown in FIG.
The instruction execution analysis apparatus 100 is, for example, a tool (instruction execution analysis tool) installed in the information processing apparatus.

命令実行分析装置１００において、シミュレーション部１０１は、バイナリコード、実行環境２００を用いて、命令実行のシミュレーションを行う。
より具体的には、シミュレーション部１０１は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）などのプロセッサ装置を模擬し、メインメモリ装置（例えば、命令実行分析装置１００が実装されている情報処理装置のメインメモリ装置）へのメモリアクセスを実施してバイナリコードの命令の実行をシミュレートする。
メインメモリ装置は、プロセッサ装置に含まれているキャッシュメモリのキャッシュブロック数に対応させた複数のメモリブロックに区分されている。
なお、シミュレーション部１０１は、後述する図２の２０３、２０４、２０６、２０７の処理、図４の３０３、３０４、３０７、３０８の処理を行う。 In the instruction execution analysis apparatus 100, the simulation unit 101 performs instruction execution simulation using the binary code and the execution environment 200.
More specifically, the simulation unit 101 simulates a processor device such as a CPU (Central Processing Unit) and moves to a main memory device (for example, a main memory device of an information processing device in which the instruction execution analysis device 100 is mounted). The memory access is executed to simulate the execution of the instruction of the binary code.
The main memory device is divided into a plurality of memory blocks corresponding to the number of cache blocks of the cache memory included in the processor device.
The simulation unit 101 performs processes 203, 204, 206, and 207 in FIG. 2 and processes 303, 304, 307, and 308 in FIG.

非効率ブロック抽出部１０２は、開始指示３００があった際に、図１の非効率ブロックの抽出処理２を行う。
より具体的には、非効率ブロック抽出部１０２は、シミュレーション部１０１のシミュレーションにおけるメインメモリ装置へのメモリアクセスの状況を分析し、分析結果に基づいて、メインメモリ装置内の複数のメモリブロックのうち、メモリアクセスの対象がメモリブロック内の一部のデータに偏っているメモリブロックを非効率ブロックとして抽出する。 When there is a start instruction 300, the inefficient block extraction unit 102 performs the inefficient block extraction process 2 in FIG.
More specifically, the inefficient block extraction unit 102 analyzes the state of memory access to the main memory device in the simulation of the simulation unit 101, and based on the analysis result, among the plurality of memory blocks in the main memory device A memory block whose memory access target is biased toward a part of the data in the memory block is extracted as an inefficient block.

非効率命令抽出部１０３は、図１の非効率命令の抽出処理３を行い、非効率命令の一覧表と非効率命令に関連する情報４００を出力する。
より具体的には、非効率命令抽出部１０３は、非効率ブロック抽出部１０２により抽出された非効率ブロックへのメモリアクセスを発生させた命令のうち所定の条件に合致する命令を非効率命令として抽出する。
また、命令実行分析装置１００には、図示していない表示装置が接続されており、非効率命令抽出部１０３は、図５のように、抽出した非効率命令と当該非効率命令のアドレスの情報を、非効率命令の一覧表と非効率命令に関連する情報４００として表示装置に出力する。 The inefficient instruction extraction unit 103 performs the inefficient instruction extraction process 3 of FIG. 1 and outputs a list of inefficient instructions and information 400 related to the inefficient instructions.
More specifically, the inefficient instruction extraction unit 103 sets an instruction that satisfies a predetermined condition among instructions that cause memory access to the inefficient block extracted by the inefficient block extraction unit 102 as an inefficient instruction. Extract.
Further, a display device (not shown) is connected to the instruction execution analysis apparatus 100, and the inefficient instruction extraction unit 103 performs information on the extracted inefficient instruction and the address of the inefficient instruction as shown in FIG. Are output to the display device as a list of inefficient instructions and information 400 related to the inefficient instructions.

非効率ブロック抽出部１０２において、メモリアクセス計測部１０２１は、後述する図２の２０１、２０２、２０５の処理（メモリアクセス数の計測処理）を行う。
非効率ブロック判定部１０２２は、後述する図２の２０８〜２１２の処理（非効率ブロックの判定処理）を行う。 In the inefficient block extraction unit 102, the memory access measurement unit 1021 performs processing (measurement processing of the number of memory accesses) 201, 202, and 205 in FIG.
The inefficient block determination unit 1022 performs processing (inefficiency block determination processing) 208 to 212 in FIG. 2 described later.

非効率命令抽出部１０３において、非効率ブロックアクセス計測部１０３１は、後述する図４の３０１、３０２、３０５、３０６の処理（非効率ブロックへのアクセス回数の計測処理）を行う。
非効率命令判定部１０３２は、後述する図４の３０９〜３１２の処理（非効率命令の判定処理）を行う。 In the inefficient instruction extraction unit 103, the inefficient block access measurement unit 1031 performs processing 301, 302, 305, and 306 (measurement of the number of accesses to the inefficient block) in FIG.
The inefficient instruction determination unit 1032 performs processing (inefficiency instruction determination processing) 309 to 312 in FIG. 4 to be described later.

図２は、図１の非効率ブロックの抽出処理２のフローを示す。
２０１から２１２のステップで構成されるが、大きく２つの処理に分けることができ、前半の２０１から２０７がメモリアクセス回数の計測であり、後半の２０８から２１２が非効率ブロックの判定フローとなっている。
以下では、図１に示した命令実行分析方法を、図１０に示した命令実行分析装置１００を用いて実行する例として説明する。 FIG. 2 shows a flow of the inefficient block extraction process 2 of FIG.
The process consists of steps 201 to 212, which can be roughly divided into two processes. The first half 201 to 207 is the memory access count measurement, and the second half 208 to 212 is the inefficient block determination flow. Yes.
In the following, the instruction execution analysis method shown in FIG. 1 will be described as an example in which the instruction execution analysis apparatus 100 shown in FIG. 10 is used.

前半（２０１〜２０７）は、命令実行をシミュレータ（図１０のシミュレーション部１０１）でシミュレーションし、メモリアクセス回数を計測するのが目的である。 The first half (201 to 207) is intended to simulate instruction execution by a simulator (simulation unit 101 in FIG. 10) and measure the number of memory accesses.

２０１で、メモリアクセス計測部１０２１が、メモリアクセスカウンタＭＣ［Ａｄｄｒ］をすべてのアドレス（Ａｄｄｒ）について初期化、即ちゼロクリアする。 In 201, the memory access measuring unit 1021 initializes the memory access counter MC [Addr] for all addresses (Addr), that is, clears it to zero.

２０２で、メモリアクセス計測部１０２１が、命令ポインタ（ＩＰ）を初期値に初期化する。即ち、ＩＰは最初に実行する命令のアドレスを指す。 In 202, the memory access measurement unit 1021 initializes the instruction pointer (IP) to an initial value. That is, IP indicates the address of the instruction to be executed first.

２０３で、シミュレーション部１０１が、ＩＰの指すアドレスから命令（Ｉｎｓｔ）をフェッチしてくる。
一般に１命令は、１バイトとは限らないので、命令を構成するのに必要なバイト数がＩｎｓｔに入る。 In 203, the simulation unit 101 fetches an instruction (Inst) from the address indicated by the IP.
In general, since one instruction is not necessarily one byte, the number of bytes necessary for composing the instruction enters Inst.

２０４で、シミュレーション部１０１が、命令（Ｉｎｓｔ）をシミュレーションする。
シミュレーション部１０１は、内部に本物のプロセッサ装置と同じ内部リソース（演算装置やレジスタやキャッシュメモリなど）をソフトウェアで実現し、それらのリソースに中間状態を格納し、各種処理を行わせることにより、本物のプロセッサ装置と同じ状態を作り出す。 In 204, the simulation unit 101 simulates an instruction (Inst).
The simulation unit 101 internally implements the same internal resources (arithmetic unit, register, cache memory, etc.) as a real processor device by software, stores intermediate states in these resources, and performs various processing, thereby enabling real processing. Produces the same state as the processor unit.

２０５で、メモリアクセス計測部１０２１が、命令（Ｉｎｓｔ）の実行に、メインメモリ装置へのメモリアクセスがある場合、該当アドレス（Ａｄｄｒ）のアクセス回数を更新する。
具体的にはＭＣ［Ａｄｄｒ］に１を加算する。 In 205, the memory access measuring unit 1021 updates the access count of the corresponding address (Addr) when there is a memory access to the main memory device in executing the instruction (Inst).
Specifically, 1 is added to MC [Addr].

２０６で、シミュレーション部１０１が、命令実行が終了したかどうかを判断する。終了の条件は実行命令数、実行サイクル数、特定命令の実行、特定のアドレスにＩＰが到達するなど様々な条件が考えられるが、本明細書では特定はしない。 In 206, the simulation unit 101 determines whether instruction execution has been completed. Various conditions such as the number of execution instructions, the number of execution cycles, the execution of a specific instruction, and the arrival of an IP at a specific address can be considered as termination conditions, but they are not specified in this specification.

２０７で、シミュレーション部１０１が、命令ポインタ（ＩＰ）を次の命令を指す値に更新する。
実行した命令が分岐命令でなければ、新しいＩＰは古いＩＰに命令長を加算した値になる。
分岐命令であれば、ＩＰは分岐先のアドレスを指す。 In 207, the simulation unit 101 updates the instruction pointer (IP) to a value indicating the next instruction.
If the executed instruction is not a branch instruction, the new IP is a value obtained by adding the instruction length to the old IP.
In the case of a branch instruction, IP indicates a branch destination address.

２０３から２０７までのループを２０６の終了条件が成立するまで繰り返し、２０６の終了条件が成立したとき、シミュレーションは終了し、メモリアクセス回数の計測が完了する。 The loop from 203 to 207 is repeated until the end condition 206 is satisfied. When the end condition 206 is satisfied, the simulation ends and the measurement of the number of memory accesses is completed.

次に、後半（２０８〜２１２）の非効率ブロックの判定処理について説明する。 Next, the second half (208 to 212) inefficient block determination processing will be described.

２０８で、非効率ブロック判定部１０２２が、ブロック番号（Ｎ）を０に初期化する。 In 208, the inefficient block determination unit 1022 initializes the block number (N) to 0.

２０９で、非効率ブロック判定部１０２２が、ブロック内のアクセス頻度が高いバイト数と低いバイト数を算出する。
アクセス頻度の高いバイト数は配列ＢＡ＿Ｈｉｇｈ［Ｎ］に、アクセス頻度の低いバイト数は配列ＢＡ＿Ｌｏｗ［Ｎ］に算出される。
詳細は図３で説明する。 In 209, the inefficient block determination unit 1022 calculates the number of bytes with high access frequency and the number of low bytes within the block.
The number of bytes with high access frequency is calculated in the array BA_High [N], and the number of bytes with low access frequency is calculated in the array BA_Low [N].
Details will be described with reference to FIG.

２１０で、非効率ブロック判定部１０２２が、非効率ブロックか否かの判定を行う。
アクセス頻度が高いバイトがあり、しかも、アクセス頻度が低いバイトが多数ある場合、非効率ブロックと判断する。
具体的には、ＢＡ＿Ｈｉｇｈ［Ｎ］≧１かつＢＡ＿Ｈｉｇｈ［Ｎ］ ≧ ＢｌｏｃｋＳｉｚｅ × ＪｕｄｇｅＲａｔｉｏで表す。
ＪｕｄｇｅＲａｔｉｏは、判断基準を変更するパラメータであり０から１までの値を取り得るが、例えば０．５を指定すれば、全ブロックのうち半数以上のバイトが低頻度アクセスで、高頻度アクセスが１バイト以上あるときに非効率ブロックと判断することになる。
非効率ブロックであれば、非効率ブロック判定部１０２２は、ブロックＮが非効率（ＮｏｔＥｆｆｅｃｉｅｎｃｙ）であることを表す配列ＮＥ［Ｎ］に１をセットし、そうでなければ０をセットする。 In 210, the inefficient block determination unit 1022 determines whether or not the block is an inefficient block.
When there are bytes with high access frequency and there are many bytes with low access frequency, it is determined that the block is inefficient.
Specifically, BA_High [N] ≧ 1 and BA_High [N] ≧ BlockSize × JudgeRatio.
JudgeRatio is a parameter for changing the judgment criterion and can take a value from 0 to 1. For example, if 0.5 is specified, more than half of all the blocks are low-frequency accesses and high-frequency access is 1 When there are more than bytes, it is determined as an inefficient block.
If it is an inefficient block, the inefficient block determination unit 1022 sets 1 to the array NE [N] indicating that the block N is inefficient (Not Efficiency), and otherwise sets 0.

２１１で、非効率ブロック判定部１０２２は、以上の処理をすべてのブロックに対して実施したかを確認し、実施済みであれば、処理を完了し、そうでなければ、２１２でブロック番号Ｎに１を加え更新し、２０９のステップに戻り、２０９から２１２の処理を繰り返す。 In 211, the inefficient block determination unit 1022 confirms whether or not the above processing has been performed on all the blocks. If the processing has been performed, the processing is completed. Otherwise, the block number N is set in 212. 1 is added and updated, the process returns to step 209, and the processes from 209 to 212 are repeated.

以上の処理により、すべてのブロックに対して非効率ブロックを示す配列ＮＥ［］が求められる。 With the above processing, an array NE [] indicating inefficient blocks is obtained for all blocks.

ここで、図３により２０９でのブロック内のアクセス頻度が高いバイト数と低いバイト数の算出方法を説明する。
図２の２０９の処理は、ブロックＮの配列ＢＡ＿Ｈｉｇｈ［Ｎ］と配列ＢＡ＿Ｌｏｗ［Ｎ］を求めるのが目的である。
図２の２０９の処理は、２０９１から２０９７の７個のステップから構成される。 Here, the calculation method of the number of bytes with high access frequency and the number of low bytes within the block at 209 will be described with reference to FIG.
The purpose of the process 209 in FIG. 2 is to obtain an array BA_High [N] and an array BA_Low [N] of the block N.
The process 209 in FIG. 2 includes seven steps 2091 to 2097.

２０９１で、非効率ブロック判定部１０２２は、配列ＢＡ＿Ｈｉｇｈ［Ｎ］と配列ＢＡ＿Ｌｏｗ［Ｎ］を０に初期化する。
２０９２で、非効率ブロック判定部１０２２は、閾値ＴｈＨとＴｈＬを求める。
閾値ＴｈＨは、高アクセス頻度か否かの判断の閾値であり、閾値ＴｈＬは、低アクセス頻度か否かの判断の閾値である。
これらの閾値は、プログラムの振る舞いが明確な場合は、外部パラメータとして与えることも可能であるが、一般的には、閾値を決定するほど明確な情報を外部にもつことは困難である。
従って、閾値を内部の動作結果に応じて作成することにする。
具体的には、ブロックについて、平均的なメモリアクセス回数（ＡｖｅｒａｇｅＭＣ）に対して、十分に大きい値をＴｈＨ、十分に小さい値をＴｈＬとする。
この例では、平均メモリアクセスの４倍をＴｈＨとし、平均メモリアクセスの１／２をＴｈＬとした。
ＡｖｅｒａｇｅＭＣは、そのメモリブロック（メモリブロックＮ）の総メモリアクセス回数（ＭＣ［Ａｄｄｒ］の総和（但し、Ａｄｄｒ∈ブロックＮ））をブロックのバイト数（ＢｌｏｃｋＳｉｚｅ）で除算すれば求められる。 In 2091, the inefficient block determination unit 1022 initializes the array BA_High [N] and the array BA_Low [N] to 0.
In 2092, the inefficient block determination unit 1022 calculates thresholds ThH and ThL.
The threshold ThH is a threshold for determining whether or not the access frequency is high, and the threshold ThL is a threshold for determining whether or not the access frequency is low.
These threshold values can be given as external parameters when the behavior of the program is clear, but it is generally difficult to have clear information externally so as to determine the threshold values.
Therefore, the threshold value is created according to the internal operation result.
Specifically, for a block, a sufficiently large value is set to ThH, and a sufficiently small value is set to ThL with respect to the average memory access count (AverageMC).
In this example, ThH is four times the average memory access, and ThL is half the average memory access.
AverageMC is obtained by dividing the total number of memory accesses (MC [Addr] (where Addrεblock N)) of the memory block (memory block N) by the number of bytes (BlockSize) of the block.

２０９３で、非効率ブロック判定部１０２２は、以下のループで使用する変数である、アドレス（Ａｄｄｒ）とカウンタ（Ｉ）を初期化する。 In 2093, the inefficient block determination unit 1022 initializes an address (Addr) and a counter (I), which are variables used in the following loop.

２０９４で、非効率ブロック判定部１０２２は、バイト毎のアクセス数の閾値判定を行う。
メモリアクセス回数ＭＣ［Ａｄｄｒ］が閾値ＴｈＨ以上であれば、ＢＡ＿Ｈｉｇｈ［Ｎ］に１を加算する。
メモリアクセス回数ＭＣ［Ａｄｄｒ］が閾値ＴｈＬ以下であれば、ＢＡ＿Ｌｏｗ［Ｎ］に１を加算する。 In 2094, the inefficient block determination unit 1022 performs threshold determination of the number of accesses for each byte.
If the memory access count MC [Addr] is greater than or equal to the threshold ThH, 1 is added to BA_High [N].
If the memory access count MC [Addr] is less than or equal to the threshold ThL, 1 is added to BA_Low [N].

２０９５で、非効率ブロック判定部１０２２は、カウンタ値（Ｉ）に１を加算し更新する。 In 2095, the inefficient block determination unit 1022 adds 1 to the counter value (I) and updates it.

２０９６で、非効率ブロック判定部１０２２は、ループの終了判定を行う。
即ち、カウンタ（Ｉ）がブロックのバイト数（ＢｌｏｃｋＳｉｚｅ）と等しくなれば、本処理を終了する。
そうでなければ、非効率ブロック判定部１０２２は、２０９７で、アドレス（Ａｄｄｒ）に１を加算し、２０９４に戻り、終了条件が成立するまで２０９４〜２０９７の処理を繰り返す。 In 2096, the inefficient block determination unit 1022 determines the end of the loop.
That is, when the counter (I) is equal to the number of bytes (BlockSize) of the block, this processing is terminated.
Otherwise, the inefficient block determination unit 1022 adds 1 to the address (Addr) in 2097, returns to 2094, and repeats the processing of 2094 to 2097 until the end condition is satisfied.

上記のループが終了したとき、配列ＢＡ＿Ｈｉｇｈ［Ｎ］と配列ＢＡ＿Ｌｏｗ［Ｎ］には、それぞれ、高アクセス頻度と低アクセス頻度のバイト数が入力されている。 When the above loop ends, the array BA_High [N] and the array BA_Low [N] are respectively input with the high access frequency and the low access frequency bytes.

図４は、図１の非効率命令の抽出処理３のフローを示す。
３０１から３１２のステップで構成されるが、大きく２つの処理に分けることができ、前半の３０１から３０８が非効率ブロックへのアクセス回数の計測であり、後半の３０９から３１２が非効率命令の判定フローとなっている。
前半（３０１〜３０８）は、命令実行をシミュレータ（図１０のシミュレーション部１０１）でシミュレーションし、非効率ブロックへのアクセス回数を計測するのが目的である。 FIG. 4 shows a flow of the inefficiency instruction extraction process 3 of FIG.
The process consists of 301 to 312 steps, which can be roughly divided into two processes. The first half 301 to 308 is the number of accesses to the inefficient block, and the second half 309 to 312 is the inefficient instruction determination. It is a flow.
The first half (301 to 308) is intended to simulate instruction execution by a simulator (simulation unit 101 in FIG. 10) and measure the number of accesses to inefficient blocks.

３０１で、非効率ブロックアクセス計測部１０３１が、命令毎の非効率アクセスカウンタＭＣ＿ＩｎｓｔＡＣ［ＩＰ］をすべての命令ポインタ（ＩＰ）について初期化、即ちゼロクリアする。 In 301, the inefficient block access measuring unit 1031 initializes the inefficient access counter MC_InstAC [IP] for each instruction for all instruction pointers (IP), that is, clears it to zero.

３０２で、非効率ブロックアクセス計測部１０３１が、命令ポインタ（ＩＰ）を初期値に初期化する。
即ち、ＩＰは最初に実行する命令のアドレスを指す。 In 302, the inefficient block access measurement unit 1031 initializes the instruction pointer (IP) to an initial value.
That is, IP indicates the address of the instruction to be executed first.

３０３で、シミュレーション部１０１が、ＩＰの指すアドレスから命令（Ｉｎｓｔ）をフェッチしてくる。 In 303, the simulation unit 101 fetches an instruction (Inst) from the address indicated by the IP.

３０４で、シミュレーション部１０１が、命令（Ｉｎｓｔ）を内部のシミュレータでシミュレーションする。 In 304, the simulation unit 101 simulates an instruction (Inst) with an internal simulator.

３０５で、非効率ブロックアクセス計測部１０３１が、命令（Ｉｎｓｔ）の実行に、メインメモリ装置へのメモリアクセスがある場合、該当アドレス（Ａｄｄｒ）が非効率ブロックか否かを検索する。
具体的には、Ａｄｄｒ／ＢｌｏｃｋＳｉｚｅの商（整数値）がブロック番号Ｎになるので、非効率ブロック抽出で求めた配列ＮＥ［Ｎ］を読むことで非効率ブロックかどうかが分かる。
ＮＥ［Ｎ］＝１のとき、非効率ブロックなので変数ＮｏＥｆｆｅｃｉｅｎｃｙ＝１にセットし、ＮＥ［Ｎ］＝０のときは、非効率ブロックでないので変数ＮｏＥｆｆｅｃｉｅｎｃｙ＝０にセットする。 In 305, if the inefficient block access measuring unit 1031 has memory access to the main memory device in executing the instruction (Inst), the inefficient block access measuring unit 1031 searches whether the corresponding address (Addr) is an inefficient block.
Specifically, since the quotient (integer value) of Addr / BlockSize is the block number N, it can be determined whether or not the block is an inefficient block by reading the array NE [N] obtained by the inefficient block extraction.
When NE [N] = 1, the variable NoEfficiency = 1 is set since it is an inefficient block, and when NE [N] = 0, the variable NoEffectiveness = 0 is set because the block is not an inefficient block.

３０６で、非効率ブロックアクセス計測部１０３１が、非効率ブロックのアクセスなら命令毎の非効率アクセスカウンタＮＥ＿ＩｎｓｔＡＣ［ＩＰ］に１を加えて更新する。 In 306, the inefficient block access measuring unit 1031 updates the inefficient access counter NE_InstAC [IP] for each instruction by adding 1 if the inefficient block is accessed.

３０７で、シミュレーション部１０１が、命令実行が終了したかどうかを判断する。
終了の条件は実行命令数、実行サイクル数、特定命令の実行、特定のアドレスにＩＰが到達するなど様々な条件が考えられるが、本明細書では特定はしない。 In 307, the simulation unit 101 determines whether the instruction execution is completed.
Various conditions such as the number of execution instructions, the number of execution cycles, the execution of a specific instruction, and the arrival of an IP at a specific address can be considered as termination conditions, but are not specified in this specification.

３０８で、シミュレーション部１０１が、命令ポインタ（ＩＰ）を次の命令を指す値に更新する。
実行した命令が分岐命令でなければ、新しいＩＰは古いＩＰに命令長を加算した値になる。
分岐命令であれば、ＩＰは分岐先のアドレスを指す。 In 308, the simulation unit 101 updates the instruction pointer (IP) to a value indicating the next instruction.
If the executed instruction is not a branch instruction, the new IP is a value obtained by adding the instruction length to the old IP.
In the case of a branch instruction, IP indicates a branch destination address.

３０３から３０８までのループを３０７の終了条件が成立するまで繰り返す。
３０７の終了条件が成立したとき、シミュレーションは終了し、メモリアクセス回数の計測が完了する。 The loop from 303 to 308 is repeated until the end condition of 307 is satisfied.
When the end condition 307 is satisfied, the simulation ends and the measurement of the number of memory accesses is completed.

次に後半（３０９〜３１２）の非効率ブロックの判定処理について説明する。 Next, the second half (309 to 312) inefficient block determination processing will be described.

３０９で、非効率命令判定部１０３２が、命令の格納アドレスの最小値をＡｄｄｒにセットする。
この値はシステム依存であるが、本実施の形態では０をセットしている。 In 309, the inefficient instruction determination unit 1032 sets the minimum value of the instruction storage address to Addr.
This value is system-dependent, but 0 is set in this embodiment.

３１０で、非効率命令判定部１０３２が、非効率命令の判定を行う。
命令毎の非効率アクセスカウンタＮＥ＿ＩｎｓｔＡＣ［Ａｄｄｒ］が閾値Ｔｈ＿ＮＥ＿Ｉｎｓｔ以上であれば、非効率命令と判断し、結果を表す配列ＮＥ＿Ｉｎｓｔ［Ａｄｄｒ］に１をセットし、非効率命令でなければ０をセットする。
閾値Ｔｈ＿ＮＥ＿Ｉｎｓｔは、目的に応じてシステムの外部から与える。
すべての非効率ブロックアクセスしている命令が知りたければ、Ｔｈ＿ＮＥ＿Ｉｎｓｔ＝１を与え、もっと絞り込んだ情報が知りたければ、Ｔｈ＿ＮＥ＿Ｉｎｓｔに大きな値を与える。 At 310, the inefficient instruction determination unit 1032 determines an inefficient instruction.
If the inefficient access counter NE_InstAC [Addr] for each instruction is greater than or equal to the threshold Th_NE_Inst, it is determined that the instruction is inefficient, and 1 is set in the array NE_Inst [Addr] representing the result, and 0 is set if the instruction is not inefficient. .
The threshold Th_NE_Inst is given from the outside of the system according to the purpose.
If you want to know all the instructions that are accessing inefficient blocks, give Th_NE_Inst = 1, and if you want to know more refined information, give Th_NE_Inst a large value.

３１１で、非効率命令判定部１０３２が、ループの終了を判定する。
すなわち、命令の格納アドレスの最大値に到達したか否かを調べ、到達していれば、この処理が終了する。
そうでなければ、３１２で命令の格納アドレスに１加えて更新して、３１０に戻り、３１０〜３１２の処理を繰り返す。 In 311, the inefficient instruction determination unit 1032 determines the end of the loop.
That is, it is checked whether or not the maximum value of the instruction storage address has been reached. If it has been reached, this processing ends.
Otherwise, at 312, the instruction storage address is updated by 1 and the process returns to 310 to repeat the processes of 310 to 312.

上記の処理が終了すると、非効率命令か否かを表す配列ＮＥ＿Ｉｎｓｔ［］と、その命令が非効率ブロックにアクセスした回数を表す配列ＮＥ＿Ｉｎｓｔ＿ＡＣ［］が求められる。 When the above processing is completed, an array NE_Inst [] indicating whether or not the instruction is an inefficient instruction and an array NE_Inst_AC [] indicating the number of times the instruction has accessed the inefficient block are obtained.

図５に出力例を示す。
ＮＥ＿Ｉｎｓｔ［Ａｄｄｒ］＝１となるＡｄｄｒについて、アドレスＡｄｄｒ、命令Ｉｎｓｔ＝Ｍｅｍ［Ａｄｄｒ］、非効率ブロックへのアクセス回数Ｎｅ＿Ｉｎｓｔ＿ＡＣ［Ａｄｄｒ］を表示した例である。 FIG. 5 shows an output example.
In this example, the address Addr, the instruction Inst = Mem [Addr], and the number of accesses to the inefficient block Ne_Inst_AC [Addr] are displayed for Addr where NE_Inst [Addr] = 1.

このように、本実施の形態では、シミュレーションにおいてメモリアクセスの対象がメモリブロック内の一部のデータに偏っているメモリブロックを非効率ブロックとして抽出するとともに、非効率ブロックへのメモリアクセスを発生させた命令を非効率命令として抽出する。
そして、図５に例示するように、抽出した非効率命令を、当該非効率命令のアドレスとともに表示するため、キャッシュメモリの同一ブロックにアクセス頻度の高いデータと低いデータが混在することを防止し、キャッシュメモリの利用効率を向上させることができる。 As described above, in the present embodiment, in the simulation, a memory block whose memory access target is biased toward a part of the data in the memory block is extracted as an inefficient block, and a memory access to the inefficient block is generated. Are extracted as inefficient instructions.
Then, as illustrated in FIG. 5, since the extracted inefficient instruction is displayed together with the address of the inefficient instruction, it is possible to prevent data with high access frequency and low data from being mixed in the same block of the cache memory, The utilization efficiency of the cache memory can be improved.

以上、本実施の形態では、
以下の手段を備えた命令実行分析装置を説明した。
（ａ）特定のプロセッサに対応したプログラムのバイナリコードと実行環境を入力する手段、
実行環境とは、初期データや実行中に外部要因により変化するデータを指す。
（ｂ）プロセッサの命令の実行を模擬する手段、
このプロセッサは、メモリをアクセスするとき、実行時間の短縮のためキャッシュメモリを有することを前提とする。
（ｃ）命令のメモリアクセスにおける非効率ブロックの抽出手段、
非効率ブロックとは、ひとつのキャッシュブロックに対して、一部のデータのみ頻繁に使われ、他の部分はほとんど使われないため、ブロック全体としては、利用効率が低くなっているものを指す。
（ｄ）非効率命令の抽出手段、
非効率命令とは、前記非効率ブロックの生成要因となっている命令を指す。 As described above, in the present embodiment,
An instruction execution analyzer having the following means has been described.
(A) means for inputting a binary code of a program corresponding to a specific processor and an execution environment;
The execution environment refers to initial data or data that changes due to external factors during execution.
(B) means for simulating the execution of instructions of the processor;
When accessing the memory, this processor is premised on having a cache memory for shortening the execution time.
(C) means for extracting inefficient blocks in instruction memory access;
An inefficient block refers to a block in which the use efficiency of the entire block is low because only a part of data is frequently used for one cache block and the other part is hardly used.
(D) means for extracting inefficient instructions;
An inefficient instruction refers to an instruction that is a cause of generation of the inefficient block.

また、本実施の形態では、非効率ブロックの抽出手段の実現方法として、
（ａ）プロセッサの命令の実行を模擬する手段、
（ｂ）メモリアクセス回数を計測する手段、
（ｃ）ブロック内のバイトに対して、高頻度のアクセスの個数を計測する手段、
（ｄ）ブロック内のバイトに対して、低頻度のアクセスの個数を計測する手段、
（ｅ）該当ブロックが非効率化か否かを、高頻度アクセスのバイト数と低頻度アクセスのバイト数を用いて判断する手段
を説明した。 In the present embodiment, as an implementation method of the inefficient block extraction means,
(A) means for simulating execution of instructions of the processor;
(B) means for measuring the number of memory accesses;
(C) means for measuring the number of high-frequency accesses to the bytes in the block;
(D) means for measuring the number of infrequent accesses to the bytes in the block;
(E) The means for determining whether the corresponding block is inefficient or not using the number of bytes of high-frequency access and the number of bytes of low-frequency access has been described.

また、本実施の形態では、以下の手段を備えた命令実行分析装置を説明した。
（ａ）高頻度アクセスか否かを判定するための閾値を算出する手段、
（ｂ）低頻度アクセスか否かを判定するための閾値を算出する手段。 Further, in the present embodiment, the instruction execution analysis apparatus provided with the following means has been described.
(A) means for calculating a threshold value for determining whether or not there is a high-frequency access;
(B) A means for calculating a threshold for determining whether or not the access is infrequent.

また、本実施の形態では、非効率命令の抽出手段の実現方法として、
（ａ）プロセッサの命令の実行を模擬する手段、
（ｂ）命令毎の非効率ブロックへのアクセス回数を計測する手段、
（ｃ）すべての命令に対し、非効率ブロックへのアクセス回数と閾値を比較する手段
を説明した。 Further, in this embodiment, as an implementation method of the inefficient instruction extraction means,
(A) means for simulating execution of instructions of the processor;
(B) means for measuring the number of accesses to the inefficient block for each instruction;
(C) The means for comparing the number of accesses to the inefficient block with the threshold for all instructions has been described.

実施の形態２．
次に、実施の形態１で示した命令実行分析方法をさらに有効に活用する手法を説明する。
ユーザが実際に実施の形態１に係る命令実行分析方法を活用する場合、図５のような形式より、ソースプログラムに直接指示やヒントが書かれることが望ましい。 Embodiment 2. FIG.
Next, a method for more effectively using the instruction execution analysis method shown in the first embodiment will be described.
When the user actually uses the instruction execution analysis method according to the first embodiment, it is desirable that instructions and hints are directly written in the source program from the format shown in FIG.

例えば図６のようなソースプログラムを考える。
このプログラムは、Ｃ言語の文法で、本実施の形態に関係があるところのみを記載している。
図６において、１行目でプログラム名を定義し、２行目で変数と配列を定義している。
配列はａ、ｂ、ｃと３つあり、要素数はいずれも１００００である。
３行目から６行目まで制御変数ｉでループを構成し、配列ａに配列ｂと配列ｃを加算し代入している。
このとき、４行目に示すように、配列ｂのインデックスｊは、変数ｉの一の位を切り落とし１０の倍数のみを取るようになっている。
従って、配列ｂは要素毎にアクセス数が異なり、非効率ブロックとなっている。
一方、配列ａと配列ｃは、効率的なブロックである。 For example, consider a source program as shown in FIG.
This program describes only the C language grammar that is relevant to the present embodiment.
In FIG. 6, a program name is defined on the first line, and variables and arrays are defined on the second line.
There are three arrays, a, b, and c, and the number of elements is 10,000.
A loop is constituted by the control variable i from the third line to the sixth line, and the array b and the array c are added to the array a and assigned.
At this time, as shown in the fourth row, the index j of the array b is cut off the first digit of the variable i and takes only a multiple of 10.
Therefore, the array b has an inefficient block with a different number of accesses for each element.
On the other hand, array a and array c are efficient blocks.

図７は、非効率命令のマーキング付きソースプログラムの例である。
図６のプログラムに対して、６行目に／＊配列ｂ［ｊ］の参照が非効率です＊／というメッセージが挿入されている。
このようなメッセージがあるとユーザは、５行目の式のｂ［ｊ］のメモリ参照が非効率であり、この配列を改善すればよいことが分かる。
例えば、配列ｂを必要最小限の配列（ｂ［１０００］）として定義し、ｊ＝ｉｎｔ（ｉ／１０）とすれば、配列ｂ［ｊ］の利用効率が上がることは容易に想像できる。
一般に、このようにデータサイズを小さくすると、キャッシュメモリの利用効率が向上し、システムの性能が向上する。 FIG. 7 is an example of a source program with inefficient instruction marking.
In the program of FIG. 6, the message / * reference to the array b [j] is inefficient in the sixth line is inserted.
With such a message, the user knows that the memory reference of b [j] in the expression on the fifth line is inefficient, and this arrangement should be improved.
For example, if the array b is defined as the minimum necessary array (b [1000]) and j = int (i / 10), it can be easily imagined that the utilization efficiency of the array b [j] increases.
Generally, when the data size is reduced in this way, the utilization efficiency of the cache memory is improved and the performance of the system is improved.

図８では、上記のようなことを実現するためのフローを説明する。
図において、１１はソースプログラムで、例えば図６のようなものである。
１２はコンパイラとリンカである。
１３はバイナリコードである。
１４はリンカの出力の一つで、ソースプログラムとバイナリコードの対応をつける表（以降、対応表と呼ぶ）で、例えば図９のようなものである。
１５は実行環境である。
２は非効率ブロックの抽出処理である。
３は、非効率命令の抽出処理である。
４は非効率命令の一覧である。
５は非効率命令をソースプログラムへひもづけする処理である。
６は非効率命令のマーキング付きソースプログラムで、例えば図７のようなものである。 FIG. 8 illustrates a flow for realizing the above.
In the figure, 11 is a source program, for example, as shown in FIG.
Reference numeral 12 denotes a compiler and a linker.
13 is a binary code.
Reference numeral 14 denotes one of the linker outputs, which is a table (hereinafter referred to as a correspondence table) for associating a source program with a binary code, for example, as shown in FIG.
Reference numeral 15 denotes an execution environment.
Reference numeral 2 denotes an inefficient block extraction process.
Reference numeral 3 denotes inefficiency instruction extraction processing.
4 is a list of inefficient instructions.
5 is a process for linking inefficient instructions to the source program.
Reference numeral 6 denotes a source program with inefficient instruction marking, for example, as shown in FIG.

図１１は、本実施の形態に係る命令実行分析装置１００の構成例を示す。
図１１において、図１０と同じ符号の要素は図１０と同じであり、説明を省略する。
ソースプログラム加工部１０４は、図８の非効率命令のソースプログラムへのひもづけ処理５を実施して、図７に示すようなメッセージをソースプログラムに挿入する。
つまり、ソースプログラム加工部１０４は、シミュレーションの対象となる命令が記述されるバイナリコードのソースプログラム６００内で非効率命令が記述されている位置を非効率命令記述位置として特定し、非効率命令記述位置に、当該非効率命令記述位置に記述されている命令が非効率命令であることを通知するメッセージを挿入する。 FIG. 11 shows a configuration example of the instruction execution analysis apparatus 100 according to the present embodiment.
11, elements having the same reference numerals as those in FIG. 10 are the same as those in FIG.
The source program processing unit 104 performs a process 5 for linking the inefficient instruction to the source program in FIG. 8 and inserts a message as shown in FIG. 7 into the source program.
That is, the source program processing unit 104 specifies the position where the inefficient instruction is described in the binary code source program 600 in which the instruction to be simulated is described as the inefficient instruction description position, and the inefficient instruction description A message notifying that the instruction described in the inefficient instruction description position is an inefficient instruction is inserted at the position.

ソースプログラムとバイナリコードの対応表５００は、図８のソースプログラムとバイナリコードの対応表１４に相当し、例えば、図９に示すものである。 The source program / binary code correspondence table 500 corresponds to the source program / binary code correspondence table 14 of FIG. 8, and is shown in FIG. 9, for example.

ソースプログラム６００は、図８のソースプログラム１１に対応する。 The source program 600 corresponds to the source program 11 in FIG.

非効率命令のマーキング付きソースプログラム７００は、図８の非効率命令のマーキング付きソースプログラム６に対応し、例えば図７に示すものである。 The source program 700 with inefficient instruction marking corresponds to the source program 6 with inefficiency instruction marking in FIG. 8 and is, for example, shown in FIG.

以下、図８及び図１１を参照して、本実施の形態に係る処理を説明する。 Hereinafter, processing according to the present embodiment will be described with reference to FIGS.

１１のソースプログラムを１２のコンパイラとリンカで処理すると、実行可能なバイナリコード１３と対応表１４が生成される。
バイナリコード１３と実行環境１５を非効率ブロックの抽出処理２（非効率ブロック抽出部１０）に入力すると、実施の形態１で説明したように非効率ブロックの一覧が出力される。
非効率命令の抽出処理３（非効率命令抽出部１０３）に非効率ブロックの一覧を入力すると実施の形態１で説明したように、非効率命令一覧４が出力される。
非効率命令一覧４と対応表１４を非効率命令のソースプログラムへのひもづけ処理５（ソースプログラム加工部１０４）に入力することで、非効率命令のマーキング付きソースプログラム６、即ち、図７のようなソースプログラムが出力される。
対応表１４は、具体的には図９のようなもので、ソースプログラムの行番号、それに対応する命令のアドレスと命令と命令がアクセスする変数などの情報が記載されている。
ソースプログラム加工部１０４は、非効率命令一覧（図５）に掲載されている命令アドレスを対応表（図９）で見つけ、その行番号と変数名を検索することで、ソースプログラムの修正行番号と非効率命令でアクセスされている変数が見つけられる。
この例では、図５の表の１行目に命令アドレス＝１０００Ｈがあるので、図９の対応表で命令アドレス＝１０００Ｈを探し、そのときのソースプログラム行番号は５行目で、変数名は配列ｂ［ｊ］であることが分かる。
この情報をもとに、ソースプログラム加工部１０４は、非効率命令記述位置である図６の５行目のあとにメッセージを挿入すると図７を生成することができる。 When 11 source programs are processed by 12 compilers and linkers, executable binary code 13 and correspondence table 14 are generated.
When the binary code 13 and the execution environment 15 are input to the inefficient block extraction process 2 (inefficient block extraction unit 10), a list of inefficient blocks is output as described in the first embodiment.
When the list of inefficient blocks is input to the inefficient instruction extraction process 3 (inefficient instruction extraction unit 103), the inefficient instruction list 4 is output as described in the first embodiment.
The inefficiency instruction list 4 and the correspondence table 14 are input to the inefficiency instruction source program linking process 5 (source program processing unit 104), whereby the inefficiency instruction marked source program 6, that is, in FIG. A source program like this is output.
The correspondence table 14 is specifically as shown in FIG. 9 and describes information such as the line number of the source program, the address of the corresponding instruction, the instruction and the variable accessed by the instruction.
The source program processing unit 104 finds the instruction address listed in the inefficient instruction list (FIG. 5) in the correspondence table (FIG. 9) and searches for the line number and variable name, thereby correcting the source program correction line number. And find the variable being accessed with the inefficient instruction.
In this example, there is an instruction address = 1000H in the first line of the table of FIG. 5, so the instruction address = 1000H is searched for in the correspondence table of FIG. 9, the source program line number at that time is the fifth line, and the variable name is It can be seen that this is an array b [j].
Based on this information, the source program processing unit 104 can generate FIG. 7 by inserting a message after the fifth line in FIG. 6 which is the inefficient instruction description position.

このように、本実施の形態によれば、ユーザに対して、キャッシュメモリの利用効率を改善するための修正箇所をソースプログラムにおいて提示することができる。 As described above, according to the present embodiment, it is possible to present, to the user, a correction location for improving the use efficiency of the cache memory in the source program.

以上、本実施の形態では、実施の形態１の構成に加えて、以下の手段を備えた命令実行分析装置を説明した。
（ａ）ソースプログラムとバイナリコードの対応表（リンカの出力）により、非効率命令とソースプログラムのひもづけをする手段、
（ｂ）非効率命令にひもづけされたソースプログラムの変数名にマーキングをする手段。 As described above, in this embodiment, the instruction execution analysis apparatus including the following means in addition to the configuration of the first embodiment has been described.
(A) Means for associating inefficient instructions with the source program based on the correspondence table (linker output) between the source program and the binary code,
(B) Means for marking variable names of source programs linked to inefficient instructions.

最後に、実施の形態１及び２に示した命令実行分析装置１００を実現するためのハードウェア構成例について説明する。
図１２は、実施の形態１及び２に示す命令実行分析装置１００を実現するためのハードウェア資源の一例を示す図である。
なお、図１２の構成は、あくまでもハードウェア構成の一例を示すものであり、命令実行分析装置１００は図１２に記載の構成に限らず、他の構成によって実現されてもよい。 Finally, a hardware configuration example for realizing the instruction execution analysis apparatus 100 shown in the first and second embodiments will be described.
FIG. 12 is a diagram illustrating an example of hardware resources for realizing the instruction execution analysis apparatus 100 illustrated in the first and second embodiments.
Note that the configuration in FIG. 12 is merely an example of a hardware configuration, and the instruction execution analysis apparatus 100 is not limited to the configuration described in FIG. 12 and may be realized by other configurations.

図１２において、命令実行分析装置１００は、プログラムを実行するＣＰＵ９１１（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ、中央処理装置、処理装置、演算装置、マイクロプロセッサ、マイクロコンピュータ、プロセッサともいう）を備えている。
ＣＰＵ９１１は、バス９１２を介して、例えば、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）９１３、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）９１４、通信ボード９１５、表示装置９０１、キーボード９０２、マウス９０３、磁気ディスク装置９２０と接続され、これらのハードウェアデバイスを制御する。
更に、ＣＰＵ９１１は、ＦＤＤ９０４（ＦｌｅｘｉｂｌｅＤｉｓｋＤｒｉｖｅ）、コンパクトディスク装置９０５（ＣＤＤ）、プリンタ装置９０６、スキャナ装置９０７と接続していてもよい。また、磁気ディスク装置９２０の代わりに、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）、光ディスク装置、メモリカード（登録商標）読み書き装置などの記憶装置でもよい。
ＲＡＭ９１４は、揮発性メモリの一例である。ＲＯＭ９１３、ＦＤＤ９０４、ＣＤＤ９０５、磁気ディスク装置９２０の記憶媒体は、不揮発性メモリの一例である。これらは、記憶装置の一例である。
通信ボード９１５、キーボード９０２、マウス９０３、スキャナ装置９０７、ＦＤＤ９０４などは、入力装置の一例である。
また、通信ボード９１５、表示装置９０１、プリンタ装置９０６などは、出力装置の一例である。 In FIG. 12, the instruction execution analysis apparatus 100 includes a CPU 911 (also referred to as a central processing unit, a central processing unit, a processing unit, an arithmetic unit, a microprocessor, a microcomputer, and a processor) that executes a program.
The CPU 911 is connected to, for example, a ROM (Read Only Memory) 913, a RAM (Random Access Memory) 914, a communication board 915, a display device 901, a keyboard 902, a mouse 903, and a magnetic disk device 920 via a bus 912. Control hardware devices.
Further, the CPU 911 may be connected to an FDD 904 (Flexible Disk Drive), a compact disk device 905 (CDD), a printer device 906, and a scanner device 907. Further, instead of the magnetic disk device 920, a storage device such as an SSD (Solid State Drive), an optical disk device, or a memory card (registered trademark) read / write device may be used.
The RAM 914 is an example of a volatile memory. The storage media of the ROM 913, the FDD 904, the CDD 905, and the magnetic disk device 920 are an example of a nonvolatile memory. These are examples of the storage device.
A communication board 915, a keyboard 902, a mouse 903, a scanner device 907, an FDD 904, and the like are examples of input devices.
The communication board 915, the display device 901, the printer device 906, and the like are examples of output devices.

通信ボード９１５は、ネットワークに接続されている。
例えば、通信ボード９１５は、ＬＡＮ（ローカルエリアネットワーク）、インターネット、ＷＡＮ（ワイドエリアネットワーク）、ＳＡＮ（ストレージエリアネットワーク）などに接続されている。 The communication board 915 is connected to the network.
For example, the communication board 915 is connected to a LAN (local area network), the Internet, a WAN (wide area network), a SAN (storage area network), and the like.

磁気ディスク装置９２０には、オペレーティングシステム９２１（ＯＳ）、ウィンドウシステム９２２、プログラム群９２３、ファイル群９２４が記憶されている。
プログラム群９２３のプログラムは、ＣＰＵ９１１がオペレーティングシステム９２１、ウィンドウシステム９２２を利用しながら実行する。 The magnetic disk device 920 stores an operating system 921 (OS), a window system 922, a program group 923, and a file group 924.
The programs in the program group 923 are executed by the CPU 911 using the operating system 921 and the window system 922.

また、ＲＡＭ９１４には、ＣＰＵ９１１に実行させるオペレーティングシステム９２１のプログラムやアプリケーションプログラムの少なくとも一部が一時的に格納される。
また、ＲＡＭ９１４には、ＣＰＵ９１１による処理に必要な各種データが格納される。 The RAM 914 temporarily stores at least part of the operating system 921 program and application programs to be executed by the CPU 911.
The RAM 914 stores various data necessary for processing by the CPU 911.

また、ＲＯＭ９１３には、ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔＯｕｔｐｕｔＳｙｓｔｅｍ）プログラムが格納され、磁気ディスク装置９２０にはブートプログラムが格納されている。
命令実行分析装置１００の起動時には、ＲＯＭ９１３のＢＩＯＳプログラム及び磁気ディスク装置９２０のブートプログラムが実行され、ＢＩＯＳプログラム及びブートプログラムによりオペレーティングシステム９２１が起動される。 The ROM 913 stores a BIOS (Basic Input Output System) program, and the magnetic disk device 920 stores a boot program.
When the instruction execution analyzer 100 is activated, the BIOS program in the ROM 913 and the boot program in the magnetic disk device 920 are executed, and the operating system 921 is activated by the BIOS program and the boot program.

上記プログラム群９２３には、実施の形態１及び２の説明において「〜部」、「〜手段」として説明している機能を実行するプログラムが記憶されている。プログラムは、ＣＰＵ９１１により読み出され実行される。 The program group 923 stores programs that execute the functions described as “˜unit” and “˜means” in the description of the first and second embodiments. The program is read and executed by the CPU 911.

ファイル群９２４には、実施の形態１及び２の説明において、「〜の判定」、「〜のシミュレーション」、「〜の算出」、「〜の検索」、「〜の計測」、「〜の更新」、「〜の設定」、「〜の加工」、「〜の挿入」、「〜の選択」、「〜の入力」、「〜の出力」等として説明している処理の結果を示す情報やデータや信号値や変数値やパラメータが、「〜ファイル」や「〜データベース」の各項目として記憶されている。
「〜ファイル」や「〜データベース」は、ディスクやメモリなどの記録媒体に記憶される。
ディスクやメモリなどの記憶媒体に記憶された情報やデータや信号値や変数値やパラメータは、読み書き回路を介してＣＰＵ９１１によりメインメモリやキャッシュメモリに読み出される。
そして、読み出された情報やデータや信号値や変数値やパラメータは、抽出・検索・参照・比較・演算・計算・処理・編集・出力・印刷・表示などのＣＰＵの動作に用いられる。
抽出・検索・参照・比較・演算・計算・処理・編集・出力・印刷・表示のＣＰＵの動作の間、情報やデータや信号値や変数値やパラメータは、メインメモリ、レジスタ、キャッシュメモリ、バッファメモリ等に一時的に記憶される。
また、実施の形態１及び２で説明しているフローチャートの矢印の部分は主としてデータや信号の入出力を示す。
データや信号値は、ＲＡＭ９１４のメモリ、ＦＤＤ９０４のフレキシブルディスク、ＣＤＤ９０５のコンパクトディスク、磁気ディスク装置９２０の磁気ディスク、その他光ディスク、ミニディスク、ＤＶＤ等の記録媒体に記録される。
また、データや信号は、バス９１２や信号線やケーブルその他の伝送媒体によりオンライン伝送される。 In the file group 924, in the description of the first and second embodiments, “determination of”, “simulation of”, “calculation of”, “search of”, “measurement of”, “update of” ”,“ Setting of ”,“ processing of ”,“ insertion of ”,“ selection of ”,“ input of ”,“ output of ”, etc. Data, signal values, variable values, and parameters are stored as items of “˜file” and “˜database”.
The “˜file” and “˜database” are stored in a recording medium such as a disk or a memory.
Information, data, signal values, variable values, and parameters stored in a storage medium such as a disk or memory are read out to the main memory or cache memory by the CPU 911 via a read / write circuit.
The read information, data, signal value, variable value, and parameter are used for CPU operations such as extraction, search, reference, comparison, calculation, calculation, processing, editing, output, printing, and display.
Information, data, signal values, variable values, and parameters are stored in the main memory, registers, cache memory, and buffers during the CPU operations of extraction, search, reference, comparison, calculation, processing, editing, output, printing, and display. It is temporarily stored in a memory or the like.
In addition, the arrows in the flowcharts described in the first and second embodiments mainly indicate input / output of data and signals.
Data and signal values are recorded on a recording medium such as a memory of the RAM 914, a flexible disk of the FDD 904, a compact disk of the CDD 905, a magnetic disk of the magnetic disk device 920, other optical disks, a mini disk, and a DVD.
Data and signals are transmitted online via a bus 912, signal lines, cables, or other transmission media.

また、実施の形態１及び２の説明において「〜部」、「〜手段」として説明しているものは、「〜回路」、「〜装置」、「〜機器」であってもよく、また、「〜ステップ」、「〜手順」、「〜処理」であってもよい。
すなわち、実施の形態１及び２で説明したフローチャートに示すステップ、手順、処理により、本発明に係る「命令実行分析方法」を実現することができる。
また、「〜部」、「〜手段」として説明しているものは、ＲＯＭ９１３に記憶されたファームウェアで実現されていても構わない。
或いは、ソフトウェアのみ、或いは、素子・デバイス・基板・配線などのハードウェアのみ、或いは、ソフトウェアとハードウェアとの組み合わせ、さらには、ファームウェアとの組み合わせで実施されても構わない。
ファームウェアとソフトウェアは、プログラムとして、磁気ディスク、フレキシブルディスク、光ディスク、コンパクトディスク、ミニディスク、ＤＶＤ等の記録媒体に記憶される。
プログラムはＣＰＵ９１１により読み出され、ＣＰＵ９１１により実行される。
すなわち、プログラムは、実施の形態１及び２の「〜部」、「〜手段」としてコンピュータを機能させるものである。あるいは、実施の形態１及び２の「〜部」、「〜手段」の手順や方法をコンピュータに実行させるものである。 Further, in the description of the first and second embodiments, what is described as “to part” and “to means” may be “to circuit”, “to device”, and “to device”. It may be “˜step”, “˜procedure”, “˜processing”.
That is, the “instruction execution analysis method” according to the present invention can be realized by the steps, procedures, and processes shown in the flowcharts described in the first and second embodiments.
In addition, what is described as “˜unit” and “˜means” may be realized by firmware stored in the ROM 913.
Alternatively, it may be implemented only by software, or only by hardware such as elements, devices, substrates, and wirings, by a combination of software and hardware, or by a combination of firmware.
Firmware and software are stored as programs in a recording medium such as a magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, and a DVD.
The program is read by the CPU 911 and executed by the CPU 911.
That is, the program causes the computer to function as “˜unit” and “˜means” of the first and second embodiments. Alternatively, the procedures and methods of “to part” and “to means” of the first and second embodiments are executed by a computer.

このように、実施の形態１及び２に示す命令実行分析装置１００は、処理装置たるＣＰＵ、記憶装置たるメモリ、磁気ディスク等、入力装置たるキーボード、マウス、通信ボード等、出力装置たる表示装置、通信ボード等を備えるコンピュータにより実現される。
そして、上記したように「〜部」、「〜手段」として示された機能をこれら処理装置、記憶装置、入力装置、出力装置を用いて実現するものである。 As described above, the instruction execution analysis apparatus 100 according to the first and second embodiments includes a CPU as a processing device, a memory as a storage device, a magnetic disk, a keyboard as an input device, a mouse, a communication board, and a display device as an output device, This is realized by a computer including a communication board.
As described above, the functions indicated as “˜unit” and “˜means” are realized by using these processing devices, storage devices, input devices, and output devices.

１００命令実行分析装置、１０１シミュレーション部、１０２非効率ブロック抽出部、１０３非効率命令抽出部、１０４ソースプログラム加工部、１０２１メモリアクセス計測部、１０２２非効率ブロック判定部、１０３１非効率ブロックアクセス計測部、１０３２非効率命令判定部。 DESCRIPTION OF SYMBOLS 100 Instruction execution analyzer, 101 Simulation part, 102 Inefficient block extraction part, 103 Inefficient instruction extraction part, 104 Source program processing part, 1021 Memory access measurement part, 1022 Inefficient block determination part, 1031 Inefficiency block access measurement part 1032 Inefficient instruction determination unit.

Claims

Analyzing the status of memory access to the main memory device in a simulation in which a memory access to the main memory device is executed and executing an instruction, and based on the analysis result, out of a plurality of memory blocks in the main memory device An inefficient block extraction unit that extracts a memory block whose memory access target is biased toward some data in the memory block as an inefficient block;
An inefficient instruction extracting unit that extracts an instruction that satisfies a predetermined condition as an inefficient instruction among the instructions that have generated memory access to the inefficient block extracted by the inefficient block extracting unit; Instruction execution analysis device.

The instruction execution analyzer is connected to a display device,
The inefficient instruction extraction unit includes:
The instruction execution analysis apparatus according to claim 1, wherein the extracted inefficient instruction is output to the display device together with an address of the inefficient instruction.

The inefficient block extraction unit includes:
3. The instruction execution analysis apparatus according to claim 1, wherein an inefficient block is extracted from a plurality of memory blocks corresponding to the number of cache blocks of a predetermined cache memory.

The inefficient block extraction unit includes:
For each memory block, among the data in the memory block, extract high access data with high access frequency and low access data with low access frequency during the simulation,
4. The instruction execution analysis apparatus according to claim 1, wherein inefficient blocks are extracted based on the number of high access data and the number of low access data in each memory block.

The inefficient block extraction unit includes:
An extraction criterion for extracting high access data and an extraction criterion for extracting low access data are derived based on an average number of memory accesses in the plurality of memory blocks during the simulation. The instruction execution analysis apparatus according to claim 4.

The inefficient instruction extraction unit includes:
The state of memory access to the main memory device in the simulation in which the memory access to the main memory device is executed and the instruction is executed is analyzed, and the memory access to the inefficient block is generated based on the analysis result 6. The instruction according to claim 1, wherein an instruction is specified, and an instruction that has generated a predetermined number of memory accesses to the inefficient block is extracted as an inefficient instruction among the specified instructions. Instruction execution analyzer.

The instruction execution analyzer further includes:
The position where the inefficient instruction extracted by the inefficient instruction extracting unit is described as the inefficient instruction description position in the binary code source program in which the instruction executed in the simulation is described, and the source A source program processing unit for inserting a notification message for notifying that an instruction described in the inefficient instruction description position is an inefficient instruction at an inefficient instruction description position in the program. The instruction execution analyzer according to any one of 1 to 6.

The source program processing unit
A notification message notifying that the setting of the variable included in the instruction described in the inefficient instruction description position in the source program is inefficient is inserted into the inefficient instruction description position in the source program. The instruction execution analysis apparatus according to claim 7, wherein:

Computer
Analyzing the status of memory access to the main memory device in a simulation in which a memory access to the main memory device is executed and executing an instruction, and based on the analysis result, out of a plurality of memory blocks in the main memory device An inefficient block extraction step for extracting a memory block whose memory access target is biased toward partial data in the memory block as an inefficient block;
The computer is
An inefficiency instruction extraction step for extracting, as an inefficiency instruction, an instruction that satisfies a predetermined condition among the instructions that have caused memory access to the inefficiency block extracted in the inefficiency block extraction step; Instruction execution analysis method.

Analyzing the status of memory access to the main memory device in a simulation in which a memory access to the main memory device is executed and executing an instruction, and based on the analysis result, out of a plurality of memory blocks in the main memory device An inefficient block extraction step for extracting a memory block whose memory access target is biased toward partial data in the memory block as an inefficient block;
Causing the computer to execute an inefficiency instruction extraction step of extracting, as an inefficiency instruction, an instruction that satisfies a predetermined condition among the instructions that have generated memory access to the inefficiency block extracted in the inefficiency block extraction step. A program characterized by