JP5874433B2

JP5874433B2 - Trace coupling device and program

Info

Publication number: JP5874433B2
Application number: JP2012034559A
Authority: JP
Inventors: 藤本　博昭; 博昭藤本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2012-02-20
Filing date: 2012-02-20
Publication date: 2016-03-02
Anticipated expiration: 2032-02-20
Also published as: JP2013171410A

Description

本発明は、性能シミュレーションで複数のプロセッサが搭載されたシステムを評価するための複数のプロセッサのトレースを結合するトレース結合装置及びプログラムに関する。 The present invention relates to a trace combining device and a program for combining traces of a plurality of processors for evaluating a system in which a plurality of processors are mounted in performance simulation.

従来より、プロセッサのアーキテクチャ検討時の性能見積りのために性能シミュレータが使用されている。ベンチマークテスト等のプログラムを実行した際に、プロセッサが処理したすべての命令を含むトレースデータを取得しておく。そして、そのトレースデータとアーキテクチャに関するパラメータ（キャッシュサイズ、キャッシュプロトコル、パイプライン仕様等）とを性能シミュレータに入力して起動することで、トレースデータに基づく性能シミュレーションが実行され、評価対象のプロセッサのキャッシュヒット率、ＣＰＩ（Clock Per Instruction）等の性能情報を出力する。そして、様々なパラメータを試しながら最適なアーキテクチャを決定していくことが行われている。 Conventionally, a performance simulator has been used for performance estimation when examining a processor architecture. When a program such as a benchmark test is executed, trace data including all instructions processed by the processor is acquired. Then, the trace data and architecture parameters (cache size, cache protocol, pipeline specifications, etc.) are input to the performance simulator and started, and a performance simulation based on the trace data is executed, and the cache of the evaluation target processor is executed. Outputs performance information such as hit rate and CPI (Clock Per Instruction). Then, the optimum architecture is determined while trying various parameters.

そして、近年、マルチプロセッサやマルチコア等の複数のコアが同時に動作する際のシミュレーションが行われている。シングルコア、即ち、ＣＰＵ（Central Processing Unit）１個のシミュレーションを行う場合は、１つのトレースデータを順次入力することで、性能シミュレーションを実行できる。しかしながら、マルチコア等の場合、命令の並列実行が可能であるプロセッサである場合には、プロセッサ間通信中に処理が進むことによりトレースデータが動的に変わる可能性があること、シングルプロセッサ用のシミュレータを用いた場合、各プロセッサのトレースデータをその都度採取してシミュレーションを行うと、シミュレーションに要する時間が膨大になるという問題があった。 In recent years, simulations have been performed when a plurality of cores such as multiprocessors and multicores operate simultaneously. When performing simulation of a single core, that is, one CPU (Central Processing Unit), performance simulation can be executed by sequentially inputting one trace data. However, in the case of a multi-core or the like, if the processor is capable of executing instructions in parallel, the trace data may change dynamically as processing proceeds during inter-processor communication, and a simulator for a single processor However, if the simulation is performed by collecting the trace data of each processor each time, there is a problem that the time required for the simulation becomes enormous.

プロセッサ毎に採取された実行トレースデータから、プロセッサ内での実質的な処理実行時間を見積もり、プロセッサ毎に見積もられた処理実行時間を用いてマルチプロセッシングシステムのシミュレーションを行い、そのシミュレーション結果に基づいてマルチプロセッシングシステムの性能を評価すること等が提案されている。 Based on the execution trace data collected for each processor, the actual processing execution time in the processor is estimated, and the multiprocessing system is simulated using the processing execution time estimated for each processor. It has been proposed to evaluate the performance of multiprocessing systems.

特開平１１−０９６１３０号公報Japanese Patent Laid-Open No. 11-096130

しかしながら、上述した従来技術では、各プロセッサのトレースデータ毎に処理実行時間を見積もったものであるため、搭載されるプロセッサの数に相当するトレースデータを転送するための経路が必要となる。従って、性能シミュレーションにおいて、トレースデータの転送に係る構成がプロセッサ数に依存してしまうと言った問題があった。 However, in the above-described prior art, since the processing execution time is estimated for each trace data of each processor, a path for transferring the trace data corresponding to the number of installed processors is required. Therefore, in the performance simulation, there is a problem that the configuration related to the transfer of trace data depends on the number of processors.

よって、本発明の目的は、性能シミュレーションで複数のプロセッサが搭載されたシステムを評価するための複数のプロセッサのトレースを結合するトレース結合装置及びプログラムを提供することである。 Accordingly, an object of the present invention is to provide a trace combining device and a program for combining traces of a plurality of processors for evaluating a system in which a plurality of processors are mounted in performance simulation.

開示の技術は、一つのシステムで動作する複数のプロセッサの夫々に対応する複数のトレースデータを記憶した記憶部と、前記記憶部に記憶された前記複数のトレースデータの１つを所定順に従って選択し、各トレースデータからトレースされた命令を１つずつ読み込みながら、同期命令毎に区切って並べ替え、前記記憶部内の転送用トレースデータに追加することによって、各プロセッサのトレースを結合するトレース結合処理部とを有するトレース結合装置のように構成される。 In the disclosed technique, a storage unit storing a plurality of trace data corresponding to each of a plurality of processors operating in one system, and one of the plurality of trace data stored in the storage unit is selected in a predetermined order Trace combining processing for combining the traces of the respective processors by reading the instructions traced from the respective trace data one by one, sorting and rearranging them for each synchronous instruction, and adding them to the transfer trace data in the storage unit Configured as a trace coupling device.

また、上記課題を解決するための手段として、コンピュータに上記トレース結合装置として機能させるためのプログラム、そのプログラムを記録した記録媒体、及びトレース結合方法とすることもできる。 Further, as means for solving the above-mentioned problems, a program for causing a computer to function as the trace coupling device, a recording medium recording the program, and a trace coupling method can be used.

開示の技術では、複数のプロセッサの夫々に対応する複数のトレースデータに含まれる同期命令で区切って並べ替えて、前記複数のトレースデータが結合された一つの転送用トレースデータが作成される。性能シミュレーションにおいて、このように同期命令で区切って並べ替えて作成された転送用トレースデータを用いることによって、各プロセッサへの同期命令が全て終了した後に、同期命令後の命令が入力されるようにすることができる。 In the disclosed technique, one transfer trace data in which the plurality of trace data are combined is created by dividing and rearranging the synchronization commands included in the plurality of trace data corresponding to each of the plurality of processors. In the performance simulation, by using the transfer trace data created by separating and rearranging with the synchronous instruction in this way, the instruction after the synchronous instruction is input after all the synchronous instructions to each processor are completed. can do.

性能シミュレーションに係る基本構成を説明するための図である。It is a figure for demonstrating the basic composition which concerns on a performance simulation. 図１に示す性能シミュレータをソフトウェアで実行するためのコンピュータ装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the computer apparatus for performing the performance simulator shown in FIG. 1 with software. 図２のコンピュータ装置の機能構成の概要を説明するための図である。It is a figure for demonstrating the outline | summary of a function structure of the computer apparatus of FIG. 図１に示す性能シミュレータをハードウェアで実現するためのシステムのハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the system for implement | achieving the performance simulator shown in FIG. 1 by hardware. 図４のシステムの構成概要を説明するための図である。FIG. 5 is a diagram for explaining a configuration outline of the system of FIG. 4. 複数コアのトレースデータを１つの経路で入力するための構成例を示す図である。It is a figure which shows the structural example for inputting the trace data of multiple cores by one path | route. 転送時に命令を並べ替える構成例を示す図である。It is a figure which shows the structural example which rearranges an instruction | indication at the time of transfer. コア毎のバッファを有する構成例を示す図である。It is a figure which shows the structural example which has the buffer for every core. 命令を並べ替えて一つのトレースを作成しておく構成例を示す図である。It is a figure which shows the structural example which rearranges an instruction | indication and produces one trace. 実機での命令の実行タイミングを説明するための図である。It is a figure for demonstrating the execution timing of the command in a real machine. 単純に並べ替えたトレースデータを用いた実行タイミングを説明するための図である。It is a figure for demonstrating the execution timing using the trace data rearranged simply. 実機と、図１１の並べ替え済トレースデータを用いた性能シミュレーションの場合の実行タイミングのずれを比較するための図である。It is a figure for comparing the shift | offset | difference of execution timing in the case of a performance simulation using a real machine and the rearranged trace data of FIG. プログラムに同期命令が含まれる場合の実行タイミングを説明するための図である。It is a figure for demonstrating the execution timing when a synchronous command is included in a program. 性能シミュレータの動作が停止してしまう場合を説明するための図である。It is a figure for demonstrating the case where operation | movement of a performance simulator stops. トレースデータ作成方法の概要を説明するための図である。It is a figure for demonstrating the outline | summary of the trace data creation method. 転送用トレースデータを作成するための機能構成例を示す図である。It is a figure which shows the example of a function structure for producing the trace data for transfer. トレース結合処理部によるトレース結合処理を説明するためのフローチャート図である。It is a flowchart for demonstrating the trace joint process by a trace joint process part. 図１７のステップＳ１７でのバッファ出力処理を説明するための図である。It is a figure for demonstrating the buffer output process in step S17 of FIG. 同期命令によって区切られた実行タイミングの例を示す図である。It is a figure which shows the example of the execution timing divided | segmented by the synchronous command. 実機と、図１９の転送用トレースデータを用いた性能シミュレーションの場合の実行タイミングのずれを比較するための図である。FIG. 20 is a diagram for comparing a difference in execution timing between an actual machine and a performance simulation using the transfer trace data of FIG. 19.

以下、本発明の実施の形態を図面に基づいて説明する。発明者は、マルチプロセッサやマルチコア等の複数のプロセッサ（ＣＰＵ（Central Processing Unit））が同時に動作するシステムを評価するための性能シミュレーションの処理性能が、低下する原因となる同期命令を含むトレースデータを用いた場合について、処理性能が低下する現象について解析した。本実施の形態では、同期命令を含むトレースデータを用いた場合であっても、その処理性能を改善することができるトレースデータを作成することについて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. The inventor obtains trace data including a synchronous instruction that causes a decrease in processing performance of performance simulation for evaluating a system in which a plurality of processors (CPU (Central Processing Unit)) such as a multiprocessor and a multicore operate simultaneously. When it was used, the phenomenon that the processing performance deteriorated was analyzed. In this embodiment, generation of trace data that can improve the processing performance even when trace data including a synchronization instruction is used will be described.

まず、性能シミュレーションに係る基本構成について、図１で説明する。図１に示すシステム１０００において、トレースデータ２とアーキテクチャ設定情報４とを性能シミュレータ５に入力し、性能シミュレータ５を動作させることによって、性能シミュレータ５から評価対象のプロセッサに係る性能情報レポート８が出力される。 First, a basic configuration related to performance simulation will be described with reference to FIG. In the system 1000 shown in FIG. 1, when the trace data 2 and the architecture setting information 4 are input to the performance simulator 5 and the performance simulator 5 is operated, the performance information report 8 relating to the evaluation target processor is output from the performance simulator 5. Is done.

トレースデータ２は、事前に実機又はＩＳＳ（Instruction Set Simulator）等でプログラムを実行して取得した、実行順の命令３を含むデータファイルである。命令３は、ＰＣ（Program Counter）、命令、アドレス、データを示すデータである。アーキテクチャ設定情報４は、キャッシュサイズ、キャッシュプロトコル、パイプライン仕様等の評価対象のプロセッサに係るパラメータを示す。 The trace data 2 is a data file including instructions 3 in the execution order obtained by executing a program in advance with an actual machine or an ISS (Instruction Set Simulator). The instruction 3 is data indicating a PC (Program Counter), an instruction, an address, and data. The architecture setting information 4 indicates parameters related to an evaluation target processor such as a cache size, a cache protocol, and a pipeline specification.

性能シミュレータ５は、実際に命令を実行せずにパイプライン、メモリアクセス等のタイミングをシミュレーションする。性能シミュレータ５は、後述されるように、ソフトウェア又はハードウェアによって実現される。 The performance simulator 5 simulates the timing of pipelines, memory accesses, etc. without actually executing instructions. The performance simulator 5 is realized by software or hardware as will be described later.

性能情報レポート８は、ＣＰＩ（Clock per Instruction）、キャッシュヒット率等を含む性能評価結果を示す。 The performance information report 8 shows performance evaluation results including CPI (Clock per Instruction), cache hit rate, and the like.

図１に示す性能シミュレータ５がソフトウェア（プログラム）で実現されるシステム１０００ａにおいて、そのソフトウェアを実行するためのコンピュータ装置１００ａのハードウェア構成を図２で示す。図２に示すシステム１０００ａでは、コンピュータ装置１００ａが性能シミュレータ５として動作するためのハードウェア構成が示される。 In a system 1000a in which the performance simulator 5 shown in FIG. 1 is realized by software (program), a hardware configuration of a computer device 100a for executing the software is shown in FIG. In the system 1000a shown in FIG. 2, a hardware configuration for the computer apparatus 100a to operate as the performance simulator 5 is shown.

コンピュータ装置１００ａは、ＣＰＵ１１と、ＲＯＭ（Read-Only Memory）１２と、ＲＡＭ（Random Access Memory）１３と、ハードディスクドライブ１４と、入力装置１５と、出力装置１６と、通信Ｉ／Ｆ１７と、ドライブ１８とを有し、それらはバスＢに接続される。 The computer device 100a includes a CPU 11, a ROM (Read-Only Memory) 12, a RAM (Random Access Memory) 13, a hard disk drive 14, an input device 15, an output device 16, a communication I / F 17, and a drive 18. And they are connected to bus B.

ＣＰＵ１１は、ＲＯＭ１２又はＲＡＭ１３に格納されたプログラムに従ってコンピュータ装置１００ａを制御する。主記憶装置としてのＲＯＭ１２及びＲＡＭ１３は、ＣＰＵ１１にて実行されるプログラム、ＣＰＵ１１での処理に必要なデータ、ＣＰＵ１１での処理にて得られたデータ等を格納する。また、ＲＡＭ１３の一部の領域が、ＣＰＵ１１での処理に利用されるワークエリアとして割り付けられる。補助記憶装置としてのハードディスクドライブ１４には、各種処理を実行するプログラム等のデータが格納される。 The CPU 11 controls the computer device 100 a according to a program stored in the ROM 12 or the RAM 13. The ROM 12 and the RAM 13 as main storage devices store programs executed by the CPU 11, data necessary for processing by the CPU 11, data obtained by processing by the CPU 11, and the like. Also, a partial area of the RAM 13 is allocated as a work area used for processing by the CPU 11. The hard disk drive 14 serving as an auxiliary storage device stores data such as programs for executing various processes.

記憶部３０は、ＲＯＭ１２、ＲＡＭ１３、ハードディスクドライブ１４等の記憶領域を有する。記憶部３０は、トレースデータ２、アーキテクチャ設定情報４、性能情報レポート８等を記憶する。 The storage unit 30 has storage areas such as a ROM 12, a RAM 13, and a hard disk drive 14. The storage unit 30 stores trace data 2, architecture setting information 4, a performance information report 8, and the like.

入力装置１５は、マウス、キーボード等を有し、ユーザがコンピュータ装置１００ａが処理を行なうための必要な各種情報を入力するために用いられる。出力装置１６は、ＬＣＤ（Liquid Crystal Display）又はＣＲＴ（Cathode Ray Tube）等の表示装置や、プリンタ等を有し、ＣＰＵ１１の制御のもとに必要な各種情報を表示、又は／及び、ユーザからの指示に応じて各種情報を出力する。性能情報レポート８が出力装置１６に表示又は／及び出力される。 The input device 15 includes a mouse, a keyboard, and the like, and is used by a user to input various information necessary for the computer device 100a to perform processing. The output device 16 includes a display device such as an LCD (Liquid Crystal Display) or CRT (Cathode Ray Tube), a printer, and the like, and displays various information necessary under the control of the CPU 11 or / and from the user. Various information is output according to the instructions. The performance information report 8 is displayed or / and output on the output device 16.

通信Ｉ／Ｆ１７は、例えばインターネット、ＬＡＮ（Local Area Network）等に接続するためのインターフェースであり、外部装置との間の通信制御をするための装置である。 The communication I / F 17 is an interface for connecting to the Internet, a LAN (Local Area Network), etc., for example, and is a device for controlling communication with an external device.

コンピュータ装置１００ａによって行われる処理を実現するプログラムは、例えば、ＣＤ−ＲＯＭ（Compact Disc Read-Only Memory）等の記憶媒体１９によってコンピュータ装置１００ａに提供される。即ち、プログラムが保存された記憶媒体１９がドライブ１８にセットされると、ドライブ１８が記憶媒体１９からプログラムを読み出し、その読み出されたプログラムがバスＢを介してハードディスクドライブ１４にインストールされる。そして、プログラムが起動されると、ハードディスクドライブ１４にインストールされたプログラムに従ってＣＰＵ１１がその処理を開始する。 A program that implements processing performed by the computer apparatus 100a is provided to the computer apparatus 100a by a storage medium 19 such as a CD-ROM (Compact Disc Read-Only Memory). That is, when the storage medium 19 storing the program is set in the drive 18, the drive 18 reads the program from the storage medium 19, and the read program is installed in the hard disk drive 14 via the bus B. When the program is activated, the CPU 11 starts its processing according to the program installed in the hard disk drive 14.

尚、プログラムを格納する媒体としてＣＤ−ＲＯＭに限定するものではなく、コンピュータが読み取り可能な媒体であればよい。コンピュータ読取可能な記憶媒体として、ＣＤ−ＲＯＭの他に、ＤＶＤディスク、ＵＳＢメモリ等の可搬型記録媒体、フラッシュメモリ等の半導体メモリであっても良い。 The medium for storing the program is not limited to a CD-ROM, and any medium that can be read by a computer may be used. As a computer-readable storage medium, in addition to a CD-ROM, a portable recording medium such as a DVD disk or a USB memory, or a semiconductor memory such as a flash memory may be used.

また、コンピュータ装置１００ａによって行われる処理を実現するプログラムが、通信Ｉ／Ｆ１７を介して外部装置から提供されてもよい。或いは、外部装置へ該プログラムを提供し、後述される各処理は外部装置で実現されるように構成してもよい。通信Ｉ／Ｆ１７による通信は無線又は有線に限定されるものではない。 Further, a program that realizes processing performed by the computer device 100a may be provided from an external device via the communication I / F 17. Alternatively, the program may be provided to an external device, and each process described below may be realized by the external device. Communication by the communication I / F 17 is not limited to wireless or wired.

図３は、図２のコンピュータ装置１００ａの機能構成の概要を説明するための図である。図３では、複数コアの例として、ＣＰＵが４個の場合を例示しているが、個数を限定するものではない。以下の図においても同様である。 FIG. 3 is a diagram for explaining an outline of a functional configuration of the computer apparatus 100a of FIG. In FIG. 3, as an example of a plurality of cores, a case where there are four CPUs is illustrated, but the number is not limited. The same applies to the following drawings.

図３において、ＣＰＵ０〜ＣＰＵ３が評価対象である場合、各ＣＰＵ０〜ＣＰＵ３に対応するトレースデータ２００〜２０３が記憶部３０に用意され格納される。トレースデータ２００はＣＰＵ０用のトレースデータであり、トレースデータ２０１はＣＰＵ１用のトレースデータであり、トレースデータ２０２はＣＰＵ２用のトレースデータであり、トレースデータ２０３はＣＰＵ３用のトレースデータである。 In FIG. 3, when CPU 0 to CPU 3 are evaluation targets, trace data 200 to 203 corresponding to each of CPU 0 to CPU 3 are prepared and stored in the storage unit 30. Trace data 200 is trace data for CPU0, trace data 201 is trace data for CPU1, trace data 202 is trace data for CPU2, and trace data 203 is trace data for CPU3.

コンピュータ装置１００ａは、ＣＰＵ１１が所定のプログラムを実行することによって性能シミュレーション部５０ａを実現する。性能シミュレーション部５０ａは性能シミュレータ５に相当する機能を行う処理部である。 The computer apparatus 100a implements the performance simulation unit 50a by the CPU 11 executing a predetermined program. The performance simulation unit 50 a is a processing unit that performs a function corresponding to the performance simulator 5.

性能シミュレーション部５０ａは、ハードウェア記述言語等で表現されたＣＰＵモデルとしての各ＣＰＵ０〜ＣＰＵ３等を有する。この例では、モデルとしてＣＰＵ０〜ＣＰＵ３のみを例示しているが、性能シミュレーション部５０ａは、メモリ、バス等のモデルを含んでいる。 The performance simulation unit 50a includes the CPU0 to CPU3 as CPU models expressed in a hardware description language or the like. In this example, only CPU0 to CPU3 are illustrated as models, but the performance simulation unit 50a includes models such as a memory and a bus.

トレースデータ２００〜２０３が、記憶部３０から性能シミュレーション部５０ａに入力されることによって、性能シミュレーション部５０ａが各ＣＰＵ０〜ＣＰＵ３の性能を評価する。性能シミュレーション部５０ａによる性能評価の結果を示す性能情報レポート８が、記憶部３０に出力され格納される。 When the trace data 200 to 203 is input from the storage unit 30 to the performance simulation unit 50a, the performance simulation unit 50a evaluates the performance of each of the CPU0 to CPU3. A performance information report 8 indicating the result of performance evaluation by the performance simulation unit 50 a is output and stored in the storage unit 30.

次に、図１に示す性能シミュレータ５が専用装置等のハードウェアで実現されるシステム１０００ｂのハードウェア構成を図４で示す。図４に示すシステム１０００ｂは、コンピュータ装置１００ｂと、性能シミュレーション装置５０ｂとを有する。図４に示すコンピュータ装置１００ｂのハードウェア構成のうち、図２に示すコンピュータ装置１００ａと同様のハードウェアには同様の符号を付し、その説明を省略する。 Next, FIG. 4 shows a hardware configuration of a system 1000b in which the performance simulator 5 shown in FIG. 1 is realized by hardware such as a dedicated device. A system 1000b illustrated in FIG. 4 includes a computer device 100b and a performance simulation device 50b. Of the hardware configuration of the computer apparatus 100b shown in FIG. 4, the same reference numerals are given to the same hardware as the computer apparatus 100a shown in FIG. 2, and the description thereof is omitted.

図２に示すコンピュータ装置１００ａとの違いにおいて、コンピュータ装置１００ｂは、性能シミュレーション装置５０ｂに対して上位装置であるホストＰＣ（Personal Computer）であり、コンピュータ装置１００ａのハードウェア構成に加えて、更に、性能シミュレーション装置５０ｂと接続し通信するための性能シミュレーション装置Ｉ／Ｆ２０を有している。 In contrast to the computer device 100a shown in FIG. 2, the computer device 100b is a host PC (Personal Computer) which is a host device with respect to the performance simulation device 50b, and in addition to the hardware configuration of the computer device 100a, A performance simulation apparatus I / F 20 for connecting and communicating with the performance simulation apparatus 50b is provided.

性能シミュレーション装置Ｉ／Ｆ２０を介して、コンピュータ装置１００ｂの記憶部３０からトレースデータ２、アーキテクチャ設定情報４等が、性能シミュレーション装置５０ｂへ入力され、性能シミュレーション装置５０ｂによって性能評価が行われる。 Trace data 2, architecture setting information 4 and the like are input from the storage unit 30 of the computer apparatus 100b to the performance simulation apparatus 50b via the performance simulation apparatus I / F 20, and performance evaluation is performed by the performance simulation apparatus 50b.

性能シミュレーション装置５０ｂによって実行された性能評価の結果を示す性能情報レポート８は、性能シミュレーション装置Ｉ／Ｆ２０を介してコンピュータ装置１００ｂの記憶部３０に格納される。 The performance information report 8 indicating the result of the performance evaluation executed by the performance simulation device 50b is stored in the storage unit 30 of the computer device 100b via the performance simulation device I / F 20.

性能シミュレーション装置５０ｂは、ＦＰＧＡ（Field-Programmable Gate Array）等の装置であり、上位装置としてのコンピュータ装置１００ｂから必要な情報を入力し、評価対象のプロセッサの性能を評価し、その結果を示す性能情報レポート８を出力する。 The performance simulation device 50b is a device such as a field-programmable gate array (FPGA), inputs necessary information from the computer device 100b as a host device, evaluates the performance of the evaluation target processor, and shows the result. An information report 8 is output.

図５は、図４のシステム１０００ｂの構成概要を説明するための図である。図５において、性能シミュレーション装置５０ｂは、性能シミュレータ５と、Ｉ／Ｆ５１ｂと、Ｉ／Ｆ５２ｂとを有する。 FIG. 5 is a diagram for explaining the outline of the configuration of the system 1000b of FIG. In FIG. 5, the performance simulation apparatus 50b includes a performance simulator 5, an I / F 51b, and an I / F 52b.

性能シミュレータ５は、ハードウェア記述言語によるモデルとして表現されるＣＰＵ０〜ＣＰＵ３の各々の動作に合わせて夫々対応するトレースデータ２００〜２０３の命令３を読み込みながらシミュレーションを行う。 The performance simulator 5 performs a simulation while reading the instruction 3 of the trace data 200 to 203 corresponding to each operation of the CPU0 to CPU3 expressed as a model in the hardware description language.

Ｉ／Ｆ５１ｂは、ホストＰＣであるコンピュータ装置１００ｂから各トレースデータ２００〜２０３を受信して性能シミュレータ５へ入力するためのインターフェースである。Ｉ／Ｆ５１ｂから入力されたトレースデータ２００〜２０３に従って、対応するＣＰＵ０〜ＣＰＵ３がシミュレーションされる。 The I / F 51b is an interface for receiving each trace data 200 to 203 from the computer device 100b which is a host PC and inputting it to the performance simulator 5. Corresponding CPU0 to CPU3 are simulated in accordance with the trace data 200 to 203 input from the I / F 51b.

Ｉ／Ｆ５２ｂは、性能シミュレータ５が行ったＣＰＵ０〜ＣＰＵ３の評価結果を示す性能情報レポート８を外部装置へ出力するためのインターフェースである。外部装置がコンピュータ装置１００ｂである場合、性能情報レポート８は、Ｉ／Ｆ５２ｂを介して、コンピュータ装置１００ｂへ転送されて記憶部３０に格納される。記憶部３０に格納された性能情報レポート８は、ＣＰＵ１１の制御によって、出力装置１６に結果表示として出力される。又は、Ｉ／Ｆ５２ｂがコンピュータ装置１００ｂ以外の結果表示用の外部装置に接続される場合には、性能情報レポート８が外部装置で表示されてもよい。 The I / F 52b is an interface for outputting a performance information report 8 indicating the evaluation results of the CPU0 to CPU3 performed by the performance simulator 5 to an external device. When the external device is the computer device 100b, the performance information report 8 is transferred to the computer device 100b via the I / F 52b and stored in the storage unit 30. The performance information report 8 stored in the storage unit 30 is output as a result display to the output device 16 under the control of the CPU 11. Alternatively, when the I / F 52b is connected to an external device for displaying results other than the computer device 100b, the performance information report 8 may be displayed on the external device.

このようなシステム１０００ｂの構成では、各ＣＰＵ０〜ＣＰＵ３毎にホストＰＣのコンピュータ装置１００ｂとの経路を用意し、夫々の経路で独立にデータ転送を行う。従って、コア数が増えるに従ってPCI-Express等の転送経路で必要となるリソースが増加していくため、費用が掛かってしまう。 In such a configuration of the system 1000b, a route to the computer device 100b of the host PC is prepared for each CPU0 to CPU3, and data is transferred independently through each route. Therefore, as the number of cores increases, the resources required for the transfer path such as PCI-Express increase, which increases costs.

そこで、複数コアの夫々に対応するトレースデータ２００〜２０３を１つの経路で入力することが考えられる。図６は、複数コアのトレースデータを１つの経路で入力するための構成例を示す図である。 Therefore, it is conceivable to input trace data 200 to 203 corresponding to each of a plurality of cores through one path. FIG. 6 is a diagram illustrating a configuration example for inputting trace data of a plurality of cores through one path.

図６に示すシステム１０００ｃでは、複数コアに相当するＣＰＵ０〜ＣＰＵ３の夫々に対応するトレースデータ２００〜２０３は、１つの経路でＩ／Ｆ５１ｂに入力される。そのため、性能シミュレーション装置５０ｂは、トレース受信部６を更に有する。 In the system 1000c shown in FIG. 6, the trace data 200 to 203 corresponding to s husband CPU0~CPU3 corresponding to multiple cores is input to I / F51b on one path. Therefore, the performance simulation device 50b further includes a trace receiving unit 6.

性能シミュレーション装置５０ｂでは、記憶部３０にトレースデータ２ｂが用意され格納される。トレースデータ２ｂに含まれる命令３ｂは、性能シミュレータ５のＣＰＵ０〜ＣＰＵ３を特定するためのＣＰＵ番号、ＰＣ（Program Counter）、命令、アドレス、データ等を示す。 In the performance simulation device 50b, the trace data 2b is prepared and stored in the storage unit 30. The instruction 3b included in the trace data 2b indicates a CPU number, a PC (Program Counter), an instruction, an address, data, and the like for specifying the CPU0 to CPU3 of the performance simulator 5.

トレース受信部６は、振り分け処理部６ｂと、性能シミュレータ５のＣＰＵ０〜ＣＰＵ３の夫々に対応するバッファ０〜３とを有する。トレース受信部６は、ホストＰＣであるコンピュータ装置１００ｂからシリアルに転送されてくるトレースデータ２ｂを受信し、各命令３ｂのＣＰＵ番号に対応するＣＰＵ用のバッファに命令を書き込む処理を行う。書き込もうとしたバッファがフルの時には、それ以上、トレースデータ２ｂを読み込まずにそのバッファが空くまで待つ。 The trace receiving unit 6 includes a distribution processing unit 6 b and buffers 0 to 3 corresponding to the CPU 0 to CPU 3 of the performance simulator 5. The trace receiving unit 6 receives the trace data 2b transferred serially from the computer device 100b, which is the host PC, and performs a process of writing the command to the CPU buffer corresponding to the CPU number of each command 3b. When the buffer to be written is full, the process waits until the buffer becomes empty without reading the trace data 2b.

振り分け処理部６ｂは、Ｉ／Ｆ５１ｂを介して入力される各命令３ｂのＣＰＵ番号に基づいて、対応するバッファ０〜３の一つに格納する。例えば、命令３ｂのＣＰＵ番号がＣＰＵ０を指定する場合、命令３ｂはバッファ０に格納され、性能シミュレータ５のＣＰＵ０に与えられる。 The distribution processing unit 6b stores it in one of the corresponding buffers 0 to 3 based on the CPU number of each instruction 3b input via the I / F 51b. For example, when the CPU number of the instruction 3b specifies CPU0, the instruction 3b is stored in the buffer 0 and given to the CPU0 of the performance simulator 5.

トレースデータ２００〜２０３を１つの経路で転送する方法として、以下に説明する、転送時に複数コア（ＣＰＵ０〜ＣＰＵ３）の命令３ｂを並べ替える構成（図７）と、コア毎のトレースデータを予め格納しておくためのバッファを有する構成（図８）、複数コア（ＣＰＵ０〜ＣＰＵ３）間でトレースデータの命令３ｂを並べ替えて一つのトレースを作成しておく構成（図９）が考えられる。 As a method of transferring the trace data 200 to 203 through one route, a configuration (FIG. 7) for rearranging the instructions 3b of a plurality of cores (CPU0 to CPU3) at the time of transfer and the trace data for each core are stored in advance. A configuration (FIG. 8) having a buffer for storing the data (FIG. 8) and a configuration (FIG. 9) in which the trace data instructions 3b are rearranged between a plurality of cores (CPU0 to CPU3) to create one trace.

図７は、転送時に命令を並べ替える構成例を示す図である。図７に示す構成では、ホストＰＣであるコンピュータ装置１００ｂは、転送処理部３３を有する。コンピュータ装置１００ｂでは、事前に命令３ｂを並べ替えることをせず、性能シミュレータ５からの各ＣＰＵ０〜３の命令フェッチ時に送信されるリクエストに応じて、転送処理部３３が、１命令単位で、対応するトレースデータから命令３ｂを取得する。 FIG. 7 is a diagram illustrating a configuration example in which instructions are rearranged at the time of transfer. In the configuration illustrated in FIG. 7, the computer device 100 b that is a host PC includes a transfer processing unit 33. The computer apparatus 100b does not rearrange the instructions 3b in advance, and the transfer processing unit 33 responds in units of instructions in response to a request transmitted at the time of fetching instructions of the CPUs 0 to 3 from the performance simulator 5. The instruction 3b is acquired from the trace data to be processed.

このような構成の場合、性能シミュレータ５がリクエストを出してから命令３ｂを含む応答があるまでの間、性能シミュレータ５の動作を停止しておくことで、実機と同じタイミングで実行することができる。しかし、１命令ごとにコンピュータ装置１００ｂ（ホストＰＣ）との間で転送を行うため、転送処理部３３による処理がボトルネックとなり実行速度を十分に得られない。 In such a configuration, the performance simulator 5 can be stopped at the same timing as the actual machine by stopping the operation of the performance simulator 5 from when the performance simulator 5 issues a request until there is a response including the instruction 3b. . However, since data is transferred to and from the computer device 100b (host PC) for each instruction, the processing by the transfer processing unit 33 becomes a bottleneck and the execution speed cannot be sufficiently obtained.

図８は、コア毎のバッファを有する構成例を示す図である。図８に示す構成では、性能シミュレーション装置５０ｂにおいて、トレース受信部６は、性能シミュレータ５のＣＰＵ０〜３の夫々に対応させてトレースサイズ分のバッファ０〜３を有する。 FIG. 8 is a diagram illustrating a configuration example having a buffer for each core. In the configuration shown in FIG. 8, in the performance simulation device 50 b, the trace receiving unit 6 includes buffers 0 to 3 corresponding to the trace size corresponding to the CPUs 0 to 3 of the performance simulator 5.

ＣＰＵ０〜３の夫々に対応するトレースデータ２００〜２０３は、性能シミュレータ５の動作が開始される前に、Ｉ／Ｆ５１ｂを介してコンピュータ装置１００ｂ（ホストＰＣ）から転送され、トレース受信部６のバッファ０〜３に格納される。しかし、この構成では、一般に、トレースデータのサイズが大きくなる（例えば、ＧＢ単位のサイズの）場合が多いため、性能シミュレーション装置５０ｂ側で全てを保持することは難しいと考えられる。 The trace data 200 to 203 corresponding to each of the CPUs 0 to 3 are transferred from the computer device 100b (host PC) via the I / F 51b before the operation of the performance simulator 5 is started, and the trace data of the trace receiving unit 6 is buffered. 0 to 3 are stored. However, in this configuration, in general, the size of the trace data often increases (for example, the size in GB units), so it is considered difficult to hold all of the data on the performance simulation device 50b side.

図９は、命令を並べ替えて一つのトレースを作成しておく構成例を示す図である。図９に示す構成では、性能シミュレータ５の動作の開始前に、コンピュータ装置１００ｂ（ホストＰＣ）側で、各ＣＰＵ０〜３の命令３ｂをトレースデータ２００〜２０３から並べ替えて一つの並べ替え済みトレースデータ２１０を、記憶部３０に用意しておく。 FIG. 9 is a diagram illustrating a configuration example in which instructions are rearranged to create one trace. In the configuration shown in FIG. 9, before the performance simulator 5 starts its operation, the computer device 100b (host PC) side rearranges the instructions 3b of the CPUs 0 to 3 from the trace data 200 to 203 to form one rearranged trace. Data 210 is prepared in the storage unit 30.

性能シミュレータ５の実行時には、コンピュータ装置１００ｂ（ホストＰＣ）の転送処理部３３は、一つの並べ替え済トレースデータ２１０の先頭から順に命令３ｂを転送するのみでよく、転送速度の低下を防ぐことができる。 When the performance simulator 5 is executed, the transfer processing unit 33 of the computer device 100b (host PC) only has to transfer the instruction 3b in order from the top of one rearranged trace data 210, thereby preventing a decrease in transfer speed. it can.

次に、並べ替え済トレースデータ２１０に関して、複数コアの命令３を単純に並べ替えた例について説明する。図１０は、実機での命令の実行タイミングを説明するための図である。図１０において、ＣＰＵ０〜ＣＰＵ３の４つのＣＰＵで夫々ＣＰＵｎ−０からＣＰＵｎ−９までの順で、１０個の命令を実行し、各命令の位置は、実機の実行タイミング９を示している。 Next, an example in which instructions 3 of a plurality of cores are simply rearranged with respect to the rearranged trace data 210 will be described. FIG. 10 is a diagram for explaining the instruction execution timing in the actual machine. In FIG. 10, ten instructions are executed in the order of CPUn-0 to CPUn-9 by the four CPUs CPU0 to CPU3, and the position of each instruction indicates the execution timing 9 of the actual machine.

例えば、ＣＰＵ０の場合、クロックサイクル１でＣＰＵ０−０命令、サイクル３でＣＰＵ０−１命令、サイクル４でＣＰＵ０−２命令、サイクル６でＣＰＵ０−３命令、・・・のタイミングで実行される様子を示している。実行間隔が空く原因としては、キャッシュミス、データ依存待ち、分岐命令で分岐した場合等の様々な要因がある。 For example, in the case of CPU0, CPU0-0 instruction in clock cycle 1, CPU0-1 instruction in cycle 3, CPU0-2 instruction in cycle 4, CPU0-3 instruction in cycle 6, and so on. Show. There are various causes of the execution interval being free, such as a cache miss, waiting for data dependency, and branching by a branch instruction.

トレースデータでは命令間のタイミング情報を持たない。一方、実機の実行タイミング９では、同じ１０命令であっても、各命令の遅延がコア毎に異なり、命令フェッチのタイミングが異なる。 Trace data does not have timing information between instructions. On the other hand, at the execution timing 9 of the actual machine, even with the same 10 instructions, the delay of each instruction is different for each core, and the instruction fetch timing is different.

ＣＰＵ０からＣＰＵ３への順に、実行順序に従って、１命令ずつを取り出すこと（点線矢印の方向）を繰り返して、単純に並び替えた例を図１１で説明する。図１１は、単純に並べ替えたトレースデータを用いた実行タイミングを説明するための図である。 FIG. 11 illustrates an example of simple rearrangement in which the instructions are fetched one by one (in the direction of the dotted arrow) in the order from the CPU 0 to the CPU 3 in accordance with the execution order. FIG. 11 is a diagram for explaining execution timing using trace data simply rearranged.

図１１において、並べ替え済トレースデータ２１０を用いた場合、図１０に示すように実際の命令実行時の遅延に加えて、並べ替えによるトレースの順序に依存した遅延が発生する。例えば、図１０に示す実機の実行タイミング９では、ＣＰＵ１−２命令よりも先にＣＰＵ２−２命令が実行されているが、図１１では、入力されたトレースの順序がＣＰＵ１−２−＞ＣＰＵ２−２となっているため、ＣＰＵ２−２命令はＣＰＵ１−２命令が実行されるまで入力できない状態となる。 In FIG. 11, when rearranged trace data 210 is used, a delay depending on the order of traces due to rearrangement occurs in addition to the delay at the time of actual instruction execution as shown in FIG. For example, at the execution timing 9 of the actual machine shown in FIG. 10, the CPU2-2 instruction is executed prior to the CPU1-2 instruction, but in FIG. 11, the input trace order is CPU1-2> CPU2- Therefore, the CPU2-2 instruction cannot be input until the CPU1-2 instruction is executed.

また、トレースの順序に応じて遅延が追加されていき、実機の実行タイミング９と、性能シミュレーションによる実行タイミングとの誤差が大きくなっていく。 In addition, a delay is added according to the trace order, and an error between the execution timing 9 of the actual machine and the execution timing by the performance simulation increases.

図１２は、実機と、図１１の並べ替え済トレースデータを用いた性能シミュレーションの場合の実行タイミングのずれを示す図である。実機と性能シミュレータ５とにおける実行タイミングの比較において、図１２（Ａ）はＣＰＵ０の実行タイミングのずれを示し、図１２（Ｂ）はＣＰＵ１の実行タイミングのずれを示し、図１２（Ｃ）はＣＰＵ２の実行タイミングのずれを示し、図１２（Ｄ）はＣＰＵ３の実行タイミングのずれを示している。 FIG. 12 is a diagram illustrating a deviation in execution timing between the actual machine and the performance simulation using the rearranged trace data of FIG. In the execution timing comparison between the actual machine and the performance simulator 5, FIG. 12A shows the deviation of the execution timing of the CPU 0, FIG. 12B shows the deviation of the execution timing of the CPU 1, and FIG. FIG. 12D shows a deviation in execution timing of the CPU 3.

１０個の命令を終了した時点において、図１２（Ａ）では、性能シミュレータ５によるＣＰＵ０の性能シミュレーションは、実機と比べてｄ０遅れて終了している。図１２（Ｂ）では、性能シミュレータ５によるＣＰＵ１の性能シミュレーションは、実機と比べてｄ１遅れて終了している。図１２（Ｃ）では、性能シミュレータ５によるＣＰＵ２の性能シミュレーションは、実機と比べてｄ２遅れて終了している。図１２（Ｄ）では、性能シミュレータ５によるＣＰＵ３の性能シミュレーションは、実機と比べてｄ３遅れて終了している。 At the time when 10 instructions are completed, in FIG. 12A, the performance simulation of the CPU 0 by the performance simulator 5 ends with a delay of d0 compared to the actual machine. In FIG. 12 (B), the performance simulation of the CPU 1 by the performance simulator 5 is finished with a delay of d1 compared to the actual machine. In FIG. 12 (C), the performance simulation of the CPU 2 by the performance simulator 5 ends with a delay of d2 compared to the actual machine. In FIG. 12 (D), the performance simulation of the CPU 3 by the performance simulator 5 is completed with a delay of d3 compared to the actual machine.

この例では、ＣＰＵ０〜３の夫々で実行タイミングのずれが発生しているが、性能シミュレーションの実行自体は完了している。 In this example, the execution timings of each of the CPUs 0 to 3 are shifted, but the performance simulation execution itself has been completed.

上述したような実行タイミングのずれに関して、マルチプロセッサの場合にはバリア同期命令等の複数のＣＰＵ間で同期を取る同期命令が使われる場合が多い。その場合、単純並べ替えで転送するとタイミングがずれるだけでなく、性能シミュレーションの途中で停止してしまう場合が発生する。その例を図１３で説明する。 Regarding the above-described shift in execution timing, in the case of a multiprocessor, a synchronization instruction that synchronizes between a plurality of CPUs such as a barrier synchronization instruction is often used. In that case, when the transfer is performed by simple rearrangement, not only the timing is shifted, but also a case where the process stops during the performance simulation occurs. An example thereof will be described with reference to FIG.

図１３は、プログラムに同期命令が含まれる場合の実行タイミングを説明するための図である。図１３に示す実行タイミングにおいて、ＣＰＵ０−４命令、ＣＰＵ１−４命令、ＣＰＵ２−２命令、及びＣＰＵ３−６命令が同期命令である場合を示している。同期命令は、指定された複数のＣＰＵ間で同時に完了する必要がある。つまり、同期命令の次の命令は、同期するように全てのＣＰＵで同時に実行される。このような同期により、同じメモリ領域でデータ共有される場合にデータの正しいことを保証することができる。 FIG. 13 is a diagram for explaining execution timing when a synchronization instruction is included in a program. In the execution timing shown in FIG. 13, the CPU0-4 instruction, the CPU1-4 instruction, the CPU2-2 instruction, and the CPU3-6 instruction are synchronous instructions. The synchronization instruction needs to be completed simultaneously among a plurality of designated CPUs. That is, the instruction next to the synchronization instruction is executed simultaneously by all the CPUs so as to synchronize. Such synchronization can ensure that the data is correct when data is shared in the same memory area.

図１３に示す実行タイミングにおいて、例えば、ＣＰＵ１がＣＰＵ１−３命令で共有メモリ領域にデータを書き込み、ＣＰＵ０、ＣＰＵ２、及びＣＰＵ３が前記命令による書き込みが確実に完了した後で、ＣＰＵ０−５命令、ＣＰＵ２−３命令、ＣＰＵ３−７命令によって、共有メモリ領域を読み出すために、ＣＰＵ０−４命令、ＣＰＵ１−４命令、ＣＰＵ２−２命令、ＣＰＵ３−６命令の同期命令が挿入されているとする。 At the execution timing shown in FIG. 13, for example, after the CPU 1 writes data to the shared memory area with the CPU 1-3 instruction and the CPU 0, CPU 2, and CPU 3 have completed writing by the instruction without fail, the CPU 0-5 instruction, CPU 2 In order to read the shared memory area by the -3 instruction and the CPU 3-7 instruction, it is assumed that a synchronization instruction of the CPU 0-4 instruction, the CPU 1-4 instruction, the CPU 2-2 instruction, and the CPU 3-6 instruction is inserted.

この例では、最後の同期命令となるＣＰＵ１−４の実行後に、各ＣＰＵ０〜ＣＰＵ３で次の命令ＣＰＵ０−５、ＣＰＵ１−５、ＣＰＵ２−３、及びＣＰＵ３−７が実行される。 In this example, after execution of CPU1-4 which is the last synchronization instruction, each of CPU0 to CPU3 executes the next instruction CPU0-5, CPU1-5, CPU2-3 and CPU3-7.

このような場合、図１１で説明した単純な並べ替えの方法では、図１３で示すように、同期命令によって各ＣＰＵ０〜ＣＰＵ３の処理が待ち状態になり、更に、性能シミュレータ５の動作が途中で停止してしまう場合がある。 In such a case, in the simple rearrangement method described with reference to FIG. 11, as shown in FIG. 13, the processing of each of the CPUs 0 to 3 is put into a waiting state by the synchronization command, and the operation of the performance simulator 5 is in progress. It may stop.

図１４は、性能シミュレータ５の動作が停止してしまう場合を説明するための図である。図１４において、並べ替え済トレースデータ２１０からＣＰＵ０−０命令、ＣＰＵ１−０命令、ＣＰＵ２−０命令、ＣＰＵ３−０命令、ＣＰＵ０−１命令、・・・という順で性能シミュレーション装置５０ｂのトレース受信部６に入力され、各命令のＣＰＵ番号に従って対応するＣＰＵ用のバッファに振り分けられた後、性能シミュレータ５に入力される。図１４では、説明を簡潔にするため、バッファサイズを１命令分とする。 FIG. 14 is a diagram for explaining a case where the operation of the performance simulator 5 stops. In FIG. 14, the trace receiving unit of the performance simulation apparatus 50b in the order of rearranged trace data 210, CPU0-0 instruction, CPU1-0 instruction, CPU2-0 instruction, CPU3-0 instruction, CPU0-1 instruction,. 6 and distributed to the corresponding CPU buffer according to the CPU number of each instruction, and then input to the performance simulator 5. In FIG. 14, the buffer size is assumed to be one instruction in order to simplify the description.

ＣＰＵ２−２命令を実行した後、全ての同期命令（ＣＰＵ０−４命令、ＣＰＵ１−４命令、及びＣＰＵ３−６命令）が実行されるまで、ＣＰＵ２では、次の命令（ＣＰＵ２−３）が実行されない。 After the CPU2-2 instruction is executed, the next instruction (CPU2-3) is not executed in the CPU2 until all synchronous instructions (CPU0-4 instruction, CPU1-4 instruction, and CPU3-6 instruction) are executed. .

並べ替え済トレースデータ２１０は、所定のＣＰＵの順に１命令ずつを繰り返して並べているため、ＣＰＵ２−２命令の後、ＣＰＵ３−２命令、ＣＰＵ０−３命令、ＣＰＵ１−３命令と順に入力された後、ＣＰＵ２−３命令がバッファ２に入った時点で、ＣＰＵ２は、他のＣＰＵ０、２及び３で同期命令（ＣＰＵ０−４命令、ＣＰＵ１−４命令、及びＣＰＵ３−６命令）が実行されるのを待つ状態となる。ＣＰＵ２は、命令フェッチを行わずに止まってしまう。 Since the rearranged trace data 210 is repeatedly arranged one by one in the order of a predetermined CPU, after the CPU2-2 instruction, the CPU3-2 instruction, the CPU0-3 instruction, and the CPU1-3 instruction are sequentially input. When the CPU2-3 instruction enters the buffer 2, the CPU2 waits for the synchronization instructions (CPU0-4 instruction, CPU1-4 instruction, and CPU3-6 instruction) to be executed by the other CPUs 0, 2, and 3. It will be in a waiting state. The CPU 2 stops without performing the instruction fetch.

その後、更に、ＣＰＵ３−３命令、ＣＰＵ０−４命令、ＣＰＵ１−４命令と入力された後、ＣＰＵ２−４命令をバッファに入力できなくなり、そこで各ＣＰＵ０〜ＣＰＵ３による実行が止まり、性能シミュレータ５による性能シミュレーション全体の処理が停止してしまう。 Thereafter, after the CPU3-3 instruction, the CPU0-4 instruction, and the CPU1-4 instruction are further input, the CPU2-4 instruction cannot be input to the buffer, and the execution by each of the CPU0 to CPU3 is stopped. The entire simulation process stops.

発明者によって、上述したように、同期命令を含むトレースデータを単純に並べ替えて用いた場合に性能シミュレーションの処理が停止してしまう仕組みが解析された。そして、発明者は、以下に説明する本実施例に係るトレースデータ作成方法を見出した。 As described above, the inventor has analyzed a mechanism in which the performance simulation process stops when the trace data including the synchronization command is simply rearranged and used. And the inventor discovered the trace data creation method which concerns on a present Example demonstrated below.

本実施例に係るトレースデータ作成方法について説明する。図１５は、トレースデータ作成方法の概要を説明するための図である。トレースデータ作成方法は、コンピュータ装置１００ａ又は１００ｂのＣＰＵ１１によって行われる。 A trace data creation method according to the present embodiment will be described. FIG. 15 is a diagram for explaining the outline of the trace data creation method. The trace data creation method is performed by the CPU 11 of the computer apparatus 100a or 100b.

図１５において、所定順に各ＣＰＵ０〜３のトレースデータ２００〜２０３からトレースされた順に１ずつ命令を取り出して転送用トレースデータ３００に追加する。その際、取り出した命令が同期命令の場合、同期命令の現われたＣＰＵのトレースデータから転送用トレースデータ３００への命令の追加を行わずにスキップして、全ての同期命令が転送用トレースデータ３００に追加された後、所定順に従って各トレースデータ２００〜２０３から同期命令以降の命令の転送用トレースデータ３００への追加を再開する。 In FIG. 15, instructions are fetched one by one from the trace data 200 to 203 of the CPUs 0 to 3 in a predetermined order and added to the transfer trace data 300. At this time, if the fetched instruction is a synchronous instruction, the CPU skips without adding the instruction from the trace data of the CPU in which the synchronous instruction appears to the transfer trace data 300, and all the synchronous instructions are transferred to the transfer trace data 300. Are added to the trace data 300 for transfer of instructions after the synchronous instruction from the trace data 200 to 203 in a predetermined order.

図１５の例では、最初の同期命令ＣＰＵ２−２が現われて、全ての同期命令ＣＰＵ０−４、ＣＰＵ１−４、及びＣＰＵ３−６を転送用トレースデータ３００に追加するまでの、ＣＰＵ０−０からＣＰＵ３−６までが最初の同期命令ＣＰＵ２−２に基づく区切りとなり、同期命令ＣＰＵ２−２に対応する最後の同期命令ＣＰＵ３−６直後の区切り７で示している。 In the example of FIG. 15, the first synchronization instruction CPU 2-2 appears and CPU 0-0 to CPU 3 until all the synchronization instructions CPU 0-4, CPU 1-4, and CPU 3-6 are added to the transfer trace data 300. Up to −6 is a break based on the first synchronization instruction CPU2-2, and is indicated by a break 7 immediately after the last synchronization instruction CPU3-6 corresponding to the synchronization instruction CPU2-2.

また、最後の同期命令ＣＰＵ３−６直後から、スキップした命令を含めて、所定順に各トレースデータ２００〜２０３から同期命令以降の命令を１ずつ取り出して、転送用トレースデータ３００に追加する。 Also, immediately after the last synchronization instruction CPU 3-6, one instruction after the synchronization instruction is taken out from each trace data 200 to 203 in a predetermined order including the skipped instruction, and added to the transfer trace data 300.

従って、図１５の例では、最初の同期命令ＣＰＵ２−２の後には、全ての他ＣＰＵ１、３、及び４で同期命令が出現するまでＣＰＵ２の命令が含まれない。同期命令ＣＰＵ２−２の後に、同期命令ＣＰＵ０−４が出現すると、ＣＰＵ２及びＣＰＵ０以外の他ＣＰＵ１及び３で同期命令が出現するまでＣＰＵ０の命令が含まれない。 Therefore, in the example of FIG. 15, the instruction of the CPU 2 is not included after the first synchronization instruction CPU2-2 until the synchronization instruction appears in all the other CPUs 1, 3, and 4. When the synchronous instruction CPU0-4 appears after the synchronous instruction CPU2-2, the instruction of the CPU0 is not included until the synchronous instruction appears in the CPUs 1 and 3 other than the CPU2 and CPU0.

ＣＰＵ１及びＣＰＵ３においても同様である。そして、同期命令ＣＰＵ１−４と、同期命令ＣＰＵ３−６が出現することによって、転送用トレースデータ３００への命令の追加が、所定順で再開する。最後の同期命令ＣＰＵ３−６後に、ＣＰＵ０−５からＣＰＵ２−９までの命令が転送用トレースデータ３００に追加されることによって、同期命令後に実行されることになる。 The same applies to CPU1 and CPU3. Then, when the synchronous instruction CPU1-4 and the synchronous instruction CPU3-6 appear, the addition of the instruction to the transfer trace data 300 is resumed in a predetermined order. After the last synchronization instruction CPU3-6, instructions from CPU0-5 to CPU2-9 are added to the transfer trace data 300, so that they are executed after the synchronization instruction.

トレース中に同期命令が現われるごとにこのように並べ替えを行うことによって、同期命令で区切られ、性能シミュレーションが停止することなく正常に動作する転送用トレースデータ３００を作成することができる。 By performing such rearrangement every time a synchronous command appears in the trace, it is possible to create transfer trace data 300 that is partitioned by the synchronous command and operates normally without stopping the performance simulation.

図１６は、転送用トレースデータを作成するための機能構成例を示す図である。図１６に示す機能構成は、コンピュータ装置１００ａ又は１００ｂに実装される。コンピュータ装置１００ａ又は１００ｂは、マルチコア対応ＩＳＳ部３５と、トレース結合処理部３６とを転送用トレースデータ３００を作成するための処理部として有する。マルチコア対応ＩＳＳ部３５と、トレース結合処理部３６とは、ＣＰＵ１１が対応するプログラムを実行することによる処理によって実現される。 FIG. 16 is a diagram illustrating a functional configuration example for creating transfer trace data. The functional configuration illustrated in FIG. 16 is implemented in the computer device 100a or 100b. The computer apparatus 100 a or 100 b includes the multi-core compatible ISS unit 35 and the trace combination processing unit 36 as processing units for creating the transfer trace data 300. The multi-core compatible ISS unit 35 and the trace combination processing unit 36 are realized by processing by the CPU 11 executing a corresponding program.

マルチコア対応ＩＳＳ部３５は、記憶部３０に格納されているマルチスレッドプログラム３４を読み込んで、命令セットレベルのシミュレーションを行う。その結果として、トレースデータＴＤ−１、ＴＤ−２、ＴＤ−３、・・・、ＴＤ−ｎが記憶部３０に出力され格納される。 The multi-core compatible ISS unit 35 reads the multi-thread program 34 stored in the storage unit 30 and performs an instruction set level simulation. As a result, trace data TD-1, TD-2, TD-3,..., TD-n are output and stored in the storage unit 30.

トレース結合処理部３６は、所定順に、記憶部３０に格納されているトレースデータＴＤ−１、ＴＤ−２、ＴＤ−３、・・・、ＴＤ−ｎから命令を１つずつ読み出して、図１５で説明したトレースデータ作成方法によって、一つの転送用トレースデータ３００を作成して記憶部３０に格納する。処理を簡潔にするため、各トレースデータＴＤ−１、ＴＤ−２、ＴＤ−３、・・・、ＴＤ−ｎのファイル名に対応するＣＰＵ番号を含むようにして記憶部３０に格納してもよい。 The trace combination processing unit 36 reads instructions one by one from the trace data TD-1, TD-2, TD-3,..., TD-n stored in the storage unit 30 in a predetermined order. One transfer trace data 300 is created and stored in the storage unit 30 by the trace data creation method described above. In order to simplify the processing, the CPU 30 corresponding to the file name of each trace data TD-1, TD-2, TD-3,..., TD-n may be stored in the storage unit 30.

トレース結合処理部３６によるトレース結合処理は、性能シミュレーションの実行の前に行われる。上述したように、トレース結合プログラムをＣＰＵ１１が実行することによって実現されるように、ソフトウェアで実現してもよいし、ハードウェアで実現してもよい。また、コンピュータ装置１００ａ及び１００ｂとは、別の装置で実現してもよい。 The trace combination processing by the trace combination processing unit 36 is performed before the performance simulation is executed. As described above, the trace coupling program may be realized by software or hardware so as to be realized by the CPU 11 executing it. The computer devices 100a and 100b may be realized by a different device.

図１７は、トレース結合処理部によるトレース結合処理を説明するためのフローチャート図である。図１７において、トレース結合処理部３６は、トレースデータを選択するための所定順に従って、記憶部３０に記憶されたＣＰＵ番号７３のトレースデータからトレースされた順に命令３ｂを一つ読み込む（ステップＳ１１）。 FIG. 17 is a flowchart for explaining the trace combination processing by the trace combination processing unit. In FIG. 17, the trace combination processing unit 36 reads one instruction 3b in the traced order from the trace data of the CPU number 73 stored in the storage unit 30 according to a predetermined order for selecting trace data (step S11). .

選択の所定順は、命令３ｂを読み込む際のトレースデータを選択する順番をＣＰＵの番号で示す。例えば、各トレースデータのファイル名にＣＰＵ番号を含めておくことで、降順又は昇順などのＣＰＵの番号順に従って、トレースデータが選択されるようにすればよい。所定順がＣＰＵ０、ＣＰＵ１、ＣＰＵ２、・・・、ＣＰＵｎの順番（０、１、２、・・・、ｎ）である場合、ＣＰＵ０用のトレースデータ、ＣＰＵ１用のトレースデータ、ＣＰＵ２用のトレースデータ、・・・、ＣＰＵｎ用のトレースデータの順に選択される。 The predetermined order of selection indicates the order of selecting the trace data when reading the instruction 3b by the CPU number. For example, by including the CPU number in the file name of each trace data, the trace data may be selected according to the CPU number order such as descending order or ascending order. When the predetermined order is the order of CPU0, CPU1, CPU2,..., CPUn (0, 1, 2,..., N), the trace data for CPU0, the trace data for CPU1, and the trace data for CPU2. ... Are selected in the order of trace data for CPUn.

トレース結合処理部３６は、読み込んだ命令３ｂが同期命令か否かを判断する（ステップＳ１２）。命令３ｂが同期命令である場合、ＣＰＵ番号７３のフラグテーブル７１に同期フラグをセットして（ステップＳ１３）、命令３ｂを転送用トレースデータ３００に出力して格納する（ステップＳ１４）。 The trace combination processing unit 36 determines whether or not the read instruction 3b is a synchronous instruction (step S12). If the instruction 3b is a synchronization instruction, a synchronization flag is set in the flag table 71 of the CPU number 73 (step S13), and the instruction 3b is output and stored in the transfer trace data 300 (step S14).

フラグテーブル７１は、記憶部３０に格納され、例えば、コア数分のビット数を少なくとも有するようにすればよい。ＣＰＵ番号と同期フラグとを対応付ければよい。同期フラグは、設定されることによって「１」を示す。即ち、ＣＰＵ番号に対応する同期フラグが「１」を示せば、読み出した命令が同期命令であったことを示す。また、ＣＰＵ番号に対応する同期フラグが「０」を示せば、未だ同期命令が読み出されていないことを示す。 The flag table 71 is stored in the storage unit 30 and may have at least the number of bits corresponding to the number of cores, for example. The CPU number may be associated with the synchronization flag. The synchronization flag indicates “1” by being set. That is, if the synchronization flag corresponding to the CPU number indicates “1”, it indicates that the read instruction is a synchronization instruction. If the synchronization flag corresponding to the CPU number indicates “0”, it indicates that the synchronization command has not been read yet.

一方、ステップＳ１２にて、読み込んだ命令が同期命令でないと判断した場合、トレース結合処理部３６は、記憶部３０に格納されているフラグテーブル７１を参照して、ＣＰＵ番号７３の同期フラグがセットされているか否かを判断する（ステップＳ１３−２）。同期フラグがセットされている場合、ＣＰＵ番号７３の同期用バッファ７２に読み出した命令３ｂを保存する（ステップＳ１４−２）。 On the other hand, if it is determined in step S12 that the read instruction is not a synchronous instruction, the trace combination processing unit 36 refers to the flag table 71 stored in the storage unit 30 and sets the synchronization flag of the CPU number 73. It is determined whether it has been performed (step S13-2). If the synchronization flag is set, the read instruction 3b is stored in the synchronization buffer 72 of the CPU number 73 (step S14-2).

同期用バッファ７２は、記憶部３０内に用意されたＣＰＵ番号毎の作業用のバッファ領域である。読み出された同期命令と同一ＣＰＵ番号を示す命令３ｂがトレースデータから読み出された場合、他ＣＰＵに対する全ての同期命令が検出されるまで、そのＣＰＵ番号の同期用バッファ７２に格納される。 The synchronization buffer 72 is a work buffer area for each CPU number prepared in the storage unit 30. When the instruction 3b indicating the same CPU number as the read synchronization instruction is read from the trace data, it is stored in the synchronization buffer 72 for that CPU number until all synchronization instructions for other CPUs are detected.

ステップＳ１３−２にて、同期フラグがセットされていないと判断した場合、トレース結合処理部３６は、命令３ｂを転送用トレースデータ３００に出力し格納する（ステップＳ１４−４）。 If it is determined in step S13-2 that the synchronization flag is not set, the trace combination processing unit 36 outputs and stores the instruction 3b in the transfer trace data 300 (step S14-4).

ステップＳ１４、Ｓ１４−２、Ｓ１４−４の処理後、トレース結合処理部３６は、フラグテーブル７１を参照することによって、全てのＣＰＵに対して同期フラグがセットされたか否かを判断する（ステップＳ１５）。全てのＣＰＵに対して同期フラグがセットされている場合、トレース結合処理部３６は、全ての同期フラグをクリアする（ステップＳ１６）。全ての同期フラグが「０」に設定される。 After the processing of steps S14, S14-2, and S14-4, the trace combination processing unit 36 refers to the flag table 71 to determine whether or not the synchronization flag has been set for all the CPUs (step S15). ). When the synchronization flag is set for all the CPUs, the trace combination processing unit 36 clears all the synchronization flags (step S16). All the synchronization flags are set to “0”.

そして、トレース結合処理部３６は、図１８に説明されるようなバッファ出力処理を実行する。バッファ出力処理の実行後、トレース結合処理部３６は、ステップＳ１８へと進む。 Then, the trace combination processing unit 36 executes a buffer output process as illustrated in FIG. After executing the buffer output process, the trace combination processing unit 36 proceeds to step S18.

ステップＳ１５にて、全てのＣＰＵに対して同期フラグがセットされていないと判断した場合、又は、同期フラグが未セットのＣＰＵが存在すると判断した場合、トレース結合処理部３６は、所定順において、ＣＰＵ番号７３が最後の番号であるか否かを判断する（ステップＳ１８）。 If it is determined in step S15 that the synchronization flag has not been set for all the CPUs, or if it is determined that there is a CPU for which the synchronization flag has not been set, the trace combination processing unit 36, in a predetermined order, It is determined whether or not the CPU number 73 is the last number (step S18).

ＣＰＵ番号７３が最後の番号であると判断した場合、トレース結合処理部３６は、ＣＰＵ番号７３を初期化して（ステップＳ１９）、ステップＳ２０へと進む。記憶部３０に格納されているＣＰＵ番号７３が、「０」に設定される。一方、ＣＰＵ番号７３が最後の番号ではないと判断した場合、トレース結合処理部３６は、ＣＰＵ番号７３を１インクリメントして（ステップＳ１９−２）、ステップＳ２０へと進む。 When determining that the CPU number 73 is the last number, the trace combination processing unit 36 initializes the CPU number 73 (step S19), and proceeds to step S20. The CPU number 73 stored in the storage unit 30 is set to “0”. On the other hand, when determining that the CPU number 73 is not the last number, the trace combination processing unit 36 increments the CPU number 73 by 1 (step S19-2), and proceeds to step S20.

そして、トレース結合処理部３６は、全てのトレースの読み込みを完了したか否かを判断する（ステップＳ２０）。全てのトレースの読み込みが未完了であると判断した場合、トレース結合処理部３６は、ステップＳ１１へ戻り、記憶部３０に記憶されているＣＰＵ番号７３に基づいて、上述した同様の処理を繰り返す。 Then, the trace combination processing unit 36 determines whether or not all the traces have been read (step S20). If it is determined that reading of all the traces is incomplete, the trace combination processing unit 36 returns to step S11 and repeats the same processing described above based on the CPU number 73 stored in the storage unit 30.

一方、ステップＳ２０にて、全てのトレースの読み込みを完了したと判断した場合、トレース結合処理部３６は、この処理を終了する。 On the other hand, if it is determined in step S20 that all the traces have been read, the trace combination processing unit 36 ends this process.

図１８は、図１７のステップＳ１７でのバッファ出力処理を説明するための図である。図１８において、トレース結合処理部３６は、バッファ出力処理用のＣＰＵ番号７３−２を初期化して（ステップＳ５１）、ＣＰＵ番号７３−２の同期用バッファから命令３ｂを１つ読み込む（ステップＳ５２）。 FIG. 18 is a diagram for explaining the buffer output process in step S17 of FIG. In FIG. 18, the trace combination processing unit 36 initializes the CPU number 73-2 for buffer output processing (step S51), and reads one instruction 3b from the synchronization buffer having the CPU number 73-2 (step S52). .

トレース結合処理部３６は、読み込んだ命令を転送用トレースデータ３００に出力する（ステップＳ５３）。トレース結合処理部３６は、所定順において、ＣＰＵ番号７３−２が最後の番号であるか否かを判断する（ステップＳ５４）。 The trace combination processing unit 36 outputs the read instruction to the transfer trace data 300 (step S53). The trace combination processing unit 36 determines whether or not the CPU number 73-2 is the last number in a predetermined order (step S54).

ＣＰＵ番号７３−２が最後の番号であると判断した場合、トレース結合処理部３６は、ＣＰＵ番号７３−２を初期化して（ステップＳ５５）、ステップＳ２０へと進む。記憶部３０に格納されているＣＰＵ番号７３−２が、「０」に設定される。一方、ＣＰＵ番号７３−２が最後の番号ではないと判断した場合、トレース結合処理部３６は、ＣＰＵ番号７３−２を１インクリメントして（ステップＳ５５−２）、ステップＳ５６へと進む。 When determining that the CPU number 73-2 is the last number, the trace combination processing unit 36 initializes the CPU number 73-2 (step S55), and proceeds to step S20. The CPU number 73-2 stored in the storage unit 30 is set to “0”. On the other hand, when determining that the CPU number 73-2 is not the last number, the trace combination processing unit 36 increments the CPU number 73-2 by 1 (step S55-2), and proceeds to step S56.

そして、トレース結合処理部３６は、全ての同期用バッファ７２が空か否かを判断する（ステップＳ５６）。全ての同期用バッファ７２が空である場合、トレース結合処理部３６は、このバッファ出力処理を終了する。全ての同期用バッファ７２が空でない場合、トレース結合処理部３６は、ステップＳ５２へと戻り、上述した同様の処理を繰り返す。 Then, the trace combination processing unit 36 determines whether or not all the synchronization buffers 72 are empty (step S56). When all the synchronization buffers 72 are empty, the trace combination processing unit 36 ends this buffer output processing . If all the synchronization buffers 72 are not empty, the trace combination processing unit 36 returns to step S52 and repeats the same processing as described above.

上述したトレース結合処理部３６によるトレース結合処理によって、作成された転送用トレースデータ３００を用いた場合の実行タイミングについて、図１９及び図２０で説明する。図１９及び図２０では、図６に示す構成例において、トレースデータ３００を用いた場合で説明する。トレースデータ３００においても、命令３ｂのデータ構成は、トレースデータ２ｂの命令３ｂと同様である。 The execution timing when the transfer trace data 300 created by the trace combination processing by the trace combination processing unit 36 described above is used will be described with reference to FIGS. 19 and 20, the case where the trace data 300 is used in the configuration example shown in FIG. 6 will be described. Also in the trace data 300, the data structure of the instruction 3b is the same as the instruction 3b of the trace data 2b.

図１９は、同期命令によって区切られた実行タイミングの例を示す図である。図１９において、トレース結合処理部３６によって作成された転送用トレースデータ３００が、性能シミュレーション装置５０ｂのトレース受信部６に入力されると、命令３ｂのＣＰＵ番号に対応するバッファに命令３ｂが格納される。 FIG. 19 is a diagram illustrating an example of execution timing divided by a synchronization instruction. In FIG. 19, when the transfer trace data 300 created by the trace combination processing unit 36 is input to the trace receiving unit 6 of the performance simulation apparatus 50b, the instruction 3b is stored in the buffer corresponding to the CPU number of the instruction 3b. The

性能シミュレータ５によってシミュレーションされる各ＣＰＵ０〜ＣＰＵ３は、夫々の対応するバッファ０〜バッファ３から命令３ｂを読み込んで実行する。最初の同期命令ＣＰＵ２−２をＣＰＵ２が実行すると、ＣＰＵ２は、他ＣＰＵ０、１及び３による同期命令の実行待ちとなる。 Each of the CPU0 to CPU3 simulated by the performance simulator 5 reads the instruction 3b from the corresponding buffer 0 to buffer 3 and executes it. When the CPU 2 executes the first synchronization instruction CPU2-2, the CPU 2 waits for execution of the synchronization instruction by the other CPUs 0, 1 and 3.

次に、ＣＰＵ０で同期命令ＣＰＵ０−４を実行すると、ＣＰＵ０は他ＣＰＵ１及び３による同期命令の実行待ちとなる。更に、ＣＰＵ１で同期命令ＣＰＵ１−４を実行すると、ＣＰＵ１は他ＣＰＵ３による同期命令の実行待ちとなる。最後に、ＣＰＵ３で同期命令ＣＰＵ３−６を実行すると、全ての同期命令が実行済みとなり、各ＣＰＵ０〜３による命令フェッチが滞ることなく実行される。その時点が区切り７で示される。区切り７以降において、各ＣＰＵ０〜３によりトレースされた命令３ｂが順次実行される。 Next, when the synchronization instruction CPU0-4 is executed by the CPU0, the CPU0 waits for execution of the synchronization instruction by the other CPUs 1 and 3. Further, when the synchronization instruction CPU1-4 is executed by the CPU1, the CPU1 waits for execution of the synchronization instruction by the other CPU3. Finally, when the synchronous instruction CPU 3-6 is executed by the CPU 3, all the synchronous instructions have been executed, and instruction fetches by the CPUs 0 to 3 are executed without delay. The point in time is indicated by a break 7. After the break 7, the instructions 3b traced by the CPUs 0 to 3 are sequentially executed.

図２０は、実機と、図１９の転送用トレースデータを用いた性能シミュレーションの場合の実行タイミングのずれを比較するための図である。実機と性能シミュレータ５とにおける実行タイミングの比較において、図２０（Ａ）はＣＰＵ０の実行タイミングのずれを示し、図２０（Ｂ）はＣＰＵ１の実行タイミングのずれを示し、図２０（Ｃ）はＣＰＵ２の実行タイミングのずれを示し、図２０（Ｄ）はＣＰＵ３の実行タイミングのずれを示している。 FIG. 20 is a diagram for comparing a difference in execution timing between the actual machine and the performance simulation using the transfer trace data of FIG. In the comparison of execution timing between the actual machine and the performance simulator 5, FIG. 20A shows a shift in the execution timing of CPU0, FIG. 20B shows a shift in the execution timing of CPU1, and FIG. FIG. 20D shows a deviation in the execution timing of the CPU 3.

１０個の命令を終了した時点において、図２０（Ａ）では、性能シミュレータ５によるＣＰＵ０の性能シミュレーションは、実機と比べてｄ２０遅れて終了している。図１２（Ｂ）では、性能シミュレータ５によるＣＰＵ１の性能シミュレーションは、実機と比べてｄ２１遅れて終了している。図１２（Ｃ）では、性能シミュレータ５によるＣＰＵ２の性能シミュレーションは、実機と比べてｄ２２遅れて終了している。図１２（Ｄ）では、性能シミュレータ５によるＣＰＵ３の性能シミュレーションは、実機と比べてｄ２３遅れて終了している。 At the time when 10 instructions are finished, in FIG. 20A, the performance simulation of the CPU 0 by the performance simulator 5 is finished with a delay of d20 compared to the actual machine. In FIG. 12 (B), the performance simulation of the CPU 1 by the performance simulator 5 is completed with a delay of d21 compared to the actual machine. In FIG. 12C, the performance simulation of the CPU 2 by the performance simulator 5 is completed with a delay of d22 compared to the actual machine. In FIG. 12 (D), the performance simulation of the CPU 3 by the performance simulator 5 is finished with a delay of d23 compared to the actual machine.

図２０（Ａ）から図１２（Ｄ）の夫々において、実機の実行タイミングにおける区切り７ａに対する、性能シミュレータ５による実行タイミングにおける区切り７ｂの遅延分の差が性能シミュレータ５にはある。しかしながら、トレース結合処理部３６によって作成された転送用トレースデータ３００を用いることによって、性能シミュレーションが途中で停止することなく実行できている。また、全ＣＰＵ０〜ＣＰＵ３の同期による実行タイミングについても、区切り７ｂに示すように実現できている。 In each of FIG. 20A to FIG. 12D, the performance simulator 5 has a difference of the delay of the break 7b at the execution timing by the performance simulator 5 with respect to the break 7a at the execution timing of the actual machine. However, by using the transfer trace data 300 created by the trace combination processing unit 36, the performance simulation can be executed without stopping midway. Further, the execution timing by the synchronization of all the CPU0 to CPU3 can also be realized as shown in the partition 7b.

上述したように、複数のプロセッサ（ＣＰＵ）を搭載したシステムの性能評価において、同期命令毎に区切られるように、複数のＣＰＵ０〜ＣＰＵｎでトレースされた命令３ｂを並べ替えて一つのトレースデータ３００を作成することによって、同期命令を含むトレースに対しても、同期命令後の各プロセッサによる命令フェッチが滞りなく行われるため、性能シミュレーションが途中で停止することなく実行される。 As described above, in the performance evaluation of a system equipped with a plurality of processors (CPUs), one trace data 300 is obtained by rearranging the instructions 3b traced by the plurality of CPUs 0 to CPUn so as to be divided for each synchronization instruction. As a result of the creation, the instruction fetch by each processor after the synchronization instruction is performed without delay even for the trace including the synchronization instruction, so that the performance simulation is executed without stopping midway.

本実施例では、複数プロセッサのシステムを評価する性能シミュレーションのための転送用トレースデータを、複数プロセッサへの全同期命令後に次の命令が入力されるように複数のトレースを同期命令毎に区切って結合して作成する。 In this embodiment, transfer trace data for performance simulation for evaluating a system of a plurality of processors is divided into a plurality of traces for each synchronization instruction so that the next instruction is input after all synchronization instructions to the plurality of processors. Create by combining.

本実施例に係る転送用トレースデータ３００は、図２、図６から図９の構成によって行われる性能シミュレーションのために適用可能である。コンピュータ装置１００ａ及び１００ｂは、以下に説明されるトレース結合装置に相当する。 The transfer trace data 300 according to the present embodiment can be applied for performance simulation performed by the configurations of FIGS. 2 and 6 to 9. The computer devices 100a and 100b correspond to a trace coupling device described below.

本発明は、具体的に開示された実施例に限定されるものではなく、特許請求の範囲から逸脱することなく、種々の変形や変更が可能である。 The present invention is not limited to the specifically disclosed embodiments, and various modifications and changes can be made without departing from the scope of the claims.

以上の実施例を含む実施形態に関し、更に以下の付記を開示する。
（付記１）
一つのシステムで動作する複数のプロセッサの夫々に対応する複数のトレースデータを記憶した記憶部と、
前記記憶部に記憶された前記複数のトレースデータの１つを所定順に従って選択し、各トレースデータからトレースされた命令を１つずつ読み込みながら同期命令毎に区切って並べ替え、前記記憶部内の転送用トレースデータに追加することによって、各プロセッサのトレースを結合するトレース結合処理部と
を有することを特徴とするトレース結合装置。
（付記２）
前記トレース結合処理部は、
前記記憶部の前記所定順に従って選択したトレースデータから命令を読み込む命令読込部と、
前記命令読込部によって読み込んだ命令が同期命令であるか否かを判断する同期命令判断部と、
前記同期命令判断部によって前記命令が同期命令でないと判断された場合、該命令がトレースされた前記プロセッサに対応させた同期用バッファに保存する命令保存部と、
全プロセッサ夫々に対応する前記複数のトレースデータから前記同期命令の読み出しが終了した場合、前記所定順に従って選択した前記同期用バッファから命令を１つずつ読み込み、前記転送用トレースデータへ出力して追加するバッファ出力部と
を有することを特徴とする付記１記載のトレース結合装置。
（付記３）
前記バッファ出力部による前記転送用トレースデータへの出力が終了すると、前記命令読込部による処理を再開することを特徴とする付記２記載のトレース結合装置。
（付記４）
前記同期命令判断部によって前記命令が同期命令であると判断された場合、前記記憶部内に格納される前記複数のプロセッサの夫々に対応する同期フラグを有するフラグテーブル内の、該命令がトレースされたプロセッサに対応する該同期フラグを設定するフラグ設定部と、
前記同期命令判断部によって前記命令が同期命令でないと判断された場合、前記フラグテーブルを参照することによって、該命令がトレースされたプロセッサに対応する該同期フラグが設定されているか否かを判断するフラグ判断部と、
前記フラグ判断部によって前記同期フラグが設定されていないと判断された場合、前記命令を前記転送用トレースデータへ出力して追加するフラグ未設定追加部と
を有することを特徴とする付記３記載のトレース結合装置。
（付記５）
前記同期命令判断部によって前記命令が同期命令でないと判断された場合であって、かつ、前記フラグ判断部によって前記同期フラグが設定されていると判断された場合に、該命令がトレースされた前記プロセッサに対応させた前記同期用バッファに保存する未設定命令保存部を有することを特徴とすることを特徴とする付記４記載のトレース結合装置。
（付記６）
前記同期命令判断部によって前記命令が同期命令であると判断され、前記フラグ設定部によって、前記フラグテーブル内の、該命令がトレースされたプロセッサに対応する該同期フラグが設定されると、該命令を前記転送用トレースデータへ出力して保存する設定後命令保存部を有することを特徴とする付記４又は５記載のトレース結合装置。
（付記７）
コンピュータによって実行されるトレース結合方法であって、
記憶部に記憶された一つのシステムで動作する複数のプロセッサの夫々に対応する複数のトレースデータから所定順に従ってトレースデータを選択し、
各トレースデータからトレースされた命令を１つずつ読み込みながら、同期命令毎に区切って並べ替えて、前記記憶部内の転送用トレースデータに追加することによって、各プロセッサのトレースを結合する
ことを特徴とするトレース結合方法。
（付記８）
記憶部に記憶された一つのシステムで動作する複数のプロセッサの夫々に対応する複数のトレースデータから所定順に従ってトレースデータを選択し、
各トレースデータからトレースされた命令を１つずつ読み込みながら、同期命令毎に区切って並べ替えて、前記記憶部内の転送用トレースデータに追加することによって、各プロセッサのトレースを結合する、
処理をコンピュータに実行させるプログラム。
（付記９）
記憶部に記憶された一つのシステムで動作する複数のプロセッサの夫々に対応する複数のトレースデータから所定順に従ってトレースデータを選択し、
各トレースデータからトレースされた命令を１つずつ読み込みながら、同期命令毎に区切って並べ替えて、前記記憶部内の転送用トレースデータに追加することによって、各プロセッサのトレースを結合する、
処理をコンピュータに実行させるプログラムを記憶したコンピュータ読取可能な記憶媒体。 The following additional notes are further disclosed with respect to the embodiment including the above examples.
(Appendix 1)
A storage unit storing a plurality of trace data corresponding to each of a plurality of processors operating in one system;
One of the plurality of trace data stored in the storage unit is selected according to a predetermined order, and the instructions traced from each trace data are read one by one and rearranged for each synchronous command, and transferred in the storage unit And a trace combination processing unit for combining the traces of the respective processors by adding to the trace data.
(Appendix 2)
The trace combination processing unit
An instruction reading unit for reading an instruction from the trace data selected according to the predetermined order of the storage unit;
A synchronous instruction determination unit for determining whether or not the instruction read by the instruction reading unit is a synchronous instruction;
An instruction storage unit that stores the instruction in the synchronization buffer associated with the processor in which the instruction is traced when the instruction is not a synchronization instruction by the synchronization instruction determination unit;
When reading of the synchronous instruction from the plurality of trace data corresponding to all the processors is completed, the instruction is read one by one from the synchronization buffer selected according to the predetermined order, and is output to the transfer trace data and added. The trace combining device according to claim 1, further comprising: a buffer output unit configured to perform the above operation.
(Appendix 3)
3. The trace combining device according to claim 2, wherein when the output to the transfer trace data by the buffer output unit is completed, the processing by the instruction reading unit is resumed.
(Appendix 4)
When the synchronous instruction determination unit determines that the instruction is a synchronous instruction, the instruction is traced in a flag table having a synchronization flag corresponding to each of the plurality of processors stored in the storage unit. A flag setting unit for setting the synchronization flag corresponding to the processor;
When the synchronous instruction determination unit determines that the instruction is not a synchronous instruction, it determines whether or not the synchronous flag corresponding to the processor in which the instruction is traced is set by referring to the flag table. A flag determination unit;
The flag non-setting addition unit that outputs and adds the instruction to the transfer trace data when the flag determination unit determines that the synchronization flag is not set. Trace coupling device.
(Appendix 5)
A case wherein the instruction by the synchronization instruction determination unit is determined not to be synchronization command, and, when the synchronization flag by the flag determination unit determines a is configured, the instruction is traced The trace combining device according to claim 4, further comprising an unset instruction storage unit that stores the synchronization buffer in the synchronization buffer corresponding to the processor.
(Appendix 6)
When the synchronization instruction determination unit determines that the instruction is a synchronization instruction, and the flag setting unit sets the synchronization flag corresponding to the processor in which the instruction is traced in the flag table, the instruction 6. The trace combining device according to appendix 4 or 5, further comprising a post-setting instruction storage unit that outputs and stores the data to the transfer trace data.
(Appendix 7)
A trace combining method executed by a computer,
Selecting trace data according to a predetermined order from a plurality of trace data corresponding to each of a plurality of processors operating in one system stored in the storage unit;
The traces of the respective processors are combined by reading the traced instructions from each trace data one by one, sorting them in synchronization instructions and adding them to the trace data for transfer in the storage unit. How to combine traces.
(Appendix 8)
Selecting trace data according to a predetermined order from a plurality of trace data corresponding to each of a plurality of processors operating in one system stored in the storage unit;
The traces of the processors are combined by reading the traced instructions from each trace data one by one and sorting and sorting by each synchronous instruction and adding to the trace data for transfer in the storage unit.
A program that causes a computer to execute processing.
(Appendix 9)
Selecting trace data according to a predetermined order from a plurality of trace data corresponding to each of a plurality of processors operating in one system stored in the storage unit;
The traces of the processors are combined by reading the traced instructions from each trace data one by one and sorting and sorting by each synchronous instruction and adding to the trace data for transfer in the storage unit.
A computer-readable storage medium storing a program for causing a computer to execute processing.

２、２ｂトレースデータ
３、３ｂ命令
４アーキテクチャ設定情報
５性能シミュレータ
６トレース受信部
７、７ａ、７ｂ区切り
８性能情報レポート
９実機の実行タイミング
１１ＣＰＵ
１２ＲＯＭ
１３ＲＡＭ
１４ハードディスクドライブ
１５入力装置
１６出力装置
１７通信Ｉ／Ｆ
１８ドライブ
１９記憶媒体
２０性能シミュレーション装置Ｉ／Ｆ
３０記憶部
３３転送処理部
３４マルチスレッドプログラム
３５マルチコア対応ＩＳＳ部
３６トレース結合処理部
５０ａ性能シミュレーション部
５０ｂ性能シミュレーション装置
５１ｂ、５２ｂＩ／Ｆ
７１フラグテーブル
７２同期用バッファ
７３ＣＰＵ番号
１００ａ、１００ｂコンピュータ装置
２００、２０１、２０２、２０３トレースデータ
２１０並べ替え済トレースデータ
３００転送用トレースデータ 2, 2b Trace data 3, 3b Instruction 4 Architecture setting information 5 Performance simulator 6 Trace reception unit 7, 7a, 7b Delimiter 8 Performance information report 9 Actual machine execution timing 11 CPU
12 ROM
13 RAM
14 hard disk drive 15 input device 16 output device 17 communication I / F
18 drive 19 storage medium 20 performance simulation device I / F
30 storage unit 33 transfer processing unit 34 multi-thread program 35 multi-core compatible ISS unit 36 trace combination processing unit 50a performance simulation unit 50b performance simulation device 51b, 52b I / F
71 Flag table 72 Synchronization buffer 73 CPU number 100a, 100b Computer device 200, 201, 202, 203 Trace data 210 Rearranged trace data 300 Transfer trace data

Claims

A storage unit storing a plurality of trace data corresponding to each of a plurality of processors operating in one system;
Selecting one of the plurality of trace data stored in the storage unit according to a predetermined order, reading one instruction traced from each trace data one by one, rearranging the data by dividing each synchronization instruction, A trace combining device comprising: a trace combination processing unit for combining the traces of the processors by adding to the trace data for transfer.

The trace combination processing unit
An instruction reading unit for reading an instruction from the trace data selected according to the predetermined order of the storage unit;
A synchronous instruction determination unit for determining whether or not the instruction read by the instruction reading unit is a synchronous instruction;
When the synchronous instruction determination unit determines that the instruction is a synchronous instruction, the instruction is traced in a flag table having a synchronization flag corresponding to each of the plurality of processors stored in the storage unit. A flag setting unit for setting the synchronization flag corresponding to the processor;
When the synchronous instruction determining unit determines that the instruction is not a synchronous instruction, it is determined whether or not the synchronous flag corresponding to the processor in which the instruction is traced is set by referring to the flag table. A flag determination unit to
When the flag determination unit determines that the synchronization flag is set , an instruction storage unit that stores the instruction in a synchronization buffer corresponding to the processor in which the instruction is traced;
When reading of the synchronous instruction from the plurality of trace data corresponding to all the processors is completed, the instruction is read one by one from the synchronization buffer selected according to the predetermined order, and is output to the transfer trace data and added. 2. The trace combining device according to claim 1, further comprising a buffer output unit.

3. The trace coupling apparatus according to claim 2, wherein when the output to the transfer trace data by the buffer output unit is completed, the processing by the instruction reading unit is resumed.

The trace combination processing unit
4. A flag non-setting addition unit that outputs and adds the instruction to the transfer trace data when the flag determination unit determines that the synchronization flag is not set. Trace coupling device.

Selecting trace data according to a predetermined order from a plurality of trace data corresponding to each of a plurality of processors operating in one system stored in the storage unit;
The traces of the processors are combined by reading the traced instructions from each trace data one by one and sorting and sorting by each synchronous instruction and adding to the trace data for transfer in the storage unit.
A program that causes a computer to execute processing.