CN102508776A

CN102508776A - Automatic construction method for evaluation stimulus of multi-thread cross double-precision short-vector structure

Info

Publication number: CN102508776A
Application number: CN2011103428031A
Authority: CN
Inventors: 李春江; 杜云飞; 易会战; 杨灿群; 黄春; 陈娟; 赵克佳; 王�锋; 彭林; 左克
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2011-11-03
Filing date: 2011-11-03
Publication date: 2012-06-20
Anticipated expiration: 2031-11-03
Also published as: CN102508776B

Abstract

The invention discloses a method for automatically constructing evaluation incentives of a multi-threaded cross double-precision short vector structure, comprising the steps of: inputting the vector operation type and vector length to be evaluated; automatically creating an assembly language file with empty content used as evaluation incentives ; Write the following content into the assembly language file: multi-thread running initialization code segment, evaluation incentive control structure, multi-thread vector operation program segment, evaluation incentive synchronization structure, data segment and data segment initialization statement; constitute a complete multi-thread crossover Evaluation stimulus for double-precision short-vector structures. The invention can realize automatic batch construction of evaluation incentives, can shorten construction time, save costs, improve evaluation incentive development efficiency, and greatly facilitate processor verification and performance evaluation.

Description

Evaluation-inspired automatic construction of multithreaded interleaved double-precision short-vector structures

技术领域 technical field

本发明涉及处理器领域和评测领域，尤其涉及一种面向多线程交叉双精度短向量结构处理器的评测激励自动构造方法。 The invention relates to the field of processors and the field of evaluation, in particular to a method for automatically constructing evaluation incentives for multi-thread interleaving double-precision short vector structure processors.

背景技术 Background technique

随着处理器芯片的集成度越来越大，在处理器内核中实现双精度短向量部件来支持数据密集的科学和工程计算是一个重要的发展趋势。在多线程处理器内核中扩展双精度短向量部件可以大幅度提高处理器的双精度浮点计算能力。双精度短向量部件需要实现更长字长的短向量寄存器（目前Intel的AVX已经支持256位共4路双精度数据的短向量），并且需要实现相应的支持双精度计算的向量操作指令集。 With the increasing integration of processor chips, it is an important development trend to implement double-precision short vector components in processor cores to support data-intensive scientific and engineering calculations. Extending the double-precision short vector unit in the multi-thread processor core can greatly improve the double-precision floating-point computing capability of the processor. The double-precision short vector component needs to implement a short vector register with a longer word length (currently Intel's AVX already supports a 256-bit short vector with a total of 4 double-precision data), and needs to implement a corresponding vector operation instruction set that supports double-precision calculations.

如图1所示，为一扩展了双精度短向量部件的多线程处理器内核结构示意图。该处理器内核基于OpenSparc T2实现，在处理器内核中扩展了向量处理单元（VPU），支持4路双精度数据的短向量操作，多个线程可并发使用。该处理器内核采用轮转多线程的方式支持8个硬件线程，每4个硬件线程为一组；每个时钟周期，处理器从每组4个线程中选择一个线程的当前指令执行，该指令可以是向量指令也可以是标量指令，当某个线程的指令由于高速缓存失效等原因引发流水线阻塞时，多线程交叉向量结构的微处理器将从其他线程取指执行，从而隐藏延迟，保证充满流水线。如图1所示，处理器内核各个功能单元的功能简述如下： As shown in FIG. 1 , it is a schematic structural diagram of a multi-threaded processor core with extended double-precision short vector components. The processor core is implemented based on OpenSparc T2, and the vector processing unit (VPU) is extended in the processor core to support short vector operations of 4-way double-precision data, and multiple threads can be used concurrently. The processor core supports 8 hardware threads in the way of round-robin multi-threading, and each group of 4 hardware threads is a group; each clock cycle, the processor selects the current instruction of a thread from each group of 4 threads to execute, and the instruction can be It is a vector instruction or a scalar instruction. When an instruction of a certain thread causes the pipeline to be blocked due to reasons such as cache failure, the microprocessor with a multi-threaded cross-vector structure will fetch and execute instructions from other threads, thereby hiding the delay and ensuring that the pipeline is full. . As shown in Figure 1, the functions of each functional unit of the processor core are briefly described as follows:

1）自陷逻辑单元（TLU），用于更新机器状态、处理异常和中断。面向处理器扩展的VPU，TLU也进行了相应的扩展，支持VPU的状态更新和异常处理。 1) Trapping logic unit (TLU), used to update machine state, handle exceptions and interrupts. For processor-extended VPU, TLU has also been extended accordingly to support VPU status update and exception handling.

2）取指令单元（IFU），每个时钟周期从每组线程中取一条指令，根据指令的类型发射到相应的执行单元（EXU0/1、FGU、LSU、VPU）执行。 2) The instruction fetch unit (IFU), which fetches an instruction from each group of threads every clock cycle, and sends it to the corresponding execution unit (EXU0/1, FGU, LSU, VPU) for execution according to the type of the instruction.

3）整数执行单元（EXU0/1），负责执行整数操作类指令。该处理器包含两个整数执行单元（分别标记为0号和1号），每4个线程共享一个整数执行单元。 3) Integer execution unit (EXU0/1), responsible for executing integer operation instructions. The processor contains two integer execution units (labeled number 0 and number 1), with each of the four threads sharing an integer execution unit.

4）浮点和图形单元（FGU），负责执行标量浮点操作指令和支持图像处理的指令。 4) Floating-point and graphics unit (FGU), responsible for executing scalar floating-point operation instructions and instructions that support image processing.

5）取/存单元（LSU），负责所有访存指令的执行。 5) Fetch/store unit (LSU), responsible for the execution of all memory access instructions.

6）存储器管理单元（MMU），负责配合LSU单元完成存储访问时的地址转换、内存管理。 6) The memory management unit (MMU), responsible for cooperating with the LSU unit to complete address conversion and memory management during storage access.

7）向量操作单元（VPU），负责执行实现4路双精度数据运算的短向量指令。 7) Vector operation unit (VPU), responsible for executing short vector instructions to realize 4-way double-precision data operations.

8）通信单元（Gasket），负责处理器内核与第2级高速缓存或其他处理器核的通信。 8) The communication unit (Gasket), which is responsible for the communication between the processor core and the second-level cache or other processor cores.

为了实现多线程交叉双精度短向量处理器，在原有的处理器内核的基础上实现了VPU单元。而为了和VPU单元的功能相配合，TLU、IFU、LSU、MMU都进行了扩展以支持双精度短向量操作。和处理器内核结构的改进相对应，多线程交叉双精度短向量结构的处理器实现了短向量操作指令集，包括向量存取指令、向量计算指令、向量比较指令、向量移位指令、状态操作指令等。 In order to realize the multi-thread interleaved double-precision short vector processor, the VPU unit is implemented on the basis of the original processor core. In order to cooperate with the functions of the VPU unit, TLU, IFU, LSU, and MMU have all been extended to support double-precision short vector operations. Corresponding to the improvement of the processor core structure, the processor of the multi-threaded cross double-precision short vector structure realizes a short vector operation instruction set, including vector access instructions, vector calculation instructions, vector comparison instructions, vector shift instructions, state operations instructions etc.

上述增加了向量处理单元（VPU）的处理器内核最多支持8个硬件线程，多线程并发使用VPU部件就构成了多线程交叉双精度短向量体系结构。使用向量单元的每个线程的指令流中包含多种类型的指令，在此处理器上，指令执行的过程如下： The above-mentioned processor core with a vector processing unit (VPU) supports up to 8 hardware threads, and multi-threaded concurrent use of VPU components constitutes a multi-threaded interleaved double-precision short vector architecture. The instruction stream of each thread using the vector unit contains several types of instructions. On this processor, the process of instruction execution is as follows:

IFU每个时钟周期从8个硬件线程的当前指令中取得来自两个线程的两条指令，它根据指令的类型决定将指令发送到哪个功能单元去执行，如果是两条都是整数运算指令可以同时分别发到两个整数执行单元；如果两条都是访存指令、向量浮点运算指令、标量浮点运算指令，则先发出其中一条，下个时钟周期再发送另外一条。当多个使用VPU单元的线程同时在处理器中执行时，来自不同线程的向量存取指令、向量计算指令同时在LSU、VPU上执行。 IFU obtains two instructions from two threads from the current instructions of eight hardware threads every clock cycle, and it decides which functional unit to send the instruction to for execution according to the type of the instruction. If both are integer operation instructions, it can be At the same time, they are sent to two integer execution units respectively; if both are memory access instructions, vector floating-point operation instructions, and scalar floating-point operation instructions, one of them is issued first, and the other is sent in the next clock cycle. When multiple threads using the VPU unit are executed in the processor at the same time, vector access instructions and vector calculation instructions from different threads are executed on the LSU and VPU at the same time.

这种多线程交叉双精度短向量结构可以隐藏长延时指令的延迟，提高处理器的整体性能。 This multi-thread interleaved double-precision short vector structure can hide the delay of long-latency instructions and improve the overall performance of the processor.

双精度短向量单元和传统面向流媒体计算的SIMD扩展相比，使用的寄存器不同、数据通路不同、指令集也完全不同；因此在此类结构体系的处理器验证、性能评估过程中，都需要编写大量的评测激励。在处理器验证和性能评估中使用的评测激励是面向处理器体系结构的汇编语言程序。在处理器验证过程中，将测试激励加载到处理器的测试平台上运行，可以验证处理器设计的正确性；并且，可以根据测试激励的执行时间和激励程序中所包含的计算量评估处理器的性能。在处理器的验证、性能评估过程中，都需要编写大量的作为评测激励的汇编语言程序，通常这些程序都是由研发、测试人员手工编写，工作量大，耗费时间长。由于不同处理器指令集体系结构不同、短向量扩展方法不同，因此无法继承和重用已有的面向多线程使用短向量处理功能单元的评测激励。 Compared with the traditional SIMD extension for streaming media computing, the double-precision short vector unit uses different registers, different data paths, and completely different instruction sets; therefore, it is necessary to Write lots of review incentives. The benchmark stimulus used in processor verification and performance evaluation is an assembly language program for the processor architecture. In the processor verification process, the test stimulus is loaded into the test platform of the processor to run, and the correctness of the processor design can be verified; and the processor can be evaluated according to the execution time of the test stimulus and the calculation amount contained in the stimulus program performance. In the process of processor verification and performance evaluation, it is necessary to write a large number of assembly language programs as evaluation incentives. Usually, these programs are manually written by R&D and test personnel, which requires a lot of work and takes a long time. Because different processor instruction set architectures and short vector extension methods are different, it is impossible to inherit and reuse the existing evaluation incentives for multi-threaded short vector processing functional units.

发明内容 Contents of the invention

本发明所要解决的技术问题是：针对现有技术存在的问题，本发明提供一种使用方便、可减少人员工作量、且能缩短耗费时长的多线程交叉双精度短向量结构的评测激励自动构造方法。 The technical problem to be solved by the present invention is: aiming at the problems existing in the prior art, the present invention provides an automatic construction of evaluation incentives for a multi-threaded interleaved double-precision short vector structure that is easy to use, can reduce the workload of personnel, and can shorten the time-consuming method.

为解决上述技术问题，本发明采用以下技术方案： In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种多线程交叉双精度短向量结构的评测激励自动构造方法，其特征在于包括以下步骤： A method for automatically constructing evaluation incentives for a multi-thread cross double-precision short vector structure, characterized in that it comprises the following steps:

（1）输入待评测的向量操作类型和向量长度； (1) Enter the vector operation type and vector length to be evaluated;

（2）自动创建一内容为空的用作评测激励的汇编语言文件； (2) Automatically create an assembly language file with empty content used as an evaluation incentive;

（3）向所述汇编语言文件中写入如下内容： (3) Write the following content to the assembly language file:

（3.1）多线程运行初始化代码段； (3.1) Multi-threaded running initialization code segment;

（3.2）评测激励控制结构，包括：启动多线程执行模式的代码段，用于设置多线程使能寄存器使处理器进入多线程工作状态；线程选择并跳转的代码段，用于读取各个线程私有的线程号寄存器并根据线程号跳转到各个线程； (3.2) The evaluation incentive control structure, including: the code segment to start the multi-thread execution mode, which is used to set the multi-thread enable register to make the processor enter the multi-thread working state; the code segment for thread selection and jump, which is used to read each Thread private thread number register and jump to each thread according to the thread number;

（3.3）多线程向量操作程序段，包括：主线程向量操作代码段，用于各线程计算任务分配、操作数的首地址和向量长度计算、读取源操作数向量和目的操作数向量并循环进行短向量运算操作；从线程向量操作代码段，用于读取源操作数向量和目的操作数向量并进行短向量运算操作； (3.3) Multi-threaded vector operation program segment, including: the main thread vector operation code segment, which is used for the allocation of computing tasks for each thread, the calculation of the first address of the operand and the vector length, reading the source operand vector and the destination operand vector and looping Perform short vector operations; operate code segments from thread vectors to read source operand vectors and destination operand vectors and perform short vector operations;

（3.4）评测激励同步结构，包括：主线程同步代码段，用于判断并等待所有线程完成向量操作；从线程同步代码段，用于向主线程标识本线程是否完成向量操作； (3.4) Evaluation incentive synchronization structure, including: the main thread synchronization code segment, which is used to judge and wait for all threads to complete the vector operation; the slave thread synchronization code segment, which is used to indicate to the main thread whether the thread has completed the vector operation;

（3.5）数据段以及数据段初始化语句，所述数据段为多线程共享数据段，所述多线程共享数据段含有多线程共享的源操作数向量和目的操作数向量； (3.5) A data segment and a data segment initialization statement, the data segment is a multi-thread shared data segment, and the multi-thread shared data segment contains a source operand vector and a destination operand vector shared by multiple threads;

（4）将步骤（3）得到的汇编语言文件作为自动生成的多线程交叉双精度短向量结构的评测激励。 (4) The assembly language file obtained in step (3) is used as the evaluation stimulus for the automatically generated multi-thread interleaved double-precision short vector structure.

作为本发明的进一步改进： As a further improvement of the present invention:

所述步骤（3.3）中，所述多线程向量操作程序段的创建步骤如下： In the step (3.3), the creation steps of the multi-threaded vector operation program segment are as follows:

（3.3.1）根据输入的向量操作类型和向量长度，分配各线程（包括主线程和所有从线程）的计算任务后，确定各线程操作的向量的起始位置和长度； (3.3.1) According to the input vector operation type and vector length, after assigning the calculation tasks of each thread (including the main thread and all slave threads), determine the starting position and length of the vector operated by each thread;

（3.3.2）各线程根据线程号以及向量长度，计算源操作数地址和循环计数寄存器，设置基地址寄存器和循环计数寄存器； (3.3.2) Each thread calculates the source operand address and loop count register according to the thread number and vector length, and sets the base address register and loop count register;

（3.3.3）各线程根据线程号计算目的操作数地址，设置目的操作数基地址寄存器； (3.3.3) Each thread calculates the destination operand address according to the thread number, and sets the destination operand base address register;

（3.3.4）各线程根据各自的计算任务，在汇编语言程序文本段中插入向量读取、操作、或结果写回的汇编指令，组成主线程向量操作代码段和从线程向量操作代码段。 (3.3.4) According to their respective calculation tasks, each thread inserts assembly instructions for vector reading, operation, or result writing back into the assembly language program text segment to form the main thread vector operation code segment and the slave thread vector operation code segment.

所述步骤（3.5）中，所述共享数据段由以下步骤构建： In the step (3.5), the shared data segment is constructed by the following steps:

（3.5.1）采用双精度浮点数据的随机数生成程序生成用作源操作数的双精度向量，向量长度由用户指定；把向量中的双精度数据转换为16进制，作为源操作数向量； (3.5.1) Use the random number generator of double-precision floating-point data to generate a double-precision vector used as a source operand, and the length of the vector is specified by the user; convert the double-precision data in the vector to hexadecimal as the source operand vector;

（3.5.2）根据输入的向量长度预留目的操作数存储空间，作为目的操作数向量。 (3.5.2) Reserve the destination operand storage space according to the input vector length as the destination operand vector.

所述数据段还包括供所述评测激励同步结构使用的锁变量和线程计数变量，所述评测激励同步结构通过锁变量控制同一时间仅有一个线程更新线程计数变量，并通过线程计数变量判别并保证多个线程必需全部完成各自所做的操作后主线程才继续执行后续操作。 The data segment also includes a lock variable and a thread count variable used by the evaluation incentive synchronization structure. The evaluation incentive synchronization structure controls only one thread to update the thread count variable at the same time through the lock variable, and uses the thread count variable to distinguish and It is guaranteed that multiple threads must all complete their respective operations before the main thread continues to perform subsequent operations.

所述步骤（3.4）完成后，向所述汇编语言文件中写入用于验证主线程向量操作结果正确性的主线程计算结果比较代码段、用于验证从线程向量操作运算结果正确性的从线程计算结果比较代码段和用于计算结果比较有错时的报错的代码段；所述步骤（3.5）中，所述数据段还包括供所述主线程计算结果比较代码段和从线程计算结果比较代码段读取的正确的计算结果向量。 After the step (3.4) is completed, write the main thread calculation result comparison code segment for verifying the correctness of the main thread vector operation result, and the slave thread vector operation result comparison code segment for verifying the correctness of the slave thread vector operation result into the assembly language file. The code segment for thread calculation result comparison and the code segment for error reporting when the calculation result comparison is wrong; in the step (3.5), the data segment also includes the code segment for the main thread calculation result comparison and the slave thread calculation result comparison The correct vector of computed results read by the code segment.

与现有技术相比，本发明的优点在于： Compared with the prior art, the present invention has the advantages of:

1、本发明的多线程交叉双精度短向量结构的评测激励自动构造方法，采用了构件化程序设计的思想，把基本的汇编语言代码段作为构造汇编语言程序的基本构件，自动构造多线程交叉双精度短向量结构的评测激励，有利于面向该类体系结构快速开发评测激励，可减少人员的工作量。 1. The evaluation and incentive automatic construction method of the multi-thread interleaving double-precision short vector structure of the present invention adopts the idea of component-based programming, uses the basic assembly language code segment as the basic component of the assembly language program, and automatically constructs multi-thread interleaving The evaluation incentive of the double-precision short vector structure is conducive to the rapid development of evaluation incentives for this type of architecture, which can reduce the workload of personnel.

2、本发明可以编程实现，输入待评测或待验证的向量操作类型以及向量长度，输出的评测激励为多线程利用短向量指令完成计算任务的汇编语言程序。多次运行该程序，输入不同的运算类型和向量长度，可以自动批量构造评测激励，能缩短构建时长，节约成本，提高评测激励开发效率，极大地方便了处理器验证和性能评估。 2. The present invention can be realized by programming, input the vector operation type and vector length to be evaluated or verified, and the output evaluation incentive is an assembly language program that uses short vector instructions to complete computing tasks with multiple threads. Run the program multiple times and input different operation types and vector lengths to automatically construct evaluation incentives in batches, which can shorten the construction time, save costs, improve the efficiency of evaluation incentive development, and greatly facilitate processor verification and performance evaluation.

附图说明 Description of drawings

图1是多线程交叉双精度短向量结构的处理器内核结构示意图。 FIG. 1 is a schematic diagram of a processor core structure of a multi-thread interleaved double-precision short vector structure.

图2是本发明具体实施例的总流程示意图。 Fig. 2 is a schematic diagram of the overall flow of a specific embodiment of the present invention.

图3是本发明具体实施例中构造的评测激励的流程示意图。 Fig. 3 is a schematic flow chart of evaluation incentives constructed in a specific embodiment of the present invention.

具体实施方式 Detailed ways

以下将结合说明书附图和具体实施例对本发明作进一步详细说明。 The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.

如图2所示，采用本发明的多线程交叉双精度短向量结构的评测激励自动构造方法构造用于评测和验证如图1所示的多线程交叉双精度短向量结构体系，步骤如下： As shown in Figure 2, the evaluation incentive automatic construction method of the multi-thread interleaved double-precision short vector structure of the present invention is used to evaluate and verify the multi-thread interleaved double-precision short vector structure system as shown in Figure 1, and the steps are as follows:

1、输入待评测的向量操作类型和向量长度。 1. Enter the vector operation type and vector length to be evaluated.

2、自动创建一内容为空的汇编语言源程序文件foo.s。 2. Automatically create an assembly language source program file foo.s whose content is empty.

3、向所述foo.s文件中写入如下内容的汇编语言程序文本： 3. Write the following assembly language program text in the foo.s file:

3.1用于多线程运行初始化的汇编语言代码段；本发明采用常规初始化代码段，所有构造出的汇编语言程序所需的初始化过程都相同。 3.1 The assembly language code segment used for multi-thread operation initialization; the present invention adopts the conventional initialization code segment, and the required initialization process of all constructed assembly language programs is the same.

3.2 评测激励控制结构，其包括： 3.2 Evaluation incentive control structure, including:

3.2.1 启动多线程执行模式的代码段，用于设置多线程使能寄存器使处理器进入多线程工作状态； 3.2.1 The code segment to start the multi-thread execution mode, which is used to set the multi-thread enable register to make the processor enter the multi-thread working state;

3.2.2线程选择并跳转的代码段，用于读取各个线程私有的线程号寄存器并根据线程号跳转到各个线程；其中线程号为0的线程是主线程，其他线程为从线程。 3.2.2 The code segment for thread selection and jump is used to read the private thread number register of each thread and jump to each thread according to the thread number; the thread with thread number 0 is the main thread, and other threads are slave threads.

3.3多线程向量操作程序段，其包括： 3.3 Multi-thread vector operation program segment, which includes:

3.3.1主线程向量操作代码段，用于各线程计算任务分配、操作数的首地址和向量长度计算、读取源操作数向量和目的操作数向量并循环进行短向量运算操作。其构建过程如下： 3.3.1 The main thread vector operation code segment is used for the allocation of calculation tasks of each thread, the calculation of the first address of the operand and the vector length, reading the source operand vector and the destination operand vector, and performing short vector operation in a loop. Its construction process is as follows:

a. 主线程根据自己的线程号和计算任务分配，计算操作数的首地址和向量长度； a. The main thread calculates the first address and vector length of the operand according to its own thread number and calculation task allocation;

b. 主线程根据线程号以及向量的长度，计算源操作数地址和循环计数寄存器，设置基地址寄存器和循环计数寄存器； b. The main thread calculates the source operand address and the loop count register according to the thread number and the length of the vector, and sets the base address register and the loop count register;

c. 主线程根据线程号计算目的操作数地址，设置目的操作数基地址寄存器； c. The main thread calculates the destination operand address according to the thread number, and sets the destination operand base address register;

d. 主线程根据各自的计算任务，在汇编语言程序文本段中插入向量读取、操作、或结果写回的汇编指令，组成主线程向量操作代码段和从线程向量操作代码。 d. According to their respective calculation tasks, the main thread inserts assembly instructions for vector reading, operation, or result writing back into the assembly language program text segment to form the main thread vector operation code segment and the slave thread vector operation code.

3.3.2 从线程向量操作代码段，用于读取源操作数向量和目的操作数向量并进行短向量运算操作；其构建过程如下： 3.3.2 From the thread vector operation code segment, it is used to read the source operand vector and the destination operand vector and perform short vector operation; its construction process is as follows:

a. 从线程根据自己的线程号和计算任务分配，计算操作数的首地址和向量长度； a. The slave thread calculates the first address and vector length of the operand according to its own thread number and calculation task assignment;

b. 从线程根据线程号以及向量的长度，计算源操作数地址和循环计数寄存器，设置基地址寄存器和循环计数寄存器； b. The slave thread calculates the source operand address and the loop count register according to the thread number and the length of the vector, and sets the base address register and the loop count register;

c. 从线程根据线程号计算目的操作数地址，设置目的操作数基地址寄存器； c. The slave thread calculates the destination operand address according to the thread number, and sets the destination operand base address register;

d. 从线程根据各自的计算任务，在汇编语言程序文本段中插入向量读取、操作、或结果写回的汇编指令，组成主线程向量操作代码段和从线程向量操作代码。 d. According to their respective calculation tasks, the slave thread inserts assembly instructions for vector reading, operation, or result writing back into the assembly language program text segment to form the main thread vector operation code segment and the slave thread vector operation code.

3.4 评测激励验证结构。其包括： 3.4 Evaluation incentive verification structure. It includes:

3.4.1 主线程计算结果比较代码段，用于验证主线程向量操作结果正确性； 3.4.1 The main thread calculation result comparison code segment is used to verify the correctness of the main thread vector operation results;

3.4.2 从线程计算结果比较代码段，用于验证从线程向量操作运算结果正确性； 3.4.2 Slave thread calculation result comparison code segment, used to verify the correctness of the slave thread vector operation results;

3.4.3 用于计算结果比较有错时的报错的代码段。 3.4.3 The code segment used to report an error when the calculation result is wrong.

对于用作验证的测试激励，需要在主线程和从线程执行完向量运算操作的代码段之后，插入用于验证向量操作的运算结果正确性的结果比较代码段。实际运行时，结果比较代码段读取各个线程刚刚计算出的数值结果和对应的预先计算好的正确的计算结果进行比较，如果数值不同则执行报错代码。 For the test stimulus used as verification, it is necessary to insert a result comparison code segment for verifying the correctness of the operation result of the vector operation after the main thread and the slave thread execute the code segment of the vector operation operation. During actual operation, the result comparison code segment reads the numerical results just calculated by each thread and compares them with the corresponding pre-calculated correct calculation results, and executes the error code if the values are different.

3.5评测激励同步结构，其包括： 3.5 Evaluation incentive synchronization structure, which includes:

3.5.1 主线程同步代码段，用于判断并等待所有线程完成向量操作；如果所有线程都完成了操作，主线程报告并退出；否则主线程循环重复上述过程等待所有线程完成向量操作。 3.5.1 The main thread synchronization code segment is used to judge and wait for all threads to complete the vector operation; if all threads have completed the operation, the main thread will report and exit; otherwise, the main thread will repeat the above process in a loop and wait for all threads to complete the vector operation.

3.5.2 从线程同步代码段，用于向主线程标识本线程是否完成向量操作。从线程完成双精度短向量运算后将线程计数变量增加1，进入忙等状态。 3.5.2 The slave thread synchronization code segment is used to identify to the main thread whether the thread has completed the vector operation. After the slave thread completes the double-precision short vector operation, increase the thread count variable by 1 and enter the busy waiting state.

3.6 数据段以及数据段初始化语句。所述数据段包括：源操作数向量和目的操作数向量、锁变量、线程计数变量和用于验证的正确的计算结果；其中，源操作数向量和目的操作数向量为多线程共享数据段。数据段以及数据段初始化语句的构建步骤如下： 3.6 Data segment and data segment initialization statement. The data segment includes: a source operand vector and a destination operand vector, a lock variable, a thread count variable and correct calculation results for verification; wherein, the source operand vector and the destination operand vector are multi-thread shared data segments. The construction steps of the data segment and the data segment initialization statement are as follows:

3.6.1 构建共享数据段： 3.6.1 Build a shared data segment:

a. 源操作数向量。采用双精度浮点数据的随机数生成程序生成用作源操作数的双精度向量，向量长度由用户指定；把向量中的双精度数据转换为16进制，作为源操作数向量，并指定数据段的对齐方式为align 32。 a. Source operand vector. The random number generation program using double-precision floating-point data generates a double-precision vector used as a source operand, and the length of the vector is specified by the user; converts the double-precision data in the vector to hexadecimal, as the source operand vector, and specifies the data The alignment of the segment is align 32.

b. 目的操作数向量。根据输入的向量长度预留目的操作数存储空间，作为目的操作数向量，数据段的对齐方式为align 32。 b. Destination operand vector. Reserve the storage space for the destination operand according to the length of the input vector. As the destination operand vector, the alignment of the data segment is align 32.

c. 正确的计算结果向量。对于用作验证用的测试激励，需要创建正确结果数据段，该数据段中保存的是预先计算好的正确结果；该数据段由验证用的激励的结果比较代码段（由步骤3.4创建）读取。 c. The correct calculation result vector. For the test stimulus used for verification, it is necessary to create a correct result data segment, which stores the pre-calculated correct result; this data segment is read by the result comparison code segment of the verification stimulus (created by step 3.4) Pick.

d. 创建锁变量。锁变量用于控制同一时间仅有一个线程更新线程计数变量。本实施例中，在共享数据区中创建一个64位的整型锁变量，并设置初始值为0，对齐方式为align 8。当锁变量的值为0时表示未加锁，当锁变量的值为1时表示加锁。在程序运行时，各个线程循环用比较并交换指令读取锁变量并判断加锁状态，若锁变量的值为0，表示没有线程使用共享数据；获得锁变量的线程先将锁变量改为1，然后对读写共享数据进行向量操作，完成向量操作后，用比较并交换指令将锁变量恢复为0。 d. Create a lock variable. The lock variable is used to control that only one thread updates the thread count variable at a time. In this embodiment, a 64-bit integer lock variable is created in the shared data area, and the initial value is set to 0, and the alignment mode is align 8. When the value of the lock variable is 0, it means unlocked, and when the value of the lock variable is 1, it means locked. When the program is running, each thread loops to read the lock variable and judge the lock status with the compare and exchange instruction. If the value of the lock variable is 0, it means that no thread uses the shared data; the thread that acquires the lock variable first changes the lock variable to 1 , and then perform vector operations on the read-write shared data. After the vector operation is completed, use the compare and exchange instruction to restore the lock variable to 0.

e. 创建线程计数变量。线程计数变量用于判别并保证多个线程必需全部完成各自所做的操作后主线程才继续执行后续操作。本实施例中，在共享数据区中创建一个64位的整型线程计数变量，并设置初始值为0，对齐方式为align 8。在程序运行时，获得锁变量的线程完成向量操作后，将线程计数变量加1；当主线程判断线程计数器等于线程总数时，判断所有线程均完成了向量操作，则报告并退出。 e. Create a thread count variable. The thread count variable is used to judge and ensure that multiple threads must complete their respective operations before the main thread continues to perform subsequent operations. In this embodiment, a 64-bit integer thread count variable is created in the shared data area, and the initial value is set to 0, and the alignment mode is align 8. When the program is running, after the thread that acquires the lock variable completes the vector operation, it adds 1 to the thread count variable; when the main thread judges that the thread counter is equal to the total number of threads, it judges that all threads have completed the vector operation, then reports and exits.

4、将步骤3得到的汇编语言文件作为自动生成的多线程交叉双精度短向量结构的评测激励。 4. Use the assembly language file obtained in step 3 as an evaluation stimulus for the automatically generated multi-thread interleaved double-precision short vector structure.

上述步骤中，评测激励的向量操作程序段、验证结构和同步结构写入foo.s文件的顺序不限，可以依照上述步骤进行，也可以采用图2所示的顺序，先将主线程的三种结构写完，再逐一写入从线程的相应结构。如图3所示，本实施例自动生成的评测激励的执行流程如下： In the above steps, there is no limit to the order in which the vector operation program segment, verification structure, and synchronization structure of the evaluation incentive are written into the foo. After writing the structures, write them into the corresponding structures of the slave threads one by one. As shown in Figure 3, the execution flow of the evaluation incentives automatically generated in this embodiment is as follows:

（1）多线程执行环境初始化。 (1) Multi-thread execution environment initialization.

（2）读取硬件线程相应的线程号寄存器，根据不同的线程号跳转到主线程或从线程的代码入口处；若线程号为0，则跳至步骤（3）转到主线程入口；否则跳到步骤（5），转入相应的从线程入口。 (2) Read the corresponding thread number register of the hardware thread, and jump to the code entry of the main thread or the slave thread according to different thread numbers; if the thread number is 0, skip to step (3) and go to the main thread entry; Otherwise, skip to step (5) and switch to the corresponding slave thread entry.

（3）主线程根据线程号和计算任务分配，完成双精度短向量操作；之后执行获取锁变量的代码，如果获得了锁变量，则将线程计数器加1，然后释放锁变量。 (3) The main thread completes the double-precision short vector operation according to the thread number and the calculation task allocation; then executes the code to acquire the lock variable, if the lock variable is acquired, the thread counter is incremented by 1, and then the lock variable is released.

（4）读取线程计数器，判断线程计数器与总线程数是否相等，若相等则表示所有线程均完成了向量操作，则程序终止，回复系统状态并退出；若二者不相等，则循环执行本步骤直至二者相等。 (4) Read the thread counter and judge whether the thread counter is equal to the total number of threads. If they are equal, it means that all threads have completed the vector operation, then the program terminates, returns to the system state and exits; if the two are not equal, execute this program in a loop steps until the two are equal.

（5）相应的从线程根据线程号和计算任务分配，完成双精度短向量操作；之后执行获取锁变量的代码，如果获得了锁变量，则将线程计数器加1，然后释放锁变量，进入忙等状态。 (5) The corresponding slave thread completes the double-precision short vector operation according to the thread number and calculation task allocation; then executes the code to acquire the lock variable, if the lock variable is acquired, the thread counter is incremented by 1, and then the lock variable is released to enter the busy Waiting for status.

以上所述仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，应视为本发明的保护范围。 The above descriptions are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principle of the present invention should be regarded as the protection scope of the present invention.

Claims

1. a kind of evaluation encourages the automatic construction method of multi-thread crossing double-precision short vector structure, it is characterized in that comprising the following steps:

(1) Enter the vector operation type and vector length to be evaluated;

(2) Automatically create an assembly language file with empty content used as an evaluation incentive;

(3) Write the following content to the assembly language file:

(3.1) Multi-threaded running initialization code segment;

(3.2) The evaluation incentive control structure, including: the code segment to start the multi-thread execution mode, which is used to set the multi-thread enable register to make the processor enter the multi-thread working state; the code segment for thread selection and jump, which is used to read each Thread private thread number register and jump to each thread according to the thread number;

(3.3) Multi-threaded vector operation program segment, including: the main thread vector operation code segment, which is used for the allocation of computing tasks for each thread, the calculation of the first address of the operand and the vector length, reading the source operand vector and the destination operand vector and looping Perform short vector operations; operate code segments from thread vectors to read source operand vectors and destination operand vectors and perform short vector operations;

(3.4) Evaluation incentive synchronization structure, including: the main thread synchronization code segment, which is used to judge and wait for all threads to complete the vector operation; the slave thread synchronization code segment, which is used to indicate that the thread has completed the vector operation;

(3.5) A data segment and a data segment initialization statement, the data segment is a multi-thread shared data segment, and the multi-thread shared data segment contains a source operand vector and a destination operand vector shared by multiple threads;

(4) The assembly language file obtained in step (3) is used as the evaluation stimulus for the automatically generated multi-thread interleaved double-precision short vector structure.

2. The evaluation-inspired automatic construction method of multi-threaded interleaved double-precision short vector structure according to claim 1, characterized in that, in the step (3.3), the creation steps of the multi-threaded vector operation program segment are as follows:

(3.3.1) According to the input vector operation type and vector length, after assigning the calculation tasks of each thread, determine the starting position and length of the vector operated by each thread;

(3.3.2) Each thread calculates the source operand address and loop count register according to the thread number and vector length, and sets the base address register and loop count register;

(3.3.3) Each thread calculates the destination operand address according to the thread number, and sets the destination operand base address register;

(3.3.4) According to their respective calculation tasks, each thread inserts assembly instructions for vector reading, operation, or result writing back into the assembly language program text segment to form the main thread vector operation code segment and the slave thread vector operation code segment.

3. The evaluation-inspired automatic construction method of multi-threaded interleaved double-precision short vector structure according to claim 2, characterized in that, in the step (3.5), the shared data segment is constructed by the following steps:

(3.5.1) Use the random number generator of double-precision floating-point data to generate a double-precision vector used as a source operand, and the length of the vector is specified by the user; convert the double-precision data in the vector to hexadecimal as the source operand vector;

(3.5.2) Reserve the destination operand storage space according to the input vector length as the destination operand vector.

4. the evaluation incentive automatic construction method of multi-thread interleaving double-precision short vector structure according to claim 3, is characterized in that, described data section also comprises lock variable and the thread count variable that are used for described evaluation excitation synchronous structure, The evaluation incentive synchronization structure controls only one thread at a time to update the thread count variable through the lock variable, and judges and ensures that multiple threads must complete their respective operations before the main thread continues to perform subsequent operations through the thread count variable.

5. The evaluation-inspired automatic construction method of the multi-threaded interleaved double-precision short vector structure according to claim 1 or 2 or 3 or 4, characterized in that, after the step (3.4) is completed, the assembly language file Write the main thread calculation result comparison code segment for verifying the correctness of the main thread vector operation result, the slave thread calculation result comparison code segment for verifying the correctness of the slave thread vector operation result, and the error report for the calculation result comparison Code segment; in the step (3.5), the data segment also includes the correct calculation result vector read from the main thread calculation result comparison code segment and the slave thread calculation result comparison code segment.