WO2022111703A1 - 用于获取硬件性能数据的方法、设备和系统 - Google Patents

用于获取硬件性能数据的方法、设备和系统 Download PDF

Info

Publication number
WO2022111703A1
WO2022111703A1 PCT/CN2021/134128 CN2021134128W WO2022111703A1 WO 2022111703 A1 WO2022111703 A1 WO 2022111703A1 CN 2021134128 W CN2021134128 W CN 2021134128W WO 2022111703 A1 WO2022111703 A1 WO 2022111703A1
Authority
WO
WIPO (PCT)
Prior art keywords
performance
performance data
data
hardware
hardware performance
Prior art date
Application number
PCT/CN2021/134128
Other languages
English (en)
French (fr)
Inventor
王石雨
许金超
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2022111703A1 publication Critical patent/WO2022111703A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • G06F9/4451User profiles; Roaming

Definitions

  • the present disclosure relates generally to the field of computers. More particularly, the present disclosure relates to methods, apparatus, compilers, heterogeneous systems, and computer-readable storage media for obtaining hardware performance data for object code.
  • performance is usually one of the most concerned metrics.
  • different implementations have great differences in performance. Therefore, in the process of programming, software developers often use performance analysis tools to view the performance data of the code during the actual operation of the hardware, such as cache misses (“cache miss”) or branch misses (“branch miss”). ), etc., as a reference for optimizing program performance.
  • cache miss cache miss
  • branch miss branch misses
  • sampling-based approach to obtain performance data has several drawbacks.
  • Secondly, such a sampling method cannot use the intermediate expression information of the program in the compilation stage, and it is difficult to obtain the performance data of a specific segment of the program in a targeted manner, such as obtaining the performance data information of a specific function call.
  • the present disclosure proposes to insert instructions into the target code to effectively obtain hardware performance data, so as to accurately judge the performance of the target code, so that program developers can optimize the code in a targeted manner. , to improve code performance.
  • the present disclosure discloses a method for obtaining hardware performance data of target code, comprising: inserting a performance read instruction for obtaining hardware performance data for the target code; and compiling the performance Instructions and object code are read to generate an executable program, wherein the hardware performance data related to the execution of the object code is obtained by running the executable program.
  • the present disclosure discloses an apparatus for obtaining hardware performance data of object code, comprising: at least one processor; at least one memory for storing computer program instructions, when the computer program instructions are stored by The at least one processor, when executed, causes the apparatus to perform the aforementioned methods and various embodiments thereof described later.
  • the present disclosure discloses a computer-readable storage medium storing computer program instructions for obtaining hardware performance data of object code, which when executed by at least one processor, achieves The foregoing method and various embodiments thereof described later.
  • the present disclosure discloses a compiler for obtaining hardware performance data of object code, comprising: an insertion module configured to insert performance read instructions for the object code, wherein the performance read instructions are used to obtain hardware performance data; and a compilation module configured to compile the performance reading instructions and object code to generate an executable program, wherein the execution of the executable program causes the acquisition and the object code to be executed. Execute the relevant hardware performance data.
  • the present disclosure discloses a heterogeneous system for obtaining hardware performance data for object code, including an interconnected host and device, wherein: the host includes the aforementioned compiler and is configured to pair the The device performs master control and cooperative operations to obtain the hardware performance data generated when the device executes the executable program; and the device includes one or more processor cores and is configured to: execute the compiler generating the executable program; and sending the hardware performance data to the host.
  • the present disclosure discloses a compiler configured to perform the above-described method and various embodiments thereof described later.
  • the present disclosure discloses a board including the above-described heterogeneous system and various embodiments thereof described later.
  • the solutions described above in the present disclosure provide users with higher degrees of freedom and flexibility.
  • a user can view any code fragment, and the collected hardware performance data will not be inaccurate due to low execution frequency or short execution time of the code fragment.
  • the solution of the present disclosure supports both manual and automatic insertion of performance reading instructions and also supports function-level insertion, thereby providing users with more testing methods. Based on this, users can not only view the hardware performance data of any interesting code segment, but also obtain function-level hardware performance data through compilation options during the compilation phase.
  • the solutions of the present disclosure can also be extended to obtain hardware performance data for specific operations such as data copying.
  • FIG. 1 is a simplified flowchart illustrating a method for obtaining hardware performance data according to an embodiment of the present disclosure
  • FIG. 2 is a detailed flowchart illustrating a method for obtaining hardware performance data according to an embodiment of the present disclosure
  • FIG. 3 is a flowchart illustrating a backend compilation process according to an embodiment of the present disclosure
  • FIG. 4 is a schematic block diagram illustrating a compiler for obtaining hardware performance data according to an embodiment of the present disclosure
  • FIG. 5 is a flowchart illustrating a calling performance read function according to an embodiment of the present disclosure
  • FIG. 6 is a schematic architecture diagram illustrating a heterogeneous system for acquiring hardware performance data according to an embodiment of the present disclosure
  • FIG. 7 is a block diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram illustrating a board according to an embodiment of the present disclosure.
  • the solution of the present disclosure proposes to insert performance reading instructions into the target code (or user program to be tested), and read the target code and the performance reading instructions.
  • the instruction fetches are compiled together to obtain an executable program for accurately obtaining hardware performance data.
  • the solution of the present disclosure when the solution of the present disclosure is applied to a heterogeneous system composed of a host ("host") and a device ("device"), the above executable program is obtained by performing a compilation process on the host side, and By executing the executable program on the device side, the solution of the present disclosure can accurately obtain relevant hardware performance data when the target code runs on the device side.
  • software developers can make preparatory judgments about the performance of the target code, and make further program optimizations when necessary.
  • FIG. 1 is a simplified flowchart illustrating a method 100 for obtaining hardware performance data in accordance with an embodiment of the present disclosure.
  • a performance read instruction for acquiring hardware performance data is inserted into the target code.
  • the target code of the present disclosure may be a program segment scheduled for performance testing or a code segment that is of interest to a user.
  • the hardware performance data of the present disclosure may include, for example, caches obtained from multiple count registers of a special purpose processor (eg, a graphics processor GPU, an artificial intelligence processor, etc.) or a general purpose processor (eg, a general purpose CPU) Hardware-related data on program execution such as misses ("cache misss") or branch misses ("branch misses").
  • a special purpose processor eg, a graphics processor GPU, an artificial intelligence processor, etc.
  • a general purpose processor eg, a general purpose CPU
  • the inserting step in step S102 may be inserting the capability reading instruction by receiving user input.
  • the solution of the present disclosure can insert a performance reading instruction for at least one code segment (for example, a code segment of interest to the user) according to the user input.
  • the insertion step in step S102 may also automatically insert the performance read instruction by a compiler (described in detail later in conjunction with FIG. 4 ).
  • the aforementioned operation of inserting a performance read instruction can also be completed by means of function-level instrumentation.
  • the insert operation of the present disclosure may be to insert the performance read instruction for at least one function to obtain corresponding hardware performance data generated when one or more functions are executed on hardware.
  • options for manual insertion and automatic insertion may also be set, so that the user may select manual insertion or automatic insertion according to preference or preference.
  • the performance read instruction of the present disclosure may be an assembly instruction, and the assembly instruction includes one or more parameters within the assembly instruction for identifying the function name of the object code.
  • the performance read instruction of the present disclosure may be expressed as "readperf.begin/end", where the target code may be located between the start instruction “readperf.begin” and the end instruction “readperf.end”.
  • “readperf.begin” and “readperf.end” can be inserted at the beginning and end of the object code, respectively, so that the hardware performance data associated with the object code can be obtained when the performance read instruction is subsequently executed.
  • the parameter included in the performance read instruction to identify the function name of the object code may be a parameter with a different number of bits.
  • a 64-bit integer parameter can be used to represent the function name.
  • the function name can be represented by two 32-bit integer parameters.
  • the function name may also be compressed through a compression algorithm (eg, a hash algorithm) to obtain the aforementioned one or more parameters representing the function name.
  • step S104 the method 100 compiles the performance read instructions and object code to generate an executable program, wherein by running the executable program, hardware performance data related to the execution of the object code can be obtained.
  • an object object
  • the methods of generating the aforementioned object files may be different, so the solution of the present disclosure does not impose any restrictions on the compilers.
  • the object file can be linked, that is, the code in the object file can be linked to form an executable file to be executed by the processor.
  • performance enable instructions for enabling hardware performance counters may be placed for object code that inserts performance read instructions.
  • the performance enable instruction enables relevant performance counters within the processor to count relevant events before executing the target code.
  • hardware performance data may be obtained from dedicated registers of the performance monitoring unit of the processor core (acting as counters).
  • the aforementioned one or more special registers may be used for counting and/or accumulating events that occur in the processor core (eg, cache or branch hits or misses, etc.).
  • the compiling process of the present disclosure includes executing a first compiling task and a second compiling task, wherein in the first compiling task, the target code after the performance reading instruction is inserted may be compiled, to generate assembly code.
  • the assembly code is compiled into an object file, and the above-mentioned performance enabling instruction may be placed in the assembly code according to the performance reading instruction in the preceding assembly code .
  • a corresponding performance reading function may also be generated according to the performance reading instruction, so as to convert the execution of the performance reading instruction into the execution of the performance reading function.
  • the performance reading function .readperf region can be generated for the aforementioned "readperf.begin/end", so that the reading operation of performance data can be performed by calling this function.
  • the method of the present disclosure may further include: identifying, and generating a data derivation instruction corresponding to the identifying, for exporting the hardware performance data.
  • the scheme of the present disclosure when the scheme of the present disclosure is applied to a device (“device") that supports heterogeneity, such as when the device has an intelligent processing unit ("Intelligence Processing Unit”, abbreviated as "IPU") and a memory processing unit (“Memory Processing Unit”, abbreviated as "MPU"), through the data export instruction, the hardware performance data can be transferred from the device to the outside along different data transfer paths, for example, to the host (“host").
  • IPU Intelligent Processing Unit
  • MPU memory Processing Unit
  • the method for obtaining hardware performance data of an object code of the present disclosure has been described in detail above with reference to FIG. 1 .
  • the method 200 which may be a specific implementation manner of the method 100 shown in FIG. 1 , will be described in detail below with reference to FIG. 2 .
  • FIG. 2 is a detailed flowchart illustrating a method 200 for obtaining hardware performance data according to an embodiment of the present disclosure. Since the method 200 can be regarded as an embodiment of the method 100 shown in FIG. 1 , the descriptions made in conjunction with FIG. 1 are also applicable to the method 200 , and the same technical content will not be repeated hereinafter.
  • the method 200 may receive a user program compiled or input by the user, wherein the user program may include the aforementioned target code, which includes one or more codes of interest to the user Fragment.
  • the method 200 may determine whether to perform a function-level automatic instrumentation operation on the object code in the user program.
  • some codes may be inserted or modified in certain positions in the target code (that is, "instrumentation” operations), so as to obtain certain programs during the running of the target code status and analyze it.
  • function-level instrumentation implements inserting or modifying code within a function. According to the solution of the present disclosure, the acquisition of hardware performance data at the kernel function (“kernel”) level can be achieved through the instrumentation operation.
  • the method 200 may automatically insert the performance read instruction of the present disclosure, for example, automatically insert the assembly instruction "readperf".
  • the method 200 may receive a user's manual input to insert a performance read instruction in the object code.
  • the automatic input or manual input may be implemented by setting a compilation option of the compiler.
  • step S210 the method 200 may perform a front-end compilation operation, that is, perform the first compilation task described in conjunction with FIG. 1 .
  • the assembly code can be obtained at step S212.
  • step S214 the method 200 performs a back-end compilation operation at step S214, that is, performs the second compilation task described in conjunction with FIG. 1 .
  • an object (“object”) file including an instruction segment and a data segment may be generated in step S216.
  • the method 200 may perform a linking operation to collect and combine the different parts of code and data into an entity called, for example, a load module or executable file, which may be used, for example, by a device directly executed by the operating system on the Finally, by executing the execution of the method 200, an executable program can be obtained at step S220. As mentioned above, through the execution of the executable program, the solution of the present disclosure can obtain hardware performance data related to the execution of the object code.
  • the solution for obtaining hardware performance data of the present disclosure has been described in detail above with reference to FIG. 1 and FIG. 2 .
  • the user can view any code fragment of interest, so that the obtained hardware performance data will not be inaccurate due to the low execution frequency or short execution time of the code fragment.
  • the solution for acquiring hardware performance data of this solution has a higher degree of freedom and flexibility in use.
  • the solution of the present disclosure also supports the selection of compilation options in the compilation phase to achieve effective acquisition of hardware performance data at the function level.
  • hardware performance data related to the data copying operation can also be obtained.
  • FIG. 3 is a flowchart illustrating a backend compilation process 300 according to an embodiment of the present disclosure. It can be understood that the back-end compilation process 300 is a further refinement of the second compilation task described in FIG. 1 and step S214 shown in FIG. 2 , so the foregoing descriptions about the second compilation task and step S214 are also applicable to Backend compilation process 300 .
  • the process 300 obtains assembly code, such as the assembly code output at step S212 of FIG. 2 .
  • the process 300 may determine whether the assembly code includes a performance read instruction of the present disclosure, such as an assembly instruction readperf.
  • a performance enabling instruction "perf_start” may be inserted, which is used to enable one or more hardware performances on a hardware performance monitoring unit ("Performance Monitoring Unit", PMU for short) Counter ("Performance Monitoring Counter", referred to as PMC).
  • the process 300 can simultaneously generate a performance reading function "readperf", which can be regarded as a built-in function of the compiler, so that the process 300 can read the performance in the assembly code at step S310
  • the instruction "readperf” reduces to a call to the performance read function, which can be represented as “call readperf” as shown in Figure 3.
  • the performance read function "readperf” of the present disclosure may contain binary instructions for obtaining hardware performance-related data during program execution, such as for obtaining cache misses ("cache misss") or branches A binary instruction that misses (“branch miss”), etc.
  • the solution of the present disclosure can simplify the repetitive compilation process of the code by calling the built-in performance reading function, thereby improving the compilation efficiency of the code.
  • an object (“object”) file can be obtained at step S312 , that is, the object file generated at step S216 in FIG. 2 .
  • a conventional compilation operation can be performed on the assembly code, thereby obtaining conventional object code at step S312, that is, without obtaining hardware performance data The object code of the function.
  • the compilation process of the present disclosure is described in detail above with reference to FIGS. 1-3 . It can be understood that the compilation process of the present disclosure (including at least instruction insertion, compilation, and executable file generation) can be implemented by a compiler. To this end, a compiler capable of implementing the solution of the present disclosure will be described below with reference to FIG. 4 .
  • FIG. 4 is a schematic block diagram illustrating a compiler 400 for obtaining hardware performance data according to an embodiment of the present disclosure.
  • the compiler 400 of the present disclosure includes an insertion module 402 that is configured to insert performance read instructions for object code, wherein the performance read instructions are used to obtain hardware performance data.
  • the compiler 400 also includes a compiling module 404 configured to compile the performance read instructions and the object code to generate an executable program, wherein the execution of the executable program causes acquisition and execution of the object code Relevant hardware performance data.
  • the compilation module 404 may include a front-end compilation unit 404-1 and a back-end compilation unit 404-2, wherein the front-end compilation unit may be configured to perform the first compilation task as described in conjunction with FIGS. 1 and 2 , and the back-end compilation unit may be configured to perform the second compilation task as described in conjunction with FIG. 1 , FIG. 2 and FIG. 3 .
  • the compiler 400 of the present disclosure can be configured to execute the method steps described above in conjunction with FIGS. 1-3 , thereby generating an executable program that can be used to obtain hardware performance data. Since the foregoing descriptions with respect to FIG. 1 to FIG. 3 are also applicable to the operations performed by the compiler, the operations of the compiler 400 will not be repeated here.
  • the compiler of the present disclosure may be implemented directly as a front-end compilation unit (or module) and a back-end compilation unit (or module), where the front-end compilation unit may be used to receive automatically or manually inserted performance reads instructions, and compile the object code into assembly code.
  • the back-end compiling unit can be used to compile the assembly code output by the front-end compiling unit to generate an object file, which is then linked to generate the executable program of the present disclosure for obtaining hardware performance data.
  • the compiler disclosed in the present disclosure may be configured to perform the operations described in connection with FIGS. 1-4 to generate the executable program.
  • FIG. 5 is a flowchart illustrating a process 500 of invoking a performance read function in accordance with an embodiment of the present disclosure.
  • the call to the built-in performance read function may be performed by the compiler (or the aforementioned backend compilation unit) in the second compilation task, thereby implementing the process 500 shown here.
  • the process starts at step S502 and at step S504, when the hash value of the function name is included in the performance reading instruction, then writing (or recording) in the performance reading function which is a built-in function ) corresponding to the hash value.
  • a zero is written in the performance read function if the corresponding hash value does not exist.
  • step S504 is an optional step, and in some application scenarios, the function hash value may not be recorded.
  • the identifier (ID) (or type) of the processor core is determined, for example, in the figure, it is determined whether the processor core executing the executable program is an MPU. According to the judgment, a data derivation instruction corresponding to the identifier can be generated, that is, an MPU data derivation instruction corresponding to the MPU and an IPU data derivation instruction corresponding to the IPU, wherein different data derivation instructions indicate the difference in the migration of hardware performance data from the device to the host. Way.
  • the compiler when it is determined to be an MPU at step S506, the compiler generates an MPU data derivation instruction that enables the operations shown in steps S508, S510 and S512 to be performed.
  • the hardware performance data can be obtained from the special register ("spr") of the PMU of the MPU at step S508, and the obtained hardware performance data can be obtained from Special registers are written (or passed) to general purpose registers ("gpr").
  • steps S510 and S512 hardware performance data may be written from gpr to static random access memory (“sram”), and via it to global dynamic random access memory (“gdram”), which Can be a type of Double Rate Synchronous Dynamic Random Access Memory (“DDR SDRAM”).
  • sram static random access memory
  • gdram global dynamic random access memory
  • DDR SDRAM Double Rate Synchronous Dynamic Random Access Memory
  • the compiler when it is determined at step S506 that the identifier of the processor core is not an MPU (ie, an IPU), the compiler generates an IPU data derivation instruction that enables the operations shown in steps S516, S518 and S520 to be performed.
  • the hardware performance data when the IPU data derivation instruction is executed on the device side, the hardware performance data can be obtained from the special register ("spr") of the PMU of the IPU at step S516, and the obtained hardware performance data can be obtained from Special registers are written (or passed) to general purpose registers ("gpr").
  • gpr general purpose registers
  • hardware performance data may be written from gpr to non-volatile random access memory ("nram"), and via it to global dynamic random access memory (“gdram”) ).
  • the process 500 ends at step S514 after the performance read function is called and the corresponding data export instruction is generated.
  • the executable program of the present disclosure can obtain hardware performance data from the device in different data migration modes based on the data derivation instructions obtained by compiling the performance read function described above.
  • the solution of the present disclosure can acquire the data of a single hardware PMU at a certain moment, and present it to the user through a function interface with a return value.
  • the solution of the present disclosure can obtain data of PMUs of all hardware. In this case, it can be implemented as a function stub with no return value, where hardware performance data can be obtained by generating files at runtime ("runtime").
  • FIG. 6 is a schematic architectural diagram illustrating a heterogeneous system 600 for acquiring hardware performance data according to an embodiment of the present disclosure.
  • the heterogeneous system 600 includes interconnected hosts 602 and devices 604 .
  • the host is configured to perform master control and cooperative operations on the device to obtain hardware performance data generated when the device executes the executable program of the present disclosure.
  • the two interact through a bidirectional transmission path 606 . More specifically, the two can interact with hardware performance data via the transmission path and through the runtime_api module and the driver (eg, "driver_api", not shown in the figure) shown in the figure.
  • the two can also interact with hardware performance data only through a driver (eg "driver_api", not shown in the figure).
  • the host 602 may include the compiler described above and be configured to perform the compilation operations as described in connection with FIGS. 1-5 to generate the executable program.
  • the device includes one or more processor cores, such as IPU 610 and MPU 608 as shown in the figure.
  • the host can make the device execute the executable program, and provide the device with the address of the target GDRAM where the hardware performance data is stored through the runtime (the runtime_api module in the figure) before or during execution (ie, storage address information) and/or the storage space size of hardware performance data (ie storage space information).
  • the driver eg "driver_api", not shown in the figure
  • the driver can determine the actual storage location of the hardware performance data on the GDRAM according to the above-mentioned information such as the address and storage space size of the target GDRAM. Meanwhile, the driver can copy the compiled executable program to the device, so that the device can execute the executable program.
  • the device When the device executes the executable, it will implement the MPU or IPU data export instructions as discussed in connection with Figure 5 to export and transfer the acquired hardware performance data to the aforementioned addresses of the GDRAM. After that, the host can obtain hardware performance data from the aforementioned address of GDRAM through the "driver_api" and runtime_api modules.
  • the present disclosure also proposes to perform various transformations on the obtained hardware performance data for subsequent processing of the hardware performance data.
  • the runtime_api module can convert the collected hardware performance data into a binary file.
  • the host may utilize another parsing module (“cnperf-cli” as shown in the figure) to parse the aforementioned binary file into a file in .csv format for subsequent processing and/or visual display.
  • the method of the present disclosure may further insert a performance reading instruction into the executable program generated after compilation, and the inserted performance reading instruction may be a binary instruction executable by the device.
  • the driver running on the host can obtain an executable program generated after compilation, and the executable program includes at least one code fragment.
  • the driver can insert a binary instruction (that is, a performance read instruction) for obtaining performance data into the executable program, and the insertion position of the binary instruction can be the start position and end position of the target code segment, or it can be a specific function.
  • the front and rear positions are not specifically limited here.
  • the driver can copy the new executable program after inserting the performance reading instruction to the device, so as to obtain corresponding performance data by executing the new executable program by the device.
  • the executable program generated after compilation includes at least one null instruction (NOP instruction), and the at least one null instruction may be at the start position and end position of the target code segment, or may be at the position before and after a specific function.
  • NOP instruction null instruction
  • the driver program can replace the empty instruction in a specific position in the above executable program with a performance read instruction.
  • the user can also insert at least one performance reading instruction into the executable program according to the executable program generated after compilation.
  • the executable program includes at least one null instruction (NOP instruction), and the at least one null instruction may be at the start position and end position of the target code segment, or may be at the position before and after a specific function.
  • NOP instruction null instruction
  • Users can replace the empty instructions before and after the interested code fragment or a specific function with binary performance reading instructions to obtain a new executable program.
  • the new executable program can be delivered to the device through the driver. Thereafter, the device executes the new executable program and obtains corresponding performance data.
  • FIG. 7 is a structural diagram illustrating a combined processing apparatus according to an embodiment of the present disclosure.
  • the combined processing apparatus may be implemented on the apparatus as previously described in this disclosure.
  • the combined processing device 700 includes a computing processing device 702 , an interface device 704 , other processing devices 706 and a storage device 708 .
  • one or more computing devices 710 may be included in the computing processing device, and the computing devices may be configured to perform various computing operations, such as various computing operations involved in machine learning in the field of artificial intelligence.
  • the computing processing devices of the present disclosure may be configured to perform user-specified operations.
  • the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor, including an IPU as previously described.
  • one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core.
  • multiple computing devices are implemented as an artificial intelligence processor core or a part of the hardware structure of an artificial intelligence processor core, for the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing processing apparatus of the present disclosure may interact with other processing apparatuses through an interface apparatus to jointly complete an operation specified by a user.
  • other processing devices of the present disclosure may include central processing units (Central Processing Unit, CPU), graphics processing units (Graphics Processing Unit, GPU), artificial intelligence processors and other general-purpose and/or special-purpose processors.
  • processors may include, but are not limited to, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable Logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the computing processing device of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when computing processing devices and other processing devices are considered together, the two can be viewed as forming a heterogeneous multi-core structure.
  • the other processing device may serve as an interface for the computing processing device of the present disclosure (which may be embodied as a related computing device for artificial intelligence such as neural network operations) with external data and control, performing operations including but not limited to Limited to basic controls such as data movement, starting and/or stopping computing devices.
  • other processing apparatuses may also cooperate with the computing processing apparatus to jointly complete computing tasks.
  • the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices.
  • the computing and processing device may obtain input data from other processing devices via the interface device, and write the input data into the on-chip storage device (or memory) of the computing and processing device.
  • the computing and processing device may obtain control instructions from other processing devices via the interface device, and write them into a control cache on the computing and processing device chip.
  • the interface device can also read the data in the storage device of the computing processing device and transmit it to other processing devices.
  • the combined processing device of the present disclosure may also include a storage device.
  • the storage device is connected to the computing processing device and the other processing device, respectively.
  • a storage device may be used to store data of the computing processing device and/or the other processing device.
  • the data may be data that cannot be fully stored in an internal or on-chip storage device of a computing processing device or other processing device.
  • the present disclosure also discloses a chip (eg, chip 802 shown in FIG. 8).
  • the chip is a System on Chip (SoC) and integrates one or more combined processing devices as shown in FIG. 7 .
  • the chip can be connected with other related components through an external interface device (such as the external interface device 806 shown in FIG. 8 ).
  • the relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface.
  • other processing units such as video codecs
  • interface modules such as DRAM interfaces
  • the present disclosure also discloses a chip package structure including the above-mentioned chip.
  • the present disclosure also discloses a board including the above-mentioned chip package structure. The board will be described in detail below with reference to FIG. 8 .
  • FIG. 8 is a schematic structural diagram illustrating a board 800 according to an embodiment of the present disclosure, which may include the device 604 shown in FIG. 6 .
  • the board includes a storage device 804 for storing data, which includes one or more storage units 810 .
  • the storage device can be connected and data transferred with the control device 808 and the chip 802 described above through, for example, a bus.
  • the board also includes an external interface device 806, which is configured for data between a chip (or a chip in a chip package structure) and an external device 812 (such as a server or computer, or the host shown in FIG. 6, etc.).
  • Relay or transfer function for example, the data to be processed can be transmitted to the chip by an external device through an external interface device.
  • the calculation result of the chip may be transmitted back to the external device via the external interface device.
  • the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface and the like.
  • control device in the board of the present disclosure may be configured to regulate the state of the chip.
  • control device may include a single-chip microcomputer (Micro Controller Unit, MCU) for regulating the working state of the chip.
  • MCU Micro Controller Unit
  • an electronic device or device which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or a plurality of the above-mentioned combined processing devices.
  • the electronic devices or devices of the present disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present disclosure can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical and other fields. Further, the electronic device or device of the present disclosure can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal.
  • the electronic device or device with high computing power according to the solution of the present disclosure can be applied to a cloud device (eg, a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solutions of the present disclosure are not limited by the order of the described actions . Accordingly, those of ordinary skill in the art, based on the disclosure or teachings of this disclosure, will appreciate that some of the steps may be performed in other orders or concurrently. Further, those skilled in the art can understand that the embodiments described in the present disclosure may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present disclosure. In addition, according to different solutions, the present disclosure also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present disclosure, and can also refer to the related descriptions of other embodiments.
  • units illustrated as separate components may or may not be physically separate, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit physically exists independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the aspects of the present disclosure are embodied in the form of a software product (eg, a computer-readable storage medium), the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device eg, a personal computer, a server or network equipment, etc.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • the various types of devices described herein e.g., computing devices or other processing devices
  • suitable hardware processors such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • a method for obtaining hardware performance data for object code comprising:
  • the performance read instructions and object code are compiled to generate an executable program, wherein the hardware performance data related to the execution of the object code is obtained by running the executable program.
  • Clause A2 The method of Clause A1, wherein the inserting comprises:
  • the performance read instructions are automatically inserted by the compiler.
  • Clause A3. The method of clause A1 or A2, wherein the object code comprises at least one code fragment, and wherein inserting the capability read instruction by receiving user input comprises:
  • the performance read instructions are inserted for the at least one code segment based on the user input.
  • Clause A4 The method of any of clauses A1-A3, wherein the object code includes at least one function, and inserting the performance read instruction includes:
  • the performance read instruction is inserted for the at least one function according to a function-level instrumentation manner.
  • Clause A5. The method of any of clauses A1-A4, wherein the performance read instruction is an assembly instruction and the assembly instruction includes one or more parameters within a function name identifying the object code.
  • Clause A6 The method according to any one of Clauses A1-A5, wherein the one or more parameters comprise values obtained by data-compressing the function name.
  • performance enable instructions for enabling hardware performance counters are placed for the object code into which the performance read instructions are inserted.
  • Clause A8 The method of any one of Clauses A1-A7, wherein during the compiling process, the method includes performing a first compile task and a second compile task, wherein:
  • the method includes compiling the object code after inserting the performance read instruction to generate assembly code;
  • the method includes placing the performance enabling instruction in the assembly code according to the performance read instruction in the assembly code.
  • Clause A9 The method according to any one of Clauses A1-A8, wherein in the execution of the second compilation task, the method further comprises:
  • a corresponding performance reading function is generated according to the performance reading instruction, so as to convert the execution of the performance reading instruction into invoking execution of the performance reading function.
  • Clause A10 The method according to any one of Clauses A1-A9, wherein during the compiling process, the method further comprises:
  • a data export instruction corresponding to the identifier is generated, so as to be used for exporting the hardware performance data.
  • An apparatus for obtaining hardware performance data of object code comprising:
  • At least one memory for storing computer program instructions which, when executed by the at least one processor, cause the apparatus to perform the method according to any of clauses A1-A10.
  • Clause A12 A computer-readable storage medium storing computer program instructions for obtaining hardware performance data of object code, which when executed by at least one processor, implement any of the items according to Clauses A1-A10. one of the methods described.
  • a compiler for obtaining hardware performance data for object code comprising:
  • an insertion module configured to insert performance read instructions for the object code, wherein the performance read instructions are used to obtain hardware performance data
  • a compilation module configured to compile the performance read instructions and object code to generate an executable program, wherein the execution of the executable program results in obtaining the hardware performance data related to the execution of the object code.
  • the capability read instructions are automatically inserted.
  • the performance read instructions are inserted for the at least one code segment based on the user input.
  • the performance read instruction is inserted for the at least one function according to a function-level instrumentation manner.
  • Clause A17 The compiler of any one of Clauses A13-A16, wherein the performance read instruction is an assembly instruction and the assembly instruction includes within it one or more parameters identifying a function name of the object code .
  • Clause A18 The compiler of any one of Clauses A13-A17, wherein the one or more parameters include a value obtained by data-compressing the function name.
  • Clause A19 The compiler of any one of Clauses A13-A18, wherein in compiling the performance read instructions and object code, the compilation module is configured to:
  • performance enable instructions for enabling hardware performance counters are placed for the object code into which the performance read instructions are inserted.
  • Clause A20 The compiler of any one of Clauses A13-A19, wherein the compilation module includes a front-end compilation unit for performing a first compilation task and a back-end compilation unit for performing a second compilation task, and Wherein: in performing the first compilation task, the front-end compilation unit is configured to compile the target code after inserting the performance read instruction to generate assembly code; and
  • the backend compilation unit is configured to place the performance enabling instruction in the assembly code according to a performance read instruction in the assembly code.
  • a corresponding performance reading function is generated according to the performance reading instruction, so as to convert the execution of the performance reading instruction into invoking execution of the performance reading function.
  • Clause A22 The compiler according to any one of Clauses A13-A21, wherein during the compilation process, the compilation module is further configured to:
  • a data export instruction corresponding to the identifier is generated, so as to be used for exporting the hardware performance data.
  • a heterogeneous system for obtaining hardware performance data for object code including interconnected hosts and devices, wherein:
  • the host includes a compiler according to any of clauses A13-A22, and is configured to host and co-operate with the device to obtain the hardware generated when the device executes the executable program performance data; and the device includes one or more processor cores and is configured to:
  • the hardware performance data is sent to the host.
  • the hardware performance data is acquired from a performance monitoring unit of a processor core running the executable program.
  • the hardware performance data is transferred to the host along the transfer path.
  • Clause A26 The heterogeneous system of any of clauses A23-A25, wherein in communicating the hardware performance data to the host along the delivery path, the device is configured to:
  • the hardware capability data is passed from the special purpose register to the general purpose register in order to pass the hardware capability data to the host.
  • Clause A27 The heterogeneous system according to any one of Clauses A23-A26, wherein two-way information exchange is realized between the device and the host through an application programming interface.
  • Clause A28 The heterogeneous system of any of clauses A23-A27, wherein the device is configured to communicate the stored hardware performance data to the host through the application programming interface.
  • Clause A29 The heterogeneous system of any of clauses A23-A27, wherein prior to receiving the hardware performance data from the device, the host is configured to report to the application programming interface through the application programming interface
  • the device transmits storage address information and/or storage space information about the hardware performance data, and the device is configured to store the hardware performance data according to the storage address information and/or storage space information, and through the An application programming interface communicates the stored hardware performance data to the host.
  • Clause A30 The heterogeneous system of any of clauses A23-A27, wherein the host is configured to utilize the application programming interface to data format the hardware performance data for use in targeting the hardware performance Subsequent processing of data.
  • Clause A33 An electronic device comprising a compiler according to any of clauses A13-A22 or clause A32.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

一种用于获取硬件性能数据的方法、设备和系统。该设备可以包括在组合处理装置的计算处理装置中,该计算处理装置可以包括一个或多个数据处理装置。前述的组合处理装置还可以包括接口装置和其他处理装置。所述计算处理装置与其他处理装置进行交互,共同完成用户指定的计算操作。组合处理装置还可以包括存储装置,该存储装置分别与设备和其他处理装置连接,用于存储该设备和其他处理装置的数据。该方法可以有效地获取与执行目标代码相关的硬件性能数据。

Description

用于获取硬件性能数据的方法、设备和系统
相关申请的交叉引用
本申请要求于2020年11月30日申请的,申请号为202011379859X,名称为“用于获取硬件性能数据的方法、设备和系统”的中国专利申请的优先权。
技术领域
本披露一般地涉及计算机领域。更具体地,本披露涉及用于获取目标代码的硬件性能数据的方法、设备、编译器、异构系统和计算机可读存储介质。
背景技术
在软件开发的过程中,性能通常是最受关注的指标之一。当实现同样的功能时,不同的实现方式在性能上存在很大的差异。因此,软件开发人员在编程的过程中,往往会借助于性能分析工具来查看代码在硬件实际运行过程中的性能数据,如缓存未命中(“cache miss”)或分支未命中(“branch miss”)等,以作为优化程序性能的参考。
目前,常见的性能分析工具支持基于采样的方式来获取硬件的性能数据。然而,基于采样的方式来获取性能数据存在多种缺陷。首先,对于执行频次较低或者执行时间较短的代码片段来说,这样的采样方式可能存在误差。其次,这样的采样方式也无法利用编译阶段的程序中间表达信息,很难针对性地获取程序特定片段的性能数据,如获取特定函数调用的性能数据信息。
发明内容
至少为了解决上述的一个或多个问题,本披露提出在目标代码中插入指令来有效地获取硬件性能数据,从而对目标代码的性能做出准确地判断,以便程序开发人员有针对性地优化代码,提升代码性能。
在第一方面中,本披露公开了一种用于获取目标代码的硬件性能数据的方法,包括:针对所述目标代码来插入用于获取硬件性能数据的性能读取指令;以及编译所述性能读取指令和目标代码,以便生成可执行程序,其中通过运行所述可执行程序来获取与所述目标代码的执行相关的所述硬件性能数据。
在第二方面中,本披露公开了一种用于获取目标代码的硬件性能数据的设备,包括:至少一个处理器;至少一个存储器,其用于存储计算机程序指令,当所述计算机程序指令由所述至少一个处理器执行时,使得所述设备执行前述方法及其稍后描述的多个实施例。
在第三方面中,本披露公开了一种计算机可读存储介质,其存储有用于获取目标代码的硬件性能数据的计算机程序指令,当所述计算机程序指令由至少一个处理器来执行时,实现前述方法及其稍后描述的多个实施例。
在第四方面中,本披露公开了一种用于获取目标代码的硬件性能数据的编译器,包括:插入模块,其配置成针对所述目标代码插入性能读取指令,其中所述性能读取指令用于获取硬件性能数据;以及编译模块,其配置成对所述性能读取指令和目标代码进行编译,以便生成可执行程序,其中所述可执行程序的运行使得获取与所述目标代码的执行相关的所述硬件性能数据。
在第五方面中,本披露公开了一种用于获取目标代码的硬件性能数据的异构系统,包括互联的主机和设备,其中:所述主机包括前述的编译器,并且配置成对所述设备执行主控和协同操作,以获取所述设备执行所述可执行程序时产生的所述硬件性能数据;以及所述设备包括一个或多个处理器核,并且配置成:执行所述编译器生成的所述可执行程序;以及向所述主机发送所述硬件性能数据。
在第六方面中,本披露公开了一种编译器,其配置成执行上述方法及其稍后描述的多个实施例。
在第七方面中,本披露公开了一种板卡,包括上述异构系统及其稍后描述的多个实施例。
通过利用本披露上面描述的方案,软件开发人员可以对任意期望的目标代码进行性能测试,从而本披露的方案为用户提供了更高的自由度和灵活性。例如,通过本披露的方案,用户可以查看任意的代码片段,不会出现因代码片段执行频次低或者执行时间短而导致采集到的硬件性能数据不准确的情况。在一些应用场景中,本披露的方案同时支持手动和自动插入性能读取指令并且还支持函数级的插入,从而向用户提供了更多的测试途径。基于此,用户既可以查看任意感兴趣的代码片段的硬件性能数据信,也可以在编译阶段通过编译选项来实现函数级的硬件性能数据的获取。在一些场景中,本披露的方案还可以扩展应用到对特定操作如数据拷贝的硬件性能数据的获取等。
附图说明
通过结合附图,可以更好地理解本发明的上述特征,并且其众多目的,特征和优点对于本领域技术人员而言是显而易见的,其中相同的附图标记表示相同的元件,并且其中:
图1是示出根据本披露实施例的用于获取硬件性能数据的方法的简化流程图;
图2是示出根据本披露实施例的用于获取硬件性能数据的方法的详细流程图;
图3是示出根据本披露实施例的后端编译过程的流程图;
图4是示出根据本披露实施例的用于获取硬件性能数据的编译器的示意框图;
图5是示出根据本披露实施例的调用性能读取函数的流程图;
图6是示出根据本披露实施例的用于获取硬件性能数据的异构系统的示意架构图;
图7是示出根据本披露实施例的一种组合处理装置的结构图;以及
图8是示出根据本披露实施例的一种板卡的结构示意图。
具体实施方式
至少针对于现有技术中硬件性能数据的获取极其低效的问题,本披露的方案提出在目标代码(或称待测试的用户程序)中插入性能读取指令,并且将目标代码和该性能读取指令共同进行编译,以便得到用于准确获取硬件性能数据的可执行程序。在一个实施场景中,当本披露的方案应用于由主机(“host”)和设备(“device”)构成的异构系统中时,通过在主机侧执行编译过程以获得上述可执行程序,并且在设备侧处执行该可执行程序,本披露的方案可以准确地获取目标代码在设备侧运行时的相关硬件性能数据。由此,软件开发人员可以对目标代码的性能做出准备地判断,并且在必要时做出进一步的程序优化。
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属 于本披露保护的范围。
图1是示出根据本披露实施例的用于获取硬件性能数据的方法100的简化流程图。如图1所示,在步骤S102处,针对目标代码来插入用于获取硬件性能数据的性能读取指令。根据不同的应用场景,本披露的目标代码可以是预定进行性能测试的程序片段或者是用户感兴趣的代码片段。在一个实施例中,本披露的硬件性能数据可以包括例如从专用处理器(例如图像处理器GPU、人工智能处理器等)或通用处理器(例如通用CPU)的多个计数寄存器获取例如高速缓存未命中(“cache misss”)或分支未命中(“branch miss”)等程序执行时的硬件相关数据。基于这样的相关数据的获取,程序开发者能够收集程序代码在执行过程中某些事件(如前述的各类未命中或CPU时钟)等相关信息,从而使得开发者能够根据这些数据来判断代码的运行情况以便做出调整和优化。
在一个实施例中,步骤S102中的插入步骤可以是通过接收用户输入来插入所述性能读取指令。在该插入方式下,本披露的方案可以根据用户输入,针对至少一个代码片段(例如用户感兴趣的代码片段)来插入性能读取指令。与该手动插入方式不同,在另一个实施例中,步骤S102中的插入步骤也可以通过编译器(稍后结合图4详细描述)来自动地插入所述性能读取指令。根据不同的应用场景,前述的插入性能读取指令的操作还可以通过函数级插桩的方式来完成。换句话说,本披露的插入操作可以是针对至少一个函数来插入所述性能读取指令,以获取一个或者多个函数在硬件上执行时所产生的相应硬件性能数据。进一步,在一些场景中,也可以设置针对于手动插入和自动插入的选项,从而用户可以根据偏好或优选项来选择手动插入或自动插入。
在一个实施例中,本披露的性能读取指令可以是汇编指令,并且该汇编指令内包括用于标识目标代码的函数名的一个或多个参数。在一个应用场景中,本披露的性能读取指令可以表达为“readperf.begin/end”,其中目标代码可以位于起始指令“readperf.begin”和结束指令“readperf.end”之间。换句话说,可以在目标代码的起始和结束处分别插入“readperf.begin”和“readperf.end”,以便后续执行性能读取指令时获取与目标代码相关联的硬件性能数据。根据硬件支持的运算位数的不同,包括在性能读取指令中、用于标识目标代码的函数名的参数可以是具有不同位数的参数。例如,当硬件构架支持64位的整型数运算时,可以通过一个64位的整型数参数来代表函数名。相反,当硬件架构不支持64位运算时,则可以通过两个32位的整型数参数来代表函数名。在一个实现场景中,还可以通过压缩算法(例如哈希算法)对函数名进行压缩,以获得代表函数名的前述一个或多个参数。
在完成步骤S102的插入操作后,方法前进到步骤S104处。在该步骤处,方法100编译所述性能读取指令和目标代码,以便生成可执行程序,其中通过运行所述可执行程序,可以获取与所述目标代码的执行相关的硬件性能数据。在一个实现场景中,在对性能读取指令和目标代码进行编译后,可以生成目标(“object”)文件,其包括可以由处理器直接识别的二进制代码。当选择使用不同的编译器时,生成前述目标文件的方式可能有所不同,因此本披露的方案并不对编译器作任何的限制。在生成目标文件后,可以对目标文件进行链接,即将目标文件中的代码进行连接以便形成由处理器执行的可执行文件。
关于上面提到的编译过程,在一个实施例中,可以针对插入性能读取指令的目标代码来置入用于使能硬件性能计数器的性能启用指令。通过该性能启动指令,可以在执行目标代码前启用处理器内的相关性能计数器,以便对相关事件进行计数。例如,可以从所述处 理器核的性能监视单元的专用寄存器(充当计数器的角色)来获取硬件性能数据。根据不同的场景,前述的一个或多个专用寄存器可以用于对在处理器核内发生的事件(例如缓存或分支的命中或非命中等)进行计数和/或累加。
在一个实施例中,本披露的编译过程中包括执行第一编译任务和第二编译任务,其中在第一编译任务中,可以对插入所述性能读取指令后的所述目标代码进行编译,以生成汇编代码。接着,在第二编译任务中,将所述汇编代码编译为目标文件,并可以根据前述汇编代码中的所述性能读取指令来在所述汇编代码中置入上文的所述性能启用指令。附加地,在所述第二编译任务中,还可以根据性能读取指令来生成对应的性能读取函数,以便将所述性能读取指令的执行转化为对所述性能读取函数的执行。例如,可以针对前述的“readperf.begin/end”来生成性能读取函数.readperf region,从而可以通过调用该函数来执行性能数据的读取操作。
在一些应用场景中,当需要在一种或多种处理器核上测试目标代码以获取相关的硬件性能数据时,本披露的方法还可以包括根据待执行所述可执行程序的处理器核的标识,生成对应于所述标识的数据导出指令,以用于将所述硬件性能数据导出。在一个实施例中,当本披露的方案应用于支持异构的设备(“device”)时,例如当该设备具有智能处理单元(“Intelligence Processing Unit”,简写为“IPU”)和存储器处理单元(“Memory Processing Unit”,简写为“MPU”)时,通过所述数据导出指令,可以沿不同的数据传递通路将硬件性能数据从设备向外部传递,例如向主机(“host”)传递。
以上结合图1对本披露的用于获取目标代码的硬件性能数据的方法进行了详细的描述。下面将结合图2对可以作为图1所示方法100的一种具体实施方式的方法200进行详细的描述。
图2是示出根据本披露实施例的用于获取硬件性能数据的方法200的详细流程图。鉴于方法200可以视为图1中所示方法100的一种实施方式,因此结合图1所做的描述也同样适用于方法200,并且相同的技术内容将在下文中不再赘述。
如图2中所示,在步骤S202处,方法200可以接收用户编制或输入的用户程序,其中该用户程序可以是包括如前所述的目标代码,其包括用户感兴趣的一个或多个代码片段。接着,在步骤S204处,方法200可以判断是否对用户程序中的目标代码执行函数级的自动插桩操作。如本领域技术人员所知,为了分析程序状态和性能,可以在目标代码中某些位置中插入或修改一些代码(即为“插桩”操作),从而在目标代码运行过程中获取某些程序状态并加以分析。进一步,函数级插桩实现在函数中插入或修改代码。根据本披露的方案,通过插桩操作可以实现对核函数(“kernel”)级的硬件性能数据的获取。
当在步骤S204处判断执行函数级自动插桩时,则在步骤S206处,方法200可以自动插入本披露的性能读取指令,例如自动插入汇编指令“readperf”。替代地,当在步骤S204处判断不执行函数级自动插桩操作时,则在步骤S208处,方法200可以接收用户的手动输入来在目标代码中插入性能读取指令。在一个实施例中,该自动输入或手动输入可以通过对编译器的编译选项进行设置来实现。当选择执行手动输入时,用户可以选择在感兴趣的目标代码前后插入性能读取指令,从而可以获得特定片段的硬件性能数据。
在插入性能读取指令后,则方法200前进到步骤S210处。在该步骤处,方法200可以执行前端编译操作,也即执行结合图1所描述的第一编译任务。经过前端编译操作后,可以在步骤S212处得到汇编代码。接着,方法200在步骤S214处执行后端编译操作,也 即执行结合图1所描述的第二编译任务。经过后端编译操作后,可以在步骤S216生成包括指令段和数据段的目标(“object”)文件。在得到目标文件后,接着在步骤S218处,方法200可以执行链接操作,以便将不同部分的代码和数据收集和组合成为一个例如称为载入模块或可执行文件的实体,其可以被例如设备上的操作系统直接执行。最终,通过执行方法200的执行,可以在步骤S220处获得可执行程序。如前所述,通过该可执行程序的执行,本披露的方案可以获取与目标代码的执行相关的硬件性能数据。
以上结合图1和图2对本披露的获取硬件性能数据的方案进行了详细的描述。利用图1和图2所示的方法,用户可以查看任意感兴趣的代码片段,从而不会因为代码片段执行频率低或执行时间短而造成获得的硬件性能数据不准确的情况。换句话说,本方案的获取硬件性能数据的方案在使用方面具有更高的自由度和灵活度。另外,本披露的方案也支持在编译阶段通过编译选项的选择来实现函数级别的硬件性能数据的有效获取。在一个应用场景中,当本披露的性能读取指令插入在用于数据拷贝的代码片段中,还能够获取与数据拷贝操作相关的硬件性能数据。
为了更好地理解本披露结合图1和图2所描述的后端编译操作,下面将结合图3来详细描述该后端编译过程。
图3是示出根据本披露实施例的后端编译过程300的流程图。可以理解的是后端编译过程300是图1所述的第二编译任务和图2所示的步骤S214的进一步细化,因此前述关于第二编译任务和步骤S214所做的描述也同样适用于后端编译过程300。
如图3中所示,在步骤S302处,过程300获得汇编代码,例如图2步骤S212处所输出的汇编代码。接着,在步骤S304处,过程300可以判断该汇编代码中是否包括本披露的性能读取指令,例如汇编指令readperf。当在步骤S304处判断存在性能读取指令时,则可以插入性能启用指令“perf_start”,其用于使能硬件性能监视单元(“Performance Monitoring Unit”,简称PMU)上的一个或多个硬件性能计数器(“Performance Monitoring Counter”,简称PMC)。
接着,在步骤S308处,过程300可以同时生成性能读取函数“readperf”,该性能读取函数可以视为编译器的内置函数,从而过程300可以在步骤S310处将汇编代码中的性能读取指令“readperf”化简为对性能读取函数的调用,该调用可以如图3中所示表示为“call readperf”。在一个实施例中,本披露的性能读取函数“readperf”中可以包含用于获取程序执行时的硬件性能相关数据的二进制指令,如用于获取高速缓存未命中(“cache misss”)或分支未命中(“branch miss”)等的二进制指令。本披露的方案通过对内置的该性能读取函数的调用,可以简化代码的重复编译过程,从而可以提高代码的编译效率。最终,经过编译操作后,可以在步骤S312处获得目标(“object”)文件,也即图2中步骤S216处生成的目标文件。当在步骤S304处判断汇编代码中并不存在本披露的性能读取指令,则可以对汇编代码执行常规的编译操作,从而在步骤S312处获得常规的目标代码,也即不具有获得硬件性能数据功能的目标代码。
上文结合图1-图3对本披露的编译过程进行了详细的描述。可以理解的是本披露的编译过程(至少包括指令插入、编译,直至生成可执行文件)可以由编译器来实现。为此,下面将结合图4对能够执行本披露方案的编译器进行描述。
图4是示出根据本披露实施例的用于获取硬件性能数据的编译器400的示意框图。如图4所示,本披露的编译器400包括插入模块402,其配置成针对目标代码插入性能读取 指令,其中所述性能读取指令用于获取硬件性能数据。进一步,编译器400还包括编译模块404,其配置成对所述性能读取指令和目标代码进行编译,以便生成可执行程序,其中所述可执行程序的运行使得获取与所述目标代码的执行相关的硬件性能数据。在一个实施例中,所述编译模块404可以包括前端编译单元404-1和后端编译单元404-2,其中前端编译单元可以配置成执行如结合图1和图2所描述的第一编译任务,而所述后端编译单元可以配置成执行如结合图1、图2和图3所描述的第二编译任务。
通过上面的描述,本领域技术人员可以理解本披露的编译器400可以配置成执行前文结合图1-图3所描述的方法步骤,从而生成可以用于获取硬件性能数据的可执行程序。鉴于前述关于图1-图3所做的描述也同样适用于编译器所执行的操作,因此关于编译器400的操作此处不再赘述。
进一步,上述对于本披露的编译器400的结构框图仅仅是示例性的而非限制性的,并且本领域技术人员根据本披露的教导也可以想到采用别的结构来实现。例如,在一个实施例中,本披露的编译器可以直接实现为前端编译单元(或模块)和后端编译单元(或模块),其中前端编译单元可以用于接收自动或手动插入的性能读取指令,并且将目标代码编译成汇编代码。对应地,后端编译单元可以用于将前端编译单元输出的汇编代码进行编译以生成目标文件,然后再经链接来生成本披露的用于获取硬件性能数据的可执行程序。总之,本披露公开的编译器可以配置成执行结合图1-图4所描述的操作,以生成所述可执行程序。
图5是示出根据本披露实施例的调用性能读取函数的过程500的流程图。如前所述,可以由编译器(或前述的后端编译单元)在第二编译任务中执行对内置的性能读取函数的调用,从而实现这里所示出的过程500。
如图5所示,该过程开始于步骤S502并且在步骤S504处,当性能读取指令中包括函数名的哈希值时,则在作为内置函数的性能读取函数中写入(或称记录)对应的哈希值。如果不存在相应的哈希值时,则在性能读取函数中写入零。如前所述,通过将待测试的函数以对应的哈希值来标识,可以有效的减小用于函数名的数据开销。当没有哈希值时,可以认为此次是针对于非函数级的代码片段进行的操作,以便获取相关联的硬件性能数据。可以理解的是,步骤S504是可选的步骤,并且在一些应用场景中,也可以不对函数哈希值进行记录。
接着,在步骤S506处,确定处理器核的标识符(ID)(或者类型),例如图中判断执行所述可执行程序的处理器核是否是MPU。根据该判断,可以生成对应于标识的数据导出指令,即对应于MPU的MPU数据导出指令和对应于IPU的IPU数据导出指令,其中不同数据导出指令指示了硬件性能数据从设备向主机迁移的不同方式。
具体地,当在步骤S506处判断为MPU时,则编译器生成使得可以执行步骤S508、S510和S512中所示操作的MPU数据导出指令。如图所示,当该MPU数据导出指令在设备侧执行时,可以在步骤S508处执行从MPU的PMU的专用寄存器(“spr”)获取所述硬件性能数据,并且将获得的硬件性能数据从专用寄存器写入(或称传递)到通用寄存器(“gpr”)。接着,在步骤S510和S512处,可以将硬件性能数据从gpr写入到静态随机存取存储器(“sram”),并且经其再写入到全局动态随机存取存储器(“gdram”),其可以是双倍速率同步动态随机存储器(“DDR SDRAM”)的一种。
类似地,当在步骤S506处确定处理器核的标识符并不是MPU(也即是IPU)时,则编译器生成使得可以执行步骤S516、S518和S520中所示操作的IPU数据导出指令。如 图所示,当该IPU数据导出指令在设备侧执行时,可以在步骤S516处执行从IPU的PMU的专用寄存器(“spr”)获取所述硬件性能数据,并且将获得的硬件性能数据从专用寄存器写入(或称传递)到通用寄存器(“gpr”)。接着,在步骤S518和S520处,可以将硬件性能数据从gpr写入到非易失性随机存取存储器(“nram”),并且经其再写入到全局动态随机存取存储器(“gdram”)。
当在调用性能读取函数并生成相应的数据导出指令后,过程500结束于步骤S514。可以看出,当在设备侧处运行时,基于编译上述性能读取函数所获得的数据导出指令,本披露的可执行程序可以以不同的数据迁移方式从设备获取硬件性能数据。在一些应用场景中,本披露的方案可以获取某个时刻单个硬件PMU数据,并且通过带返回值的函数接口来向用户呈现。在另外一些场景中,本披露的方案可以获取所有硬件的PMU的数据。在该情形中,可以不带返回值的函数桩点来实现,其中可以通过在运行时间(“runtime”)生成文件来获取硬件性能数据。
图6是示出根据本披露实施例的用于获取硬件性能数据的异构系统600的示意架构图。如图6中所示,该异构系统600包括互联的主机602和设备604。在一个实施例中,主机配置成对设备执行主控和协同操作,以获取设备执行本披露的可执行程序时产生的硬件性能数据。如图中所示,二者通过双向的传输通路606进行交互。更具体地,二者可以经由传输通路并通过图中所示出的runtime_api模块和驱动程序(如“driver_api”,图中未示出)来进行关于硬件性能数据的交互。可选地,二者还可以仅通过驱动程序(如“driver_api”,图中未示出)来进行关于硬件性能数据的交互。
根据本披露的方案,主机602可以包括前文所描述的编译器并且配置成执行如结合图1-图5所描述的编译操作,从而生成所述可执行程序。在一个实施例中,所述设备包括一个或多个处理器核,如图中所示出的IPU 610和MPU 608。
在获取硬件性能数据的操作中,主机可以令设备执行可执行程序,并且在执行前或期间通过运行时(如图中的runtime_api模块)向设备端提供硬件性能数据存储的目标GDRAM的地址(即存储地址信息)和/或硬件性能数据的存储空间大小(即存储空间信息)。之后,主机上运行的驱动程序(如“driver_api”,图中未示出)可以根据上述的目标GDRAM的地址和存储空间大小等信息确定硬件性能数据的在GDRAM上的实际存储位置。同时,驱动程序可以将编译后的可执行程序拷贝至设备上,以使得设备可以执行该可执行程序。当设备执行所述可执行程序时,其将实施如结合图5所讨论的MPU或IPU数据导出指令,以将获取的硬件性能数据导出并传送到GDRAM的前述地址处。此后,主机可以通过“driver_api”和runtime_api模块从GDRAM的前述地址获取硬件性能数据。
根据不同的应用场景,本披露还提出对获得的硬件性能数据进行各类转换,以用于针对所述硬件性能数据的后续处理。例如,runtime_api模块可以将收集的硬件性能数据转换成二进制文件。又例如,主机可以利用另一解析模块(如图中所示“cnperf-cli”)将前述的二进制文件解析成.csv格式的文件,以用于后续的处理和/或可视化显示。
在其他实施例中,本公开的方法还可以是在编译后生成的可执行程序中插入性能读取指令,此时插入的性能读取指令可以是设备可以执行的二进制指令。
可选地,主机上运行的驱动程序可以获得编译后生成的可执行程序,该可执行程序中包含至少一个代码片段。驱动程序可以在该可执行程序中插入用于获取性能数据的二进制指令(即性能读取指令),该二进制指令的插入位置可以是目标代码片段的起始位置和终 止位置,也可以是特定函数的前后位置,此处不做具体限定。之后,驱动程序可以将该插入性能读取指令之后的新的可执行程序拷贝至设备,以通过设备执行该新的可执行程序获得相应的性能数据。进一步可选地,编译后生成的可执行程序包含至少一个空指令(NOP指令),该至少一个空指令可以处于目标代码片段的起始位置和终止位置,也可以处于特定函数的前后位置。这样,在驱动程序向该可执行程序中插入性能读取指令时,驱动程序可以将上述可执行程序中特定位置的空指令替换为性能读取指令。
可选地,用户也可以根据编译后生成的可执行程序,在该可执行程序中插入至少一个性能读取指令。其中,该可执行程序包含至少一个空指令(NOP指令),该至少一个空指令可以处于目标代码片段的起始位置和终止位置,也可以处于特定函数的前后位置。用户可以将其感兴趣的代码片段或特定函数的前后位置的空指令替换为二进制的性能读取指令,获得新的可执行程序。之后,该新的可执行程序可以通过驱动程序下发到设备上。此后,设备执行该新的可执行程序,并获得相应的性能数据。
可以理解的是性能数据的读取过程与上文实施例中的读取过程一致,具体可参见图6及上文中相关描述,此处不再赘述。
图7是示出根据本披露实施例的一种组合处理装置的结构图。该组合处理装置可以实现在本披露如前所述的设备上。如图7中所示,该组合处理装置700包括计算处理装置702、接口装置704、其他处理装置706和存储装置708。根据不同的应用场景,计算处理装置中可以包括一个或多个计算装置710,该计算装置可以配置用于执行各类计算操作,例如人工智能领域内的机器学习所涉及的各类运算。
在不同的实施例中,本披露的计算处理装置可以配置成执行用户指定的操作。在示例性的应用中,该计算处理装置可以实现为单核人工智能处理器或者多核人工智能处理器,包括如前所述的IPU。类似地,包括在计算处理装置内的一个或多个计算装置可以实现为人工智能处理器核或者人工智能处理器核的部分硬件结构。当多个计算装置实现为人工智能处理器核或人工智能处理器核的部分硬件结构时,就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。
在示例性的操作中,本披露的计算处理装置可以通过接口装置与其他处理装置进行交互,以共同完成用户指定的操作。根据实现方式的不同,本披露的其他处理装置可以包括中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器。这些处理器可以包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本披露的计算处理装置而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算处理装置和其他处理装置共同考虑时,二者可以视为形成异构多核结构。
在一个或多个实施例中,该其他处理装置可以作为本披露的计算处理装置(其可以具体化为人工智能例如神经网络运算的相关运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运、对计算装置的开启和/或停止等基本控制。在另外的实施例中,其他处理装置也可以和该计算处理装置协作以共同完成运算任务。
在一个或多个实施例中,该接口装置可以用于在计算处理装置与其他处理装置间传输 数据和控制指令。例如,该计算处理装置可以经由所述接口装置从其他处理装置中获取输入数据,写入该计算处理装置片上的存储装置(或称存储器)。进一步,该计算处理装置可以经由所述接口装置从其他处理装置中获取控制指令,写入计算处理装置片上的控制缓存中。替代地或可选地,接口装置也可以读取计算处理装置的存储装置中的数据并传输给其他处理装置。
附加地或可选地,本披露的组合处理装置还可以包括存储装置。如图中所示,该存储装置分别与所述计算处理装置和所述其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算处理装置和/或所述其他处理装置的数据。例如,该数据可以是在计算处理装置或其他处理装置的内部或片上存储装置中无法全部保存的数据。
在一些实施例里,本披露还公开了一种芯片(例如图8中示出的芯片802)。在一种实现中,该芯片是一种系统级芯片(System on Chip,SoC),并且集成有一个或多个如图7中所示的组合处理装置。该芯片可以通过对外接口装置(如图8中示出的对外接口装置806)与其他相关部件相连接。该相关部件可以例如是摄像头、显示器、鼠标、键盘、网卡或wifi接口。在一些应用场景中,该芯片上可以集成有其他处理单元(例如视频编解码器)和/或接口模块(例如DRAM接口)等。在一些实施例中,本披露还公开了一种芯片封装结构,其包括了上述芯片。在一些实施例里,本披露还公开了一种板卡,其包括上述的芯片封装结构。下面将结合图8对该板卡进行详细地描述。
图8是示出根据本披露实施例的一种板卡800的结构示意图,其可以包括如图6所示出的设备604。如图8中所示,该板卡包括用于存储数据的存储器件804,其包括一个或多个存储单元810。该存储器件可以通过例如总线等方式与控制器件808和上文所述的芯片802进行连接和数据传输。进一步,该板卡还包括对外接口装置806,其配置用于芯片(或芯片封装结构中的芯片)与外部设备812(例如服务器或计算机,或者图6中示出的主机等)之间的数据中继或转接功能。例如,待处理的数据可以由外部设备通过对外接口装置传递至芯片。又例如,所述芯片的计算结果可以经由所述对外接口装置传送回外部设备。根据不同的应用场景,所述对外接口装置可以具有不同的接口形式,例如其可以采用标准PCIE接口等。
在一个或多个实施例中,本披露板卡中的控制器件可以配置用于对所述芯片的状态进行调控。为此,在一个应用场景中,该控制器件可以包括单片机(Micro Controller Unit,MCU),以用于对所述芯片的工作状态进行调控。
根据上述结合图7和图8的描述,本领域技术人员可以理解本披露也公开了一种电子设备或装置,其可以包括一个或多个上述板卡、一个或多个上述芯片和/或一个或多个上述组合处理装置。
根据不同的应用场景,本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、 零售、工地、医疗等领域。进一步,本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所 述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
依据以下条款可更好地理解前述内容:
条款A1、一种用于获取目标代码的硬件性能数据的方法,包括:
针对所述目标代码来插入用于获取硬件性能数据的性能读取指令;以及
编译所述性能读取指令和目标代码,以便生成可执行程序,其中通过运行所述可执行程序来获取与所述目标代码的执行相关的所述硬件性能数据。
条款A2、根据条款A1所述的方法,其中所述插入包括:
通过接收用户输入来插入所述性能读取指令;或者
通过编译器来自动地插入所述性能读取指令。
条款A3、根据条款A1或A2所述的方法,其中所述目标代码包括至少一个代码片段,并且其中通过接收用户输入来插入所述性能读取指令包括:
根据所述用户输入,针对所述至少一个代码片段来插入所述性能读取指令。
条款A4、根据条款A1-A3的任意一项所述的方法,其中所述目标代码包括至少一个函数,并且插入所述性能读取指令包括:
根据函数级插桩的方式,针对所述至少一个函数来插入所述性能读取指令。
条款A5、根据条款A1-A4的任意一项所述的方法,其中所述性能读取指令是汇编指令并且该汇编指令内包括用于标识所述目标代码的函数名的一个或多个参数。
条款A6、根据条款A1-A5的任意一项所述的方法,其中所述一个或多个参数包括将所述函数名进行数据压缩后获得的数值。
条款A7、根据条款A1-A6的任意一项所述的方法,其中编译所述性能读取指令和目标代码包括:
在编译过程中,针对插入所述性能读取指令的所述目标代码来置入用于使能硬件性能计数器的性能启用指令。
条款A8、根据条款A1-A7的任意一项所述的方法,其中在所述编译过程中,所述方法包括执行第一编译任务和第二编译任务,其中:
在所述第一编译任务的执行中,所述方法包括对插入所述性能读取指令后的所述目标代码进行编译,以生成汇编代码;以及
在所述第二编译任务的执行中,所述方法包括根据所述汇编代码中的所述性能读取指令来在所述汇编代码中置入所述性能启用指令。
条款A9、根据条款A1-A8的任意一项所述的方法,其中在所述第二编译任务的执行中,所述方法还包括:
根据所述性能读取指令来生成对应的性能读取函数,以便将所述性能读取指令的执行转化为对所述性能读取函数的调用执行。
条款A10、根据条款A1-A9的任意一项所述的方法,其中在编译过程中,所述方法还包括:
根据待执行所述可执行程序的处理器核的标识,生成对应于所述标识的数据导出指令,以用于将所述硬件性能数据导出。
条款A11、一种用于获取目标代码的硬件性能数据的设备,包括:
至少一个处理器;
至少一个存储器,其用于存储计算机程序指令,当所述计算机程序指令由所述至少一个处理器执行时,使得所述设备执行根据条款A1-A10的任意一项所述的方法。
条款A12、一种计算机可读存储介质,其存储有用于获取目标代码的硬件性能数据的计算机程序指令,当所述计算机程序指令由至少一个处理器来执行时,实现根据条款A1-A10的任意一项所述的方法。
条款A13、一种用于获取目标代码的硬件性能数据的编译器,包括:
插入模块,其配置成针对所述目标代码插入性能读取指令,其中所述性能读取指令用于获取硬件性能数据;以及
编译模块,其配置成对所述性能读取指令和目标代码进行编译,以便生成可执行程序,其中所述可执行程序的运行使得获取与所述目标代码的执行相关的所述硬件性能数据。
条款A14、根据条款A13所述的编译器,其中所述插入模块配置成:
接收用户输入来插入所述性能读取指令;或者
自动地插入所述性能读取指令。
条款A15、根据条款A13或A14所述的编译器,其中所述目标代码包括至少一个代码片段,并且其中所述插入模块配置成:
根据所述用户输入,针对所述至少一个代码片段来插入所述性能读取指令。
条款A16、根据条款A13-A15的任意一项所述的编译器,其中所述目标代码包括至少一个函数,并且所述插入模块配置成:
根据函数级插桩的方式,针对所述至少一个函数来插入所述性能读取指令。
条款A17、根据条款A13-A16的任意一项所述的编译器,其中所述性能读取指令是汇编指令并且该汇编指令内包括用于标识所述目标代码的函数名的一个或多个参数。
条款A18、根据条款A13-A17的任意一项所述的编译器,其中所述一个或多个参数包括将所述函数名进行数据压缩后获得的数值。
条款A19、根据条款A13-A18的任意一项所述的编译器,其中在编译所述性能读取指令和目标代码中,所述编译模块配置成:
在编译过程中,针对插入所述性能读取指令的所述目标代码来置入用于使能硬件性能计数器的性能启用指令。
条款A20、根据条款A13-A19的任意一项所述的编译器,其中所述编译模块包括用于执行第一编译任务的前端编译单元和用于执行第二编译任务的后端编译单元,并且其中:在执行所述第一编译任务中,所述前端编译单元配置成对插入性能读取指令后的所述目标代码进行编译,以生成汇编代码;以及
在执行所述第二编译任务中,所述后端编译单元配置成根据所述汇编代码中的性能读取指令来在所述汇编代码中置入所述性能启用指令。
条款A21、根据条款A13-A20的任意一项所述的编译器,其中在执行所述第二编译任 务中,所述后端编译单元还配置成:
根据所述性能读取指令来生成对应的性能读取函数,以便将所述性能读取指令的执行转化为对所述性能读取函数的调用执行。
条款A22、根据条款A13-A21的任意一项所述的编译器,其中在编译过程中,所述编译模块还配置成:
根据待执行所述可执行程序的处理器核的标识,生成对应于所述标识的数据导出指令,以用于将所述硬件性能数据导出。
条款A23、一种用于获取目标代码的硬件性能数据的异构系统,包括互联的主机和设备,其中:
所述主机包括根据条款A13-A22的任意一项所述的编译器,并且配置成对所述设备执行主控和协同操作,以获取所述设备执行所述可执行程序时产生的所述硬件性能数据;以及所述设备包括一个或多个处理器核,并且配置成:
执行所述编译器生成的所述可执行程序;以及
向所述主机发送所述硬件性能数据。
条款A24、根据条款A23所述的异构系统,其中在执行所述性能读取函数中,所述设备配置成:
从运行所述可执行程序的处理器核的性能监视单元获取所述硬件性能数据。
条款A25、根据条款A23-A24的任意一项所述的异构系统,其中所述设备配置成根据所述数据导出指令来执行以下操作:
选择对应于所述标识的传递通路;以及
将硬件性能数据沿所述传递通路向所述主机传递。
条款A26、根据条款A23-A25的任意一项所述的异构系统,其中在沿所述传递通路向所述主机传递所述硬件性能数据中,所述设备配置成:
从所述处理器核的所述性能监视单元的专用寄存器获取所述硬件性能数据;以及
从所述专用寄存器向通用寄存器传递所述硬件性能数据,以便将所述硬件性能数据向所述主机传递。
条款A27、根据条款A23-A26的任意一项所述的异构系统,其中所述设备和主机之间通过应用编程接口来实现双向的信息交互。
条款A28、根据条款A23-A27的任意一项所述的异构系统,其中所述设备配置成通过所述应用编程接口将存储的所述硬件性能数据向所述主机传送。
条款A29、根据条款A23-A27的任意一项所述的异构系统,其中在接收来自于所述设备的所述硬件性能数据前,所述主机配置成通过所述应用编程接口来向所述设备传送关于所述硬件性能数据的存储地址信息和/或存储空间信息,并且所述设备配置成根据所述存储地址信息和/或存储空间信息对所述硬件性能数据进行存储,并且通过所述应用编程接口向所述主机传递存储的所述硬件性能数据。
条款A30、根据条款A23-A27的任意一项所述的异构系统,其中所述主机配置成利用所述应用编程接口对所述硬件性能数据进行数据格式转换,以用于针对所述硬件性能数据的后续处理。
条款A31、一种板卡,包括根据条款A23-A30中的任意一项所述的异构系统。
条款A32、一种编码器,其配置成执行根据条款A1-A10中的任意一项所述的方法。
条款A33、一种电子设备,包括根据条款A13-A22或条款A32中的任意一项所述的编译器。
虽然本文已经示出和描述了本披露的多个实施例,但对于本领域技术人员显而易见的是,这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本披露思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本披露的过程中,可以采用对本文所描述的本披露实施例的各种替代方案。所附权利要求书旨在限定本披露的保护范围,并因此覆盖这些权利要求范围内的等同或替代方案。

Claims (20)

  1. 一种用于获取目标代码的硬件性能数据的方法,包括:
    针对所述目标代码来插入用于获取硬件性能数据的性能读取指令;以及
    编译所述性能读取指令和目标代码,以便生成可执行程序,其中通过运行所述可执行程序来获取与所述目标代码的执行相关的所述硬件性能数据。
  2. 根据权利要求1所述的方法,其中所述插入包括:
    通过接收用户输入来插入所述性能读取指令;或者
    通过编译器来自动地插入所述性能读取指令。
  3. 根据权利要求2所述的方法,其中所述目标代码包括至少一个代码片段,并且其中通过接收用户输入来插入所述性能读取指令包括:
    根据所述用户输入,针对所述至少一个代码片段来插入所述性能读取指令。
  4. 根据权利要求2所述的方法,其中所述目标代码包括至少一个函数,并且插入所述性能读取指令包括:
    根据函数级插桩的方式,针对所述至少一个函数来插入所述性能读取指令。
  5. 根据权利要求1-4的任意一项所述的方法,其中所述性能读取指令是汇编指令并且该汇编指令内包括用于标识所述目标代码的函数名的一个或多个参数。
  6. 根据权利要求5所述的方法,其中所述一个或多个参数包括将所述函数名进行数据压缩后获得的数值。
  7. 根据权利要求1所述的方法,其中编译所述性能读取指令和目标代码包括:
    在编译过程中,针对插入所述性能读取指令的所述目标代码来置入用于使能硬件性能计数器的性能启用指令。
  8. 根据权利要求7所述的方法,其中在所述编译过程中,所述方法包括执行第一编译任务和第二编译任务,其中:
    在所述第一编译任务的执行中,所述方法包括对插入所述性能读取指令后的所述目标代码进行编译,以生成汇编代码;以及
    在所述第二编译任务的执行中,所述方法包括根据所述汇编代码中的所述性能读取指令来在所述汇编代码中置入所述性能启用指令。
  9. 根据权利要求8所述的方法,其中在所述第二编译任务的执行中,所述方法还包括:
    根据所述性能读取指令来生成对应的性能读取函数,以便将所述性能读取指令的执行转化为对所述性能读取函数的调用执行。
  10. 根据权利要求1所述的方法,其中在编译过程中,所述方法还包括:
    根据待执行所述可执行程序的处理器核的标识,生成对应于所述标识的数据导出指令,以用于将所述硬件性能数据导出。
  11. 一种用于获取目标代码的硬件性能数据的设备,包括:
    至少一个处理器;
    至少一个存储器,其用于存储计算机程序指令,当所述计算机程序指令由所述至少一个处理器执行时,使得所述设备执行根据权利要求1-10的任意一项所述的方法。
  12. 一种计算机可读存储介质,其存储有用于获取目标代码的硬件性能数据的计算 机程序指令,当所述计算机程序指令由至少一个处理器来执行时,实现根据权利要求1-10的任意一项所述的方法。
  13. 一种编译器,其配置成执行根据权利要求1-10的任意一项所述的方法。
  14. 一种用于获取目标代码的硬件性能数据的异构系统,包括互联的主机和设备,其中:
    所述主机包括根据权利要求13所述的编译器,并且配置成对所述设备执行主控和协同操作,以获取所述设备执行所述可执行程序时产生的所述硬件性能数据;以及
    所述设备包括一个或多个处理器核,并且配置成:
    执行所述编译器生成的所述可执行程序;以及
    向所述主机发送所述硬件性能数据。
  15. 根据权利要求14所述的异构系统,其中在执行所述性能读取函数中,所述设备配置成:
    从运行所述可执行程序的处理器核的性能监视单元获取所述硬件性能数据。
  16. 根据权利要求15所述的异构系统,其中所述设备配置成根据所述数据导出指令来执行以下操作:
    选择对应于所述标识的传递通路;以及
    将硬件性能数据沿所述传递通路向所述主机传递。
  17. 根据权利要求16所述的异构系统,其中在沿所述传递通路向所述主机传递所述硬件性能数据中,所述设备配置成:
    从所述处理器核的所述性能监视单元的专用寄存器获取所述硬件性能数据;以及
    从所述专用寄存器向通用寄存器传递所述硬件性能数据,以便将所述硬件性能数据向所述主机传递。
  18. 根据权利要求14-17的任意一项所述的异构系统,其中所述设备和主机之间通过应用编程接口将存储的所述硬件性能数据向所述主机传送。
  19. 根据权利要求18所述的异构系统,其中在接收来自于所述设备的所述硬件性能数据前,所述主机配置成通过所述应用编程接口来向所述设备传送关于所述硬件性能数据的存储地址信息和/或存储空间信息,并且所述设备配置成根据所述存储地址信息和/或存储空间信息对所述硬件性能数据进行存储,并且通过所述应用编程接口向所述主机传递存储的所述硬件性能数据,其中所述主机配置成利用所述应用编程接口对所述硬件性能数据进行数据格式转换,以用于针对所述硬件性能数据的后续处理。
  20. 一种板卡,包括根据权利要求14-19中的任意一项所述的异构系统。
PCT/CN2021/134128 2020-11-30 2021-11-29 用于获取硬件性能数据的方法、设备和系统 WO2022111703A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011379859.XA CN114579131A (zh) 2020-11-30 2020-11-30 用于获取硬件性能数据的方法、设备和系统
CN202011379859.X 2020-11-30

Publications (1)

Publication Number Publication Date
WO2022111703A1 true WO2022111703A1 (zh) 2022-06-02

Family

ID=81754057

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/134128 WO2022111703A1 (zh) 2020-11-30 2021-11-29 用于获取硬件性能数据的方法、设备和系统

Country Status (2)

Country Link
CN (1) CN114579131A (zh)
WO (1) WO2022111703A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822256B (zh) * 2023-08-29 2023-11-10 常州星宇车灯股份有限公司 一种利用场景仿真调试验证车道线拟合偏差问题的方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101408862A (zh) * 2007-10-12 2009-04-15 李周 一种嵌入式系统测试方法
US20110138125A1 (en) * 2009-12-04 2011-06-09 International Business Machines Corporation Event tracking hardware
CN103853648A (zh) * 2014-02-21 2014-06-11 北京神舟航天软件技术有限公司 嵌入式软件性能评测硬件辅助测试装置和方法
CN110908876A (zh) * 2018-09-18 2020-03-24 阿里巴巴集团控股有限公司 一种硬件性能数据的获取方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101408862A (zh) * 2007-10-12 2009-04-15 李周 一种嵌入式系统测试方法
US20110138125A1 (en) * 2009-12-04 2011-06-09 International Business Machines Corporation Event tracking hardware
CN103853648A (zh) * 2014-02-21 2014-06-11 北京神舟航天软件技术有限公司 嵌入式软件性能评测硬件辅助测试装置和方法
CN110908876A (zh) * 2018-09-18 2020-03-24 阿里巴巴集团控股有限公司 一种硬件性能数据的获取方法及装置

Also Published As

Publication number Publication date
CN114579131A (zh) 2022-06-03

Similar Documents

Publication Publication Date Title
US11694299B2 (en) Methods and apparatus to emulate graphics processing unit instructions
US9170919B2 (en) Apparatus and method for detecting location of source code error in mixed-mode program
US8561032B2 (en) Visualizing thread life time in eclipse
US20220100512A1 (en) Deterministic replay of a multi-threaded trace on a multi-threaded processor
WO2022111703A1 (zh) 用于获取硬件性能数据的方法、设备和系统
CN111901155B (zh) 一种物联网调试方法、装置、系统及存储介质
US20220334840A1 (en) Data processing apparatus and related products
CN115291561A (zh) 一种制造系统与装备虚拟调试平台、方法、设备及应用
CN112465116B (zh) 编译方法、运算方法、电子设备和存储介质
CN108334313A (zh) 用于大型soc研发的持续集成方法、装置及代码管理系统
US11475017B2 (en) Asynchronous data enrichment for an append-only data store
CN115373646A (zh) 扩展信息方法、装置和相关产品
US10061604B2 (en) Program execution recording and playback
WO2022253287A1 (zh) 用于生成随机数的方法及其相关产品
CN112394985B (zh) 执行方法、装置及相关产品
CN112579169B (zh) 处理器追踪流的生成方法及装置
CN118260185A (zh) 代码覆盖率统计的方法及其相关产品
WO2022100286A1 (zh) 数据处理装置、数据处理方法及相关产品
CN111078281B (zh) 运算方法、系统及相关产品
CN111124497B (zh) 运算方法、装置、计算机设备和存储介质
CN111078280B (zh) 运算方法、装置及相关产品
WO2020108496A1 (zh) 原子操作中的数据处理方法及装置
CN114625370A (zh) 在主机和设备之间进行数据布局的方法、设备和异构系统
CN115203011A (zh) 一种软件稳定性测试的方法、电子设备以及存储介质
CN118708384A (zh) 基于预先分析计算业务优化gpu远程调用性能的方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21897203

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21897203

Country of ref document: EP

Kind code of ref document: A1