WO2020062086A1 - Method and device for selecting processor - Google Patents

Method and device for selecting processor Download PDF

Info

Publication number
WO2020062086A1
WO2020062086A1 PCT/CN2018/108459 CN2018108459W WO2020062086A1 WO 2020062086 A1 WO2020062086 A1 WO 2020062086A1 CN 2018108459 W CN2018108459 W CN 2018108459W WO 2020062086 A1 WO2020062086 A1 WO 2020062086A1
Authority
WO
WIPO (PCT)
Prior art keywords
processor
program
processors
target
information
Prior art date
Application number
PCT/CN2018/108459
Other languages
French (fr)
Chinese (zh)
Inventor
刘恺
周小超
庞俊
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201880094887.1A priority Critical patent/CN112292667B/en
Priority to PCT/CN2018/108459 priority patent/WO2020062086A1/en
Publication of WO2020062086A1 publication Critical patent/WO2020062086A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • the present application relates to the field of computers, and, more particularly, to a method and apparatus for selecting a processor, and a computer device.
  • coprocessors can provide more parallel computing capabilities and increase computing speed.
  • coprocessors have a more efficient energy efficiency and power consumption ratio than general-purpose processors. Therefore, the coprocessor can make up for the lack of computing power of the CPU processor, and can reduce the overall energy consumption of the system.
  • NN neural networks
  • ML machine learning
  • Compiling is the process of converting a program written in one programming language (source language) to another language (target language).
  • the source language may be a language used by a user when writing a target program
  • the target language may be a language used by a processor that the user wishes to select to run the target program in a heterogeneous system.
  • a front end, an intermediate expression, and a back end can be included.
  • the front-end mainly implements the conversion from the source program to the intermediate representation, that is, the user first uses the domain description language (domain specific language) to describe the calculation of the operator as the input of the front-end source program.
  • Intermediate representation (IR) is input to the back end.
  • the back-end code generator completes the conversion from IR to specific target code according to a specified target processor (for example, a general-purpose processor or coprocessor).
  • the programmer is required to manually specify the target processor on which the operator runs in the early stage, for example, during the description of the calculation of the operator by the DSL.
  • the target processor specified by the compiler may not be the most suitable processor to run the target program. This leads to a reduction in processing efficiency.
  • the process of manually specifying the target processor increases the workload of the programmer.
  • the present application provides a method and device for selecting a processor, which can improve the processing efficiency of computer equipment and reduce the burden on programmers.
  • a method for selecting a processor is provided, and hardware information of each processor in at least two processors is obtained, where the hardware information is used to indicate an instruction set corresponding to each processor; and a target to be executed is obtained.
  • Program information of a program the program information being used to indicate instructions in the target program; and based on the program information and the hardware information, determining from the at least two processors a target that satisfies a preset condition and can be used to execute the target program
  • the preset condition includes an instruction set corresponding to the processor including an instruction in the target program.
  • the method for selecting a processor by obtaining hardware information of each processor and program information of a target program in advance, and based on the hardware information and program information, selecting hardware information and the program information from a variety of processors
  • the matching processor can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.
  • the “instruction set corresponding to the processor” can be understood as a function that the processor can process, and the hardware information can be used to indicate a function (for example, a function name) that the processor can process.
  • the "instructions in the target program” can be understood as functions included in the target program, and the program information is used to indicate functions (for example, function names) included in the target program.
  • determining, from the at least two processors, a target processor that satisfies a preset condition and can be used to execute the target program includes: determining that the at least two processors each Based on the program information and the hardware information, in order of priority of the at least two processors from high to low, determine whether the at least two processors meet the preset condition in order, and The first processor that meets the preset condition serves as the target processor.
  • the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors. Therefore, it is possible to ensure that there is a processor capable of processing the target program among at least two types of processors, and because the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, the coprocessor can be selected as the target processor. Therefore, the effect and practicability of the present application can be further improved.
  • the at least two processors include at least two of the following processors: a CPU, a graphics processor GPU, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, an image processing unit IPU, or a digital signal Processing DSP.
  • processors include at least two of the following processors: a CPU, a graphics processor GPU, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, an image processing unit IPU, or a digital signal Processing DSP.
  • the ASIC can perform calculations by software.
  • the hardware information is also used to indicate the size of the available memory space of the processor
  • the program information is also used to indicate the memory space required by the target program
  • the preset condition further includes the available memory space of the processor Greater than or equal to the memory space required by the target program.
  • the available space of the processor may refer to a specified proportion of the memory space in the total memory space of the processor.
  • the prescribed ratio may be 90%.
  • the available space of the processor may refer to a specified proportion of the memory space in the total free memory space of the processor.
  • the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space required by the target program
  • the preset condition may mean that as long as the available memory space of the processor is greater than or equal to the memory space required by the target program Meet preset conditions.
  • the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy” may mean that the available memory space of the processor needs to be greater than or equal to the The memory space required by all programs executed by the target processor can satisfy the preset conditions.
  • the preset conditions further include that the available memory space of the processor is greater than or equal to the memory space required by the target program, it can be ensured that the selected target processor can support the operation of the target program, thereby further improving the practicality of the present application. Sex.
  • obtaining the program information of the target program includes: determining a memory space that the target program needs to occupy according to the data dimension of the target program. Therefore, it is possible to easily determine the memory space that the target program needs to occupy.
  • obtaining the program information of the target program includes: determining the program information according to an intermediate expression IR of the target program, wherein the IR of the target program is determined according to a domain description language DSL code of the target program.
  • the DSL code may be determined by a front-end compiler in a computer device, and the IR may be determined by an intermediate compiler in the computer device. Thereby, the program information can be easily obtained.
  • the hardware information of each of the at least two processors is obtained: hardware information of each processor is obtained according to registration information of each processor, and the registration information is used by the processor in the computing device. Registration.
  • the registration information may include hardware description information.
  • the hardware description information may be obtained offline by the computer device before the processor is installed.
  • the hardware description information may be obtained by the computer device from the driver information of the processor when the processor is installed.
  • the computer device includes at least two back-end compilers, and the at least two back-end compilers correspond to the at least two processors in a one-to-one manner, and each back-end compiler is configured to convert the IR to a corresponding one.
  • the code recognized by the processor is configured to convert the IR to a corresponding one.
  • the method further includes: inputting the IR of the target program to a target back-end compiler corresponding to the target processor.
  • the IR of the target program may be an IR that has been optimized by IR.
  • a method for selecting a processor includes: acquiring hardware information of each of the at least two processors, the hardware information used to indicate a size of a processor's available memory space; acquiring a target Program information of a program, the program information being used to indicate a memory space that the target program needs to occupy; and based on the program information and the hardware information, it is determined from the at least two processors that the preset conditions are satisfied and can be used to execute the target program
  • the target processor, the preset condition includes that the available memory space of the processor is greater than or equal to the memory space required by the target program.
  • the available space of the processor may refer to a specified proportion of the memory space in the total memory space of the processor.
  • the prescribed ratio may be 90%.
  • the available space of the processor may refer to a specified proportion of the memory space in the total free memory space of the processor.
  • the at least two processors include at least two of the following processors: a CPU, a graphics processor GPU, a field programmable gate array FPGA, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, or Digital Signal Processing DSP.
  • processors include at least two of the following processors: a CPU, a graphics processor GPU, a field programmable gate array FPGA, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, or Digital Signal Processing DSP.
  • the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space required by the target program
  • the preset condition can mean that as long as the available memory space of the processor is greater than or equal to the memory space required by the target program, Meet preset conditions.
  • the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy” may mean that the available memory space of the processor needs to be greater than or equal to the The memory space required by all programs executed by the target processor can satisfy the preset conditions.
  • obtaining the program information of the target program includes: determining a memory space that the target program needs to occupy according to the data dimension of the target program. Therefore, it is possible to easily determine the memory space that the target program needs to occupy.
  • the hardware information is also used to indicate the instruction set corresponding to the processor
  • the program information is also used to indicate the instruction in the target program
  • the preset condition further includes that the instruction set corresponding to the processor includes the target program.
  • Instructions can be understood as a function that the processor can process
  • the hardware information can be used to indicate a function (for example, a function name) that the processor can process.
  • the "instructions in the target program” can be understood as functions included in the target program
  • the program information is used to indicate functions (for example, function names) included in the target program.
  • determining, from the at least two processors, a target processor that satisfies a preset condition and can be used to execute the target program includes: determining that the at least two processors each Based on the program information and the hardware information, in order of priority of the at least two processors from high to low, determine whether the at least two processors meet the preset condition in order, and The first processor that meets the preset condition serves as the target processor.
  • the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors. Therefore, it is possible to ensure that there is a processor capable of processing the target program among at least two types of processors, and because the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, the coprocessor can be selected as the target processor. Therefore, the effect and practicability of the present application can be further improved.
  • obtaining the program information of the target program includes: determining the program information according to an intermediate expression IR of the target program, wherein the IR of the target program is determined according to a domain description language DSL code of the target program.
  • the DSL code may be determined by a front-end compiler in a computer device, and the IR may be determined by an intermediate compiler in the computer device.
  • the hardware information of each of the at least two processors is obtained: hardware information of each processor is obtained according to registration information of each processor, and the registration information is used by the processor in the computing device. Registration.
  • the registration information may include hardware description information.
  • the hardware description information may be obtained offline by the computer device before the processor is installed.
  • the hardware description information may be obtained by the computer device from the driver information of the processor when the processor is installed.
  • the computer device includes at least two back-end compilers, and the at least two back-end compilers correspond to the at least two processors in a one-to-one manner, and each back-end compiler is configured to convert the IR to a corresponding one.
  • the code recognized by the processor is configured to convert the IR to a corresponding one.
  • the method further includes: inputting the IR of the target program to a target back-end compiler corresponding to the target processor.
  • the IR of the target program may be an IR that has been optimized by IR.
  • an apparatus for selecting a processor includes: an identification module for acquiring hardware information of each of the at least two processors, and the hardware information is used to instruct the processor Corresponding instruction set; analysis module, for obtaining program information of the target program to be executed, the program information is used to indicate instructions in the target program; selection module, for obtaining information from the at least the program information and the hardware information Among the two types of processors, a target processor that satisfies a preset condition and can be used to execute the target program is determined, and the preset condition includes an instruction set corresponding to the processor including an instruction in the target program.
  • processor selection device provided in the present application, hardware information and program information of a target program are obtained in advance by each processor, and based on the hardware information and program information, hardware information and the program information are selected from a variety of processors. Matching processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.
  • the “instruction set corresponding to the processor” can be understood as a function that the processor can process, and the hardware information can be used to indicate a function (for example, a function name) that the processor can process.
  • the "instructions in the target program” can be understood as functions included in the target program, and the program information is used to indicate functions (for example, function names) included in the target program.
  • the selection module is configured to determine a priority of each processor of the at least two processors, and based on the program information and the hardware information, according to a priority order of the at least two processors from high to low , Sequentially determine whether the at least two processors meet the preset condition, and use the first processor that meets the preset condition as the target processor.
  • a priority for the processor personalized processing can be achieved, and different processing scenarios can be flexibly handled.
  • the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.
  • the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors. Therefore, it is possible to ensure that there is a processor capable of processing the target program among at least two types of processors, and because the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, the coprocessor can be selected as the target processor Therefore, the effect and practicability of the present application can be further improved.
  • the at least two processors include at least two of the following processors: a CPU, a graphics processor GPU, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, an image processing unit IPU, or a digital signal Processing DSP.
  • processors include at least two of the following processors: a CPU, a graphics processor GPU, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, an image processing unit IPU, or a digital signal Processing DSP.
  • the ASIC can perform calculations by software.
  • the hardware information is also used to indicate the size of the available memory space of the processor
  • the program information is also used to indicate the memory space required by the target program
  • the preset condition further includes the available memory space of the processor Greater than or equal to the memory space required by the target program.
  • the available space of the processor may refer to a specified proportion of the memory space in the total memory space of the processor.
  • the prescribed ratio may be 90%.
  • the available space of the processor may refer to a specified proportion of the memory space in the total free memory space of the processor.
  • the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space required by the target program
  • the preset condition may mean that as long as the available memory space of the processor is greater than or equal to the memory space required by the target program, Meet preset conditions.
  • the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy” may mean that the available memory space of the processor needs to be greater than or equal to the The memory space required by all programs executed by the target processor can satisfy the preset conditions.
  • the preset conditions further include that the available memory space of the processor is greater than or equal to the memory space required by the target program, it can be ensured that the selected target processor can support the operation of the target program, thereby further improving the practicality of the present application. Sex.
  • the analysis unit is configured to determine a memory space required by the target program according to a data dimension of the target program. Therefore, it is possible to easily determine the memory space that the target program needs to occupy.
  • the analysis module is configured to determine the program information according to the intermediate expression IR of the target program, where the IR is determined according to a domain description language DSL code of the target program.
  • the analysis module is configured to determine the program information according to the intermediate expression IR of the target program, where the IR is determined according to a domain description language DSL code of the target program.
  • the identification unit is configured to obtain hardware information of each processor according to registration information of each processor, and the registration information is used to register the processor in the computing device.
  • the registration information may include hardware description information.
  • the hardware description information may be obtained offline by the computer device before the processor is installed.
  • the hardware description information may be obtained by the computer device from the driver information of the processor when the processor is installed.
  • the computer device includes at least two back-end compilers, and the at least two back-end compilers correspond to the at least two processors in a one-to-one manner, and each back-end compiler is configured to convert the IR to a corresponding one.
  • the code recognized by the processor is configured to convert the IR to a corresponding one.
  • the selection unit is used to input the IR of the target program to a target back-end compiler corresponding to the target processor.
  • the IR of the target program may be an IR that has been optimized by IR.
  • an apparatus for selecting a processor includes: an identifying unit configured to obtain hardware information of each of the at least two processors, and the hardware information is used to indicate available memory of the processor. The size of the space; the analysis unit is used to obtain the program information of the target program, the program information is used to indicate the memory space that the target program needs to occupy; the selection unit is used to select from the at least two
  • the processor determines a target processor that satisfies a preset condition and can be used to execute the target program.
  • the preset condition includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy.
  • the available space of the processor may refer to a specified proportion of the memory space in the total memory space of the processor.
  • the prescribed ratio may be 90%.
  • the available space of the processor may refer to a specified proportion of the memory space in the total free memory space of the processor.
  • the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space required by the target program” may mean that as long as the available memory space of the processor is greater than or equal to the memory space required by the target program Meet preset conditions.
  • the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy” may mean that the available memory space of the processor needs to be greater than or equal to the existing memory including the target program. The memory space required by all programs executed by the target processor can satisfy the preset conditions.
  • the analysis unit is configured to determine a memory space required by the target program according to a data dimension of the target program. Therefore, it is possible to easily determine the memory space that the target program needs to occupy.
  • the at least two processors include at least two of the following processors: a CPU, a graphics processor GPU, a field programmable gate array FPGA, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, or Digital Signal Processing DSP.
  • processors include at least two of the following processors: a CPU, a graphics processor GPU, a field programmable gate array FPGA, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, or Digital Signal Processing DSP.
  • the hardware information is also used to indicate the instruction set corresponding to the processor
  • the program information is also used to indicate the instruction in the target program
  • the preset condition further includes that the instruction set corresponding to the processor includes the target program.
  • Instructions can be understood as a function that the processor can process
  • the hardware information can be used to indicate a function (for example, a function name) that the processor can process.
  • the "instructions in the target program” can be understood as functions included in the target program
  • the program information is used to indicate functions (for example, function names) included in the target program.
  • the selection module is configured to determine a priority of each processor of the at least two processors, and based on the program information and the hardware information, according to a priority order of the at least two processors from high to low , Sequentially determine whether the at least two processors meet the preset condition, and use the first processor that meets the preset condition as the target processor.
  • the parallel computing capability of each of the at least two processors is determined, and the priority of each processor is determined.
  • the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors.
  • obtaining the program information of the target program includes: obtaining a domain description language DSL code of the target program; determining an intermediate expression IR according to the DSL code; and determining the program information according to the IR.
  • the program information can be easily obtained.
  • the identification unit is configured to obtain hardware information of each processor according to registration information of each processor, and the registration information is used to register the processor in the computing device.
  • the registration information may include hardware description information.
  • the hardware description information may be obtained offline by the computer device before the processor is installed.
  • the hardware description information may be obtained by the computer device from the driver information of the processor when the processor is installed.
  • the computer device includes at least two back-end compilers, and the at least two back-end compilers correspond to the at least two processors in a one-to-one manner, and each back-end compiler is configured to convert the IR to a corresponding one.
  • the code recognized by the processor is configured to convert the IR to a corresponding one.
  • the selection unit is used to control the IR of the target program to a target back-end compiler corresponding to the target processor.
  • the IR of the target program may be an IR that has been optimized by IR.
  • a compiling device configured in a computer device including at least two processors.
  • the device includes a plurality of back-end compiling units, which correspond to the processors in a one-to-one manner.
  • the front-end compilation unit is used to obtain the DSL corresponding to the target program;
  • the intermediate compilation unit is used to determine the IR according to the DSL;
  • the selection unit is used to Determine program information of the target program according to the IR, and obtain hardware information of each of the at least two processors, and use the program information and the hardware information to select from the at least two processors Determine a target processor for executing the target program, and send the IR to a back-end compilation unit corresponding to the target processor.
  • the program information is used to indicate an instruction in the target program
  • the hardware information is used to indicate a corresponding instruction set of a processor
  • the target processor is a processor that satisfies a preset condition among the at least two processors
  • the The preset condition includes that the instruction set corresponding to the processor includes instructions in the target program; and / or the hardware information is used to indicate the size of the available memory space of the processor, and the program information is used to indicate the memory space required by the target program.
  • the preset condition includes that the available memory space of the processor is greater than or equal to the memory space required by the target program.
  • the hardware information of each processor and the program information of the target program are obtained in advance, and based on the hardware information and program information, the hardware information that matches the program information is selected from a variety of processors. Processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.
  • the selection unit is configured to determine a priority of each processor of the at least two processors; based on the program information and the hardware information, determine whether the processor meets the pre-determined order of priority from high to low. Set conditions, and use the first processor that meets the preset condition as the target processor.
  • a priority for the processor personalized processing can be achieved, and different processing scenarios can be flexibly handled.
  • the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.
  • the parallel computing capability of each of the at least two processors is determined, and the priority of each processor is determined.
  • the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors. Therefore, it is possible to ensure that there is a processor capable of processing the target program among at least two types of processors, and because the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, the coprocessor can be selected as the target processor. Therefore, the effect and practicability of the present application can be further improved.
  • the at least two processors include at least two of the following processors: a CPU, a graphics processor GPU, a field programmable gate array FPGA, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, or Digital Signal Processing DSP.
  • processors include at least two of the following processors: a CPU, a graphics processor GPU, a field programmable gate array FPGA, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, or Digital Signal Processing DSP.
  • the identification unit is configured to obtain hardware information of each processor according to registration information of each processor, and the registration information is used to register the processor in the computing device.
  • the registration information may include hardware description information.
  • the hardware description information may be obtained offline by the computer device before the processor is installed.
  • the hardware description information may be obtained by the computer device from the driver information of the processor when the processor is installed.
  • a computer device which includes multiple processors, a compiler, and a selection device, and the selection device executes the method in the first aspect and any possible implementation manner thereof, or the second aspect and the foregoing Methods in any of the possible implementations.
  • the compiler includes a front-end compiler, an intermediate compiler, and a back-end compiler.
  • a chip or chipset including at least one processor and at least one memory control unit.
  • the processor executes the method in the first aspect and any possible implementation manner thereof, or the second aspect And any of its possible implementations.
  • the chip or chipset may include a smart chip.
  • the smart chip may include at least two processors.
  • a computer system including a processor and a memory.
  • the processor includes at least two processors and a memory control unit, and the processor executes the method in the first aspect and any possible implementation manner. Or, the method in the second aspect and any one of the possible implementation manners.
  • the computing system further includes a system bus for connecting the processor (specifically, a memory control unit) and a memory.
  • a computer program product includes a computer program (also referred to as code or instructions).
  • a computer program also referred to as code or instructions.
  • the processor executes the method in the foregoing first aspect and any one of its possible implementations, or the method in the foregoing second aspect and any of its possible implementations.
  • a computer-readable medium stores a computer program (also referred to as code, or instructions) that, when executed on a processor or a processor in a chip, causes processing
  • the processor executes the method in the foregoing first aspect and any one of its possible implementations, or the method in the foregoing second aspect and any of its possible implementations.
  • FIG. 1 is a schematic hardware structural diagram of a computer device (or a computer system) to which a method and an apparatus for monitoring a process according to an embodiment of the present application are applied.
  • FIG. 2 is a schematic diagram of an example of a lexical analysis process of the present application.
  • FIG. 3 is a schematic diagram of an example of a syntax analysis process of the present application.
  • FIG. 4 is a schematic diagram of an example of an intermediate code generation and optimization process of the present application.
  • FIG. 5 is a schematic flowchart of an example of a method for selecting a processor according to the present application.
  • FIG. 6 is a schematic flowchart of another example of a method for selecting a processor according to the present application.
  • FIG. 7 is a schematic diagram of an example of a compilation method of the present application.
  • FIG. 8 is a schematic configuration diagram of an example of a processor selection device of the present application.
  • FIG. 9 is a schematic configuration diagram of an example of a compiler device of the present application.
  • a computing device can also be referred to as a computer system. From a logical hierarchical perspective, a computing device can include a hardware layer, an operating system layer running on the hardware layer, and an application layer running on the operating system layer.
  • the hardware layer includes hardware such as a processor, a memory, and a memory control unit. The functions and structure of the hardware are described in detail later.
  • the operating system may be any one or more computer operating systems that implement business processing through processes, such as a Linux operating system, a Unix operating system, an Android operating system, an iOS operating system, or a windows operating system.
  • This application layer contains applications such as browsers, address books, word processing software, and instant messaging software.
  • the computer system may be a handheld device such as a smart phone or a terminal device such as a personal computer, which is not particularly limited in the present application, as long as the program code of the method of the embodiment can be read, and Run the program code to monitor the sample process by the method of monitoring the memory access behavior of the sample process according to the embodiment of the present application.
  • the execution subject of the method for monitoring a memory access behavior of a sample process in the embodiment of the present application may be a computer system, or a functional module, such as a processor, in the computer system that can call a program and execute the program.
  • a program or program code refers to a set of ordered instructions (or codes) used to implement some relatively independent function.
  • a process is a running process of a program and its data on a computer device.
  • the program usually adopts a modular design, that is, the function of the program is broken down into multiple smaller functional modules.
  • the program contains at least one function.
  • a function is a code segment that implements a functional module. Therefore, functions are the basic unit of program function modularity, and can also be regarded as subroutines.
  • FIG. 1 is a schematic structural diagram of a computing device 100 according to an embodiment of the present application. The method shown in FIG. 1 for a computing device to perform a monitoring process.
  • the computing device 100 may include: at least two processors 110, and a memory 120.
  • the computer device 110 may further include a system bus, where the processor 110 and the memory 120 are respectively connected to the system bus.
  • the processor 110 can access the memory 120 through the system bus.
  • the processor 110 can read and write data or execute code in the memory 120 through the system bus.
  • the function of the processor 110 is mainly to interpret instructions (or codes) of a computer program and to process data in computer software.
  • the instructions of the computer program and data in the computer software may be stored in the memory 120 or the cache unit 116.
  • the processor 110 may be an integrated circuit chip or a component therein, and has a signal processing capability.
  • the processor 110 may fetch instructions from a memory or a cache memory, place them in an instruction register, and decode the instructions. It breaks down instructions into a series of micro-operations, and then issues various control commands to execute a series of micro-operations to complete the execution of an instruction.
  • An instruction is a basic command that a computer specifies to perform the type and operand of an operation. The instruction is composed of one byte or multiple bytes, which includes the opcode field, one or more fields related to the operand address, and some status words and characteristic codes that characterize the state of the machine. Some instructions directly include the operand itself.
  • the processor 110 may include a memory control unit 114 and at least one processing unit 112.
  • the processing unit 112 may also be referred to as a core or a core, and is the most important component of the processor.
  • the processing unit 112 may be manufactured by monocrystalline silicon in a certain production process.
  • the calculation, the receiving command, the storing command, and the processing data of the processor 110 are all performed by the core.
  • the processing unit 112 can run the program instructions independently, and use the ability of parallel computing to accelerate the running speed of the program.
  • Various processors 110 have a fixed logical structure.
  • the processor 110 includes a logical unit such as a first-level cache, a second-level cache, an execution unit, an instruction-level unit, and a bus interface.
  • the memory control unit 114 is configured to control data interaction between the memory 120 and the processing unit 112. Specifically, the memory control unit 114 may receive a memory access request from the processing unit 112 and control access to the memory based on the memory access request.
  • the memory control unit may be a device such as a memory management unit (MMU).
  • MMU memory management unit
  • each memory control unit 114 may address the memory 120 through a system bus.
  • an arbiter (not shown) may be configured in the system bus, and the arbiter may be responsible for processing and coordinating competing accesses of the plurality of processing units 112.
  • the processing unit 112 and the memory control unit 114 may be connected through a connection line inside the chip, such as an address line, to implement communication between the processing unit 112 and the memory control unit 114.
  • each processor 110 may further include a cache unit 116, where the cache unit 116 is a buffer (called a cache) for data exchange.
  • the processor 112 wants to read the data, it will first look up the required data from the cache unit 116, and if it finds it, execute it directly. If it cannot find it, it will look for it from the memory 120. Since the cache unit 116 runs much faster than the memory 120, the role of the cache unit 116 is to help the processing unit 112 run faster.
  • the memory 120 may provide a running space for a process in the computing device 100.
  • the memory 120 may store a computer program (specifically, a program code) for generating a process, and the memory 120 may store a process Data generated during operation, for example, intermediate data, or process data.
  • the memory may also be called an internal memory, and its function is to temporarily store the operation data in the processor 110 and data exchanged with an external memory such as a hard disk. As long as the computer is running, the processor 110 will transfer the data to be calculated into the memory 120 for operation, and the processing unit 112 will transmit the result after the operation is completed.
  • the memory 120 may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrical memory Erase programmable read-only memory (EPROM, EEPROM) or flash memory.
  • the volatile memory may be a random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • double SDRAM double SDRAM
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • enhanced SDRAM enhanced SDRAM
  • SLDRAM synchronous connection dynamic random access memory
  • direct RAMbus RAM direct RAMbus RAM
  • the structure of the computing device 100 listed above is only an exemplary description, and the present application is not limited thereto.
  • the computing device 100 in the embodiment of the present application may include various hardware in a computer system in the prior art.
  • the computing device 110 may further include other memories besides the memory 120, for example, a disk memory and the like.
  • a virtualization technology may be applied on the computing device 100.
  • the computer device 100 can run multiple virtual machines at the same time, each virtual machine can run at least one operating system, and each operating system runs multiple programs.
  • Virtual machine refers to a complete computer system with complete hardware system functions and running in a completely isolated environment simulated by software.
  • the processor 110 may include a plurality of categories. For example, different kinds of processors may use different kinds of instructions. As another example, different types of processors may have different computing capabilities. As another example, different kinds of processors can be used to handle different types of calculations.
  • the various processors may include a general purpose processor and a coprocessor. The above-mentioned various processors are respectively described in detail below.
  • a general-purpose processor can also be referred to as a central processing unit (CPU), which is a very large-scale integrated circuit or a component thereof, and is a computing core (Core) and a control core (Control Unit) of a computer. Its function is mainly to interpret computer instructions and process data in computer software.
  • the central processing unit mainly includes an arithmetic unit (Arithmetic Logic Unit, ALU), a cache memory (Cache), and a data bus (Data), a control, and a status bus (Bus) that implement the connection between them. It, together with internal memory and input / output (I / O) equipment, is called the three core components of an electronic computer.
  • the CPU includes an arithmetic logic unit, a register unit, a control unit, and the like.
  • Logic components are operational logic components. You can perform fixed-point or floating-point arithmetic operations, shift operations, and logical operations. You can also perform address operations and conversions.
  • Registers include general purpose registers, special purpose registers, and control registers.
  • General-purpose registers can be divided into fixed-point and floating-point numbers. They are used to store register operands temporarily stored during instruction execution and intermediate (or final) operation results.
  • the control unit is mainly responsible for decoding the instructions and sending out control signals to complete each operation to be performed by each instruction.
  • Microcode is maintained in the micro memory, and each microcode corresponds to a basic micro operation, also called microinstruction; each instruction is composed of different sequences of microcode, and this microcode sequence constitutes a microprogram.
  • the central processor decodes the instructions, it sends out a certain timing control signal, and executes a number of micro operations determined by these microcodes in micro-cycles in the order of a given sequence to complete the execution of an instruction.
  • Simple instructions are composed of (3 to 5) micro operations, while complex instructions are composed of dozens of micro operations or even hundreds of micro operations.
  • a coprocessor a chip or part of a chip, used to alleviate specific processing tasks of the system microprocessor.
  • Coprocessor which is a processor developed and applied to assist the central processing unit in performing processing tasks that it cannot perform or perform inefficiently and inefficiently.
  • various auxiliary processors were born. It should be noted that, since the integer arithmetic unit and the floating-point arithmetic unit have been integrated in the current computer, the floating-point processor is no longer an auxiliary processor.
  • the coprocessor built into the CPU is not necessarily an auxiliary processor. Of course, the coprocessor can also exist independently.
  • the coprocessor can be used for specific processing tasks, for example, a mathematical coprocessor can control digital processing; a graphics coprocessor can handle video rendering.
  • the coprocessor can be attached to a general-purpose processor.
  • a coprocessor extends the general-purpose processor core processing capabilities by extending the instruction set or providing configuration registers.
  • One or more coprocessors can be connected to a general-purpose processor core through a coprocessor interface.
  • the coprocessor can also expand the instruction set by providing a new set of specialized instructions.
  • the coprocessor may include, but is not limited to, at least one of the following processors:
  • GPU Graphics processing unit
  • display core also known as display core, visual processor, and display chip
  • microprocessor for arithmetic work.
  • the purpose of the GPU is to convert and drive the display information required by the computer system, and provide line scanning signals to the display to control the correct display of the display. It is an important component that connects the display and the main board of the personal computer, and is also an important device for "human-machine dialogue".
  • the processor of a graphics card is sometimes called a graphics processor (GPU).
  • GPU graphics processor
  • GPUs have 2D or 3D graphics acceleration capabilities. If the CPU wants to draw a two-dimensional graphic, it only needs to send an instruction to the GPU, such as "draw a rectangle with a length and width of a ⁇ b at the coordinate position (x, y)", and the GPU can quickly calculate the graphic And draw the corresponding graphics at the specified position on the display, after the drawing is finished, the CPU is notified that "I have finished drawing", and then waits for the CPU to issue the next graphics instruction. With the GPU, the CPU is freed from graphics processing tasks and can perform other system tasks, which can greatly improve the overall performance of the computer. For example, the GPU generates a lot of heat, so a radiator or fan is usually installed above it.
  • the GPU is the "brain" of the graphics card.
  • the GPU determines the grade and most performance of the graphics card.
  • the GPU is also the basis for the difference between a 2D graphics card and a 3D graphics card.
  • the 2D display chip mainly relies on the processing power of the CPU when processing 3D images and special effects, which is called soft acceleration.
  • the 3D display chip is a three-dimensional image and special effects processing function concentrated in the display chip, which is the so-called "hardware acceleration” function.
  • the display chip is generally the largest chip (and also has the most pins) on the display card.
  • the GPU is no longer limited to 3D graphics processing.
  • the development of GPU general computing technology has attracted a lot of attention in the industry.
  • the facts also prove that GPU can provide dozens of times or more in floating point computing, parallel computing and other partial calculations. A hundred times the performance of the CPU.
  • the GPU enables computer equipment to reduce its dependence on the CPU and share some of the work that was originally performed by the CPU.
  • Field programmable gate array is, for example, Programmable Array Logic (PAL, Programmable Array Logic), General Array Logic (GAL, Generic Array Logic), Complex Programmable Logic Device (CPLD, Complex Programmable) Logic Device) and other programmable products based on the further development of the product.
  • Field Programmable Gate Array ASIC appears as a semi-custom circuit in the field of Application Specific Integrated Circuit (ASIC, Application Specific Integrated Circuit), which not only solves the shortcomings of custom circuits, but also overcomes the original programmable device gates. Disadvantages of limited number of circuits. System designers can connect logic blocks inside the FPGA through editable connections as needed, as if a circuit test board was placed in a chip. The logic blocks and connections of a finished FPGA after leaving the factory can be changed according to the designer, so the FPGA can complete the required logic functions.
  • PAL Programmable Array Logic
  • GAL Generic Array Logic
  • CPLD Complex Programmable Logic Device
  • the FPGA uses a logic cell array (LCA, Logic Cell Array), which includes three parts: a configurable logic module (CLB, Configurable Logic Block), an input output module (IOB, Input Output Block), and an internal connection (Interconnect).
  • CLB configurable logic module
  • IOB input output module
  • Interconnect internal connection
  • FPGA can have different structures compared to traditional logic circuits and gate arrays (such as PAL, GAL, and CPLD devices) through different programming methods.
  • FPGA uses a small lookup table (16 ⁇ 1RAM) to implement combinational logic. Each lookup table is connected to the input of a D flip-flop. The flip-flop then drives other logic circuits or drives I / O.
  • the logic function can also realize the basic logic unit module of the sequential logic function.
  • the logic of the FPGA is implemented by loading programming data into the internal static storage unit.
  • the value stored in the memory unit determines the logic function of the logic unit and the connection mode between the modules or between the modules and I / O, and finally determines FPGA can realize the function, FPGA allows unlimited programming.
  • the FPGA does not include an instruction set
  • the method 200 described below may not be used to determine whether the FPGA can be used as a target processor.
  • a method 300 described later can be used to determine whether the FPGA can be used as a target processor.
  • Neural network processors adopt a "data-driven parallel computing" architecture and are particularly good at processing massive multimedia data such as video and images.
  • NPU can be used for deep learning.
  • deep learning is actually a type of multilayer large-scale artificial neural network. It is modeled after a biological neural network and consists of several artificial neuron nodes interconnected. Neurons are connected one by one through synapses. Synapses record the weight of the connections between neurons. Each neuron can be abstracted into a stimulus function whose input is determined by the output of the neuron connected to it and the synapses that connect the neuron.
  • the basic operation of deep learning is the processing of neurons and synapses.
  • the traditional processor instruction set was developed for general purpose computing. Its basic operations are arithmetic operations (addition, subtraction, multiplication, and division) and logical operations (and or not). It often takes hundreds or even thousands of instructions to complete a neuron. Processing, the processing efficiency of deep learning is not high.
  • NPU instructions directly face the processing of large-scale neurons and synapses.
  • One instruction can complete the processing of a group of neurons and provide a series of specialized support for the transmission of neurons and synaptic data on the chip. .
  • the storage and processing in the neural network are integrated, and both are represented by synaptic weights.
  • ASIC Application specific integrated circuit
  • ASIC is an integrated circuit made for a specific user or a specific electronic system.
  • the universality and mass production of digital integrated circuits has greatly reduced the cost of electronic products and promoted the popularization of computer communications and electronic products.
  • it has also caused the contradiction between general and special applications, and the disconnection between system design and circuit production.
  • the larger the integrated circuit scale the more difficult it is to change for special requirements when building a system.
  • ASICs featuring user participation in design have emerged, which can realize the optimized design of the entire system, with superior performance and strong confidentiality.
  • ASICs can be used to execute software programs, or they can perform calculations through hardware logic instead of software programs.
  • an ASIC executing a software program may include one or more processor cores to execute instructions and have a corresponding instruction set.
  • Digital signal processing is a theory and technology that represents and processes signals digitally.
  • Digital signal processing and analog signal processing are a subset of signal processing.
  • the purpose of digital signal processing is to measure or filter continuous analog signals in the real world. Therefore, before performing digital signal processing, the signal needs to be converted from the analog domain to the digital domain, which is usually achieved by an analog-to-digital converter. And the output of digital signal processing often needs to be transformed into the analog domain, which is realized by a digital-to-analog converter.
  • DSP is a special-purpose chip for digital signal processing. It is a new device that is accompanied by the development of microelectronics, digital signal processing technology, and computer technology.
  • An image processing unit can also be called an image signal processor (image signal processor), which can be used to process the output signal of the front-end image sensor to match image sensors from different manufacturers. And, it can be used to provide comprehensive support for end-to-end data stream signal processing from image input (camera sensor / TV signal input, etc.) to display devices (eg, LCD screen, TV V output or external image processing unit, etc.).
  • image signal processor image signal processor
  • processors in this application may further include programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like.
  • the above-mentioned structure including multiple processors may be referred to as a heterogeneous architecture or a heterogeneous system architecture.
  • Heterogeneous computing mainly refers to the calculation method of the system composed of different types of instruction sets and architecture computing units.
  • heterogeneity refers to various computing units such as CPUs, DSPs, GPUs, ASICs, coprocessors, and FPGAs, computing units that use different types of instruction sets, and different architectures to form a mixed system that performs special calculations.
  • This method is called "heterogeneous computing”.
  • heterogeneous computing Especially in the field of artificial intelligence, heterogeneous computing has great potential.
  • AI means ultra-high requirements for computing power.
  • heterogeneous computing represented by GPU has become a new generation of computing architecture to accelerate AI innovation.
  • HSA heterogeneous system architecture
  • multiple processors work together, that is, the CPU can use most resources for cache and logic control (that is, non-computing units), and a small part of resources for computing. This shows that the CPU is suitable for running serial programs with branch-intensive, irregular data structures, recursion and other characteristics.
  • dedicated computing modules are added to the system as accelerators, such as graphics processing units (GPUs), digital signal processors (DSPs), field programmable gate arrays (FPGAs), and other programmable logic units.
  • GPUs graphics processing units
  • DSPs digital signal processors
  • FPGAs field programmable gate arrays
  • HSA launches new system architecture and execution standards with the goal of optimizing heterogeneous computing.
  • the ultimate goal is to perform collaborative operations through heterogeneous architectures of the cores (including CPU, GPU, DSP, and other processors) within the SoC. In this way, the performance of each architecture in the entire SoC can be maximized.
  • Heterogeneous system architecture enables multiple processors to implement unified memory addressing.
  • Heterogeneous computing is a special form of parallel and distributed computing. It can also support a single instruction and multiple data streams ( Single independent computer with Single Instruction, Multiple Data (SIMD) method and multiple instruction stream, multiple data stream (MIMD) method, or a group of independent computers interconnected by a high-speed network to complete computing tasks. It can coordinate the use of machines with different performances and structures to meet different computing needs, and enables the code (or code segment) to be executed in a way that maximizes overall performance.
  • SIMD Single Instruction, Multiple Data
  • MIMD multiple data stream
  • Heterogeneous computing technology is a parallel and distributed computing technology that enables the type of parallelism (code type) of computing tasks to best match the type of computing that the machine can effectively support (that is, machine capabilities) and makes the best use of various computing resources.
  • code type type of parallelism
  • the above chip with a heterogeneous system architecture may be called an artificial intelligence (AI) chip, or an accelerated processing unit (APU).
  • AI artificial intelligence
  • APU accelerated processing unit
  • a processor for executing a target process may be selected from the foregoing multiple processors.
  • the processor executes the target program by executing the code of the target program.
  • different types of processors may have different instruction-set architectures (ISA).
  • ISA instruction-set architectures
  • different types of processors may have different instruction sets.
  • the instruction set is stored or integrated in the processor in the form of hardware, a hard program that guides and optimizes processor operations. The processor can run more efficiently through the instruction set.
  • Compiling is the process of converting a program written in one programming language (source language) to another language (target language).
  • source language programming language
  • target language target language
  • the compiler used by the above compilation technology may include, but is not limited to, the following structures:
  • the front-end compiler is used to implement the conversion from source program (or source code) to intermediate representation (IR), that is, the user first calculates the operator using the domain description language (DSL) language Describe as input to the front-end compiler.
  • the processing of the front-end compiler mainly includes lexical analysis, syntax analysis and semantic analysis.
  • Syntax analysis further obtains an Abstract Syntax Tree (AST) from the sequence of tokens.
  • AST Abstract Syntax Tree
  • Semantic analysis identifies the types of variables, the scope of operations, etc.
  • the front-end compiler may also be referred to as a front-end compilation device or a front-end compilation unit.
  • the intermediate compiler is used for code generation and optimization.
  • machine-independent optimizations such as: constant combination, extraction of common self-expressions, unrolling and merging of loops, code extraction (moving constant calculations out of loops), etc.
  • Machine-related optimizations such as: the use of registers (putting common quantities into registers to reduce the number of times to access memory), storage strategies (you can arrange Cache and parallel storage architecture to reduce access conflicts according to the requirements of algorithm fetching).
  • the intermediate compiler may also be referred to as an intermediate compilation device or an intermediate compilation unit.
  • Backend compiler is mainly used for object code generation, that is, the backend compiler can include multiple, the multiple backend compilers can correspond to a variety of processors, each backend compiler is used to The input optimized IR is converted into an object code (or an instruction or a function) that can be run on a corresponding processor, where the object code may be an instruction code or an assembly code.
  • object code or an instruction or a function
  • the back-end compiler may also be referred to as a back-end compilation device or a back-end compilation unit.
  • FIG. 5 is a schematic flowchart of an example of a method 200 for selecting a processor according to the present application.
  • the execution body of the method 200 (hereinafter, referred to as processing node #A for ease of understanding and description) may be any processor among multiple processors in a computing device, such as a central processing unit.
  • processing node #A may be a virtual machine running in a computing device.
  • the processing node #A may be the above-mentioned back-end compiler, or may be a device independent of the above-mentioned back-end compiler, which is not particularly limited in this application.
  • the method 200 selects a target processor based on instructions. Since the FPGA does not include an instruction set, the method 200 may not be used to determine whether the FPGA can be used as a target processor.
  • the method 200 can be used to determine whether the ASIC can be used as a target processor.
  • the method 200 may not be used to determine whether the ASIC can serve as a target processor.
  • the processing node #A may obtain hardware information of each of the two processors included in the computing device 100.
  • the manufacturer of the computing device 100 may pre-configure the hardware information of the processors included in the computing device 100 in the computing device 100 when the computing device 100 leaves the factory, so that the processing node # A
  • hardware information of each of the two processors included in the computing device 100 may be obtained based on the relevant information of the appearance configuration.
  • the manufacturer of the computing device 100 may save the hardware information of each processor included in the computing device 100 on the server, so that the processing node #A is connected to the server through the network in advance in S210.
  • each of the two processors included in the computing device 100 may input hardware information of each processor included in the computing device 100 to the processing node #A.
  • each processor may be installed in a hot-pluggable manner, and a driver of each processor may complete registration of each processor during hot-plugging.
  • the processing node #A hardware information of each of the two processors included in the computing device 100 may be obtained based on registration information of each processor or related information in a driver.
  • the computer device 100 may have a processor registration information collection function, so as to be able to identify which heterogeneous hardware is supported in the computer device 100, and according to the identified hardware, in the system Register the backend corresponding to each processor at startup. Therefore, the processing node can determine the hardware information of each processor according to the registration information of the backend corresponding to each processor.
  • hardware information of a processor may include information of an instruction set corresponding to the processor.
  • the hardware information of a processor may include information about the names of instructions that the processor can execute.
  • the hardware information of a processor may include information about the names of functions that the processor can execute.
  • the processing node #A can determine program information of a program (that is, an example of a target program, which is described as: program #A) that needs to be currently run.
  • program information may be determined according to the IR of the program #A.
  • the front-end compiler can obtain the source program code of program #A (denoted as: code #A).
  • the compiler may use, for example, a domain description language interface (DSL interface) for developers to call the DSL corresponding to the write operator (ie, an example of code #A); thereafter, the intermediate compiler may use the program #
  • the code #A (for example, DSL) corresponding to A is converted into the IR of the program #A; and, in this application, the intermediate compiler may also optimize the IR of the program #A. Therefore, the processing node #A can determine the program information of the program #A from the IR (for example, the optimized IR) of the program #A.
  • DSL interface domain description language interface
  • the processing node #A may be a front-end decoder and an intermediate decoder as the code #A. In this case, the processing node #A can directly obtain the program #A. IR.
  • the front-end decoder and the intermediate decoder of the code #A may be implemented by the processing node #B. In this case, the processing node #A may also communicate with the processing node #B, so that the processing The node #B may send the IR of the program #A to the processing node #A.
  • the program information of the program #A may include instructions (denoted as: instruction #A) included in the code (for example, optimized IR) of the program #A.
  • the instruction #A may include one instruction or multiple instructions, which is not particularly limited in this application.
  • the program information of the program #A may include the name of the instruction in the IR of the program #A.
  • the program information of the program #A may include the names of functions in the IR of the program #A.
  • the processing node #A may determine a target processor (denoted as processor # 1) from a plurality of processors based on the program information of the program #A and the hardware information of each processor.
  • the processor # 1 may be a processor whose corresponding instruction set in the multiple processors includes the instruction #A.
  • the processor # 1 may be a processor among the multiple processors that meets the constraint #A.
  • the constraint condition #A includes that the instruction set corresponding to the processor includes the instruction #A.
  • the processing node #A may determine the priority of each processor in the multiple processors.
  • the processing node #A may determine the priority of each processor according to the parallel computing capability of each of the multiple processors, that is, in this application, the parallel computing capability is high
  • the priority of the processor is higher than that of the processor with low parallel computing capability.
  • parallel computing or parallel computing is relative to serial computing.
  • Parallel computing is an algorithm that can execute multiple instructions at one time.
  • the purpose is to increase the computing speed and solve large and complex computing problems by expanding the problem solving scale.
  • the so-called parallel computing can be divided into parallel in time and parallel in space.
  • Temporal parallelism refers to pipeline technology
  • spatial parallelism refers to the use of multiple processors to perform calculations concurrently.
  • the processing node #A may determine the priority of each processor according to the types of multiple processors. For example, in this application, the priority of a special-purpose processor is higher than that of a general-purpose processor. And, optionally, the general-purpose processor may be the processor with the lowest priority among the multiple processors. Therefore, the processing node #A may sequentially determine whether each processor satisfies the above-mentioned constraint condition #A according to the priority of each processor, for example, in the order of priority from high to low. And, optionally, the processing node #A may determine the first processor satisfying the constraint condition #A as the processor # 1.
  • the processing node #A may stop determining other processors after determining the processor # 1.
  • different processors will have different instruction sets. The instructions used to implement the same function will be different on different chips.
  • the instruction set of processor #a is intrin # a
  • processor The instruction set of #b is intrin # b.
  • processor #b is the processor with the lowest default priority.
  • processor #b may be a general-purpose processor, and processor #a is a dedicated processor, that is, processor #b may be Implements functions, but the parallel computing power is not as good as the dedicated processor #a.
  • the IR description of program #A is obtained after IR processing.
  • the analysis uses calculations of two instructions, mul (multiplication) and add (addition). Thereafter, the processing node #A preferentially determines whether the above instructions belong to the intrin # a; if it is determined as "yes”, it selects processor #a as the processor # 1; if it is determined as "no", it selects the processor # b as processor # 1.
  • the hardware information obtained by the processing node #A in S210 further includes information about the size of the currently available memory space of each processor.
  • the size of the currently available memory space of the processor may be, for example, 90% of the free space (or memory capacity) of the processor.
  • the program information obtained by the processing node #A at S220 further includes information on the size of the memory space required for the operation of the program #A.
  • the processor # 1 may be a processor whose corresponding instruction set of the multiple processors includes the instruction #A, and the currently available memory space is greater than or equal to the memory space required for the operation of the program #A.
  • the processor # 1 may be a processor among the multiple processors that meets the constraint #A and the constraint #B.
  • the constraint #B includes the current free space of the processor that is greater than or equal to the memory space required for the operation of the program #A (recorded as: space #A).
  • the processing node #A may determine the space #A according to the data dimension of the program #A (or the code of the program #A).
  • the data dimension of the program #A can be understood as the shape of the tensor of the program #A.
  • a tensor is a multilinear mapping defined on the Cartesian product of some vector spaces and some dual spaces. Its coordinates are
  • r is called the rank or order of the tensor (which has nothing to do with the rank and order of the matrix).
  • a data structure such as a tensor can be used to represent all data. That is, in this application, a tensor may correspond to an n-dimensional array or list. A tensor has a dimension of static type and dynamic type. Tensors can be circulated between nodes in the graph.
  • the processing node #A may perform a shape analysis on the IR of the program #A, thereby determining the dimension of the IR (specifically, the IR tensor) of the program #A, and then estimating the The amount of memory space required for program #A to run.
  • the method and process of estimating the size of the memory space based on the dimensions of the data may be similar to the prior art.
  • detailed descriptions thereof are omitted.
  • processor #b is the processor with the lowest default priority.
  • processor #b may be a general-purpose processor
  • processor #a is a dedicated processor, that is, processor #b may be Implements functions, but the parallel computing power is not as good as the dedicated processor #a.
  • m mul (x, y)
  • s add (m, z)
  • the IR description of program #A is obtained after IR processing. The analysis uses calculations of two instructions, mul (multiplication) and add (addition).
  • the processing node #A may determine the size of the memory space that the program #A needs to occupy (for example, let the size of the memory space be X). And, the processing node #A can determine the size of the currently available memory space of each processor. Let the current memory space size of processor #a be Y. Thereafter, the processing node #A first determines whether the above-mentioned instructions belong to intrinsic # a; if the determination is "YES", it is further determined whether the X is less than or equal to Y. If the determination is "YES", processor #a is selected as the processor # 1; if the determination is "NO", processor #b is selected as the processor # 1.
  • processing node #A may control the intermediate compiler to send the IR of the program #A to the backend corresponding to the processor # 1. Therefore, the backend corresponding to the processor # 1 can convert the IR of the program #A into code that the processor # 1 can recognize and process.
  • the method for selecting a processor by obtaining hardware information of each processor and program information of a target program in advance, and based on the hardware information and program information, selecting hardware information and the program information from a variety of processors Matching processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.
  • FIG. 6 is a schematic flowchart of an example of a method 300 for selecting a processor according to the present application.
  • the execution body of the method 300 (hereinafter, referred to as processing node #B for ease of understanding and description) may be any processor among multiple processors in a computing device, such as a central processing unit.
  • processing node #B may be a virtual machine running in a computing device.
  • the processing node #B may be the above-mentioned back-end compiler, or may be a device independent of the above-mentioned back-end compiler, which is not particularly limited in this application.
  • the processing node #B can obtain hardware information of each of the two processors included in the computing device 100.
  • the manufacturer of the computing device 100 may pre-configure the hardware information of the processors included in the computing device 100 in the computing device 100 when the computing device 100 leaves the factory, so that the processing node # B
  • hardware information of each of the two processors included in the computing device 100 may be obtained based on the relevant information of the appearance configuration.
  • the manufacturer of the computing device 100 may save the hardware information of each processor included in the computing device 100 on a server, so that the processing node #B is connected to the server through the network in advance in S310.
  • each of the two processors included in the computing device 100 may input hardware information of each processor included in the computing device 100 to the processing node #B.
  • each processor may be installed in a hot-pluggable manner, and a driver of each processor may complete registration of each processor during hot-plugging.
  • the processing node #B In S310, hardware information of each of the two processors included in the computing device 100 may be obtained based on registration information of each processor or related information in a driver.
  • the computer device 100 may have a processor registration information collection function, so as to be able to identify which heterogeneous hardware is supported in the computer device 100. Register the backend corresponding to each processor at startup. Therefore, the processing node can determine the hardware information of each processor according to the registration information of the backend corresponding to each processor.
  • the hardware information of a processor may include the size of the currently available memory space of the processor.
  • the size of the currently available memory space of the processor may be, for example, 90% of the free space (or memory capacity) of the processor.
  • the processing node #B may have program information of a program (that is, an example of a target program, which is described as: program #B) that needs to be currently run.
  • program information may be determined according to the IR of the program #B.
  • the front-end compiler can obtain the source program code of program #B (denoted as: code #B).
  • the compiler can use, for example, a domain description language interface (DSL interface) for developers to call the DSL corresponding to the write operator (that is, an example of code #B); thereafter, the intermediate compiler can use the program #
  • the code #B (for example, DSL) corresponding to B is converted into the IR of the program #B; and, in this application, the intermediate compiler may also optimize the IR of the program #B. Therefore, the processing node #B can determine the program information of the program #B from the IR (for example, the optimized IR) of the program #B.
  • DSL interface domain description language interface
  • the processing node #B may be a front-end decoder and an intermediate decoder as the code #B. In this case, the processing node #B can directly obtain the program #B. IR.
  • the front-end decoder and the intermediate decoder of the code #B may be implemented by the processing node #B. In this case, the processing node #B may also communicate with the processing node #B. The node #B may send the IR of the program #B to the processing node #B.
  • the program information of the program #B may include information on the size of a memory space (denoted as: space #B) required for the operation of the program #B.
  • the processing node #B may determine the space #B according to the data dimension of the program #B (or the code of the program #B).
  • the data dimension of the program #B can be understood as the shape (shape) of the tensor of the program #B.
  • a tensor is a multiple linear mapping defined on the Cartesian product of some vector spaces and some dual spaces. Its coordinates are a quantity of
  • a data structure such as a tensor can be used to represent all data. That is, in this application, a tensor may correspond to an n-dimensional array or list. A tensor has a dimension of static type and dynamic type. Tensors can be circulated between nodes in the graph. In this application, the dimensionality of a tensor is described as an order. It should be noted that the order of the tensor (sometimes about order or degree or n dimensions) is a quantitative description of the tensor dimension.
  • the processing node #B may perform a shape analysis on the IR of the program #B, thereby determining the dimension of the IR (specifically, the IR tensor) of the program #B, and then estimating the The amount of memory space required for program #B to run.
  • the method and process of estimating the size of the memory space based on the dimensions of the data may be similar to the prior art.
  • detailed descriptions thereof are omitted.
  • the processing node #B may determine a target processor (denoted as processor # 2) from a plurality of processors based on the program information of the program #B and the hardware information of each processor.
  • the processor # 2 may be a processor in which the currently available memory space of the multiple processors is greater than or equal to the size of the memory space required for the running of the program #B.
  • the processor # 2 may be a processor among the multiple processors that meets the constraint #C.
  • the constraint #C includes: the current available space of the processor is greater than or equal to the size of the memory space required for the operation of the program #B.
  • the processing node #B may determine the priority of each processor in the plurality of processors.
  • the processing node #B may determine the priority of each processor according to the parallel computing capability of each of the multiple processors, that is, in this application, the parallel computing capability is high
  • the priority of the processor is higher than that of the processor with low parallel computing capability.
  • parallel computing or parallel computing is relative to serial computing.
  • Parallel computing is an algorithm that can execute multiple instructions at one time.
  • the purpose is to increase the computing speed and solve large and complex computing problems by expanding the problem solving scale.
  • the so-called parallel computing can be divided into parallel in time and parallel in space.
  • Temporal parallelism refers to pipeline technology
  • spatial parallelism refers to the use of multiple processors to perform calculations concurrently.
  • the processing node #B may determine the priority of each processor according to the power consumption of each of the multiple processors, that is, in this application, the priority of the processor with high power consumption Level is lower than the priority of the processor with low power consumption, for example, for processor #a and processor #b, if the power consumption of processor #b is higher than the power consumption of processor #a, processing node #B may It is considered that the priority of the processor #b is lower than the priority of the processor #a.
  • the processing node #B may determine the priority of each processor according to the types of multiple processors. For example, in this application, the priority of a special-purpose processor is higher than that of a general-purpose processor. And, optionally, the general-purpose processor may be the processor with the lowest priority among the multiple processors. Therefore, the processing node #B can sequentially determine whether each processor satisfies the above-mentioned constraint condition #C according to the priority of each processor, for example, in the order of priority from high to low. And, optionally, the processing node #B may determine the first processor satisfying the constraint condition #C as the processor # 2. In addition, the processing node #B may stop determining other processors after determining the processor # 2.
  • processor #b is the processor with the lowest default priority.
  • processor #b may be a general-purpose processor
  • processor #a is a dedicated processor, that is, processor #b may be Implements functions, but the parallel computing power is not as good as the dedicated processor #a.
  • m mul (x, y)
  • s add (m, z)
  • the IR description of program #B is obtained after IR processing.
  • the analysis uses calculations of two instructions, mul (multiplication) and add (addition).
  • processing node #B can determine the size of the memory space that the program #B needs to occupy (for example, let the size of the memory space be W). And, processing node #B can determine the size of the currently available memory space of each processor. Let the current memory space size of processor #a be Z. Thereafter, the processing node #B judges whether the Z is greater than or equal to W; if it is determined as "YES", the processor #a is selected as the processor # 2. If the determination is "No", the processor #b is selected as the processor # 2.
  • the hardware information obtained by processing node #B in S310 further includes information of an instruction set corresponding to each processor.
  • the hardware information of a processor may include information about the names of instructions that the processor can execute.
  • the hardware information of a processor may include information about the names of functions that the processor can execute.
  • the program information obtained by the processing node #B at S320 further includes instructions (denoted as: instruction #B) included in the code (for example, optimized IR) of the program #B.
  • the instruction #B may include one instruction or multiple instructions, which is not particularly limited in this application.
  • the program information of the program #B may include the name of the instruction in the IR of the program #B.
  • the program information of the program #B may include the names of functions in the IR of the program #B.
  • the processor # 2 may be a processor in which the currently available memory space of the multiple processors is greater than or equal to the memory space required for the operation of the program #B, and the corresponding instruction set includes the instruction #B.
  • the processor # 1 may be a processor among the multiple processors that meets the constraint #C and the constraint #D.
  • the constraint condition #D includes that the instruction set corresponding to the processor includes the instruction #B.
  • processor #b is the processor with the lowest default priority.
  • processor #b may be a general-purpose processor, and processor #a is a dedicated processor, that is, processor #b may be Implements functions, but the parallel computing power is not as good as the dedicated processor #a.
  • processors will have different instruction sets. The instructions used to implement the same function will be different on different chips.
  • the instruction set of processor #a is intrin # a
  • processor The instruction set of #b is intrin # b.
  • the IR description of program #B is obtained after IR processing.
  • the analysis uses calculations of two instructions, mul (multiplication) and add (addition).
  • processing node #B can determine the size of the memory space that the program #B needs to occupy (for example, let the size of the memory space be W). And, processing node #B can determine the size of the currently available memory space of each processor. Let the current memory space size of processor #a be Z. Thereafter, the processing node #B first determines whether Z is greater than or equal to W. If the determination is "Yes”, then it is further determined whether all the above instructions belong to inintrin # a; if the determination is "Yes”, then the processor #a is selected as the processor # 2; if the determination is "No", the processor is selected #b ⁇ Handler # 2.
  • the processing node #B can control the intermediate compiler to send the IR of the program #B to the corresponding backend of the processor # 2. Therefore, the backend corresponding to the processor # 2 can convert the IR of the program #A into code that the processor # 2 can recognize and process.
  • the method for selecting a processor of the present application can be applied to a compilation technique.
  • a compiling device for example, a front-end compiler
  • a compiling device may use a DSL interface for a developer to call a DSL corresponding to a write operator.
  • a compilation device eg, an intermediate compiler
  • the compilation device eg, an intermediate compiler
  • the compiling device selects the optimal backend compilation backend based on the backend hardware registration information obtained by the automatic identification hardware device and the analysis result of the IR device.
  • the specific process of this step may be similar to the process described in the above method 200 or method 300.
  • the device for example, the selected back-end compiler
  • the method for selecting a processor by obtaining hardware information of each processor and program information of a target program in advance, and based on the hardware information and program information, selecting hardware information and the program information from a variety of processors
  • the matching process can match the selected processor with the target program, and can reduce the labor time.
  • FIG. 8 is a schematic diagram of a logical architecture of a processor selection apparatus 500 according to an embodiment of the present application.
  • the processor selection device may be configured on a computing device including multiple processors, or the processor selection device itself is one of the multiple processors.
  • the apparatus 500 for selecting a processor may include a recognition unit 510, an analysis unit 520, and a selection unit 530.
  • the identification unit 510 may be configured to execute the method in S210 or S310. That is, the identification unit 510 may obtain hardware information of each of the at least two processors, and the hardware information is used to instruct the processor. The corresponding instruction set, and / or the hardware information is used to indicate the size of the available memory space of the processor, and the specific processing procedure of the identification unit 510 may be similar to the processing procedure described in the above S210 or S310, in order to avoid redundant description here , And its detailed description is omitted.
  • the analysis unit 520 may be configured to execute the method in S220 or S320, that is, the analysis unit 520 may obtain program information of a target program, where the program information is used to indicate instructions in the target program, and / or, the The program information is used to indicate the memory space that the target program needs to occupy, and the specific processing process of the analysis unit 520 may be similar to the processing process described in the above S220 or S320. To avoid redundant descriptions, detailed descriptions are omitted here.
  • the selection unit 530 may be configured to execute the method in S230 or S330, that is, the selection unit 530 determines a target for executing the target program from the at least two processors according to the program information and the hardware information.
  • a processor wherein the target processor is a processor that satisfies a preset condition among the at least two processors, the preset condition includes an instruction set corresponding to the processor including an instruction in the target program, and / Or, the preset condition includes that the available memory space of the processor is greater than or equal to the memory space required by the target program, and the specific processing procedure of the selection unit 530 may be similar to the processing procedure described in the above S230 or S350, To avoid redundant descriptions, detailed descriptions are omitted here.
  • the selection unit 530 can also control the intermediate compiler to send the IR of the target program to the backend compiler backend corresponding to the target processor.
  • the actions and functions of the identification unit 510, analysis unit 520, and selection unit 530 may be implemented by the same virtual machine or the same processor. Alternatively, the actions and functions of the identification unit 510, analysis unit 520, and selection unit 530 may be implemented by different multiple virtual machines or multiple processors, respectively.
  • processor selection device provided in the present application, hardware information and program information of a target program are obtained in advance by each processor, and based on the hardware information and program information, hardware information and the program information are selected from a variety of processors. Matching processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.
  • FIG. 9 is a schematic diagram of a logical architecture of a compiling device 600 to which the embodiment of the present application is applied.
  • the compilation apparatus 600 may include a front-end compilation unit 610, an intermediate compilation unit 620, a selection unit 630, and a plurality of back-end compilation units 640.
  • the multiple back-end compiling units 640 correspond to multiple processors (or computing units, computing platforms, or processing units).
  • the selection unit 630 may include an identification module 632, an analysis module 634, and a selection module 636.
  • the front-end compilation unit 610 may use a DSL interface for developers to call the DSL corresponding to the write operator.
  • the actions performed by the front-end compilation unit 610 may be similar to the actions performed by the aforementioned front-end compiler.
  • the intermediate compilation unit 620 is communicatively connected to the front-end compilation unit 610, and is configured to obtain the DSL from the front-end compilation unit 610, and the DSL can generate an intermediate expression IR, and the intermediate expression IR can be optimized.
  • the actions performed by the compiling unit 620 may be similar to the actions performed by the above-mentioned intermediate compiler. Here, in order to avoid redundant description, the description is omitted.
  • the identification module 632 may be configured to execute the method in S210 or S310, that is, the identification unit 510 may obtain hardware information of each of the at least two processors, and the hardware information is used to indicate instructions corresponding to the processors. Set, and / or, the hardware information is used to indicate the size of the available memory space of the processor, and the specific processing process of the identification module 632 may be similar to the processing process described in the above S210 or S310, in order to avoid redundant descriptions, omitted here Detailed description.
  • the analysis module 634 is communicatively connected to the intermediate compilation unit 620, and is configured to obtain the IR from the intermediate compilation unit 620, and can further be used to execute the method in S220 or S320, that is, the analysis module 634 can obtain the program information of the target program.
  • the program information is used to indicate instructions in the target program, and / or the program information is used to indicate the memory space that the target program needs to occupy, and the specific processing process of the analysis module 634 can be described with the above S220 or S320 The processing process is similar. To avoid redundant description, detailed description is omitted here.
  • the selection module 636 may be communicatively connected to the identification module 632 and the analysis module 634, and is configured to obtain hardware information from the identification module 632 and program information from the analysis module 634.
  • the selection module 636 may be used to execute the method in S230 or S330, that is,
  • the selection unit 530 determines a target processor for executing the target program from the at least two processors according to the program information and the hardware information, wherein the target processor is the at least two processors.
  • a processor in the processor that satisfies a preset condition includes an instruction set corresponding to the processor including instructions in the target program, and / or, the preset condition includes an available memory space of the processor It is greater than or equal to the memory space required by the target program, and the specific processing process of the selection module 636 may be similar to the processing process described in the above S230 or S350. To avoid redundant description, detailed descriptions are omitted here.
  • the selection module 636 can also control the intermediate compiler to send the IR of the target program to the back-end compilation unit 640 corresponding to the target processor.
  • the back-end compilation unit 640 can convert the IR into code that can be executed on the corresponding processor.
  • the actions performed by the back-end compilation unit 640 may be similar to the actions performed by the back-end compiler described above. Here, to avoid repetition, the description is omitted.
  • the hardware information of each processor and the program information of the target program are obtained in advance, and based on the hardware information and program information, the hardware information that matches the program information is selected from a variety of processors. Processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of this application is essentially a part that contributes to the existing technology or a part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
  • the aforementioned storage media include: U disks, mobile hard disks, read-only memory (ROM), random access memory (RAM), magnetic disks or compact discs, and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The present application provides a method and a device for selecting a processor. The method comprises: acquiring hardware information of each of at least two processors, the hardware information being used for indicating an instruction set corresponding to each processor; acquiring program information of a target program, the program information being used for indicating instructions in the target program; and determining, from the at least two processors according to the program information and the hardware information, a target processor that satisfies a preset condition and can be used for executing the target program, wherein the preset condition comprises that the instruction set corresponding to the processor comprises instructions in the target program, so that the processing efficiency of a computer device can be improved and the burden of programmers can be reduced.

Description

选择处理器的方法和装置Method and device for selecting processor 技术领域Technical field
本申请涉及计算机领域,并且,更具体地,涉及选择处理器的方法和装置以及计算机设备。The present application relates to the field of computers, and, more particularly, to a method and apparatus for selecting a processor, and a computer device.
背景技术Background technique
随着计算机技术的发展和进步,出现了异构体系结构,即,计算机设备包括通用处理器(例如,中央处理器)和协处理器(例如,图形处理器)。协处理器可以提供更多并行计算能力,提高运算速度。除了在性能上可以提供更强大的线程级或数据级并行计算能力外,协处理器具有比通用处理器更高效的能效功耗比。因此,可以通过协处理器来弥补CPU处理器计算能力的不足,并且,可以降低系统的总体能耗。With the development and advancement of computer technology, heterogeneous architectures have emerged, that is, computer devices include general-purpose processors (e.g., central processing units) and co-processors (e.g., graphics processors). Coprocessors can provide more parallel computing capabilities and increase computing speed. In addition to providing more powerful thread-level or data-level parallel computing performance, coprocessors have a more efficient energy efficiency and power consumption ratio than general-purpose processors. Therefore, the coprocessor can make up for the lack of computing power of the CPU processor, and can reduce the overall energy consumption of the system.
目前,异构体系结构已经普遍用于例如,神经网络(neural network,NN)或机器学习(machine learning,ML)等领域,在该领域中,为了便于编程人员的编写针对该系统的软件程序,提出了编译技术,例如,可以利用编译技术产生算子代码。Currently, heterogeneous architectures have been commonly used in fields such as neural networks (NN) or machine learning (ML). In this field, in order to facilitate programmers to write software programs for the system, A compilation technique is proposed. For example, the compiler technique can be used to generate operator code.
编译就是将一种编程语言(源语言)编写的程序转换到另一种语言(目标语言)的过程。其中,源语言可以是用户编写目标程序时使用的语言,目标语言可以是在异构体系中用户希望选择运行该目标程序的处理器所使用的语言。在编译技术的构造中,可以包括前端、中间表达式和后端。前端主要实现从源程序到中间表示的转换,即:用户先使用领域描述语言(domain specific language,DSL)对算子的计算进行描述,作为前端源程序输入,经过前端的各步骤处理优化后以中间表示(intermediate representation,IR)输入到后端。后端的代码生成器根据指定的目标处理器(例如,通用处理器或协处理器)完成从IR到具体目标代码的转换。Compiling is the process of converting a program written in one programming language (source language) to another language (target language). The source language may be a language used by a user when writing a target program, and the target language may be a language used by a processor that the user wishes to select to run the target program in a heterogeneous system. In the construction of the compilation technology, a front end, an intermediate expression, and a back end can be included. The front-end mainly implements the conversion from the source program to the intermediate representation, that is, the user first uses the domain description language (domain specific language) to describe the calculation of the operator as the input of the front-end source program. Intermediate representation (IR) is input to the back end. The back-end code generator completes the conversion from IR to specific target code according to a specified target processor (for example, a general-purpose processor or coprocessor).
但是,在该技术中,需要编程人员在前期,例如,在DSL对算子的计算进行描述期间,人工指定算子运行所在的目标处理器。然而,因为硬件指令的支持情况、数据对齐与否、运算效率、周边算子的配合等因素,存在编译人员所指定的目标处理器不一定是最适合运行目标程序的处理器的情况,从而,导致处理效率降低。并且,该人工指定目标处理器的过程增加了编程人员的工作负担。However, in this technique, the programmer is required to manually specify the target processor on which the operator runs in the early stage, for example, during the description of the calculation of the operator by the DSL. However, because of the support of hardware instructions, data alignment, operation efficiency, and cooperation of peripheral operators, the target processor specified by the compiler may not be the most suitable processor to run the target program. This leads to a reduction in processing efficiency. And, the process of manually specifying the target processor increases the workload of the programmer.
发明内容Summary of the Invention
本申请提供一种选择处理器的方法和装置,能提高计算机设备的处理效率,降低编程人员的负担。The present application provides a method and device for selecting a processor, which can improve the processing efficiency of computer equipment and reduce the burden on programmers.
第一方面,提供了一种选择处理器的方法,获取至少两种处理器中每种处理器的硬件信息,该硬件信息用于指示该每种处理器对应的指令集;获取待执行的目标程序的程序信息,该程序信息用于指示该目标程序中的指令;根据该程序信息和该硬件信息,从该至少两种处理器中确定满足预设条件且能够用于执行该目标程序的目标处理器,该预设条件包 括处理器对应的指令集包括该目标程序中的指令。In a first aspect, a method for selecting a processor is provided, and hardware information of each processor in at least two processors is obtained, where the hardware information is used to indicate an instruction set corresponding to each processor; and a target to be executed is obtained. Program information of a program, the program information being used to indicate instructions in the target program; and based on the program information and the hardware information, determining from the at least two processors a target that satisfies a preset condition and can be used to execute the target program A processor, the preset condition includes an instruction set corresponding to the processor including an instruction in the target program.
根据本申请提供的选择处理器的方法,通过预先获取每种处理器的硬件信息和目标程序的程序信息,并基于该硬件信息和程序信息,从多种处理器中选择硬件信息与该程序信息相匹配的处理器,能够使所选择的处理器与该目标程序相匹配,并且,无需人工指定该处理器,从而,能提提高计算机设备的处理效率,降低编程人员的负担。According to the method for selecting a processor provided in the present application, by obtaining hardware information of each processor and program information of a target program in advance, and based on the hardware information and program information, selecting hardware information and the program information from a variety of processors The matching processor can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.
其中,“处理器对应的指令集”可以理解为处理器所能够处理的函数,并且,该硬件信息可以用于指示处理器所能够处理的函数(例如,函数名称)。其中,“该目标程序中的指令”可以理解为该目标程序中包括的各函数,并且,该程序信息用于指示该目标程序中包括的各函数(例如,函数名称)。The “instruction set corresponding to the processor” can be understood as a function that the processor can process, and the hardware information can be used to indicate a function (for example, a function name) that the processor can process. The "instructions in the target program" can be understood as functions included in the target program, and the program information is used to indicate functions (for example, function names) included in the target program.
可选地,该根据该程序信息和该硬件信息,从该至少两种处理器中确定满足预设条件且能够用于执行该目标程序的目标处理器,包括:确定该至少两种处理器每种处理器的优先级;基于该程序信息和该硬件信息,按照该至少两种处理器的优先级从高到低的顺序,依次判定该至少两种处理器是否满足该预设条件,并将首个满足该预设条件的处理器作为该目标处理器。通过为处理器设置优先级,能够实现个性化处理,灵活应对不同的处理场景,并且,能够提高确定目标处理器的效率,缩短确定目标处理器的时间。Optionally, according to the program information and the hardware information, determining, from the at least two processors, a target processor that satisfies a preset condition and can be used to execute the target program, includes: determining that the at least two processors each Based on the program information and the hardware information, in order of priority of the at least two processors from high to low, determine whether the at least two processors meet the preset condition in order, and The first processor that meets the preset condition serves as the target processor. By setting a priority for the processor, personalized processing can be achieved, and different processing scenarios can be flexibly handled. Moreover, the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.
可选地,该确定该至少两种处理器中每种处理器的并行计算能力或功耗中的至少一项,确定每种处理器的优先级。Optionally, determining at least one of a parallel computing capability or power consumption of each of the at least two processors, and determining a priority of each processor.
可选地,该至少两种处理器包括中央处理器CPU,且该CPU在该至少两种处理器中的优先级最低。从而,能够确保至少两种处理器中存在能够处理目标程序的处理器,并且,由于CPU的功耗较高,通过将CPU的优先级设置为最低,能够提高协处理器被选择为目标处理器的可能性,从而,能够进一步提高本申请的效果及实用性。Optionally, the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors. Therefore, it is possible to ensure that there is a processor capable of processing the target program among at least two types of processors, and because the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, the coprocessor can be selected as the target processor. Therefore, the effect and practicability of the present application can be further improved.
可选地,该至少两种处理器包括以下处理器中的至少两种:CPU、图形处理器GPU、专用集成电路ASIC、神经网络处理器NPU、图像处理单元IPU、图像处理单元IPU或数字信号处理DSP。Optionally, the at least two processors include at least two of the following processors: a CPU, a graphics processor GPU, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, an image processing unit IPU, or a digital signal Processing DSP.
其中,该ASIC可以通过软件执行计算。Among them, the ASIC can perform calculations by software.
可选地,该硬件信息还用于指示处理器的可用内存空间的大小,该程序信息还用于指示该目标程序需要占用的内存空间,以及该预设条件还包括该处理器的可用内存空间大于或等于该目标程序需要占用的内存空间。其中,该处理器的可用空间可以是指处理器的总内存空间中规定比例的内存空间。例如,该规定比例可以为90%。或者,该处理器的可用空间可以是指处理器的总的空闲内存空间中规定比例的内存空间。Optionally, the hardware information is also used to indicate the size of the available memory space of the processor, the program information is also used to indicate the memory space required by the target program, and the preset condition further includes the available memory space of the processor Greater than or equal to the memory space required by the target program. The available space of the processor may refer to a specified proportion of the memory space in the total memory space of the processor. For example, the prescribed ratio may be 90%. Alternatively, the available space of the processor may refer to a specified proportion of the memory space in the total free memory space of the processor.
另外,“预设条件还包括处理器的可用内存空间大于或等于该目标程序需要占用的内存空间”可以是指,只要该处理器的可用内存空间大于或等于该目标程序需要占用的内存空间即满足预设条件。In addition, "the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space required by the target program" may mean that as long as the available memory space of the processor is greater than or equal to the memory space required by the target program Meet preset conditions.
或者,“预设条件还包括处理器的可用内存空间大于或等于该目标程序需要占用的内存空间”可以是指,处理器的可用内存空间需要大于或等于包括该目标程序在内的在由该目标处理器执行的所有程序所需要占用的内存空间才能满足预设条件。Or, “the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy” may mean that the available memory space of the processor needs to be greater than or equal to the The memory space required by all programs executed by the target processor can satisfy the preset conditions.
通过使预设条件进一步包括该处理器的可用内存空间大于或等于该目标程序需要占用的内存空间,能够确保所选择的目标处理器能够支持目标程序的运行,从而,能够进一步提高本申请的实用性。By making the preset conditions further include that the available memory space of the processor is greater than or equal to the memory space required by the target program, it can be ensured that the selected target processor can support the operation of the target program, thereby further improving the practicality of the present application. Sex.
可选地,该获取目标程序的程序信息,包括:根据该目标程序的数据维度,确定目标程序需要占用的内存空间。从而,能够容易地确定目标程序需要占用的内存空间。Optionally, obtaining the program information of the target program includes: determining a memory space that the target program needs to occupy according to the data dimension of the target program. Therefore, it is possible to easily determine the memory space that the target program needs to occupy.
可选地,该获取目标程序的程序信息包括:根据目标程序的中间表达式IR,确定该程序信息,其中,该目标程序的IR是根据取该目标程序的领域描述语言DSL代码确定的。Optionally, obtaining the program information of the target program includes: determining the program information according to an intermediate expression IR of the target program, wherein the IR of the target program is determined according to a domain description language DSL code of the target program.
其中,该DSL代码可以是计算机设备中的前端编译器确定的,该IR可以是计算机设备中的中间编译器确定的。从而,能够容易地获得程序信息。The DSL code may be determined by a front-end compiler in a computer device, and the IR may be determined by an intermediate compiler in the computer device. Thereby, the program information can be easily obtained.
可选地,该获取该至少两种处理器中每种处理器的硬件信息:根据每个处理器的注册信息,获取每种处理器的硬件信息,该注册信息用于处理器在该计算设备中的注册。其中,该注册信息可以包括硬件描述信息。并且,该硬件描述信息可以是计算机设备在处理器安装前离线获取的。或者,该硬件描述信息可以是计算机设备在处理器安装时,从该处理器的驱动信息中获取的。Optionally, the hardware information of each of the at least two processors is obtained: hardware information of each processor is obtained according to registration information of each processor, and the registration information is used by the processor in the computing device. Registration. The registration information may include hardware description information. In addition, the hardware description information may be obtained offline by the computer device before the processor is installed. Alternatively, the hardware description information may be obtained by the computer device from the driver information of the processor when the processor is installed.
可选地,该计算机设备中包括至少两个后端编译器,该至少两个后端编译器与该至少两种处理器一一对应,每个后端编译器用于将IR转换为所对应的处理器所能够识别的代码。Optionally, the computer device includes at least two back-end compilers, and the at least two back-end compilers correspond to the at least two processors in a one-to-one manner, and each back-end compiler is configured to convert the IR to a corresponding one. The code recognized by the processor.
此情况下,该方法还包括:将该目标程序的IR输入至与该目标处理器相对应的目标后端编译器。其中,该目标程序的IR可以是经过IR优化处理后的IR。In this case, the method further includes: inputting the IR of the target program to a target back-end compiler corresponding to the target processor. The IR of the target program may be an IR that has been optimized by IR.
第二方面,提供了一种选择处理器的方法,该方法包括:获取至少两种处理器中每种处理器的硬件信息,该硬件信息用于指示处理器的可用内存空间的大小;获取目标程序的程序信息,该程序信息用于指示该目标程序需要占用的内存空间;根据该程序信息和该硬件信息,从该至少两种处理器中确定满足预设条件且能够用于执行该目标程序的目标处理器,该预设条件包括处理器的可用内存空间大于或等于该目标程序需要占用的内存空间。According to a second aspect, a method for selecting a processor is provided. The method includes: acquiring hardware information of each of the at least two processors, the hardware information used to indicate a size of a processor's available memory space; acquiring a target Program information of a program, the program information being used to indicate a memory space that the target program needs to occupy; and based on the program information and the hardware information, it is determined from the at least two processors that the preset conditions are satisfied and can be used to execute the target program The target processor, the preset condition includes that the available memory space of the processor is greater than or equal to the memory space required by the target program.
根据本申请提供的选择处理器的方法,通过预先获取每种处理器的硬件信息和目标程序的程序信息,并基于该硬件信息和程序信息,从多种处理器中选择硬件信息与该程序信息相匹配的处理,能够使所选择的处理器与该目标程序相匹配,并且,无需人工指定该处理器,从而,能提提高计算机设备的处理效率,降低编程人员的负担。其中,该处理器的可用空间可以是指处理器的总内存空间中规定比例的内存空间。例如,该规定比例可以为90%。或者,该处理器的可用空间可以是指处理器的总的空闲内存空间中规定比例的内存空间。According to the method for selecting a processor provided in the present application, by obtaining hardware information of each processor and program information of a target program in advance, and based on the hardware information and program information, selecting hardware information and the program information from a variety of processors Matching processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer. The available space of the processor may refer to a specified proportion of the memory space in the total memory space of the processor. For example, the prescribed ratio may be 90%. Alternatively, the available space of the processor may refer to a specified proportion of the memory space in the total free memory space of the processor.
可选地,该至少两种处理器包括以下处理器中的至少两种:CPU、图形处理器GPU、现场可编程门阵列FPGA、专用集成电路ASIC、神经网络处理器NPU、图像处理单元IPU或数字信号处理DSP。Optionally, the at least two processors include at least two of the following processors: a CPU, a graphics processor GPU, a field programmable gate array FPGA, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, or Digital Signal Processing DSP.
另外,“预设条件还包括处理器的可用内存空间大或等于该目标程序需要占用的内存空间”可以是指,只要该处理器的可用内存空间大于或等于该目标程序需要占用的内存空间即满足预设条件。In addition, "the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space required by the target program" can mean that as long as the available memory space of the processor is greater than or equal to the memory space required by the target program, Meet preset conditions.
或者,“预设条件还包括处理器的可用内存空间大于或等于该目标程序需要占用的内存空间”可以是指,处理器的可用内存空间需要大于或等于包括该目标程序在内的在由该目标处理器执行的所有程序所需要占用的内存空间才能满足预设条件。Or, “the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy” may mean that the available memory space of the processor needs to be greater than or equal to the The memory space required by all programs executed by the target processor can satisfy the preset conditions.
可选地,该获取目标程序的程序信息,包括:根据该目标程序的数据维度,确定目标程序需要占用的内存空间。从而,能够容易地确定目标程序需要占用的内存空。Optionally, obtaining the program information of the target program includes: determining a memory space that the target program needs to occupy according to the data dimension of the target program. Therefore, it is possible to easily determine the memory space that the target program needs to occupy.
可选地,该硬件信息还用于指示处理器对应的指令集,该程序信息还用于指示该目标程序中的指令,以及该预设条件还包括处理器对应的指令集包括该目标程序中的指令。其中,“处理器对应的指令集”可以理解为处理器所能够处理的函数,并且,该硬件信息可以用于指示处理器所能够处理的函数(例如,函数名称)。其中,“该目标程序中的指令”可以理解为该目标程序中包括的各函数,并且,该程序信息用于指示该目标程序中包括的各函数(例如,函数名称)。Optionally, the hardware information is also used to indicate the instruction set corresponding to the processor, the program information is also used to indicate the instruction in the target program, and the preset condition further includes that the instruction set corresponding to the processor includes the target program. Instructions. The “instruction set corresponding to the processor” can be understood as a function that the processor can process, and the hardware information can be used to indicate a function (for example, a function name) that the processor can process. The "instructions in the target program" can be understood as functions included in the target program, and the program information is used to indicate functions (for example, function names) included in the target program.
可选地,该根据该程序信息和该硬件信息,从该至少两种处理器中确定满足预设条件且能够用于执行该目标程序的目标处理器,包括:确定该至少两种处理器每种处理器的优先级;基于该程序信息和该硬件信息,按照该至少两种处理器的优先级从高到低的顺序,依次判定该至少两种处理器是否满足该预设条件,并将首个满足该预设条件的处理器作为该目标处理器。Optionally, according to the program information and the hardware information, determining, from the at least two processors, a target processor that satisfies a preset condition and can be used to execute the target program, includes: determining that the at least two processors each Based on the program information and the hardware information, in order of priority of the at least two processors from high to low, determine whether the at least two processors meet the preset condition in order, and The first processor that meets the preset condition serves as the target processor.
通过为处理器设置优先级,能够实现个性化处理,灵活应对不同的处理场景,并且,能够提高确定目标处理器的效率,缩短确定目标处理器的时间。By setting a priority for the processor, personalized processing can be achieved, and different processing scenarios can be flexibly handled. Moreover, the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.
可选地,该确定该至少两种处理器中每种处理器的并行计算能力或功耗中的至少一项,确定每种处理器的优先级。Optionally, determining at least one of a parallel computing capability or power consumption of each of the at least two processors, and determining a priority of each processor.
可选地,该至少两种处理器包括中央处理器CPU,且该CPU在该至少两种处理器中的优先级最低。从而,能够确保至少两种处理器中存在能够处理目标程序的处理器,并且,由于CPU的功耗较高,通过将CPU的优先级设置为最低,能够提高协处理器被选择为目标处理器的可能性,从而,能够进一步提高本申请的效果及实用性。Optionally, the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors. Therefore, it is possible to ensure that there is a processor capable of processing the target program among at least two types of processors, and because the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, the coprocessor can be selected as the target processor. Therefore, the effect and practicability of the present application can be further improved.
可选地,该获取目标程序的程序信息包括:根据目标程序的中间表达式IR,确定该程序信息,其中,该目标程序的IR是根据取该目标程序的领域描述语言DSL代码确定的。其中,该DSL代码可以是计算机设备中的前端编译器确定的,该IR可以是计算机设备中的中间编译器确定的。从而,能够容易地获得程序信息。Optionally, obtaining the program information of the target program includes: determining the program information according to an intermediate expression IR of the target program, wherein the IR of the target program is determined according to a domain description language DSL code of the target program. The DSL code may be determined by a front-end compiler in a computer device, and the IR may be determined by an intermediate compiler in the computer device. Thereby, the program information can be easily obtained.
可选地,该获取该至少两种处理器中每种处理器的硬件信息:根据每个处理器的注册信息,获取每种处理器的硬件信息,该注册信息用于处理器在该计算设备中的注册。其中,该注册信息可以包括硬件描述信息。并且,该硬件描述信息可以是计算机设备在处理器安装前离线获取的。或者,该硬件描述信息可以是计算机设备在处理器安装时,从该处理器的驱动信息中获取的。Optionally, the hardware information of each of the at least two processors is obtained: hardware information of each processor is obtained according to registration information of each processor, and the registration information is used by the processor in the computing device. Registration. The registration information may include hardware description information. In addition, the hardware description information may be obtained offline by the computer device before the processor is installed. Alternatively, the hardware description information may be obtained by the computer device from the driver information of the processor when the processor is installed.
可选地,该计算机设备中包括至少两个后端编译器,该至少两个后端编译器与该至少两种处理器一一对应,每个后端编译器用于将IR转换为所对应的处理器所能够识别的代码。Optionally, the computer device includes at least two back-end compilers, and the at least two back-end compilers correspond to the at least two processors in a one-to-one manner, and each back-end compiler is configured to convert the IR to a corresponding one. The code recognized by the processor.
此情况下,该方法还包括:将该目标程序的IR输入至与该目标处理器相对应的目标后端编译器。其中,该目标程序的IR可以是经过IR优化处理后的IR。In this case, the method further includes: inputting the IR of the target program to a target back-end compiler corresponding to the target processor. The IR of the target program may be an IR that has been optimized by IR.
第三方面,提供了一种选择处理器的装置,其特征在于,该装置包括:识别模块,用于获取至少两种处理器中每种处理器的硬件信息,该硬件信息用于指示处理器对应的指令集;分析模块,用于获取待执行的目标程序的程序信息,该程序信息用于指示该目标程序中的指令;选择模块,用于根据该程序信息和该硬件信息,从该至少两种处理器中确定满足预设条件且能够用于执行该目标程序的目标处理器,该预设条件包括处理器对应的指令集包括该目标程序中的指令。According to a third aspect, an apparatus for selecting a processor is provided, which is characterized in that the apparatus includes: an identification module for acquiring hardware information of each of the at least two processors, and the hardware information is used to instruct the processor Corresponding instruction set; analysis module, for obtaining program information of the target program to be executed, the program information is used to indicate instructions in the target program; selection module, for obtaining information from the at least the program information and the hardware information Among the two types of processors, a target processor that satisfies a preset condition and can be used to execute the target program is determined, and the preset condition includes an instruction set corresponding to the processor including an instruction in the target program.
根据本申请提供的选择处理器的装置,通过预先获取每种处理器的硬件信息和目标程序的程序信息,并基于该硬件信息和程序信息,从多种处理器中选择硬件信息与该程序信息相匹配的处理,能够使所选择的处理器与该目标程序相匹配,并且,无需人工指定该处理器,从而,能提提高计算机设备的处理效率,降低编程人员的负担。According to the processor selection device provided in the present application, hardware information and program information of a target program are obtained in advance by each processor, and based on the hardware information and program information, hardware information and the program information are selected from a variety of processors. Matching processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.
其中,“处理器对应的指令集”可以理解为处理器所能够处理的函数,并且,该硬件信息可以用于指示处理器所能够处理的函数(例如,函数名称)。其中,“该目标程序中的指令”可以理解为该目标程序中包括的各函数,并且,该程序信息用于指示该目标程序中包括的各函数(例如,函数名称)。The “instruction set corresponding to the processor” can be understood as a function that the processor can process, and the hardware information can be used to indicate a function (for example, a function name) that the processor can process. The "instructions in the target program" can be understood as functions included in the target program, and the program information is used to indicate functions (for example, function names) included in the target program.
可选地,该选择模块用于确定该至少两种处理器每种处理器的优先级,并基于该程序信息和该硬件信息,按照该至少两种处理器的优先级从高到低的顺序,依次判定该至少两种处理器是否满足该预设条件,并将首个满足该预设条件的处理器作为该目标处理器。通过为处理器设置优先级,能够实现个性化处理,灵活应对不同的处理场景,并且,能够提高确定目标处理器的效率,缩短确定目标处理器的时间。Optionally, the selection module is configured to determine a priority of each processor of the at least two processors, and based on the program information and the hardware information, according to a priority order of the at least two processors from high to low , Sequentially determine whether the at least two processors meet the preset condition, and use the first processor that meets the preset condition as the target processor. By setting a priority for the processor, personalized processing can be achieved, and different processing scenarios can be flexibly handled. Moreover, the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.
可选地,该确定该至少两种处理器中每种处理器的并行计算能力或功耗中的至少一项,确定每种处理器的优先级。Optionally, determining at least one of a parallel computing capability or power consumption of each of the at least two processors, and determining a priority of each processor.
可选地,该至少两种处理器包括中央处理器CPU,且该CPU在该至少两种处理器中的优先级最低。从而,能够确保至少两种处理器中存在能够处理目标程序的处理器,并且,由于CPU的功耗较高,通过将CPU的优先级设置为最低,能够提高协处理器被选择为目标处理器的可能性,从而,能够进一步提高本申请的效果及实用性。Optionally, the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors. Therefore, it is possible to ensure that there is a processor capable of processing the target program among at least two types of processors, and because the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, the coprocessor can be selected as the target processor Therefore, the effect and practicability of the present application can be further improved.
可选地,该至少两种处理器包括以下处理器中的至少两种:CPU、图形处理器GPU、专用集成电路ASIC、神经网络处理器NPU、图像处理单元IPU、图像处理单元IPU或数字信号处理DSP。Optionally, the at least two processors include at least two of the following processors: a CPU, a graphics processor GPU, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, an image processing unit IPU, or a digital signal Processing DSP.
其中,该ASIC可以通过软件执行计算。Among them, the ASIC can perform calculations by software.
可选地,该硬件信息还用于指示处理器的可用内存空间的大小,该程序信息还用于指示该目标程序需要占用的内存空间,以及该预设条件还包括该处理器的可用内存空间大于或等于该目标程序需要占用的内存空间。其中,该处理器的可用空间可以是指处理器的总内存空间中规定比例的内存空间。例如,该规定比例可以为90%。或者,该处理器的可用空间可以是指处理器的总的空闲内存空间中规定比例的内存空间。另外,“预设条件还包括处理器的可用内存空间大于或等于该目标程序需要占用的内存空间”可以是指,只要该处理器的可用内存空间大于或等于该目标程序需要占用的内存空间即满足预设条件。或者,“预设条件还包括处理器的可用内存空间大于或等于该目标程序需要占用的内存空间”可以是指,处理器的可用内存空间需要大于或等于包括该目标程序在内的在由该目标处理器执行的所有程序所需要占用的内存空间才能满足预设条件。通过使预设条件进一步包括该处理器的可用内存空间大于或等于该目标程序需要占用的内存空间,能够确保所选择的目标处理器能够支持目标程序的运行,从而,能够进一步提高本申请的实用性。Optionally, the hardware information is also used to indicate the size of the available memory space of the processor, the program information is also used to indicate the memory space required by the target program, and the preset condition further includes the available memory space of the processor Greater than or equal to the memory space required by the target program. The available space of the processor may refer to a specified proportion of the memory space in the total memory space of the processor. For example, the prescribed ratio may be 90%. Alternatively, the available space of the processor may refer to a specified proportion of the memory space in the total free memory space of the processor. In addition, "the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space required by the target program" may mean that as long as the available memory space of the processor is greater than or equal to the memory space required by the target program, Meet preset conditions. Or, “the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy” may mean that the available memory space of the processor needs to be greater than or equal to the The memory space required by all programs executed by the target processor can satisfy the preset conditions. By making the preset conditions further include that the available memory space of the processor is greater than or equal to the memory space required by the target program, it can be ensured that the selected target processor can support the operation of the target program, thereby further improving the practicality of the present application. Sex.
可选地,该分析单元用于根据该目标程序的数据维度,确定目标程序需要占用的内存空间。从而,能够容易地确定目标程序需要占用的内存空。Optionally, the analysis unit is configured to determine a memory space required by the target program according to a data dimension of the target program. Therefore, it is possible to easily determine the memory space that the target program needs to occupy.
可选地,该分析模块用于根据该目标程序的中间表达式IR,确定该程序信息,其中,该IR是根据该目标程序的领域描述语言DSL代码确定的。从而,能够容易地获得程序信 息。Optionally, the analysis module is configured to determine the program information according to the intermediate expression IR of the target program, where the IR is determined according to a domain description language DSL code of the target program. Thereby, program information can be easily obtained.
可选地,该识别单元用于根据每个处理器的注册信息,获取每种处理器的硬件信息,该注册信息用于处理器在该计算设备中的注册。其中,该注册信息可以包括硬件描述信息。并且,该硬件描述信息可以是计算机设备在处理器安装前离线获取的。或者,该硬件描述信息可以是计算机设备在处理器安装时,从该处理器的驱动信息中获取的。Optionally, the identification unit is configured to obtain hardware information of each processor according to registration information of each processor, and the registration information is used to register the processor in the computing device. The registration information may include hardware description information. In addition, the hardware description information may be obtained offline by the computer device before the processor is installed. Alternatively, the hardware description information may be obtained by the computer device from the driver information of the processor when the processor is installed.
可选地,该计算机设备中包括至少两个后端编译器,该至少两个后端编译器与该至少两种处理器一一对应,每个后端编译器用于将IR转换为所对应的处理器所能够识别的代码。Optionally, the computer device includes at least two back-end compilers, and the at least two back-end compilers correspond to the at least two processors in a one-to-one manner, and each back-end compiler is configured to convert the IR to a corresponding one. The code recognized by the processor.
此情况下,该选择单元用于将该目标程序的IR输入至与该目标处理器相对应的目标后端编译器。其中,该目标程序的IR可以是经过IR优化处理后的IR。In this case, the selection unit is used to input the IR of the target program to a target back-end compiler corresponding to the target processor. The IR of the target program may be an IR that has been optimized by IR.
第四方面,提供了一种选择处理器的装置,该装置包括:识别单元,用于获取该至少两种处理器中每种处理器的硬件信息,该硬件信息用于指示处理器的可用内存空间的大小;分析单元,用于获取目标程序的程序信息,该程序信息用于指示该目标程序需要占用的内存空间;选择单元,用于根据该程序信息和该硬件信息,从该至少两种处理器中确定满足预设条件且能够用于执行该目标程序的目标处理器,该预设条件包括处理器的可用内存空间大于或等于该目标程序需要占用的内存空间。According to a fourth aspect, an apparatus for selecting a processor is provided. The apparatus includes: an identifying unit configured to obtain hardware information of each of the at least two processors, and the hardware information is used to indicate available memory of the processor. The size of the space; the analysis unit is used to obtain the program information of the target program, the program information is used to indicate the memory space that the target program needs to occupy; the selection unit is used to select from the at least two The processor determines a target processor that satisfies a preset condition and can be used to execute the target program. The preset condition includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy.
根据本申请提供的选择处理器的方法,通过预先获取每种处理器的硬件信息和目标程序的程序信息,并基于该硬件信息和程序信息,从多种处理器中选择硬件信息与该程序信息相匹配的处理,能够使所选择的处理器与该目标程序相匹配,并且,无需人工指定该处理器,从而,能提提高计算机设备的处理效率,降低编程人员的负担。其中,该处理器的可用空间可以是指处理器的总内存空间中规定比例的内存空间。例如,该规定比例可以为90%。或者,该处理器的可用空间可以是指处理器的总的空闲内存空间中规定比例的内存空间。另外,“预设条件还包括该处理器的可用内存空间大于或等于该目标程序需要占用的内存空间”可以是指,只要处理器的可用内存空间大于或等于该目标程序需要占用的内存空间即满足预设条件。或者,“预设条件还包括该处理器的可用内存空间大于或等于该目标程序需要占用的内存空间”可以是指,该处理器的可用内存空间需要大于或等于包括该目标程序在内的在由该目标处理器执行的所有程序所需要占用的内存空间才能满足预设条件。According to the method for selecting a processor provided in the present application, by obtaining hardware information of each processor and program information of a target program in advance, and based on the hardware information and program information, selecting hardware information and the program information from a variety of processors Matching processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer. The available space of the processor may refer to a specified proportion of the memory space in the total memory space of the processor. For example, the prescribed ratio may be 90%. Alternatively, the available space of the processor may refer to a specified proportion of the memory space in the total free memory space of the processor. In addition, "the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space required by the target program" may mean that as long as the available memory space of the processor is greater than or equal to the memory space required by the target program Meet preset conditions. Or, "the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy" may mean that the available memory space of the processor needs to be greater than or equal to the existing memory including the target program. The memory space required by all programs executed by the target processor can satisfy the preset conditions.
可选地,该分析单元用于根据该目标程序的数据维度,确定目标程序需要占用的内存空间。从而,能够容易地确定目标程序需要占用的内存空。Optionally, the analysis unit is configured to determine a memory space required by the target program according to a data dimension of the target program. Therefore, it is possible to easily determine the memory space that the target program needs to occupy.
可选地,该至少两种处理器包括以下处理器中的至少两种:CPU、图形处理器GPU、现场可编程门阵列FPGA、专用集成电路ASIC、神经网络处理器NPU、图像处理单元IPU或数字信号处理DSP。Optionally, the at least two processors include at least two of the following processors: a CPU, a graphics processor GPU, a field programmable gate array FPGA, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, or Digital Signal Processing DSP.
可选地,该硬件信息还用于指示处理器对应的指令集,该程序信息还用于指示该目标程序中的指令,以及该预设条件还包括处理器对应的指令集包括该目标程序中的指令。其中,“处理器对应的指令集”可以理解为处理器所能够处理的函数,并且,该硬件信息可以用于指示处理器所能够处理的函数(例如,函数名称)。其中,“该目标程序中的指令”可以理解为该目标程序中包括的各函数,并且,该程序信息用于指示该目标程序中包括的各函数(例如,函数名称)。Optionally, the hardware information is also used to indicate the instruction set corresponding to the processor, the program information is also used to indicate the instruction in the target program, and the preset condition further includes that the instruction set corresponding to the processor includes the target program. Instructions. The “instruction set corresponding to the processor” can be understood as a function that the processor can process, and the hardware information can be used to indicate a function (for example, a function name) that the processor can process. The "instructions in the target program" can be understood as functions included in the target program, and the program information is used to indicate functions (for example, function names) included in the target program.
可选地,该选择模块用于确定该至少两种处理器每种处理器的优先级,并基于该程序信息和该硬件信息,按照该至少两种处理器的优先级从高到低的顺序,依次判定该至少两种处理器是否满足该预设条件,并将首个满足该预设条件的处理器作为该目标处理器。Optionally, the selection module is configured to determine a priority of each processor of the at least two processors, and based on the program information and the hardware information, according to a priority order of the at least two processors from high to low , Sequentially determine whether the at least two processors meet the preset condition, and use the first processor that meets the preset condition as the target processor.
可选地,该确定该至少两种处理器中每种处理器的并行计算能力,确定每种处理器的优先级。Optionally, the parallel computing capability of each of the at least two processors is determined, and the priority of each processor is determined.
可选地,该至少两种处理器包括中央处理器CPU,且该CPU在该至少两种处理器中的优先级最低。Optionally, the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors.
从而,能够确保至少两种处理器中存在能够处理目标程序的处理器,并且,由于CPU的功耗较高,通过将CPU的优先级设置为最低,能够提高协处理器被选择为目标处理器的可能性,从而,能够进一步提高本申请的效果及实用性。Therefore, it is possible to ensure that there is a processor capable of processing the target program among at least two types of processors, and because the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, the coprocessor can be selected as the target processor. Therefore, the effect and practicability of the present application can be further improved.
可选地,该获取目标程序的程序信息,包括:获取该目标程序的领域描述语言DSL代码;根据该DSL代码,确定中间表达式IR;根据该IR,确定该程序信息。从而,能够容易地获得程序信息。Optionally, obtaining the program information of the target program includes: obtaining a domain description language DSL code of the target program; determining an intermediate expression IR according to the DSL code; and determining the program information according to the IR. Thereby, the program information can be easily obtained.
可选地,该识别单元用于根据每个处理器的注册信息,获取每种处理器的硬件信息,该注册信息用于处理器在该计算设备中的注册。其中,该注册信息可以包括硬件描述信息。并且,该硬件描述信息可以是计算机设备在处理器安装前离线获取的。或者,该硬件描述信息可以是计算机设备在处理器安装时,从该处理器的驱动信息中获取的。Optionally, the identification unit is configured to obtain hardware information of each processor according to registration information of each processor, and the registration information is used to register the processor in the computing device. The registration information may include hardware description information. In addition, the hardware description information may be obtained offline by the computer device before the processor is installed. Alternatively, the hardware description information may be obtained by the computer device from the driver information of the processor when the processor is installed.
可选地,该计算机设备中包括至少两个后端编译器,该至少两个后端编译器与该至少两种处理器一一对应,每个后端编译器用于将IR转换为所对应的处理器所能够识别的代码。Optionally, the computer device includes at least two back-end compilers, and the at least two back-end compilers correspond to the at least two processors in a one-to-one manner, and each back-end compiler is configured to convert the IR to a corresponding one. The code recognized by the processor.
此情况下,该选择单元用于控制将该目标程序的IR输入至与该目标处理器相对应的目标后端编译器。其中,该目标程序的IR可以是经过IR优化处理后的IR。In this case, the selection unit is used to control the IR of the target program to a target back-end compiler corresponding to the target processor. The IR of the target program may be an IR that has been optimized by IR.
第五方面,提供了一种编译装置,该编译装置配置在包括至少两种处理器的计算机设备中,该装置包括:多个后端编译单元,与所述多个处理器一一对应,用于将所接收到的IR转换为所对应的处理器能够识别的代码;前端编译单元,用于获取目标程序对应的DSL;中间编译单元,用于根据该DSL,确定IR;选择单元,用于根据该IR确定该目标程序的程序信息,并用于获取该至少两种处理器中每种处理器的硬件信息,并且,用于根据该程序信息和该硬件信息,从该至少两种处理器中确定用于执行该目标程序的目标处理器,并将所述IR发送至所述目标处理器对应的后端编译单元。According to a fifth aspect, a compiling device is provided. The compiling device is configured in a computer device including at least two processors. The device includes a plurality of back-end compiling units, which correspond to the processors in a one-to-one manner. It is used to convert the received IR into the code that the corresponding processor can recognize; the front-end compilation unit is used to obtain the DSL corresponding to the target program; the intermediate compilation unit is used to determine the IR according to the DSL; the selection unit is used to Determine program information of the target program according to the IR, and obtain hardware information of each of the at least two processors, and use the program information and the hardware information to select from the at least two processors Determine a target processor for executing the target program, and send the IR to a back-end compilation unit corresponding to the target processor.
其中,该程序信息用于指示该目标程序中的指令,该硬件信息用于指示处理器对应的指令集,以及该目标处理器是该至少两种处理器中满足预设条件的处理器,该预设条件包括处理器对应的指令集包括该目标程序中的指令;和/或该硬件信息用于指示处理器的可用内存空间的大小,该程序信息用于指示该目标程序需要占用的内存空间,以及该预设条件包括处理器的可用内存空间大于或等于该目标程序需要占用的内存空间。The program information is used to indicate an instruction in the target program, the hardware information is used to indicate a corresponding instruction set of a processor, and the target processor is a processor that satisfies a preset condition among the at least two processors, the The preset condition includes that the instruction set corresponding to the processor includes instructions in the target program; and / or the hardware information is used to indicate the size of the available memory space of the processor, and the program information is used to indicate the memory space required by the target program. And the preset condition includes that the available memory space of the processor is greater than or equal to the memory space required by the target program.
根据本申请提供的编译装置,通过预先获取每种处理器的硬件信息和目标程序的程序信息,并基于该硬件信息和程序信息,从多种处理器中选择硬件信息与该程序信息相匹配的处理,能够使所选择的处理器与该目标程序相匹配,并且,无需人工指定该处理器,从而,能提提高计算机设备的处理效率,降低编程人员的负担。According to the compiler provided in this application, the hardware information of each processor and the program information of the target program are obtained in advance, and based on the hardware information and program information, the hardware information that matches the program information is selected from a variety of processors. Processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.
可选地,该选择单元用于确定该至少两种处理器每种处理器的优先级;基于该程序信 息和该硬件信息,按照优先级从高到低的顺序,判定处理器是否满足该预设条件,并将首个满足该预设条件的处理器作为该目标处理器。通过为处理器设置优先级,能够实现个性化处理,灵活应对不同的处理场景,并且,能够提高确定目标处理器的效率,缩短确定目标处理器的时间。Optionally, the selection unit is configured to determine a priority of each processor of the at least two processors; based on the program information and the hardware information, determine whether the processor meets the pre-determined order of priority from high to low. Set conditions, and use the first processor that meets the preset condition as the target processor. By setting a priority for the processor, personalized processing can be achieved, and different processing scenarios can be flexibly handled. Moreover, the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.
可选地,该确定该至少两种处理器中每种处理器的并行计算能力,确定每种处理器的优先级。Optionally, the parallel computing capability of each of the at least two processors is determined, and the priority of each processor is determined.
可选地,该至少两种处理器包括中央处理器CPU,且该CPU在该至少两种处理器中的优先级最低。从而,能够确保至少两种处理器中存在能够处理目标程序的处理器,并且,由于CPU的功耗较高,通过将CPU的优先级设置为最低,能够提高协处理器被选择为目标处理器的可能性,从而,能够进一步提高本申请的效果及实用性。Optionally, the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors. Therefore, it is possible to ensure that there is a processor capable of processing the target program among at least two types of processors, and because the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, the coprocessor can be selected as the target processor. Therefore, the effect and practicability of the present application can be further improved.
可选地,该至少两种处理器包括以下处理器中的至少两种:CPU、图形处理器GPU、现场可编程门阵列FPGA、专用集成电路ASIC、神经网络处理器NPU、图像处理单元IPU或数字信号处理DSP。Optionally, the at least two processors include at least two of the following processors: a CPU, a graphics processor GPU, a field programmable gate array FPGA, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, or Digital Signal Processing DSP.
可选地,该识别单元用于根据每个处理器的注册信息,获取每种处理器的硬件信息,该注册信息用于处理器在该计算设备中的注册。其中,该注册信息可以包括硬件描述信息。并且,该硬件描述信息可以是计算机设备在处理器安装前离线获取的。或者,该硬件描述信息可以是计算机设备在处理器安装时,从该处理器的驱动信息中获取的。Optionally, the identification unit is configured to obtain hardware information of each processor according to registration information of each processor, and the registration information is used to register the processor in the computing device. The registration information may include hardware description information. In addition, the hardware description information may be obtained offline by the computer device before the processor is installed. Alternatively, the hardware description information may be obtained by the computer device from the driver information of the processor when the processor is installed.
第六方面,提供一种计算机设备,包括多个处理器、编译器和选择装置,该选择装置执行上述第一方面及其任一种可能实现方式中的方法,或者,上述第二方面及其任一种可能实现方式中的方法。例如,编译器包括前端编译器、中间编译器、和后端编译器。According to a sixth aspect, a computer device is provided, which includes multiple processors, a compiler, and a selection device, and the selection device executes the method in the first aspect and any possible implementation manner thereof, or the second aspect and the foregoing Methods in any of the possible implementations. For example, the compiler includes a front-end compiler, an intermediate compiler, and a back-end compiler.
第七方面,提供一种芯片或芯片组,包括至少一个处理器和至少一个内存控制单元,该处理器执行上述第一方面及其任一种可能实现方式中的方法,或者,上述第二方面及其任一种可能实现方式中的方法。其中,该芯片或芯片组可以包括智能芯片。该智能芯片可以包括至少两种处理器。According to a seventh aspect, a chip or chipset is provided, including at least one processor and at least one memory control unit. The processor executes the method in the first aspect and any possible implementation manner thereof, or the second aspect And any of its possible implementations. The chip or chipset may include a smart chip. The smart chip may include at least two processors.
第八方面,提供一种计算机系统,包括,处理器和存储器,该处理器包括至少两个处理器和内存控制单元,该处理器执行上述第一方面及其任一种可能实现方式中的方法,或者,上述第二方面及其任一种可能实现方式中的方法。According to an eighth aspect, a computer system is provided, including a processor and a memory. The processor includes at least two processors and a memory control unit, and the processor executes the method in the first aspect and any possible implementation manner. Or, the method in the second aspect and any one of the possible implementation manners.
可选地,该计算系统还包括系统总线,该系统总线用于连接该处理器(具体地说,是内存控制单元)和存储器。Optionally, the computing system further includes a system bus for connecting the processor (specifically, a memory control unit) and a memory.
第九方面,提供了一种计算机程序产品,所述计算机程序产品包括:计算机程序(也可以称为代码,或指令),当所述计算机程序被处理器或芯片中的处理器运行时,使得处理器执行上述第一方面及其任一种可能实现方式中的方法,或者,上述第二方面及其任一种可能实现方式中的方法。According to a ninth aspect, a computer program product is provided. The computer program product includes a computer program (also referred to as code or instructions). When the computer program is executed by a processor or a processor in a chip, The processor executes the method in the foregoing first aspect and any one of its possible implementations, or the method in the foregoing second aspect and any of its possible implementations.
第十方面,提供了一种计算机可读介质,所述计算机可读介质存储有计算机程序(也可以称为代码,或指令)当其在处理器或芯片中的处理器上运行时,使得处理器执行上述第一方面及其任一种可能实现方式中的方法,或者,上述第二方面及其任一种可能实现方式中的方法。According to a tenth aspect, a computer-readable medium is provided, where the computer-readable medium stores a computer program (also referred to as code, or instructions) that, when executed on a processor or a processor in a chip, causes processing The processor executes the method in the foregoing first aspect and any one of its possible implementations, or the method in the foregoing second aspect and any of its possible implementations.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是适用本申请实施例的监控进程的方法和装置的计算机设备(或者说,计算机系统)的示意性硬件结构图。FIG. 1 is a schematic hardware structural diagram of a computer device (or a computer system) to which a method and an apparatus for monitoring a process according to an embodiment of the present application are applied.
图2是本申请的词法分析过程的一例的示意图。FIG. 2 is a schematic diagram of an example of a lexical analysis process of the present application.
图3是本申请的语法分析过程的一例的示意图。FIG. 3 is a schematic diagram of an example of a syntax analysis process of the present application.
图4是本申请的中间代码生成和优化过程的一例的示意图。FIG. 4 is a schematic diagram of an example of an intermediate code generation and optimization process of the present application.
图5是本申请的选择处理器的方法的一例的示意性流程图。FIG. 5 is a schematic flowchart of an example of a method for selecting a processor according to the present application.
图6是本申请的选择处理器的方法的另一例的示意性流程图。FIG. 6 is a schematic flowchart of another example of a method for selecting a processor according to the present application.
图7是本申请的编译方法的一例的示意图。FIG. 7 is a schematic diagram of an example of a compilation method of the present application.
图8是本申请的选择处理器的装置的一例的示意性结构图。FIG. 8 is a schematic configuration diagram of an example of a processor selection device of the present application.
图9是本申请的编译装置的一例的示意性结构图。FIG. 9 is a schematic configuration diagram of an example of a compiler device of the present application.
具体实施方式detailed description
下面将结合附图,对本申请中的技术方案进行描述。首先结合图1对执行本申请实施例的监控进程的方法的计算设备100进行详细说明。The technical solutions in this application will be described below with reference to the drawings. First, a computing device 100 that executes a method for monitoring a process in an embodiment of the present application will be described in detail with reference to FIG. 1.
计算设备也可以被称为计算机系统,从逻辑分层来看,计算设备可以包括硬件层、运行在硬件层之上的操作系统层,以及运行在操作系统层上的应用层。该硬件层包括处理器、内存和内存控制单元等硬件,随后对该硬件的功能和结构进行详细说明。该操作系统可以是任意一种或多种通过进程(process)实现业务处理的计算机操作系统,例如,Linux操作系统、Unix操作系统、Android操作系统、iOS操作系统或windows操作系统等。该应用层包含浏览器、通讯录、文字处理软件、即时通信软件等应用程序。并且,在本申请实施例中,该计算机系统可以是智能手机等手持设备,也可以是个人计算机等终端设备,本申请并未特别限定,只要能够读取本实施例的方法的程序代码,并运行该程序代码,以根据本申请实施例的监控样本进程的内存访问行为的方法对样本进程进行监控即可。本申请实施例的监控样本进程的内存访问行为的方法的执行主体可以是计算机系统,或者,是计算机系统中能够调用程序并执行程序的功能模块,如处理器。A computing device can also be referred to as a computer system. From a logical hierarchical perspective, a computing device can include a hardware layer, an operating system layer running on the hardware layer, and an application layer running on the operating system layer. The hardware layer includes hardware such as a processor, a memory, and a memory control unit. The functions and structure of the hardware are described in detail later. The operating system may be any one or more computer operating systems that implement business processing through processes, such as a Linux operating system, a Unix operating system, an Android operating system, an iOS operating system, or a windows operating system. This application layer contains applications such as browsers, address books, word processing software, and instant messaging software. Moreover, in the embodiment of the present application, the computer system may be a handheld device such as a smart phone or a terminal device such as a personal computer, which is not particularly limited in the present application, as long as the program code of the method of the embodiment can be read, and Run the program code to monitor the sample process by the method of monitoring the memory access behavior of the sample process according to the embodiment of the present application. The execution subject of the method for monitoring a memory access behavior of a sample process in the embodiment of the present application may be a computer system, or a functional module, such as a processor, in the computer system that can call a program and execute the program.
在本申请中,程序或程序代码是指用来实现某种相对独立功能的一组有序指令(或者说,代码)的集合。进程是程序及其数据在计算机设备上的一次运行过程。程序通常采用模块化设计,即将程序的功能细化拆解为多个更小的功能模块。程序中包含至少一个函数,函数是实现一个功能模块的代码段。因此函数是程序功能模块化的基本单元,也可以被视为子程序。In this application, a program or program code refers to a set of ordered instructions (or codes) used to implement some relatively independent function. A process is a running process of a program and its data on a computer device. The program usually adopts a modular design, that is, the function of the program is broken down into multiple smaller functional modules. The program contains at least one function. A function is a code segment that implements a functional module. Therefore, functions are the basic unit of program function modularity, and can also be regarded as subroutines.
图1是本申请实施例提供的一种计算设备100的架构示意图。图1所示的计算设备用于执行监控进程的方法。计算设备100可以包括:至少两个处理器110,和内存120。FIG. 1 is a schematic structural diagram of a computing device 100 according to an embodiment of the present application. The method shown in FIG. 1 for a computing device to perform a monitoring process. The computing device 100 may include: at least two processors 110, and a memory 120.
可选地,计算机设备110还可以包括系统总线,其中处理器110和内存120分别与系统总线连接。处理器110能够通过系统总线访问内存120,例如,处理器110能够通过系统总线在内存120中进行数据读写或代码执行。其中,处理器110的功能主要是解释计算机程序的指令(或者说,代码)以及处理计算机软件中的数据。其中,该计算机程序的指令以及计算机软件中的数据可以保存在内存120或者缓存单元116中。在本申请实施例中,处理器110可能是集成电路芯片或其中部件,具有信号的处理能力。Optionally, the computer device 110 may further include a system bus, where the processor 110 and the memory 120 are respectively connected to the system bus. The processor 110 can access the memory 120 through the system bus. For example, the processor 110 can read and write data or execute code in the memory 120 through the system bus. Among them, the function of the processor 110 is mainly to interpret instructions (or codes) of a computer program and to process data in computer software. The instructions of the computer program and data in the computer software may be stored in the memory 120 or the cache unit 116. In the embodiment of the present application, the processor 110 may be an integrated circuit chip or a component therein, and has a signal processing capability.
在本申请中处理器110可以从存储器或高速缓冲存储器中取出指令,放入指令寄存 器,并对指令译码。它把指令分解成一系列的微操作,然后发出各种控制命令,执行微操作系列,从而完成一条指令的执行。指令是计算机规定执行操作的类型和操作数的基本命令。指令是由一个字节或者多个字节组成,其中包括操作码字段、一个或多个有关操作数地址的字段以及一些表征机器状态的状态字以及特征码。有的指令中也直接包含操作数本身。In this application, the processor 110 may fetch instructions from a memory or a cache memory, place them in an instruction register, and decode the instructions. It breaks down instructions into a series of micro-operations, and then issues various control commands to execute a series of micro-operations to complete the execution of an instruction. An instruction is a basic command that a computer specifies to perform the type and operand of an operation. The instruction is composed of one byte or multiple bytes, which includes the opcode field, one or more fields related to the operand address, and some status words and characteristic codes that characterize the state of the machine. Some instructions directly include the operand itself.
作为示例而非限定,在本申请中,处理器110可以包括内存控制单元114和至少一个处理单元112。By way of example and not limitation, in this application, the processor 110 may include a memory control unit 114 and at least one processing unit 112.
处理单元112也可以称为核心(core)或内核,是处理器最重要的组成部分。处理单元112可以是由单晶硅以一定的生产工艺制造出来的,处理器110的计算、接受命令、存储命令、处理数据都由核心执行。处理单元112可以分别独立地运行程序指令,利用并行计算的能力加快程序的运行速度。各种处理器110都具有固定的逻辑结构,例如,处理器110包括一级缓存、二级缓存、执行单元、指令级单元和总线接口等逻辑单元。The processing unit 112 may also be referred to as a core or a core, and is the most important component of the processor. The processing unit 112 may be manufactured by monocrystalline silicon in a certain production process. The calculation, the receiving command, the storing command, and the processing data of the processor 110 are all performed by the core. The processing unit 112 can run the program instructions independently, and use the ability of parallel computing to accelerate the running speed of the program. Various processors 110 have a fixed logical structure. For example, the processor 110 includes a logical unit such as a first-level cache, a second-level cache, an execution unit, an instruction-level unit, and a bus interface.
内存控制单元114用于控制内存120与处理单元112之间的数据交互。具体地说,内存控制单元114可以从处理单元112接收内存访问请求,并基于该内存访问请求控制针对内存的访问。作为示例而非限定,在本申请实施例中,内存控制单元可以是内存管理单元(memory management unit,MMU)等器件。The memory control unit 114 is configured to control data interaction between the memory 120 and the processing unit 112. Specifically, the memory control unit 114 may receive a memory access request from the processing unit 112 and control access to the memory based on the memory access request. By way of example and not limitation, in the embodiment of the present application, the memory control unit may be a device such as a memory management unit (MMU).
在本申请实施例中,各内存控制单元114可以通过系统总线进行内存120的寻址。并且在系统总线中可以配置仲裁器(未图示),该仲裁器可以负责处理和协调多个处理单元112的竞争访问。In the embodiment of the present application, each memory control unit 114 may address the memory 120 through a system bus. In addition, an arbiter (not shown) may be configured in the system bus, and the arbiter may be responsible for processing and coordinating competing accesses of the plurality of processing units 112.
在本申请实施例中,处理单元112和内存控制单元114可以通过芯片内部的连接线,例如地址线,通信连接,从而实现处理单元112和内存控制单元114之间的通信。In the embodiment of the present application, the processing unit 112 and the memory control unit 114 may be connected through a connection line inside the chip, such as an address line, to implement communication between the processing unit 112 and the memory control unit 114.
可选地,每个处理器110还可以包括缓存单元116,其中,缓存单元116是数据交换的缓冲区(称作cache)。当处理器112要读取数据时,会首先从缓存单元116中查找需要的数据,如果找到了则直接执行,找不到的话再从内存120中找。由于缓存单元116的运行速度比内存120快得多,故缓存单元116的作用就是帮助处理单元112更快地运行。Optionally, each processor 110 may further include a cache unit 116, where the cache unit 116 is a buffer (called a cache) for data exchange. When the processor 112 wants to read the data, it will first look up the required data from the cache unit 116, and if it finds it, execute it directly. If it cannot find it, it will look for it from the memory 120. Since the cache unit 116 runs much faster than the memory 120, the role of the cache unit 116 is to help the processing unit 112 run faster.
内存(memory)120可以为计算设备100中的进程提供运行空间,例如,内存120中可以保存用于生成进程的计算机程序(具体地说,是程序的代码),并且,内存120中可以保存进程运行期间产生的数据,例如,中间数据,或过程数据。内存也可以称为内存储器,其作用是用于暂时存放处理器110中的运算数据,以及与硬盘等外部存储器交换的数据。只要计算机在运行中,处理器110就会把需要运算的数据调到内存120中进行运算,当运算完成后处理单元112再将结果传送出来。The memory 120 may provide a running space for a process in the computing device 100. For example, the memory 120 may store a computer program (specifically, a program code) for generating a process, and the memory 120 may store a process Data generated during operation, for example, intermediate data, or process data. The memory may also be called an internal memory, and its function is to temporarily store the operation data in the processor 110 and data exchanged with an external memory such as a hard disk. As long as the computer is running, the processor 110 will transfer the data to be calculated into the memory 120 for operation, and the processing unit 112 will transmit the result after the operation is completed.
作为示例而非限定,在本申请实施例中,内存120可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同 步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。应注意,本文描述的系统和方法的内存120旨在包括但不限于这些和任意其它适合类型的存储器。By way of example and not limitation, in the embodiment of the present application, the memory 120 may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory. Among them, the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrical memory Erase programmable read-only memory (EPROM, EEPROM) or flash memory. The volatile memory may be a random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM ) And direct memory bus random access memory (direct RAMbus RAM, DR RAM). It should be noted that the memory 120 of the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.
应理解,以上列举的计算设备100的结构仅为示例性说明,本申请并未限定于此,本申请实施例的计算设备100可以包括现有技术中计算机系统中的各种硬件,例如,该计算设备110还可以包括除内存120以外的其他存储器,例如,磁盘存储器等。It should be understood that the structure of the computing device 100 listed above is only an exemplary description, and the present application is not limited thereto. The computing device 100 in the embodiment of the present application may include various hardware in a computer system in the prior art. The computing device 110 may further include other memories besides the memory 120, for example, a disk memory and the like.
在本申请实施例中,可以在计算设备100上应用虚拟化技术。通过虚拟化技术计算机设备100中可以同时运行多个虚拟机,每个虚拟机上可以运行至少一个操作系统,每一个操作系统都运行多个程序。虚拟机(virtual machine)是指通过软件模拟的具有完整硬件系统功能的、运行在一个完全隔离环境中的完整计算机系统。In the embodiment of the present application, a virtualization technology may be applied on the computing device 100. Through the virtualization technology, the computer device 100 can run multiple virtual machines at the same time, each virtual machine can run at least one operating system, and each operating system runs multiple programs. Virtual machine (virtual machine) refers to a complete computer system with complete hardware system functions and running in a completely isolated environment simulated by software.
在本申请中,处理器110可以包括多个种类。例如,不同种类的处理器可以使用不同种类的指令。再例如,不同种类的处理器可以具有不同的计算能力。再例如,不同种类的处理器可以用于处理不同类型的计算。作为示例而非限定,在本申请中,该多种处理器可以包括通用处理器和协处理器。下面分别对上述各种处理器进行详细说明。In the present application, the processor 110 may include a plurality of categories. For example, different kinds of processors may use different kinds of instructions. As another example, different types of processors may have different computing capabilities. As another example, different kinds of processors can be used to handle different types of calculations. By way of example and not limitation, in this application, the various processors may include a general purpose processor and a coprocessor. The above-mentioned various processors are respectively described in detail below.
A.通用处理器A. General Purpose Processor
通用处理器也可以称为中央处理器(central processing unit,CPU),是一块超大规模的集成电路或其中的部件,是一台计算机的运算核心(Core)和控制核心(Control Unit)。它的功能主要是解释计算机指令以及处理计算机软件中的数据。中央处理器主要包括运算器(算术逻辑运算单元,ALU,Arithmetic Logic Unit)和高速缓冲存储器(Cache)及实现它们之间联系的数据(Data)、控制及状态的总线(Bus)。它与内部存储器(Memory)和输入/输出(I/O)设备合称为电子计算机三大核心部件。例如,CPU包括运算逻辑部件、寄存器部件和控制部件等。A general-purpose processor can also be referred to as a central processing unit (CPU), which is a very large-scale integrated circuit or a component thereof, and is a computing core (Core) and a control core (Control Unit) of a computer. Its function is mainly to interpret computer instructions and process data in computer software. The central processing unit mainly includes an arithmetic unit (Arithmetic Logic Unit, ALU), a cache memory (Cache), and a data bus (Data), a control, and a status bus (Bus) that implement the connection between them. It, together with internal memory and input / output (I / O) equipment, is called the three core components of an electronic computer. For example, the CPU includes an arithmetic logic unit, a register unit, a control unit, and the like.
逻辑部件(logic components)是运算逻辑部件。可以执行定点或浮点算术运算操作、移位操作以及逻辑操作,也可执行地址运算和转换。Logic components (logic components) are operational logic components. You can perform fixed-point or floating-point arithmetic operations, shift operations, and logical operations. You can also perform address operations and conversions.
寄存器包括通用寄存器、专用寄存器和控制寄存器。通用寄存器又可分定点数和浮点数两类,它们用来保存指令执行过程中临时存放的寄存器操作数和中间(或最终)的操作结果。Registers include general purpose registers, special purpose registers, and control registers. General-purpose registers can be divided into fixed-point and floating-point numbers. They are used to store register operands temporarily stored during instruction execution and intermediate (or final) operation results.
控制部件(control unit)主要是负责对指令译码,并且发出为完成每条指令所要执行的各个操作的控制信号。其结构有两种:一种是以微存储为核心的微程序控制方式;一种是以逻辑硬布线结构为主的控制方式。微存储中保持微码,每一个微码对应于一个最基本的微操作,又称微指令;各条指令是由不同序列的微码组成,这种微码序列构成微程序。中央处理器在对指令译码以后,即发出一定时序的控制信号,按给定序列的顺序以微周期为节拍执行由这些微码确定的若干个微操作,即可完成某条指令的执行。简单指令是由(3~5)个微操作组成,复杂指令则要由几十个微操作甚至几百个微操作组成。The control unit (control unit) is mainly responsible for decoding the instructions and sending out control signals to complete each operation to be performed by each instruction. There are two types of structures: one is a microprogram control method with micro-storage as the core; the other is a control method mainly with a logical hard-wired structure. Microcode is maintained in the micro memory, and each microcode corresponds to a basic micro operation, also called microinstruction; each instruction is composed of different sequences of microcode, and this microcode sequence constitutes a microprogram. After the central processor decodes the instructions, it sends out a certain timing control signal, and executes a number of micro operations determined by these microcodes in micro-cycles in the order of a given sequence to complete the execution of an instruction. Simple instructions are composed of (3 to 5) micro operations, while complex instructions are composed of dozens of micro operations or even hundreds of micro operations.
B.协处理器B. Coprocessor
协处理器(coprocessor),一种芯片或芯片中的部件,用于减轻系统微处理器的特定处理任务。协处理器,这是一种协助中央处理器完成其无法执行或执行效率、效果低下的 处理工作而开发和应用的处理器。这种中央处理器无法执行的工作有很多,比如设备间的信号传输、接入设备的管理等;而执行效率、效果低下的有图形处理、声频处理等。为了进行这些处理,各种辅助处理器就诞生了。需要说明的是,由于现在的计算机中,整数运算器与浮点运算器已经集成在一起,因此浮点处理器已经不算是辅助处理器。而内建于CPU中的协处理器,可以不算是辅助处理器。当然,协处理器也可以是独立存在。A coprocessor, a chip or part of a chip, used to alleviate specific processing tasks of the system microprocessor. Coprocessor, which is a processor developed and applied to assist the central processing unit in performing processing tasks that it cannot perform or perform inefficiently and inefficiently. There are many tasks that the central processor cannot perform, such as signal transmission between devices, management of access devices, etc., and the execution efficiency and low effects include graphics processing and audio processing. In order to perform these processes, various auxiliary processors were born. It should be noted that, since the integer arithmetic unit and the floating-point arithmetic unit have been integrated in the current computer, the floating-point processor is no longer an auxiliary processor. The coprocessor built into the CPU is not necessarily an auxiliary processor. Of course, the coprocessor can also exist independently.
在本申请中,协处理器可以用于特定处理任务,例如,数学协处理器可以控制数字处理;图形协处理器可以处理视频绘制。协处理器可以附属于通用处理器。一个协处理器通过扩展指令集或提供配置寄存器来扩展通用处理器内核处理功能。一个或多个协处理器可以通过协处理器接口与通用处理器内核相连。例如,协处理器也能通过提供一组专门的新指令来扩展指令集。作为示例而非限定,协处理器可以包括但不限于以下至少一种处理器:In this application, the coprocessor can be used for specific processing tasks, for example, a mathematical coprocessor can control digital processing; a graphics coprocessor can handle video rendering. The coprocessor can be attached to a general-purpose processor. A coprocessor extends the general-purpose processor core processing capabilities by extending the instruction set or providing configuration registers. One or more coprocessors can be connected to a general-purpose processor core through a coprocessor interface. For example, the coprocessor can also expand the instruction set by providing a new set of specialized instructions. By way of example and not limitation, the coprocessor may include, but is not limited to, at least one of the following processors:
B1.图形处理器B1. Graphics Processor
图形处理器(graphics processing Unit,GPU),又称显示核心、视觉处理器、显示芯片,是一种专门在个人电脑、工作站、游戏机和一些移动设备(如平板电脑、智能手机等)上图像运算工作的微处理器。GPU的用途是将计算机系统所需要的显示信息进行转换驱动,并向显示器提供行扫描信号,控制显示器的正确显示,是连接显示器和个人电脑主板的重要元件,也是“人机对话”的重要设备之一。例如,显卡的处理器有时被称为图形处理器(GPU),它是显卡的“心脏”,与CPU类似,只不过GPU是专为执行复杂的数学和几何计算而设计的,这些计算是图形渲染所需的。某些最快速的GPU集成的晶体管数甚至超过了普通CPU。Graphics processing unit (GPU), also known as display core, visual processor, and display chip, is a kind of image specially designed for personal computers, workstations, game consoles, and some mobile devices (such as tablets, smart phones, etc.) Microprocessor for arithmetic work. The purpose of the GPU is to convert and drive the display information required by the computer system, and provide line scanning signals to the display to control the correct display of the display. It is an important component that connects the display and the main board of the personal computer, and is also an important device for "human-machine dialogue". one. For example, the processor of a graphics card is sometimes called a graphics processor (GPU). It is the "heart" of a graphics card, similar to a CPU, except that the GPU is designed to perform complex mathematical and geometric calculations. These calculations are graphics Required for rendering. Some of the fastest GPUs integrate even more transistors than a normal CPU.
时下的GPU多数拥有2D或3D图形加速功能。如果CPU想画一个二维图形,只需要发个指令给GPU,如“在坐标位置(x,y)处画个长和宽为a×b大小的长方形”,GPU就可以迅速计算出该图形的所有像素,并在显示器上指定位置画出相应的图形,画完后就通知CPU“我画完了”,然后等待CPU发出下一条图形指令。有了GPU,CPU就从图形处理的任务中解放出来,可以执行其他更多的系统任务,这样可以大大提高计算机的整体性能。例如,GPU会产生大量热量,所以它的上方通常安装有散热器或风扇。Most current GPUs have 2D or 3D graphics acceleration capabilities. If the CPU wants to draw a two-dimensional graphic, it only needs to send an instruction to the GPU, such as "draw a rectangle with a length and width of a × b at the coordinate position (x, y)", and the GPU can quickly calculate the graphic And draw the corresponding graphics at the specified position on the display, after the drawing is finished, the CPU is notified that "I have finished drawing", and then waits for the CPU to issue the next graphics instruction. With the GPU, the CPU is freed from graphics processing tasks and can perform other system tasks, which can greatly improve the overall performance of the computer. For example, the GPU generates a lot of heat, so a radiator or fan is usually installed above it.
GPU是显示卡的“大脑”,GPU决定了该显卡的档次和大部分性能,同时GPU也是2D显示卡和3D显示卡的区别依据。2D显示芯片在处理3D图像与特效时主要依赖CPU的处理能力,称为软加速。3D显示芯片是把三维图像和特效处理功能集中在显示芯片内,也就是所谓的“硬件加速”功能。显示芯片一般是显示卡上最大的芯片(也是引脚最多的)。目前,GPU已经不再局限于3D图形处理了,GPU通用计算技术发展已经引起业界不少的关注,事实也证明在浮点运算、并行计算等部分计算方面,GPU可以提供数十倍乃至于上百倍于CPU的性能。GPU使计算机设备削减了对CPU的依赖,并分担部分原本CPU执行的工作。The GPU is the "brain" of the graphics card. The GPU determines the grade and most performance of the graphics card. At the same time, the GPU is also the basis for the difference between a 2D graphics card and a 3D graphics card. The 2D display chip mainly relies on the processing power of the CPU when processing 3D images and special effects, which is called soft acceleration. The 3D display chip is a three-dimensional image and special effects processing function concentrated in the display chip, which is the so-called "hardware acceleration" function. The display chip is generally the largest chip (and also has the most pins) on the display card. At present, the GPU is no longer limited to 3D graphics processing. The development of GPU general computing technology has attracted a lot of attention in the industry. The facts also prove that GPU can provide dozens of times or more in floating point computing, parallel computing and other partial calculations. A hundred times the performance of the CPU. The GPU enables computer equipment to reduce its dependence on the CPU and share some of the work that was originally performed by the CPU.
B2.现场可编程门阵列专用集成电路B2. Field Programmable Gate Array Application Specific Integrated Circuit
现场可编程门阵列(field programmable gate array,FPGA)是在例如,可编程阵列逻辑(PAL,Programmable Array Logic)、通用阵列逻辑(GAL,Generic Array Logic)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)等可编程器件的基础上进一步发展的产物。现场可编程门阵列专用集成电路是作为专用集成电路(ASIC,Application Specific Integrated Circuit)领域中的一种半定制电路而出现的,既解决了定制 电路的不足,又克服了原有可编程器件门电路数有限的缺点。系统设计师可以根据需要通过可编辑的连接把FPGA内部的逻辑块连接起来,就好像一个电路试验板被放在了一个芯片里。一个出厂后的成品FPGA的逻辑块和连接可以按照设计者而改变,所以FPGA可以完成所需要的逻辑功能。Field programmable gate array (FPGA) is, for example, Programmable Array Logic (PAL, Programmable Array Logic), General Array Logic (GAL, Generic Array Logic), Complex Programmable Logic Device (CPLD, Complex Programmable) Logic Device) and other programmable products based on the further development of the product. Field Programmable Gate Array ASIC appears as a semi-custom circuit in the field of Application Specific Integrated Circuit (ASIC, Application Specific Integrated Circuit), which not only solves the shortcomings of custom circuits, but also overcomes the original programmable device gates. Disadvantages of limited number of circuits. System designers can connect logic blocks inside the FPGA through editable connections as needed, as if a circuit test board was placed in a chip. The logic blocks and connections of a finished FPGA after leaving the factory can be changed according to the designer, so the FPGA can complete the required logic functions.
FPGA采用了逻辑单元阵列(LCA,Logic Cell Array),内部包括可配置逻辑模块(CLB,Configurable Logic Block)、输入输出模块(IOB,Input Output Block)和内部连线(Interconnect)三个部分。FPGA作为可编程器件,通过不同的编程方式,与传统逻辑电路和门阵列(如PAL,GAL及CPLD器件)相比,FPGA可具有不同的结构。FPGA利用小型查找表(16×1RAM)来实现组合逻辑,每个查找表连接到一个D触发器的输入端,触发器再来驱动其他逻辑电路或驱动I/O,由此构成了既可实现组合逻辑功能又可实现时序逻辑功能的基本逻辑单元模块,这些模块间利用金属连线互相连接或连接到I/O模块。FPGA的逻辑是通过向内部静态存储单元加载编程数据来实现的,存储在存储器单元中的值决定了逻辑单元的逻辑功能以及各模块之间或模块与I/O间的联接方式,并最终决定了FPGA所能实现的功能,FPGA允许无限次的编程。The FPGA uses a logic cell array (LCA, Logic Cell Array), which includes three parts: a configurable logic module (CLB, Configurable Logic Block), an input output module (IOB, Input Output Block), and an internal connection (Interconnect). As a programmable device, FPGA can have different structures compared to traditional logic circuits and gate arrays (such as PAL, GAL, and CPLD devices) through different programming methods. FPGA uses a small lookup table (16 × 1RAM) to implement combinational logic. Each lookup table is connected to the input of a D flip-flop. The flip-flop then drives other logic circuits or drives I / O. The logic function can also realize the basic logic unit module of the sequential logic function. These modules are connected to each other or connected to the I / O module by metal wiring. The logic of the FPGA is implemented by loading programming data into the internal static storage unit. The value stored in the memory unit determines the logic function of the logic unit and the connection mode between the modules or between the modules and I / O, and finally determines FPGA can realize the function, FPGA allows unlimited programming.
需要说明的是,由于FPGA不包括指令集,因此,后述方法200可以不用于判定FPGA是否能够作为目标处理器。It should be noted that, because the FPGA does not include an instruction set, the method 200 described below may not be used to determine whether the FPGA can be used as a target processor.
但是,由于FPGA具有内存空间,因此,后述方法300可以用于判定FPGA是否能够作为目标处理器。However, since the FPGA has a memory space, a method 300 described later can be used to determine whether the FPGA can be used as a target processor.
B3.神经网络处理器B3. Neural Network Processor
神经网络处理器(neural-network process units,NPU)采用“数据驱动并行计算”的架构,特别擅长处理视频、图像类的海量多媒体数据。其中,NPU可以用于深度学习,从技术角度看,深度学习实际上是一类多层大规模人工神经网络。它模仿生物神经网络而构建,由若干人工神经元结点互联而成。神经元之间通过突触两两连接,突触记录了神经元间联系的权值强弱。每个神经元可抽象为一个激励函数,该函数的输入由与其相连的神经元的输出以及连接神经元的突触共同决定。为了表达特定的知识,使用者通常需要(通过某些特定的算法)调整人工神经网络中突触的取值、网络的拓扑结构等。该过程称为“学习”。在学习之后,人工神经网络可通过习得的知识来解决特定的问题。Neural network processors (neural-network process units) adopt a "data-driven parallel computing" architecture and are particularly good at processing massive multimedia data such as video and images. Among them, NPU can be used for deep learning. From a technical perspective, deep learning is actually a type of multilayer large-scale artificial neural network. It is modeled after a biological neural network and consists of several artificial neuron nodes interconnected. Neurons are connected one by one through synapses. Synapses record the weight of the connections between neurons. Each neuron can be abstracted into a stimulus function whose input is determined by the output of the neuron connected to it and the synapses that connect the neuron. In order to express specific knowledge, users usually need to adjust (by some specific algorithms) the value of synapses in the artificial neural network, the topology of the network, and so on. This process is called "learning". After learning, artificial neural networks can use the acquired knowledge to solve specific problems.
深度学习的基本操作是神经元和突触的处理。而传统的处理器指令集是为了进行通用计算发展起来的,其基本操作为算术操作(加减乘除)和逻辑操作(与或非),往往需要数百甚至上千条指令才能完成一个神经元的处理,深度学习的处理效率不高。与此相对NPU指令直接面对大规模神经元和突触的处理,一条指令即可完成一组神经元的处理,并对神经元和突触数据在芯片上的传输提供了一系列专门的支持。另外,神经网络中存储和处理是一体化的,都是通过突触权重来体现。The basic operation of deep learning is the processing of neurons and synapses. The traditional processor instruction set was developed for general purpose computing. Its basic operations are arithmetic operations (addition, subtraction, multiplication, and division) and logical operations (and or not). It often takes hundreds or even thousands of instructions to complete a neuron. Processing, the processing efficiency of deep learning is not high. In contrast, NPU instructions directly face the processing of large-scale neurons and synapses. One instruction can complete the processing of a group of neurons and provide a series of specialized support for the transmission of neurons and synaptic data on the chip. . In addition, the storage and processing in the neural network are integrated, and both are represented by synaptic weights.
B4.专用集成电路B4. Application Specific Integrated Circuits
专用集成电路(application specific integrated circuit,ASIC)是为特定用户或特定电子系统制作的集成电路。数字集成电路的通用性和大批量生产,使电子产品成本大幅度下降,推进了计算机通信和电子产品的普及,但同时也产生了通用与专用的矛盾,以及系统设计与电路制作脱节的问题。同时,集成电路规模越大,组建系统时就越难以针对特殊要求加以改变。为解决这些问题,就出现了以用户参加设计为特征的专用集成电路,它能实现整 机系统的优化设计,性能优越,保密性强。ASIC可以用于执行软件程序,也可以不执行软件程序而是通过硬件逻辑执行计算。例如,执行软件程序的ASIC中可以包括一个或多个处理器内核以运行指令,并具有对应的指令集。Application specific integrated circuit (ASIC) is an integrated circuit made for a specific user or a specific electronic system. The universality and mass production of digital integrated circuits has greatly reduced the cost of electronic products and promoted the popularization of computer communications and electronic products. However, it has also caused the contradiction between general and special applications, and the disconnection between system design and circuit production. At the same time, the larger the integrated circuit scale, the more difficult it is to change for special requirements when building a system. In order to solve these problems, ASICs featuring user participation in design have emerged, which can realize the optimized design of the entire system, with superior performance and strong confidentiality. ASICs can be used to execute software programs, or they can perform calculations through hardware logic instead of software programs. For example, an ASIC executing a software program may include one or more processor cores to execute instructions and have a corresponding instruction set.
B5.数字信号处理器B5. Digital Signal Processor
数字信号处理(digital signal processor,DSP)是将信号以数字方式表示并处理的理论和技术。数字信号处理与模拟信号处理是信号处理的子集。数字信号处理的目的是对真实世界的连续模拟信号进行测量或滤波。因此在进行数字信号处理之前需要将信号从模拟域转换到数字域,这通常通过模数转换器实现。而数字信号处理的输出经常也要变换到模拟域,这是通过数模转换器实现的。DSP是进行数字信号处理的专用芯片,是伴随着微电子学、数字信号处理技术、计算机技术的发展而产生的新器件。Digital signal processing (DSP) is a theory and technology that represents and processes signals digitally. Digital signal processing and analog signal processing are a subset of signal processing. The purpose of digital signal processing is to measure or filter continuous analog signals in the real world. Therefore, before performing digital signal processing, the signal needs to be converted from the analog domain to the digital domain, which is usually achieved by an analog-to-digital converter. And the output of digital signal processing often needs to be transformed into the analog domain, which is realized by a digital-to-analog converter. DSP is a special-purpose chip for digital signal processing. It is a new device that is accompanied by the development of microelectronics, digital signal processing technology, and computer technology.
B6.图像处理单元B6. Image processing unit
图像处理单元(image processing unit,IPU)也可以称为图像信号处理器(image signal processor),可以用来对前端图像传感器输出信号处理的单元,以匹配不同厂商的图象传感器。并且,可以用于提供从图像输入(摄像头传感器/电视信号输入等)到显示设备(例如,液晶显示屏、电视V输出或外部图像处理单元等)端到端的数据流信号处理的全面支持。An image processing unit (IPU) can also be called an image signal processor (image signal processor), which can be used to process the output signal of the front-end image sensor to match image sensors from different manufacturers. And, it can be used to provide comprehensive support for end-to-end data stream signal processing from image input (camera sensor / TV signal input, etc.) to display devices (eg, LCD screen, TV V output or external image processing unit, etc.).
应理解,以上列举的处理器仅为示例性说明,本申请并未限定于此,例如,本申请中的处理器还可以包括可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。作为示例而非限定,在本申请中,上述包括多种处理器的结构可以称为异构体系结构,或者,异构系统架构。It should be understood that the above-listed processors are merely exemplary descriptions, and the present application is not limited thereto. For example, the processors in this application may further include programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. By way of example and not limitation, in the present application, the above-mentioned structure including multiple processors may be referred to as a heterogeneous architecture or a heterogeneous system architecture.
在互联网行业,随着信息化的普及,数据量的暴增使得人们对存储空间又有了新要求,同时,机器学习、人工智能、无人驾驶、工业仿真等领域的崛起,使得通用处理器在处理海量计算、海量数据/图片时遇到越来越多的性能瓶颈,如并行度不高、带宽不够、时延高等。为了应对计算多元化的需求,越来越多的场景开始引入GPU、FPGA等硬件进行加速,异构计算应运而生。异构计算(Heterogeneous Computing),主要指不同类型的指令集和体系架构的计算单元组成的系统的计算方式。所谓的异构,就是CPU、DSP、GPU、ASIC、协处理器、FPGA等各种计算单元、使用不同的类型指令集、不同的体系架构的计算单元,组成一个混合的系统,执行计算的特殊方式,就叫做“异构计算”。特别是在人工智能领域,异构计算大有可为。众所周知,AI意味着对计算力的超高要求,目前以GPU为代表的异构计算已成为加速AI创新的新一代计算架构。In the Internet industry, with the popularization of information and the explosion of data, people have new requirements for storage space. At the same time, the rise of machine learning, artificial intelligence, unmanned driving, industrial simulation and other fields has made general-purpose processors When dealing with massive calculations and massive data / pictures, more and more performance bottlenecks are encountered, such as low parallelism, insufficient bandwidth, and high latency. In order to cope with the demand for computing diversification, more and more scenarios have begun to introduce hardware such as GPUs and FPGAs for acceleration, and heterogeneous computing has emerged at the historic moment. Heterogeneous computing (Heterogeneous Computing), mainly refers to the calculation method of the system composed of different types of instruction sets and architecture computing units. The so-called heterogeneity refers to various computing units such as CPUs, DSPs, GPUs, ASICs, coprocessors, and FPGAs, computing units that use different types of instruction sets, and different architectures to form a mixed system that performs special calculations. This method is called "heterogeneous computing". Especially in the field of artificial intelligence, heterogeneous computing has great potential. As we all know, AI means ultra-high requirements for computing power. At present, heterogeneous computing represented by GPU has become a new generation of computing architecture to accelerate AI innovation.
在异构系统架构(heterogeneous system architecture,HSA)中,多种处理器协同工作,即,CPU可以将大部分资源用于缓存和逻辑控制(即非计算单元),将少部分资源用于计算。这体现了CPU适合运行具有分支密集型、不规则数据结构、递归等特点的串行程序。与传统多核心架构相结合,将专用的计算模块作为加速器加入系统,例如图形处理单元(GPU),数字信号处理器(DSP),现场可编程门阵列(FPGA)和其他可编程逻辑单元正在被利用作为加速器(即内核异构架构)成为趋势。HSA以实现异构计算最佳化为目标推出新的系统架构和执行标准,最终目的是透过SoC内各核心(包括CPU、GPU、DSP和其他处理器)的异质架构之间进行协同运算,借此促使整颗SoC内各架构效能得到最大发挥。异构系统架构能够使多种处理器实现内存统一寻址。In a heterogeneous system architecture (HSA), multiple processors work together, that is, the CPU can use most resources for cache and logic control (that is, non-computing units), and a small part of resources for computing. This shows that the CPU is suitable for running serial programs with branch-intensive, irregular data structures, recursion and other characteristics. Combined with traditional multi-core architecture, dedicated computing modules are added to the system as accelerators, such as graphics processing units (GPUs), digital signal processors (DSPs), field programmable gate arrays (FPGAs), and other programmable logic units. Utilization as an accelerator (ie, heterogeneous kernel architecture) has become a trend. HSA launches new system architecture and execution standards with the goal of optimizing heterogeneous computing. The ultimate goal is to perform collaborative operations through heterogeneous architectures of the cores (including CPU, GPU, DSP, and other processors) within the SoC. In this way, the performance of each architecture in the entire SoC can be maximized. Heterogeneous system architecture enables multiple processors to implement unified memory addressing.
在异构计算系统上进行的并行计算通常称为异构计算。人们已从不同角度对异构计算进行定义,综合起来本实施例给出如下定义:异构计算是一种特殊形式的并行和分布式计算,它或是用能同时支持单指令多数据流(Single Instruction Multiple Data,SIMD)方式和多指令流多数据流(multiple instruction stream multiple dataStream,MIMD)方式的单个独立计算机,或是用由高速网络互连的一组独立计算机来完成计算任务。它能协调地使用性能、结构各异地机器以满足不同的计算需求,并使代码(或代码段)能以获取最大总体性能方式来执行。Parallel computing on heterogeneous computing systems is often called heterogeneous computing. People have defined heterogeneous computing from different perspectives. In summary, this embodiment gives the following definition: Heterogeneous computing is a special form of parallel and distributed computing. It can also support a single instruction and multiple data streams ( Single independent computer with Single Instruction, Multiple Data (SIMD) method and multiple instruction stream, multiple data stream (MIMD) method, or a group of independent computers interconnected by a high-speed network to complete computing tasks. It can coordinate the use of machines with different performances and structures to meet different computing needs, and enables the code (or code segment) to be executed in a way that maximizes overall performance.
异构计算技术是一种使计算任务的并行性类型(代码类型)与机器能有效支持的计算类型(即机器能力)最相匹配、最能充分利用各种计算资源的并行和分布计算技术。上述具有异构系统架构的芯片可以称为人工智能(artificial intelligence,AI)芯片,或者,加速处理器(accelerated processing unit,APU)。Heterogeneous computing technology is a parallel and distributed computing technology that enables the type of parallelism (code type) of computing tasks to best match the type of computing that the machine can effectively support (that is, machine capabilities) and makes the best use of various computing resources. The above chip with a heterogeneous system architecture may be called an artificial intelligence (AI) chip, or an accelerated processing unit (APU).
根据本申请提供的选择处理器的方法可以从上述多种处理器中选择用于执行目标进程的处理器。如上所述,在本申请中,处理器通过执行目标程序的代码来运行目标程序。在本申请中,不同种类的处理器可以具有不同的指令集体系结构(istruction-set architecture,ISA),例如,不同种类的处理器可以具有不同的指令集。指令集存储或以硬件形式集成在处理器内部,对处理器运算进行指导和优化的硬程序。处理器可以通过指令集更高效地运行。According to the method for selecting a processor provided in the present application, a processor for executing a target process may be selected from the foregoing multiple processors. As described above, in the present application, the processor executes the target program by executing the code of the target program. In this application, different types of processors may have different instruction-set architectures (ISA). For example, different types of processors may have different instruction sets. The instruction set is stored or integrated in the processor in the form of hardware, a hard program that guides and optimizes processor operations. The processor can run more efficiently through the instruction set.
为了便于编程人员编写程序,本申请中可以使用编译技术。编译就是将一种编程语言(源语言)编写的程序转换到另一种语言(目标语言)的过程。在本申请中,上述编译技术使用的编译器可以包括但不限于以下结构:To facilitate programmers to write programs, compilation techniques can be used in this application. Compiling is the process of converting a program written in one programming language (source language) to another language (target language). In this application, the compiler used by the above compilation technology may include, but is not limited to, the following structures:
A.前端编译器:A. Front-end compiler:
前端编译器用于实现从源程序(或者说,源程序代码)到中间表达式(intermediate representation,IR)的转换,即,用户先使用领域描述语言(domain secific language,DSL)语言对算子的计算进行描述,作为前端编译器的输入。前端编译器的处理主要包括词法分析、语法分析和语义分析。The front-end compiler is used to implement the conversion from source program (or source code) to intermediate representation (IR), that is, the user first calculates the operator using the domain description language (DSL) language Describe as input to the front-end compiler. The processing of the front-end compiler mainly includes lexical analysis, syntax analysis and semantic analysis.
1)词法分析从字符序列得到对应的记号序列。例如,对于“b=3+52*a”这一代码(或者说,指令或函数),前段译码器可以得到如图2所示的记号序列。1) Lexical analysis obtains the corresponding token sequence from the character sequence. For example, for a code (or instruction or function) of "b = 3 + 52 * a", the preceding decoder can obtain a token sequence as shown in FIG. 2.
2)语法分析从记号序列进一步得到抽象语法树(Abstract Syntax Tree,AST)。例如,上述记号序列可以得到图3所示语法树。2) Syntax analysis further obtains an Abstract Syntax Tree (AST) from the sequence of tokens. For example, the above symbol sequence can obtain the syntax tree shown in FIG. 3.
3)语义分析识别变量的类型,操作的作用域等。3) Semantic analysis identifies the types of variables, the scope of operations, etc.
在本申请中,前端编译器也可以称为前端编译装置或前端编译单元。In this application, the front-end compiler may also be referred to as a front-end compilation device or a front-end compilation unit.
B.中间编译器:B. Intermediate compiler:
中间编译器用于代码生成和优化,具体地说,中间代码是伪码,可看作是一种抽象机上的程序,其特点是简单规范,与机器无关,易于优化和转换。以语法树的方式来组织。例如,对于“sum=(10+20)*(num+square)”这一代码(或者说,指令或函数),经过代码生成和优化后可以得到图4所示语法树。The intermediate compiler is used for code generation and optimization. Specifically, the intermediate code is pseudo-code and can be regarded as a program on an abstract machine. Its characteristics are simple specifications, machine-independent, and easy to optimize and convert. Organized as a syntax tree. For example, for the code (or instruction or function) of “sum = (10 + 20) * (num + square)”, the code tree shown in FIG. 4 can be obtained after code generation and optimization.
在本申请中,中间代码进行优化处理是建立在等价的基础上的,可以节省存储空间,运行更快。常见的优化方式分两类:1)与机器无关的优化,如:常量的合并,公共自表达式的提取,循环的展开合并,代码外提(将循环不变计算移出循环)等;2)与机器有 关的优化,如:寄存器的利用(将常用量放入寄存器,以减少访问内存的次数),存储策略(可以根据算法访存的要求安排Cache、并行存储体系以减少访问冲突)。In this application, the optimization processing of the intermediate code is based on the equivalent, which can save storage space and run faster. Common optimization methods are divided into two categories: 1) machine-independent optimizations, such as: constant combination, extraction of common self-expressions, unrolling and merging of loops, code extraction (moving constant calculations out of loops), etc. Machine-related optimizations, such as: the use of registers (putting common quantities into registers to reduce the number of times to access memory), storage strategies (you can arrange Cache and parallel storage architecture to reduce access conflicts according to the requirements of algorithm fetching).
在本申请中,中间编译器也可以称为中间编译装置或中间编译单元。In this application, the intermediate compiler may also be referred to as an intermediate compilation device or an intermediate compilation unit.
C.后端编译器:C. Back-end compiler:
后端编译器(backend)主要用于目标代码生成,即,后端编译器可以包括多个,该多个后端编译器可以与多种处理器一一对应,每种后端编译器用于将所输入的优化后的IR转换为能够在所对应的处理器上运行的目标代码(或者说,指令或函数),其中,该目标代码可以为指令代码或汇编代码。如上所述,需要从多个后端编译器中选择一个用于生成目标代码的后端编译器,在现有技术中,该过程由人工完成,与此相对,在本申请实施例中,该过程可以由计算机设备自动完成。另外,由于该多个后端编译器可以与多种处理器,因此,“从多个后端编译器中选择一个用于生成目标代码的后端编译器”也可以理解为从多种处理器中选择一个用于执行目标程序的处理器的过程。Backend compiler (backend) is mainly used for object code generation, that is, the backend compiler can include multiple, the multiple backend compilers can correspond to a variety of processors, each backend compiler is used to The input optimized IR is converted into an object code (or an instruction or a function) that can be run on a corresponding processor, where the object code may be an instruction code or an assembly code. As described above, it is necessary to select one back-end compiler from multiple back-end compilers for generating the target code. In the prior art, this process is manually completed. In contrast, in the embodiment of the present application, the The process can be automated by computer equipment. In addition, since the multiple back-end compilers can interact with multiple processors, "selecting a back-end compiler from multiple back-end compilers to generate object code" can also be understood as referring to multiple processors. The process of selecting a processor for executing a target program.
在本申请中,后端编译器也可以称为后端编译装置或后端编译单元。In this application, the back-end compiler may also be referred to as a back-end compilation device or a back-end compilation unit.
图5是本申请的选择处理器的方法200的一例的示意性流程图。作为示例而非限定,该方法200的执行主体(以下,为了便于理解和说明,称为处理节点#A)可以是计算设备中的多个处理器中的任意处理器,例如,中央处理器。或者,处理节点#A可以是计算设备中运行的虚拟机。另外,在本申请中,处理节点#A也可以是上述后端编译器,也可以是与上述后端编译器相互独立的装置,本申请并未特别限定。FIG. 5 is a schematic flowchart of an example of a method 200 for selecting a processor according to the present application. By way of example and not limitation, the execution body of the method 200 (hereinafter, referred to as processing node #A for ease of understanding and description) may be any processor among multiple processors in a computing device, such as a central processing unit. Alternatively, processing node #A may be a virtual machine running in a computing device. In addition, in this application, the processing node #A may be the above-mentioned back-end compiler, or may be a device independent of the above-mentioned back-end compiler, which is not particularly limited in this application.
需要说明的是,该方法200是基于指令来选择目标处理器,由于FPGA不包括指令集,因此,该方法200可以不用于判定FPGA是否能够作为目标处理器。It should be noted that the method 200 selects a target processor based on instructions. Since the FPGA does not include an instruction set, the method 200 may not be used to determine whether the FPGA can be used as a target processor.
并且,当ASIC可以用于执行软件程序时,该方法200可以用于判定ASIC是否能够作为目标处理器。Moreover, when the ASIC can be used to execute a software program, the method 200 can be used to determine whether the ASIC can be used as a target processor.
当ASIC不执行软件程序而是通过硬件逻辑执行计算时,该方法200可以不用于判定ASIC是否能够作为目标处理器。When the ASIC does not execute a software program but performs calculations through hardware logic, the method 200 may not be used to determine whether the ASIC can serve as a target processor.
如图5所示,S210,处理节点#A可以获取计算设备100所包括的两种处理器中每种处理器的硬件信息。可选地,在本申请中,计算设备100的制造商可以将该计算设备100所包括的各处理器的硬件信息在该计算设备100出厂时预先配置在计算设备100中,从而,处理节点#A在S210,可以基于该出场配置的相关信息,获取计算设备100所包括的两种处理器中每种处理器的硬件信息。可选地,在本申请中,计算设备100的制造商可以将该计算设备100所包括的各处理器的硬件信息保存在服务器上,从而,处理节点#A在S210,预先通过网络连接该服务器,并从该服务器中获取计算设备100所包括的两种处理器中每种处理器的硬件信息。可选地,在本申请中,计算设备100的使用者可以将该计算设备100所包括的各处理器的硬件信息输入至处理节点#A。可选地,在本申请中,各处理器可以以热插拔的方式安装,并且,各处理器的驱动程序可以在热插拔时使各处理器完成注册,此前情况下,处理节点#A在S210,可以基于各处理器的注册信息或驱动程序中的相关信息,获取计算设备100所包括的两种处理器中每种处理器的硬件信息。As shown in FIG. 5, S210, the processing node #A may obtain hardware information of each of the two processors included in the computing device 100. Optionally, in the present application, the manufacturer of the computing device 100 may pre-configure the hardware information of the processors included in the computing device 100 in the computing device 100 when the computing device 100 leaves the factory, so that the processing node # A In S210, hardware information of each of the two processors included in the computing device 100 may be obtained based on the relevant information of the appearance configuration. Optionally, in this application, the manufacturer of the computing device 100 may save the hardware information of each processor included in the computing device 100 on the server, so that the processing node #A is connected to the server through the network in advance in S210. And obtain hardware information of each of the two processors included in the computing device 100 from the server. Optionally, in this application, a user of the computing device 100 may input hardware information of each processor included in the computing device 100 to the processing node #A. Optionally, in this application, each processor may be installed in a hot-pluggable manner, and a driver of each processor may complete registration of each processor during hot-plugging. In the previous case, the processing node #A In S210, hardware information of each of the two processors included in the computing device 100 may be obtained based on registration information of each processor or related information in a driver.
即,在本申请中,计算机设备100(或者说,处理节点#A)可以具有处理器注册信息收集功能,从而能够识别出计算机设备100中支持哪些异构硬件,根据识别出的硬件,在系统启动时对各处理器对应的backend进行注册。从而,处理节点可以根据各处理器对应 的backend的注册信息,确定各处理器的硬件信息。That is, in this application, the computer device 100 (or processing node #A) may have a processor registration information collection function, so as to be able to identify which heterogeneous hardware is supported in the computer device 100, and according to the identified hardware, in the system Register the backend corresponding to each processor at startup. Therefore, the processing node can determine the hardware information of each processor according to the registration information of the backend corresponding to each processor.
在本申请中,一个处理器的硬件信息可以包括该处理器对应的指令集的信息。例如,一个处理器的硬件信息可以包括该处理器所能够执行的指令的名称的信息。再例如,一个处理器的硬件信息可以包括该处理器所能够执行的函数的名称的信息。In this application, hardware information of a processor may include information of an instruction set corresponding to the processor. For example, the hardware information of a processor may include information about the names of instructions that the processor can execute. As another example, the hardware information of a processor may include information about the names of functions that the processor can execute.
如图5所示,S220,处理节点#A可以确定当前需要运行的程序(即,目标程序的一例,记作:程序#A)的程序信息。作为示例而非限定,在本申请中,该程序信息可以是根据程序#A的IR确定的。例如,在本申请中,前端编译器可以获取程序#A的源程序代码(记作:代码#A)。具体地说,编译器可以通过例如,领域描述语言接口(DSL Interface),供开发者调用写算子对应的DSL(即,代码#A的一例);其后,中间编译器可以将该程序#A对应的代码#A(例如,DSL)转换为程序#A的IR;并且,在本申请中,中间编译器还可以对程序#A的IR进行优化。从而,处理节点#A可以从该程序#A的IR(例如,优化后的IR)中,确定该程序#A的程序信息。As shown in FIG. 5, in S220, the processing node #A can determine program information of a program (that is, an example of a target program, which is described as: program #A) that needs to be currently run. By way of example and not limitation, in this application, the program information may be determined according to the IR of the program #A. For example, in this application, the front-end compiler can obtain the source program code of program #A (denoted as: code #A). Specifically, the compiler may use, for example, a domain description language interface (DSL interface) for developers to call the DSL corresponding to the write operator (ie, an example of code #A); thereafter, the intermediate compiler may use the program # The code #A (for example, DSL) corresponding to A is converted into the IR of the program #A; and, in this application, the intermediate compiler may also optimize the IR of the program #A. Therefore, the processing node #A can determine the program information of the program #A from the IR (for example, the optimized IR) of the program #A.
需要说明的是,在本申请中,该处理节点#A可以是作为该代码#A的前端译码器和中间译码器,此情况下,该处理节点#A能够直接获得该程序#A的IR。或者,在本申请中,该代码#A的前端译码器和中间译码器可以由处理节点#B实现,此情况下,该处理节点#A也可以与处理节点#B通信,从而,处理节点#B可以将该程序#A的IR发送给处理节点#A。It should be noted that, in this application, the processing node #A may be a front-end decoder and an intermediate decoder as the code #A. In this case, the processing node #A can directly obtain the program #A. IR. Alternatively, in the present application, the front-end decoder and the intermediate decoder of the code #A may be implemented by the processing node #B. In this case, the processing node #A may also communicate with the processing node #B, so that the processing The node #B may send the IR of the program #A to the processing node #A.
在本申请中,程序#A的程序信息可以包括程序#A的代码(例如,优化后的IR)所包括的指令(记作:指令#A)。其中,该指令#A可以包括一个指令,也可以包括多个指令,本申请并未特别限定。例如,程序#A的程序信息可以包括程序#A的IR中的指令的名称。再例如,程序#A的程序信息可以包括程序#A的IR中的函数的名称。In the present application, the program information of the program #A may include instructions (denoted as: instruction #A) included in the code (for example, optimized IR) of the program #A. The instruction #A may include one instruction or multiple instructions, which is not particularly limited in this application. For example, the program information of the program #A may include the name of the instruction in the IR of the program #A. As another example, the program information of the program #A may include the names of functions in the IR of the program #A.
在S230,该处理节点#A可以基于该程序#A的程序信息和各处理器的硬件信息,从多个处理器中确定目标处理器(记作,处理器#1)。其中,该处理器#1中可以是多个处理器中所对应的指令集包括该指令#A的处理器。或者说,该处理器#1可以是该多个处理器中,满足约束条件#A的处理器。该约束条件#A包括处理器对应的指令集包括该指令#A。In S230, the processing node #A may determine a target processor (denoted as processor # 1) from a plurality of processors based on the program information of the program #A and the hardware information of each processor. The processor # 1 may be a processor whose corresponding instruction set in the multiple processors includes the instruction #A. In other words, the processor # 1 may be a processor among the multiple processors that meets the constraint #A. The constraint condition #A includes that the instruction set corresponding to the processor includes the instruction #A.
可选地,在本申请中,该处理节点#A可以确定该多个处理器中每个处理器的优先级。作为示例而非限定,在本申请中,处理节点#A可以根据多个处理器中每个处理器的并行计算能力,确定各处理器的优先级,即,在本申请中,并行计算能力高的处理器的优先级高于并行计算能力低的处理器的优先级,例如,对于处理器#a和处理器#b,如果处理器#b的并行能力高于处理器#a的并行计算能力,则处理节点#A可以认为处理器#b的优先级高于处理器#a的优先级。其中,并行计算或称平行计算是相对于串行计算来说的。并行计算是一种一次可执行多个指令的算法,目的是提高计算速度,及通过扩大问题求解规模,解决大型而复杂的计算问题。所谓并行计算可分为时间上的并行和空间上的并行。时间上的并行就是指流水线技术,而空间上的并行则是指用多个处理器并发的执行计算。Optionally, in this application, the processing node #A may determine the priority of each processor in the multiple processors. By way of example and not limitation, in this application, the processing node #A may determine the priority of each processor according to the parallel computing capability of each of the multiple processors, that is, in this application, the parallel computing capability is high The priority of the processor is higher than that of the processor with low parallel computing capability. For example, for processor #a and processor #b, if the parallel capability of processor #b is higher than the parallel computing capability of processor #a , The processing node #A may consider that the priority of the processor #b is higher than the priority of the processor #a. Among them, parallel computing or parallel computing is relative to serial computing. Parallel computing is an algorithm that can execute multiple instructions at one time. The purpose is to increase the computing speed and solve large and complex computing problems by expanding the problem solving scale. The so-called parallel computing can be divided into parallel in time and parallel in space. Temporal parallelism refers to pipeline technology, while spatial parallelism refers to the use of multiple processors to perform calculations concurrently.
可选地,在本申请中,该处理节点#A可以根据多个处理器的种类确定每个处理器的优先级。例如,在本申请中,专用处理器的优先级高于通用处理器的优先级。并且,可选地,通用处理器可以是多个处理器中优先级最低的处理器。从而,处理节点#A可以按各处理器的优先级,例如,按照优先级从高到低的顺序,依次判定每个处理器是否满足上述约束条件#A。并且,可选地,处理节点#A可以将首个满足约束条件#A的处理器确定为处理器#1。Optionally, in this application, the processing node #A may determine the priority of each processor according to the types of multiple processors. For example, in this application, the priority of a special-purpose processor is higher than that of a general-purpose processor. And, optionally, the general-purpose processor may be the processor with the lowest priority among the multiple processors. Therefore, the processing node #A may sequentially determine whether each processor satisfies the above-mentioned constraint condition #A according to the priority of each processor, for example, in the order of priority from high to low. And, optionally, the processing node #A may determine the first processor satisfying the constraint condition #A as the processor # 1.
另外,处理节点#A可以在确定出处理器#1后,停止对其他处理器进行判定。例如,在本申请中,不同的处理器会有不同的指令集合,实现同样的功能所用到的指令在不同的芯片上会不一样,假设处理器#a的指令集合为intrin#a,处理器#b的指令集为intrin#b。设程序#A的源代码的表达式为s=x*y+z。并且,按优先级排列,处理器#b为默认优先级最低的处理器,例如,处理器#b可以为通用处理器,该处理器#a为专用处理器,即,该处理器#b可以实现功能,但并行计算能力不如专用处理器#a。用DSL语言来描述该计算可表达为m=mul(x,y),s=add(m,z)。则在本申请中,通过IR处理后,得到程序#A的IR的描述,分析会用到mul(乘法),add(加和)这两种指令的计算。其后,处理节点#A优先判断上述指令是否都是属于intrin#a;若判定为“是”,则选择处理器#a作为处理器#1;若判定为“否”,则选择处理器#b作为处理器#1。In addition, the processing node #A may stop determining other processors after determining the processor # 1. For example, in this application, different processors will have different instruction sets. The instructions used to implement the same function will be different on different chips. Assume that the instruction set of processor #a is intrin # a, processor The instruction set of #b is intrin # b. Let the expression of the source code of program #A be s = x * y + z. And, in order of priority, processor #b is the processor with the lowest default priority. For example, processor #b may be a general-purpose processor, and processor #a is a dedicated processor, that is, processor #b may be Implements functions, but the parallel computing power is not as good as the dedicated processor #a. DSL is used to describe this calculation as m = mul (x, y), s = add (m, z). In this application, the IR description of program #A is obtained after IR processing. The analysis uses calculations of two instructions, mul (multiplication) and add (addition). Thereafter, the processing node #A preferentially determines whether the above instructions belong to the intrin # a; if it is determined as "yes", it selects processor #a as the processor # 1; if it is determined as "no", it selects the processor # b as processor # 1.
可选地,处理节点#A在S210获得的硬件信息还包括各处理器当前可用内存空间的大小的信息。作为示例而非限定,在本申请中,处理器当前可用内存空间的大小可以是处理器的空闲空间(或者,内存容量)的例如,90%。相应地,处理节点#A在S220获得的程序信息还包括程序#A的运行所需要的内存空间的大小的信息。此情况下,该处理器#1中可以是多个处理器中所对应的指令集包括该指令#A、且当前可用内存空间大于或等于程序#A的运行所需要的内存空间的处理器。或者说,该处理器#1可以是该多个处理器中,满足约束条件#A和约束条件#B的处理器。该约束条件#B包括处理器当前的可用空间大于或等于程序#A的运行所需要的内存空间(记作:空间#A)。Optionally, the hardware information obtained by the processing node #A in S210 further includes information about the size of the currently available memory space of each processor. By way of example and not limitation, in the present application, the size of the currently available memory space of the processor may be, for example, 90% of the free space (or memory capacity) of the processor. Correspondingly, the program information obtained by the processing node #A at S220 further includes information on the size of the memory space required for the operation of the program #A. In this case, the processor # 1 may be a processor whose corresponding instruction set of the multiple processors includes the instruction #A, and the currently available memory space is greater than or equal to the memory space required for the operation of the program #A. In other words, the processor # 1 may be a processor among the multiple processors that meets the constraint #A and the constraint #B. The constraint #B includes the current free space of the processor that is greater than or equal to the memory space required for the operation of the program #A (recorded as: space #A).
作为示例而非限定,在本申请中,处理节点#A可以根据程序#A(或者说程序#A的代码)的数据维度,确定该空间#A。其中,该程序#A的数据维度可以理解为,该程序#A的张量的维度(shape)。张量(Tensor)是一个定义在的一些向量空间和一些对偶空间的笛卡儿积上的多重线性映射,其坐标是||维空间内,有||个分量的一种量,其中每个分量都是坐标的函数,而在坐标变换时,这些分量也依照某些规则作线性变换。r称为该张量的秩或阶(与矩阵的秩和阶均无关系)。在同构的意义下,第零阶张量(r=0)为标量(Scalar),第一阶张量(r=1)为向量(Vector),第二阶张量(r=2)则成为矩阵(Matrix)。例如,对于3维空间,r=1时的张量为此向量:(x,y,z)。由于变换方式的不同,张量分成协变张量(CovariantTensor,指标在下者)、逆变张量(ContravariantTensor,指标在上者)、混合张量(指标在上和指标在下两者都有)三类。By way of example and not limitation, in this application, the processing node #A may determine the space #A according to the data dimension of the program #A (or the code of the program #A). The data dimension of the program #A can be understood as the shape of the tensor of the program #A. A tensor is a multilinear mapping defined on the Cartesian product of some vector spaces and some dual spaces. Its coordinates are || in a dimension space, there is a quantity of || components, each of which The components are all functions of coordinates, and when the coordinates are transformed, these components are linearly transformed according to certain rules. r is called the rank or order of the tensor (which has nothing to do with the rank and order of the matrix). In the sense of isomorphism, the zeroth-order tensor (r = 0) is a scalar (Scalar), the first-order tensor (r = 1) is a vector, and the second-order tensor (r = 2) is Become a Matrix. For example, for a 3-dimensional space, the tensor at r = 1 is this vector: (x, y, z). Due to the different transformation methods, the tensor is divided into a covariant tensor (CovariantTensor (the index is below)), an inverter tensor (ContravariantTensor (the index is above)), and a mixed tensor (both the index is above and the index is below). class.
在本申请中,可以用张量这种数据结构来表示所有的数据。即,在本申请中,一个张量可以对应一个n维的数组或列表。一个张量有一个静态类型和动态类型的维数。张量可以在图中的节点之间流通。In this application, a data structure such as a tensor can be used to represent all data. That is, in this application, a tensor may correspond to an n-dimensional array or list. A tensor has a dimension of static type and dynamic type. Tensors can be circulated between nodes in the graph.
在本申请中,张量的维数来被描述为阶。需要说明的是,张量的阶(有时是关于如顺序或度数或者是n维)是张量维数的一个数量描述。例如,在本申请中,处理节点#A可以根据对程序#A的IR进行shape分析,从而,确定该程序#A的IR(具体地说,是IR的张量)的维度,进而估计出该程序#A的运行所需要的内存空间的大小。这里,基于数据的维度估计内存空间大小的方法和过程可以与现有技术相似,这里,为了避免赘述,省略其详细说明。In this application, the dimensionality of a tensor is described as an order. It should be noted that the order of the tensor (sometimes about order or degree or n dimensions) is a quantitative description of the tensor dimension. For example, in this application, the processing node #A may perform a shape analysis on the IR of the program #A, thereby determining the dimension of the IR (specifically, the IR tensor) of the program #A, and then estimating the The amount of memory space required for program #A to run. Here, the method and process of estimating the size of the memory space based on the dimensions of the data may be similar to the prior art. Here, in order to avoid redundant description, detailed descriptions thereof are omitted.
此情况下,设程序#A的源代码的表达式为s=x*y+z。并且,按优先级排列,处理器#b为默认优先级最低的处理器,例如,处理器#b可以为通用处理器,该处理器#a为 专用处理器,即,该处理器#b可以实现功能,但并行计算能力不如专用处理器#a。用DSL语言来描述该计算可表达为m=mul(x,y),s=add(m,z)。则在本申请中,通过IR处理后,得到程序#A的IR的描述,分析会用到mul(乘法),add(加和)这两种指令的计算。并且,处理节点#A可以确定该程序#A需要占用的内存空间的大小(例如,设该内存空间的大小为X)。并且,处理节点#A可以确定各处理器当前的可用内存空间的大小。设处理器#a当前的内存空间大小为Y。其后,处理节点#A优先判断上述指令是否都是属于intrin#a;若判定为“是”,则进一步确定该X是否小于或等于Y。若判定为“是”,则选择处理器#a作为处理器#1;若判定为“否”,则选择处理器#b作为处理器#1。In this case, let the expression of the source code of the program #A be s = x * y + z. And, in order of priority, processor #b is the processor with the lowest default priority. For example, processor #b may be a general-purpose processor, and processor #a is a dedicated processor, that is, processor #b may be Implements functions, but the parallel computing power is not as good as the dedicated processor #a. DSL is used to describe this calculation as m = mul (x, y), s = add (m, z). In this application, the IR description of program #A is obtained after IR processing. The analysis uses calculations of two instructions, mul (multiplication) and add (addition). In addition, the processing node #A may determine the size of the memory space that the program #A needs to occupy (for example, let the size of the memory space be X). And, the processing node #A can determine the size of the currently available memory space of each processor. Let the current memory space size of processor #a be Y. Thereafter, the processing node #A first determines whether the above-mentioned instructions belong to intrinsic # a; if the determination is "YES", it is further determined whether the X is less than or equal to Y. If the determination is "YES", processor #a is selected as the processor # 1; if the determination is "NO", processor #b is selected as the processor # 1.
并且,处理节点#A可以控制中间编译器将程序#A的IR发送至处理器#1对应的backend。从而,该处理器#1对应的backend可以将该程序#A的IR转换为处理器#1能够识别并处理的代码。In addition, the processing node #A may control the intermediate compiler to send the IR of the program #A to the backend corresponding to the processor # 1. Therefore, the backend corresponding to the processor # 1 can convert the IR of the program #A into code that the processor # 1 can recognize and process.
根据本申请提供的选择处理器的方法,通过预先获取每种处理器的硬件信息和目标程序的程序信息,并基于该硬件信息和程序信息,从多种处理器中选择硬件信息与该程序信息相匹配的处理,能够使所选择的处理器与该目标程序相匹配,并且,无需人工指定该处理器,从而,能提提高计算机设备的处理效率,降低编程人员的负担。According to the method for selecting a processor provided in the present application, by obtaining hardware information of each processor and program information of a target program in advance, and based on the hardware information and program information, selecting hardware information and the program information from a variety of processors Matching processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.
图6是本申请的选择处理器的方法300的一例的示意性流程图。作为示例而非限定,方法300的执行主体(以下,为了便于理解和说明,称为处理节点#B)可以是计算设备中的多个处理器中的任意处理器,例如,中央处理器。或者,处理节点#B可以是计算设备中运行的虚拟机。另外,在本申请中,处理节点#B也可以是上述后端编译器,也可以是与上述后端编译器相互独立的装置,本申请并未特别限定。FIG. 6 is a schematic flowchart of an example of a method 300 for selecting a processor according to the present application. By way of example and not limitation, the execution body of the method 300 (hereinafter, referred to as processing node #B for ease of understanding and description) may be any processor among multiple processors in a computing device, such as a central processing unit. Alternatively, processing node #B may be a virtual machine running in a computing device. In addition, in this application, the processing node #B may be the above-mentioned back-end compiler, or may be a device independent of the above-mentioned back-end compiler, which is not particularly limited in this application.
如图6所示,S310,处理节点#B可以获取计算设备100所包括的两种处理器中每种处理器的硬件信息。可选地,在本申请中,计算设备100的制造商可以将该计算设备100所包括的各处理器的硬件信息在该计算设备100出厂时预先配置在计算设备100中,从而,处理节点#B在S310,可以基于该出场配置的相关信息,获取计算设备100所包括的两种处理器中每种处理器的硬件信息。可选地,在本申请中,计算设备100的制造商可以将该计算设备100所包括的各处理器的硬件信息保存在服务器上,从而,处理节点#B在S310,预先通过网络连接该服务器,并从该服务器中获取计算设备100所包括的两种处理器中每种处理器的硬件信息。可选地,在本申请中,计算设备100的使用者可以将该计算设备100所包括的各处理器的硬件信息输入至处理节点#B。可选地,在本申请中,各处理器可以以热插拔的方式安装,并且,各处理器的驱动程序可以在热插拔时使各处理器完成注册,此前情况下,处理节点#B在S310,可以基于各处理器的注册信息或驱动程序中的相关信息,获取计算设备100所包括的两种处理器中每种处理器的硬件信息。即,在本申请中,计算机设备100(或者说,处理节点#B)可以具有处理器注册信息收集功能,从而能够识别出计算机设备100中支持哪些异构硬件,根据识别出的硬件,在系统启动时对各处理器对应的backend进行注册。从而,处理节点可以根据各处理器对应的backend的注册信息,确定各处理器的硬件信息。As shown in FIG. 6, in S310, the processing node #B can obtain hardware information of each of the two processors included in the computing device 100. Optionally, in the present application, the manufacturer of the computing device 100 may pre-configure the hardware information of the processors included in the computing device 100 in the computing device 100 when the computing device 100 leaves the factory, so that the processing node # B In S310, hardware information of each of the two processors included in the computing device 100 may be obtained based on the relevant information of the appearance configuration. Optionally, in this application, the manufacturer of the computing device 100 may save the hardware information of each processor included in the computing device 100 on a server, so that the processing node #B is connected to the server through the network in advance in S310. And obtain hardware information of each of the two processors included in the computing device 100 from the server. Optionally, in this application, a user of the computing device 100 may input hardware information of each processor included in the computing device 100 to the processing node #B. Optionally, in this application, each processor may be installed in a hot-pluggable manner, and a driver of each processor may complete registration of each processor during hot-plugging. In the previous case, the processing node #B In S310, hardware information of each of the two processors included in the computing device 100 may be obtained based on registration information of each processor or related information in a driver. That is, in the present application, the computer device 100 (or processing node #B) may have a processor registration information collection function, so as to be able to identify which heterogeneous hardware is supported in the computer device 100. Register the backend corresponding to each processor at startup. Therefore, the processing node can determine the hardware information of each processor according to the registration information of the backend corresponding to each processor.
在本申请中,一个处理器的硬件信息可以包括该处理器当前的可用内存空间的大小。作为示例而非限定,在本申请中,处理器当前可用内存空间的大小可以是处理器的空闲空间(或者,内存容量)的例如,90%。In this application, the hardware information of a processor may include the size of the currently available memory space of the processor. By way of example and not limitation, in the present application, the size of the currently available memory space of the processor may be, for example, 90% of the free space (or memory capacity) of the processor.
如图6所示,S320,处理节点#B可以当前需要运行的程序(即,目标程序的一例,记作:程序#B)的程序信息。作为示例而非限定,在本申请中,该程序信息可以是根据程序#B的IR确定的。例如,在本申请中,前端编译器可以获取程序#B的源程序代码(记作:代码#B)。具体地说,编译器可以通过例如,领域描述语言接口(DSL Interface),供开发者调用写算子对应的DSL(即,代码#B的一例);其后,中间编译器可以将该程序#B对应的代码#B(例如,DSL)转换为程序#B的IR;并且,在本申请中,中间编译器还可以对程序#B的IR进行优化。从而,处理节点#B可以从该程序#B的IR(例如,优化后的IR)中,确定该程序#B的程序信息。As shown in FIG. 6, in S320, the processing node #B may have program information of a program (that is, an example of a target program, which is described as: program #B) that needs to be currently run. By way of example and not limitation, in this application, the program information may be determined according to the IR of the program #B. For example, in this application, the front-end compiler can obtain the source program code of program #B (denoted as: code #B). Specifically, the compiler can use, for example, a domain description language interface (DSL interface) for developers to call the DSL corresponding to the write operator (that is, an example of code #B); thereafter, the intermediate compiler can use the program # The code #B (for example, DSL) corresponding to B is converted into the IR of the program #B; and, in this application, the intermediate compiler may also optimize the IR of the program #B. Therefore, the processing node #B can determine the program information of the program #B from the IR (for example, the optimized IR) of the program #B.
需要说明的是,在本申请中,该处理节点#B可以是作为该代码#B的前端译码器和中间译码器,此情况下,该处理节点#B能够直接获得该程序#B的IR。或者,在本申请中,该代码#B的前端译码器和中间译码器可以由处理节点#B实现,此情况下,该处理节点#B也可以与处理节点#B通信,从而,处理节点#B可以将该程序#B的IR发送给处理节点#B。在本申请中,程序#B的程序信息可以包括程序#B的运行所需要的内存空间(记作:空间#B)的大小的信息。It should be noted that, in this application, the processing node #B may be a front-end decoder and an intermediate decoder as the code #B. In this case, the processing node #B can directly obtain the program #B. IR. Alternatively, in the present application, the front-end decoder and the intermediate decoder of the code #B may be implemented by the processing node #B. In this case, the processing node #B may also communicate with the processing node #B. The node #B may send the IR of the program #B to the processing node #B. In the present application, the program information of the program #B may include information on the size of a memory space (denoted as: space #B) required for the operation of the program #B.
作为示例而非限定,在本申请中,处理节点#B可以根据程序#B(或者说程序#B的代码)的数据维度,确定该空间#B。其中,该程序#B的数据维度可以理解为,该程序#B的张量的维度(shape)。张量(Tensor)是一个定义在的一些向量空间和一些对偶空间的笛卡儿积上的多重线性映射,其坐标是|n|维空间内,有|n|个分量的一种量,其中每个分量都是坐标的函数,而在坐标变换时,这些分量也依照某些规则作线性变换。r称为该张量的秩或阶(与矩阵的秩和阶均无关系)。在同构的意义下,第零阶张量(r=0)为标量(Scalar),第一阶张量(r=1)为向量(Vector),第二阶张量(r=2)则成为矩阵(Matrix)。例如,对于3维空间,r=1时的张量为此向量:(x,y,z)。由于变换方式的不同,张量分成协变张量(CovariantTensor,指标在下者)、逆变张量(ContravariantTensor,指标在上者)、混合张量(指标在上和指标在下两者都有)三类。By way of example and not limitation, in this application, the processing node #B may determine the space #B according to the data dimension of the program #B (or the code of the program #B). The data dimension of the program #B can be understood as the shape (shape) of the tensor of the program #B. A tensor is a multiple linear mapping defined on the Cartesian product of some vector spaces and some dual spaces. Its coordinates are a quantity of | n | components in the | n | -dimensional space, where Each component is a function of coordinates, and when the coordinates are transformed, these components are also linearly transformed according to certain rules. r is called the rank or order of the tensor (which has nothing to do with the rank and order of the matrix). In the sense of isomorphism, the zeroth-order tensor (r = 0) is a scalar, the first-order tensor (r = 1) is a vector, and the second-order tensor (r = 2) is Become a Matrix. For example, for a 3-dimensional space, the tensor at r = 1 is this vector: (x, y, z). Due to the different transformation methods, the tensor is divided into a covariant tensor (CovariantTensor (the index is below)), an inverter tensor (ContravariantTensor (the index is above)), and a mixed tensor (both the index is above and the index is below) class.
在本申请中,可以用张量这种数据结构来表示所有的数据。即,在本申请中,一个张量可以对应一个n维的数组或列表。一个张量有一个静态类型和动态类型的维数。张量可以在图中的节点之间流通。在本申请中,张量的维数来被描述为阶。需要说明的是,张量的阶(有时是关于如顺序或度数或者是n维)是张量维数的一个数量描述。例如,在本申请中,处理节点#B可以根据对程序#B的IR进行shape分析,从而,确定该程序#B的IR(具体地说,是IR的张量)的维度,进而估计出该程序#B的运行所需要的内存空间的大小。这里,基于数据的维度估计内存空间大小的方法和过程可以与现有技术相似,这里,为了避免赘述,省略其详细说明。In this application, a data structure such as a tensor can be used to represent all data. That is, in this application, a tensor may correspond to an n-dimensional array or list. A tensor has a dimension of static type and dynamic type. Tensors can be circulated between nodes in the graph. In this application, the dimensionality of a tensor is described as an order. It should be noted that the order of the tensor (sometimes about order or degree or n dimensions) is a quantitative description of the tensor dimension. For example, in the present application, the processing node #B may perform a shape analysis on the IR of the program #B, thereby determining the dimension of the IR (specifically, the IR tensor) of the program #B, and then estimating the The amount of memory space required for program #B to run. Here, the method and process of estimating the size of the memory space based on the dimensions of the data may be similar to the prior art. Here, in order to avoid redundant description, detailed descriptions thereof are omitted.
在S330,该处理节点#B可以基于该程序#B的程序信息和各处理器的硬件信息,从多个处理器中确定目标处理器(记作,处理器#2)。其中,该处理器#2中可以是多个处理器中当前可用内存空间大于或等于该程序#B的运行所需要的内存空间的大小的处理器。或者说,该处理器#2可以是该多个处理器中,满足约束条件#C的处理器。该约束条件#C包括:处理器当前的可用空间大于或等于该程序#B的运行所需要的内存空间的大小。In S330, the processing node #B may determine a target processor (denoted as processor # 2) from a plurality of processors based on the program information of the program #B and the hardware information of each processor. Wherein, the processor # 2 may be a processor in which the currently available memory space of the multiple processors is greater than or equal to the size of the memory space required for the running of the program #B. In other words, the processor # 2 may be a processor among the multiple processors that meets the constraint #C. The constraint #C includes: the current available space of the processor is greater than or equal to the size of the memory space required for the operation of the program #B.
可选地,在本申请中,该处理节点#B可以确定该多个处理器中每个处理器的优先级。作为示例而非限定,在本申请中,处理节点#B可以根据多个处理器中每个处理器的并行 计算能力,确定各处理器的优先级,即,在本申请中,并行计算能力高的处理器的优先级高于并行计算能力低的处理器的优先级,例如,对于处理器#a和处理器#b,如果处理器#b的并行能力高于处理器#a的并行计算能力,则处理节点#B可以认为处理器#b的优先级高于处理器#a的优先级。其中,并行计算或称平行计算是相对于串行计算来说的。并行计算是一种一次可执行多个指令的算法,目的是提高计算速度,及通过扩大问题求解规模,解决大型而复杂的计算问题。所谓并行计算可分为时间上的并行和空间上的并行。时间上的并行就是指流水线技术,而空间上的并行则是指用多个处理器并发的执行计算。Optionally, in this application, the processing node #B may determine the priority of each processor in the plurality of processors. By way of example and not limitation, in this application, the processing node #B may determine the priority of each processor according to the parallel computing capability of each of the multiple processors, that is, in this application, the parallel computing capability is high The priority of the processor is higher than that of the processor with low parallel computing capability. For example, for processor #a and processor #b, if the parallel capability of processor #b is higher than the parallel computing capability of processor #a , The processing node #B may consider that the priority of the processor #b is higher than the priority of the processor #a. Among them, parallel computing or parallel computing is relative to serial computing. Parallel computing is an algorithm that can execute multiple instructions at one time. The purpose is to increase the computing speed and solve large and complex computing problems by expanding the problem solving scale. The so-called parallel computing can be divided into parallel in time and parallel in space. Temporal parallelism refers to pipeline technology, while spatial parallelism refers to the use of multiple processors to perform calculations concurrently.
再例如,在本申请中,处理节点#B可以根据多个处理器中每个处理器的功耗,确定各处理器的优先级,即,在本申请中,功耗高的处理器的优先级低于功耗低的处理器的优先级,例如,对于处理器#a和处理器#b,如果处理器#b的功耗高于处理器#a的功耗,则处理节点#B可以认为处理器#b的优先级低于处理器#a的优先级。As another example, in this application, the processing node #B may determine the priority of each processor according to the power consumption of each of the multiple processors, that is, in this application, the priority of the processor with high power consumption Level is lower than the priority of the processor with low power consumption, for example, for processor #a and processor #b, if the power consumption of processor #b is higher than the power consumption of processor #a, processing node #B may It is considered that the priority of the processor #b is lower than the priority of the processor #a.
可选地,在本申请中,该处理节点#B可以根据多个处理器的种类确定每个处理器的优先级。例如,在本申请中,专用处理器的优先级高于通用处理器的优先级。并且,可选地,通用处理器可以是多个处理器中优先级最低的处理器。从而,处理节点#B可以按各处理器的优先级,例如,按照优先级从高到低的顺序,依次判定每个处理器是否满足上述约束条件#C。并且,可选地,处理节点#B可以将首个满足约束条件#C的处理器确定为处理器#2。另外,处理节点#B可以在确定出处理器#2后,停止对其他处理器进行判定。Optionally, in this application, the processing node #B may determine the priority of each processor according to the types of multiple processors. For example, in this application, the priority of a special-purpose processor is higher than that of a general-purpose processor. And, optionally, the general-purpose processor may be the processor with the lowest priority among the multiple processors. Therefore, the processing node #B can sequentially determine whether each processor satisfies the above-mentioned constraint condition #C according to the priority of each processor, for example, in the order of priority from high to low. And, optionally, the processing node #B may determine the first processor satisfying the constraint condition #C as the processor # 2. In addition, the processing node #B may stop determining other processors after determining the processor # 2.
例如,设程序#B的源代码的表达式为s=x*y+z。并且,按优先级排列,处理器#b为默认优先级最低的处理器,例如,处理器#b可以为通用处理器,该处理器#a为专用处理器,即,该处理器#b可以实现功能,但并行计算能力不如专用处理器#a。用DSL语言来描述该计算可表达为m=mul(x,y),s=add(m,z)。则在本申请中,通过IR处理后,得到程序#B的IR的描述,分析会用到mul(乘法),add(加和)这两种指令的计算。并且,处理节点#B可以确定该程序#B需要占用的内存空间的大小(例如,设该内存空间的大小为W)。并且,处理节点#B可以确定各处理器当前的可用内存空间的大小。设处理器#a当前的内存空间大小为Z。其后,处理节点#B判断该Z是否大于或等于W;若判定为“是”,则选择处理器#a作为处理器#2。若判定为“否”,则选择处理器#b作为处理器#2。For example, let the expression of the source code of the program #B be s = x * y + z. And, in order of priority, processor #b is the processor with the lowest default priority. For example, processor #b may be a general-purpose processor, and processor #a is a dedicated processor, that is, processor #b may be Implements functions, but the parallel computing power is not as good as the dedicated processor #a. DSL is used to describe this calculation as m = mul (x, y), s = add (m, z). In this application, the IR description of program #B is obtained after IR processing. The analysis uses calculations of two instructions, mul (multiplication) and add (addition). In addition, the processing node #B can determine the size of the memory space that the program #B needs to occupy (for example, let the size of the memory space be W). And, processing node #B can determine the size of the currently available memory space of each processor. Let the current memory space size of processor #a be Z. Thereafter, the processing node #B judges whether the Z is greater than or equal to W; if it is determined as "YES", the processor #a is selected as the processor # 2. If the determination is "No", the processor #b is selected as the processor # 2.
可选地,处理节点#B在S310获得的硬件信息还包括各处理器对应的指令集的信息。例如,一个处理器的硬件信息可以包括该处理器所能够执行的指令的名称的信息。再例如,一个处理器的硬件信息可以包括该处理器所能够执行的函数的名称的信息。相应地,处理节点#B在S320获得的程序信息还包括程序#B的代码(例如,优化后的IR)所包括的指令(记作:指令#B)。其中,该指令#B可以包括一个指令,也可以包括多个指令,本申请并未特别限定。例如,程序#B的程序信息可以包括程序#B的IR中的指令的名称。再例如,程序#B的程序信息可以包括程序#B的IR中的函数的名称。此情况下,该处理器#2中可以是多个处理器中当前可用内存空间大于或等于程序#B的运行所需要的内存空间、且所对应的指令集包括该指令#B的处理器。或者说,该处理器#1可以是该多个处理器中,满足约束条件#C和约束条件#D的处理器。该约束条件#D包括处理器对应的指令集包括该指令#B。此情况下,设程序#B的源代码的表达式为s=x*y+z。并且,按优先级排列,处理器#b为默认优先级最低的处理器,例如,处理器#b可以为通用处理器, 该处理器#a为专用处理器,即,该处理器#b可以实现功能,但并行计算能力不如专用处理器#a。例如,在本申请中,不同的处理器会有不同的指令集合,实现同样的功能所用到的指令在不同的芯片上会不一样,假设处理器#a的指令集合为intrin#a,处理器#b的指令集为intrin#b。用DSL语言来描述该计算可表达为m=mul(x,y),s=add(m,z)。则在本申请中,通过IR处理后,得到程序#B的IR的描述,分析会用到mul(乘法),add(加和)这两种指令的计算。并且,处理节点#B可以确定该程序#B需要占用的内存空间的大小(例如,设该内存空间的大小为W)。并且,处理节点#B可以确定各处理器当前的可用内存空间的大小。设处理器#a当前的内存空间大小为Z。其后,处理节点#B优先判断Z是否大于或等于W。若判定为“是”,则进一步判定上述指令是否都是属于intrin#a;若判定为“是”,则选择处理器#a作为处理器#2;若判定为“否”,则选择处理器#b作为处理器#2。Optionally, the hardware information obtained by processing node #B in S310 further includes information of an instruction set corresponding to each processor. For example, the hardware information of a processor may include information about the names of instructions that the processor can execute. As another example, the hardware information of a processor may include information about the names of functions that the processor can execute. Correspondingly, the program information obtained by the processing node #B at S320 further includes instructions (denoted as: instruction #B) included in the code (for example, optimized IR) of the program #B. The instruction #B may include one instruction or multiple instructions, which is not particularly limited in this application. For example, the program information of the program #B may include the name of the instruction in the IR of the program #B. As another example, the program information of the program #B may include the names of functions in the IR of the program #B. In this case, the processor # 2 may be a processor in which the currently available memory space of the multiple processors is greater than or equal to the memory space required for the operation of the program #B, and the corresponding instruction set includes the instruction #B. In other words, the processor # 1 may be a processor among the multiple processors that meets the constraint #C and the constraint #D. The constraint condition #D includes that the instruction set corresponding to the processor includes the instruction #B. In this case, let the expression of the source code of the program #B be s = x * y + z. And, in order of priority, processor #b is the processor with the lowest default priority. For example, processor #b may be a general-purpose processor, and processor #a is a dedicated processor, that is, processor #b may be Implements functions, but the parallel computing power is not as good as the dedicated processor #a. For example, in this application, different processors will have different instruction sets. The instructions used to implement the same function will be different on different chips. Assume that the instruction set of processor #a is intrin # a, processor The instruction set of #b is intrin # b. DSL is used to describe this calculation as m = mul (x, y), s = add (m, z). In this application, the IR description of program #B is obtained after IR processing. The analysis uses calculations of two instructions, mul (multiplication) and add (addition). In addition, the processing node #B can determine the size of the memory space that the program #B needs to occupy (for example, let the size of the memory space be W). And, processing node #B can determine the size of the currently available memory space of each processor. Let the current memory space size of processor #a be Z. Thereafter, the processing node #B first determines whether Z is greater than or equal to W. If the determination is "Yes", then it is further determined whether all the above instructions belong to inintrin # a; if the determination is "Yes", then the processor #a is selected as the processor # 2; if the determination is "No", the processor is selected #b 以 Handler # 2.
该处理节点#B可以控制中间编译器将程序#B的IR发送至处理器#2对应的backend。从而,该处理器#2对应的backend可以将该程序#A的IR转换为处理器#2能够识别并处理的代码。本申请的选择处理器的方法可以应用于编译技术中。The processing node #B can control the intermediate compiler to send the IR of the program #B to the corresponding backend of the processor # 2. Therefore, the backend corresponding to the processor # 2 can convert the IR of the program #A into code that the processor # 2 can recognize and process. The method for selecting a processor of the present application can be applied to a compilation technique.
如图7所示,在S410,编译设备(例如,前端编译器)可以通过DSL接口,供开发者调用写算子对应的DSL。在S420,编译设备(例如,中间编译器)可以将DSL生成中间表达式IR。在S430,编译设备(例如,中间编译器)可以对中间表达式IR进行优化。在S440,编译设备(例如,上述处理节点#A或处理节点#B)基于自动识别硬件装置获取的后端硬件注册信息和自动分析IR装置的分析结果,选择最优的后端编译backend。其中,该步骤的具体过程可以与上述方法200或方法300描述的过程相似,这里,为了避免赘述,省略其详细说明。在S450,编译设备(例如,所选择的后端编译器)并生成可以在此backend对应的处理器上运行的算子代码。As shown in FIG. 7, in S410, a compiling device (for example, a front-end compiler) may use a DSL interface for a developer to call a DSL corresponding to a write operator. At S420, a compilation device (eg, an intermediate compiler) may generate the DSL to an intermediate expression IR. At S430, the compilation device (eg, an intermediate compiler) may optimize the intermediate expression IR. In S440, the compiling device (for example, the processing node #A or processing node #B described above) selects the optimal backend compilation backend based on the backend hardware registration information obtained by the automatic identification hardware device and the analysis result of the IR device. The specific process of this step may be similar to the process described in the above method 200 or method 300. Here, in order to avoid redundant description, detailed descriptions thereof are omitted. In S450, the device (for example, the selected back-end compiler) is compiled and the operator code that can be run on the processor corresponding to this backend is generated.
根据本申请提供的选择处理器的方法,通过预先获取每种处理器的硬件信息和目标程序的程序信息,并基于该硬件信息和程序信息,从多种处理器中选择硬件信息与该程序信息相匹配的处理,能够使所选择的处理器与该目标程序相匹配,并且,能够缩减人工劳动时间。According to the method for selecting a processor provided in the present application, by obtaining hardware information of each processor and program information of a target program in advance, and based on the hardware information and program information, selecting hardware information and the program information from a variety of processors The matching process can match the selected processor with the target program, and can reduce the labor time.
根据前述方法,图8是适用本申请实施例的选择处理器的装置500的逻辑架构的示意图。其中,该选择处理器的装置可以配置在包括多个处理器的计算设备上,或者,该选择处理器的装置本身就是该多个处理器中的一个处理器。如图8所示,该选择处理器的装置500可以包括识别单元510、分析单元520和选择单元530。According to the foregoing method, FIG. 8 is a schematic diagram of a logical architecture of a processor selection apparatus 500 according to an embodiment of the present application. The processor selection device may be configured on a computing device including multiple processors, or the processor selection device itself is one of the multiple processors. As shown in FIG. 8, the apparatus 500 for selecting a processor may include a recognition unit 510, an analysis unit 520, and a selection unit 530.
其中,该识别单元510可以用于执行上述S210或S310中的方法,即,识别单元510可以获取所述至少两种处理器中每种处理器的硬件信息,所述硬件信息用于指示处理器对应的指令集,和/或,所述硬件信息用于指示处理器的可用内存空间的大小,并且,识别单元510的具体处理过程可以与上述S210或S310描述的处理过程相似,这里为了避免赘述,省略其详细说明。The identification unit 510 may be configured to execute the method in S210 or S310. That is, the identification unit 510 may obtain hardware information of each of the at least two processors, and the hardware information is used to instruct the processor. The corresponding instruction set, and / or the hardware information is used to indicate the size of the available memory space of the processor, and the specific processing procedure of the identification unit 510 may be similar to the processing procedure described in the above S210 or S310, in order to avoid redundant description here , And its detailed description is omitted.
该分析单元520可以用于执行上述S220或S320中的方法,即,分析单元520可以获取目标程序的程序信息,所述程序信息用于指示所述目标程序中的指令,和/或,所述程序信息用于指示所述目标程序需要占用的内存空间,并且,分析单元520的具体处理过程可以与上述S220或S320描述的处理过程相似,这里为了避免赘述,省略其详细说明。The analysis unit 520 may be configured to execute the method in S220 or S320, that is, the analysis unit 520 may obtain program information of a target program, where the program information is used to indicate instructions in the target program, and / or, the The program information is used to indicate the memory space that the target program needs to occupy, and the specific processing process of the analysis unit 520 may be similar to the processing process described in the above S220 or S320. To avoid redundant descriptions, detailed descriptions are omitted here.
选择单元530可以用于执行上述S230或S330中的方法,即,选择单元530根据所述程序信息和所述硬件信息,从所述至少两种处理器中确定用于执行所述目标程序的目标处理器,其中,所述目标处理器是所述至少两种处理器中满足预设条件的处理器,所述预设条件包括处理器对应的指令集包括所述目标程序中的指令,和/或,所述预设条件包括所述处理器的可用内存空间大于或等于所述目标程序需要占用的内存空间,并且,选择单元530的具体处理过程可以与上述S230或S350描述的处理过程相似,这里为了避免赘述,省略其详细说明。此外,该选择单元530还可以控制中间编译器将目标程序的IR发送至目标处理器对应的后端编译器backend。The selection unit 530 may be configured to execute the method in S230 or S330, that is, the selection unit 530 determines a target for executing the target program from the at least two processors according to the program information and the hardware information. A processor, wherein the target processor is a processor that satisfies a preset condition among the at least two processors, the preset condition includes an instruction set corresponding to the processor including an instruction in the target program, and / Or, the preset condition includes that the available memory space of the processor is greater than or equal to the memory space required by the target program, and the specific processing procedure of the selection unit 530 may be similar to the processing procedure described in the above S230 or S350, To avoid redundant descriptions, detailed descriptions are omitted here. In addition, the selection unit 530 can also control the intermediate compiler to send the IR of the target program to the backend compiler backend corresponding to the target processor.
需要说明的是,在本申请中,上述识别单元510、分析单元520和选择单元530的动作和功能可以由同一虚拟机或同一处理器实现。或者,上述识别单元510、分析单元520和选择单元530的动作和功能可以由不同的多个虚拟机或多个处理器分别实现。It should be noted that, in this application, the actions and functions of the identification unit 510, analysis unit 520, and selection unit 530 may be implemented by the same virtual machine or the same processor. Alternatively, the actions and functions of the identification unit 510, analysis unit 520, and selection unit 530 may be implemented by different multiple virtual machines or multiple processors, respectively.
该装置500所涉及的与本申请实施例提供的技术方案相关的概念,解释和详细说明及其他步骤请参见前述方法或其他实施例中关于这些内容的描述,此处不做赘述。根据本申请提供的选择处理器的装置,通过预先获取每种处理器的硬件信息和目标程序的程序信息,并基于该硬件信息和程序信息,从多种处理器中选择硬件信息与该程序信息相匹配的处理,能够使所选择的处理器与该目标程序相匹配,并且,无需人工指定该处理器,从而,能提提高计算机设备的处理效率,降低编程人员的负担。For concepts, explanations and detailed descriptions and other steps related to the technical solution provided by the embodiments of the present application related to the device 500, please refer to the description of these contents in the foregoing method or other embodiments, and will not be repeated here. According to the processor selection device provided in the present application, hardware information and program information of a target program are obtained in advance by each processor, and based on the hardware information and program information, hardware information and the program information are selected from a variety of processors. Matching processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.
根据前述方法,图9是适用本申请实施例的编译装置600的逻辑架构的示意图。如图9所示,该编译装置600可以包括前端编译单元610、中间编译单元620、选择单元630、多个后端编译单元640。其中,该多个后端编译单元640与多个处理器(或者说,计算单元,计算平台或处理单元)一一对应。其中,该选择单元630可以包括识别模块632、分析模块634和选择模块636。前端编译单元610可以通过DSL接口,供开发者调用写算子对应的DSL。其中,该前端编译单元610执行的动作可以与上述前端编译器执行动作相似,这里,为了避免赘述,省略其说明。中间编译单元620与该前端编译单元610通信连接,用于从前端编译单元610获取该DSL,并且,可以将DSL生成中间表达式IR,并且,可以对中间表达式IR进行优化,其中,该中间编译单元620执行的动作可以与上述中间编译器执行动作相似,这里,为了避免赘述,省略其说明。According to the foregoing method, FIG. 9 is a schematic diagram of a logical architecture of a compiling device 600 to which the embodiment of the present application is applied. As shown in FIG. 9, the compilation apparatus 600 may include a front-end compilation unit 610, an intermediate compilation unit 620, a selection unit 630, and a plurality of back-end compilation units 640. The multiple back-end compiling units 640 correspond to multiple processors (or computing units, computing platforms, or processing units). The selection unit 630 may include an identification module 632, an analysis module 634, and a selection module 636. The front-end compilation unit 610 may use a DSL interface for developers to call the DSL corresponding to the write operator. The actions performed by the front-end compilation unit 610 may be similar to the actions performed by the aforementioned front-end compiler. Here, in order to avoid redundant description, descriptions thereof are omitted. The intermediate compilation unit 620 is communicatively connected to the front-end compilation unit 610, and is configured to obtain the DSL from the front-end compilation unit 610, and the DSL can generate an intermediate expression IR, and the intermediate expression IR can be optimized. The actions performed by the compiling unit 620 may be similar to the actions performed by the above-mentioned intermediate compiler. Here, in order to avoid redundant description, the description is omitted.
识别模块632可以用于执行上述S210或S310中的方法,即,识别单元510可以获取所述至少两种处理器中每种处理器的硬件信息,所述硬件信息用于指示处理器对应的指令集,和/或,所述硬件信息用于指示处理器的可用内存空间的大小,并且,识别模块632的具体处理过程可以与上述S210或S310描述的处理过程相似,这里为了避免赘述,省略其详细说明。The identification module 632 may be configured to execute the method in S210 or S310, that is, the identification unit 510 may obtain hardware information of each of the at least two processors, and the hardware information is used to indicate instructions corresponding to the processors. Set, and / or, the hardware information is used to indicate the size of the available memory space of the processor, and the specific processing process of the identification module 632 may be similar to the processing process described in the above S210 or S310, in order to avoid redundant descriptions, omitted here Detailed description.
分析模块634与该中间编译单元620通信连接,用于从中间编译单元620获取IR,进而可以用于执行上述S220或S320中的方法,即,分析模块634可以获取目标程序的程序信息,所述程序信息用于指示所述目标程序中的指令,和/或,所述程序信息用于指示所述目标程序需要占用的内存空间,并且,分析模块634的具体处理过程可以与上述S220或S320描述的处理过程相似,这里为了避免赘述,省略其详细说明。The analysis module 634 is communicatively connected to the intermediate compilation unit 620, and is configured to obtain the IR from the intermediate compilation unit 620, and can further be used to execute the method in S220 or S320, that is, the analysis module 634 can obtain the program information of the target program. The program information is used to indicate instructions in the target program, and / or the program information is used to indicate the memory space that the target program needs to occupy, and the specific processing process of the analysis module 634 can be described with the above S220 or S320 The processing process is similar. To avoid redundant description, detailed description is omitted here.
选择模块636可以与识别模块632及分析模块634通信连接,用于从识别模块632获取硬件信息,从分析模块634获取程序信息,进而选择模块636可以用于执行上述S230 或S330中的方法,即,选择单元530根据所述程序信息和所述硬件信息,从所述至少两种处理器中确定用于执行所述目标程序的目标处理器,其中,所述目标处理器是所述至少两种处理器中满足预设条件的处理器,所述预设条件包括处理器对应的指令集包括所述目标程序中的指令,和/或,所述预设条件包括所述处理器的可用内存空间大于或等于所述目标程序需要占用的内存空间,并且,选择模块636的具体处理过程可以与上述S230或S350描述的处理过程相似,这里为了避免赘述,省略其详细说明。此外,该选择模块636还可以控制中间编译器将目标程序的IR发送至目标处理器对应的后端编译单元640。后端编译单元640可以将IR转换为能够在所对应的处理器上可以运行的代码。该后端编译单元640执行的动作可以与上述后端编译器执行动作相似,这里,为了避免赘述,省略其说明。The selection module 636 may be communicatively connected to the identification module 632 and the analysis module 634, and is configured to obtain hardware information from the identification module 632 and program information from the analysis module 634. The selection module 636 may be used to execute the method in S230 or S330, that is, The selection unit 530 determines a target processor for executing the target program from the at least two processors according to the program information and the hardware information, wherein the target processor is the at least two processors. A processor in the processor that satisfies a preset condition, the preset condition includes an instruction set corresponding to the processor including instructions in the target program, and / or, the preset condition includes an available memory space of the processor It is greater than or equal to the memory space required by the target program, and the specific processing process of the selection module 636 may be similar to the processing process described in the above S230 or S350. To avoid redundant description, detailed descriptions are omitted here. In addition, the selection module 636 can also control the intermediate compiler to send the IR of the target program to the back-end compilation unit 640 corresponding to the target processor. The back-end compilation unit 640 can convert the IR into code that can be executed on the corresponding processor. The actions performed by the back-end compilation unit 640 may be similar to the actions performed by the back-end compiler described above. Here, to avoid repetition, the description is omitted.
根据本申请提供的编译装置,通过预先获取每种处理器的硬件信息和目标程序的程序信息,并基于该硬件信息和程序信息,从多种处理器中选择硬件信息与该程序信息相匹配的处理,能够使所选择的处理器与该目标程序相匹配,并且,无需人工指定该处理器,从而,能提提高计算机设备的处理效率,降低编程人员的负担。According to the compiler provided in this application, the hardware information of each processor and the program information of the target program are obtained in advance, and based on the hardware information and program information, the hardware information that matches the program information is selected from a variety of processors. Processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art may realize that the units and algorithm steps of each example described in connection with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not implemented. In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application is essentially a part that contributes to the existing technology or a part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. The aforementioned storage media include: U disks, mobile hard disks, read-only memory (ROM), random access memory (RAM), magnetic disks or compact discs, and other media that can store program codes .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above is only a specific implementation of this application, but the scope of protection of this application is not limited to this. Any person skilled in the art can easily think of changes or replacements within the technical scope disclosed in this application. It should be covered by the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims (17)

  1. 一种选择处理器的方法,其特征在于,所述方法包括:A method for selecting a processor, wherein the method includes:
    获取至少两种处理器中每种处理器的硬件信息,所述硬件信息用于指示所述每种处理器对应的指令集;Acquiring hardware information of each processor of at least two processors, where the hardware information is used to indicate an instruction set corresponding to each processor;
    获取待执行的目标程序的程序信息,所述程序信息用于指示所述目标程序中的指令;Acquiring program information of a target program to be executed, where the program information is used to indicate instructions in the target program;
    根据所述程序信息和所述硬件信息,从所述至少两种处理器中确定满足预设条件且能够用于执行所述目标程序的目标处理器,所述预设条件包括处理器对应的指令集包括所述目标程序中的指令。Determining, from the at least two processors, a target processor that satisfies a preset condition and can be used to execute the target program according to the program information and the hardware information, and the preset condition includes instructions corresponding to the processor The set includes instructions in the target program.
  2. 如权利要求1所述的方法,其特征在于,所述根据所述程序信息和所述硬件信息,从所述至少两种处理器中确定满足预设条件且能够用于执行所述目标程序的目标处理器,包括:The method according to claim 1, wherein, according to the program information and the hardware information, determining from the at least two processors that meets a preset condition and can be used to execute the target program Target processors, including:
    确定所述至少两种处理器每种处理器的优先级;Determining a priority of each processor of the at least two processors;
    基于所述程序信息和所述硬件信息,按照所述至少两种处理器的优先级从高到低的顺序,依次判定所述至少两种处理器是否满足所述预设条件,并将首个满足所述预设条件的处理器作为所述目标处理器。Based on the program information and the hardware information, in order of priority of the at least two processors from high to low, sequentially determine whether the at least two processors meet the preset condition, and set the first A processor satisfying the preset condition is used as the target processor.
  3. 如权利要求2所述的方法,其特征在于,所述确定所述至少两种处理器每种处理器的优先级,包括:The method according to claim 2, wherein determining the priority of each processor of the at least two processors comprises:
    根据所述至少两种处理器中每种处理器的并行计算能力或功耗中的至少一项,确定所述每种处理器的优先级。A priority of each processor is determined according to at least one of a parallel computing capability or power consumption of each of the at least two processors.
  4. 如权利要求2或3所述的方法,其特征在于,所述至少两种处理器包括中央处理器CPU,且所述CPU在所述至少两种处理器中的优先级最低。The method according to claim 2 or 3, wherein the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors.
  5. 如权利要求1至4中任一项所述的方法,其特征在于,所述硬件信息还用于指示处理器的可用内存空间的大小,The method according to any one of claims 1 to 4, wherein the hardware information is further used to indicate a size of an available memory space of a processor,
    所述程序信息还用于指示所述目标程序需要占用的内存空间,以及The program information is further used to indicate a memory space required by the target program, and
    所述预设条件还包括处理器的可用内存空间大于或等于所述目标程序需要占用的内存空间。The preset condition further includes that an available memory space of the processor is greater than or equal to a memory space required by the target program.
  6. 如权利要求1至5中任一项所述的方法,其特征在于,所述至少两种处理器包括以下处理器中的至少两种:The method according to any one of claims 1 to 5, wherein the at least two processors include at least two of the following processors:
    CPU、图形处理器GPU、专用集成电路ASIC、神经网络处理器NPU、图像处理单元IPU或数字信号处理DSP。CPU, graphics processor GPU, application specific integrated circuit ASIC, neural network processor NPU, image processing unit IPU or digital signal processing DSP.
  7. 如权利要求1至6中任一项所述的方法,其特征在于,所述获取目标程序的程序信息包括:The method according to any one of claims 1 to 6, wherein the obtaining program information of a target program comprises:
    根据目标程序的中间表达式IR,确定所述程序信息,其中,所述目标程序的IR是根据取所述目标程序的领域描述语言DSL代码确定的。The program information is determined according to an intermediate expression IR of the target program, wherein the IR of the target program is determined according to a domain description language DSL code of the target program.
  8. 如权利要求1至7中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 7, further comprising:
    将所述目标程序的IR输入至与所述目标处理器相对应的目标后端编译器。The IR of the target program is input to a target back-end compiler corresponding to the target processor.
  9. 一种选择处理器的装置,其特征在于,所述装置包括:An apparatus for selecting a processor, wherein the apparatus includes:
    识别模块,用于获取至少两种处理器中每种处理器的硬件信息,所述硬件信息用于指示处理器对应的指令集;An identification module, configured to obtain hardware information of each of the at least two processors, where the hardware information is used to indicate a corresponding instruction set of the processor;
    分析模块,用于获取待执行的目标程序的程序信息,所述程序信息用于指示所述目标程序中的指令;An analysis module, configured to obtain program information of a target program to be executed, where the program information is used to indicate instructions in the target program;
    选择模块,用于根据所述程序信息和所述硬件信息,从所述至少两种处理器中确定满足预设条件且能够用于执行所述目标程序的目标处理器,所述预设条件包括处理器对应的指令集包括所述目标程序中的指令。A selection module, configured to determine, according to the program information and the hardware information, a target processor that satisfies a preset condition and can be used to execute the target program from the at least two processors, the preset condition includes The instruction set corresponding to the processor includes instructions in the target program.
  10. 如权利要求9所述的装置,其特征在于,所述选择模块用于确定所述至少两种处理器每种处理器的优先级,并基于所述程序信息和所述硬件信息,按照所述至少两种处理器的优先级从高到低的顺序,依次判定所述至少两种处理器是否满足所述预设条件,并将首个满足所述预设条件的处理器作为所述目标处理器。The apparatus according to claim 9, wherein the selection module is configured to determine a priority of each processor of the at least two processors, and based on the program information and the hardware information, according to the The order of priority of at least two processors is determined from high to low, sequentially determining whether the at least two processors meet the preset conditions, and using the first processor that meets the preset conditions as the target process. Device.
  11. 如权利要求10所述的装置,其特征在于,所述选择模块用于根据所述至少两种处理器中每种处理器的并行计算能力或功耗中的至少一项,确定所述每种处理器的优先级。The device according to claim 10, wherein the selection module is configured to determine each of the processors according to at least one of a parallel computing capability or a power consumption of each of the at least two processors. The priority of the processor.
  12. 如权利要求10或11所述的装置,其特征在于,所述至少两种处理器包括中央处理器CPU,且所述CPU在所述至少两种处理器中的优先级最低。The apparatus according to claim 10 or 11, wherein the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors.
  13. 如权利要求9至12中任一项所述的装置,其特征在于,所述硬件信息还用于指示处理器的可用内存空间的大小,The device according to any one of claims 9 to 12, wherein the hardware information is further used to indicate a size of an available memory space of the processor,
    所述程序信息还用于指示所述目标程序需要占用的内存空间,以及The program information is further used to indicate a memory space required by the target program, and
    所述预设条件还包括处理器的可用内存空间大于或等于所述目标程序需要占用的内存空间。The preset condition further includes that an available memory space of the processor is greater than or equal to a memory space required by the target program.
  14. 如权利要求9至13中任一项所述的装置,其特征在于,所述至少两种处理器包括以下处理器中的至少两种:The apparatus according to any one of claims 9 to 13, wherein the at least two processors include at least two of the following processors:
    CPU、图形处理器GPU、专用集成电路ASIC、神经网络处理器NPU、图像处理单元IPU或数字信号处理DSP。CPU, graphics processor GPU, application specific integrated circuit ASIC, neural network processor NPU, image processing unit IPU or digital signal processing DSP.
  15. 如权利要求9至14中任一项所述的装置,其特征在于,所述分析模块用于根据所述目标程序的中间表达式IR,确定所述程序信息,其中,所述IR是根据所述目标程序的领域描述语言DSL代码确定的。The apparatus according to any one of claims 9 to 14, wherein the analysis module is configured to determine the program information according to an intermediate expression IR of the target program, wherein the IR is based on the The target program's domain description language is determined by the DSL code.
  16. 如权利要求9至15中任一项所述的装置,其特征在于,The device according to any one of claims 9 to 15, wherein
    所述选择模块还用于将所述目标程序的IR提供至与所述目标处理器相对应的目标后端编译器。The selection module is further configured to provide the IR of the target program to a target back-end compiler corresponding to the target processor.
  17. 一种计算机可读存储介质,其特征在于,包括计算机程序,当其在计算机设备或处理器上运行时,使得所述计算机设备或处理器执行如权利要求1至8中任意一项所述的方法。A computer-readable storage medium, comprising a computer program, which when executed on a computer device or processor, causes the computer device or processor to execute the method according to any one of claims 1 to 8. method.
PCT/CN2018/108459 2018-09-28 2018-09-28 Method and device for selecting processor WO2020062086A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880094887.1A CN112292667B (en) 2018-09-28 2018-09-28 Method and apparatus for selecting processor
PCT/CN2018/108459 WO2020062086A1 (en) 2018-09-28 2018-09-28 Method and device for selecting processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/108459 WO2020062086A1 (en) 2018-09-28 2018-09-28 Method and device for selecting processor

Publications (1)

Publication Number Publication Date
WO2020062086A1 true WO2020062086A1 (en) 2020-04-02

Family

ID=69949840

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/108459 WO2020062086A1 (en) 2018-09-28 2018-09-28 Method and device for selecting processor

Country Status (2)

Country Link
CN (1) CN112292667B (en)
WO (1) WO2020062086A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111951363A (en) * 2020-07-16 2020-11-17 广州玖的数码科技有限公司 Cloud computing chain-based rendering method and system and storage medium
CN113778984A (en) * 2021-08-16 2021-12-10 维沃移动通信(杭州)有限公司 Processing component selection method and device
CN115330587A (en) * 2022-02-22 2022-11-11 摩尔线程智能科技(北京)有限责任公司 Distributed storage interconnection structure, video card and memory access method of graphics processor
CN115391053A (en) * 2022-10-26 2022-11-25 北京云迹科技股份有限公司 Online service method and device based on CPU and GPU hybrid calculation
CN115600664A (en) * 2022-09-28 2023-01-13 美的集团(上海)有限公司(Cn) Operator processing method, electronic device and storage medium
CN117032999A (en) * 2023-10-09 2023-11-10 之江实验室 CPU-GPU cooperative scheduling method and device based on asynchronous running
CN117076330A (en) * 2023-10-12 2023-11-17 北京开源芯片研究院 Access verification method, system, electronic equipment and readable storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112988194B (en) * 2021-03-29 2023-12-15 北京市商汤科技开发有限公司 Program optimization method and device based on equipment information, electronic equipment and storage medium
CN116450055B (en) * 2023-06-15 2023-10-27 支付宝(杭州)信息技术有限公司 Method and system for distributing storage area between multi-processing cards

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901207A (en) * 2010-07-23 2010-12-01 中国科学院计算技术研究所 Operating system of heterogeneous shared storage multiprocessor system and working method thereof
CN103167021A (en) * 2013-02-01 2013-06-19 浪潮(北京)电子信息产业有限公司 Resource allocation method and resource allocation device
US20150020206A1 (en) * 2013-07-10 2015-01-15 Raytheon BBN Technologies, Corp. Synthetic processing diversity with multiple architectures within a homogeneous processing environment
CN105138406A (en) * 2015-08-17 2015-12-09 浪潮(北京)电子信息产业有限公司 Task processing method, task processing device and task processing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901207A (en) * 2010-07-23 2010-12-01 中国科学院计算技术研究所 Operating system of heterogeneous shared storage multiprocessor system and working method thereof
CN103167021A (en) * 2013-02-01 2013-06-19 浪潮(北京)电子信息产业有限公司 Resource allocation method and resource allocation device
US20150020206A1 (en) * 2013-07-10 2015-01-15 Raytheon BBN Technologies, Corp. Synthetic processing diversity with multiple architectures within a homogeneous processing environment
CN105138406A (en) * 2015-08-17 2015-12-09 浪潮(北京)电子信息产业有限公司 Task processing method, task processing device and task processing system

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111951363A (en) * 2020-07-16 2020-11-17 广州玖的数码科技有限公司 Cloud computing chain-based rendering method and system and storage medium
CN113778984A (en) * 2021-08-16 2021-12-10 维沃移动通信(杭州)有限公司 Processing component selection method and device
CN115330587A (en) * 2022-02-22 2022-11-11 摩尔线程智能科技(北京)有限责任公司 Distributed storage interconnection structure, video card and memory access method of graphics processor
CN115330587B (en) * 2022-02-22 2023-10-10 摩尔线程智能科技(北京)有限责任公司 Distributed storage interconnection structure of graphic processor, display card and memory access method
CN115600664A (en) * 2022-09-28 2023-01-13 美的集团(上海)有限公司(Cn) Operator processing method, electronic device and storage medium
CN115600664B (en) * 2022-09-28 2024-03-08 美的集团(上海)有限公司 Operator processing method, electronic device and storage medium
CN115391053A (en) * 2022-10-26 2022-11-25 北京云迹科技股份有限公司 Online service method and device based on CPU and GPU hybrid calculation
CN117032999A (en) * 2023-10-09 2023-11-10 之江实验室 CPU-GPU cooperative scheduling method and device based on asynchronous running
CN117032999B (en) * 2023-10-09 2024-01-30 之江实验室 CPU-GPU cooperative scheduling method and device based on asynchronous running
CN117076330A (en) * 2023-10-12 2023-11-17 北京开源芯片研究院 Access verification method, system, electronic equipment and readable storage medium
CN117076330B (en) * 2023-10-12 2024-02-02 北京开源芯片研究院 Access verification method, system, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN112292667A (en) 2021-01-29
CN112292667B (en) 2022-04-29

Similar Documents

Publication Publication Date Title
WO2020062086A1 (en) Method and device for selecting processor
Zhang et al. A system hierarchy for brain-inspired computing
CN110689138B (en) Operation method, device and related product
CN112381220B (en) Neural network tensor processor
Moreau et al. Approximate computing: Making mobile systems more efficient
WO2021000971A1 (en) Method and device for generating operation data and related product
SE505783C2 (en) Method of manufacturing a digital signal processor
US9990216B2 (en) Providing hypercall interface for virtual machines
WO2023071238A1 (en) Computational graph compiling and scheduling methods and related products
Gadiyar et al. Artificial intelligence software and hardware platforms
CN112465133A (en) Operation method, operation device, computer equipment and storage medium
US20220114495A1 (en) Apparatus, articles of manufacture, and methods for composable machine learning compute nodes
Vokorokos et al. A multicore architecture focused on accelerating computer vision computations
Xu et al. Empowering R with high performance computing resources for big data analytics
WO2023030507A1 (en) Compilation optimization method and apparatus, computer device and storage medium
Nardelli et al. Comparing tensorflow deep learning performance and experiences using cpus via local pcs and cloud solutions
Lohoff et al. Interfacing neuromorphic hardware with machine learning frameworks-a review
CN112214443B (en) Secondary unloading device and method arranged in graphic processor
Popov et al. Teragraph heterogeneous system for ultra-large graph processing
CN114925591A (en) Automatic parallel strategy searching method based on polyhedron model modeling and related equipment
Nanjappa CAFFE2 QUICK START GUIDE: modular and scalable deep learning made easy
Singh An Empirical Study of Programming Languages from the Point of View of Scientific Computing
CN112230931B (en) Compiling method, device and medium suitable for secondary unloading of graphic processor
Cong Hardware accelerated simulation and automatic design of heterogeneous architecture
Stankaitis Algebraic Specifications of ARM Cortex M0+ Instruction Set

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18935799

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18935799

Country of ref document: EP

Kind code of ref document: A1