CN112292667A - Method and apparatus for selecting processor - Google Patents

Method and apparatus for selecting processor Download PDF

Info

Publication number
CN112292667A
CN112292667A CN201880094887.1A CN201880094887A CN112292667A CN 112292667 A CN112292667 A CN 112292667A CN 201880094887 A CN201880094887 A CN 201880094887A CN 112292667 A CN112292667 A CN 112292667A
Authority
CN
China
Prior art keywords
processor
program
processors
target
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201880094887.1A
Other languages
Chinese (zh)
Other versions
CN112292667B (en
Inventor
刘恺
周小超
庞俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN112292667A publication Critical patent/CN112292667A/en
Application granted granted Critical
Publication of CN112292667B publication Critical patent/CN112292667B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Abstract

The application provides a method and a device for selecting a processor, wherein the method comprises the following steps: acquiring hardware information of each processor in at least two processors, wherein the hardware information is used for indicating an instruction set corresponding to the processor; acquiring program information of a target program, wherein the program information is used for indicating instructions in the target program; according to the program information and the hardware information, a target processor which meets preset conditions and can be used for executing the target program is determined from the at least two processors, wherein the preset conditions comprise that an instruction set corresponding to the processor comprises instructions in the target program, so that the processing efficiency of the computer equipment can be improved, and the burden of a programmer is reduced.

Description

Method and apparatus for selecting processor Technical Field
The present application relates to the field of computers, and more particularly, to a method and apparatus for selecting a processor and a computer device.
Background
As computer technology has evolved and advanced, heterogeneous architectures have emerged in which computer devices include a general purpose processor (e.g., a central processing unit) and a coprocessor (e.g., a graphics processor). The coprocessor can provide more parallel computing power and improve the computing speed. In addition to providing more powerful thread-level or data-level parallel computing capabilities in performance, coprocessors have a more efficient energy-efficient power consumption ratio than general-purpose processors. Therefore, the insufficiency of the computing power of the CPU processor can be compensated by the coprocessor, and the overall energy consumption of the system can be reduced.
At present, heterogeneous architectures are widely used in fields such as Neural Network (NN) or Machine Learning (ML), in which, in order to facilitate programmers to write software programs for the system, a compiling technology is proposed, for example, an operator code can be generated by using the compiling technology.
Compilation is the process of converting a program written in one programming language (the source language) to another (the target language). The source language may be a language used by a user when writing a target program, and the target language may be a language used by a processor in a heterogeneous system that the user wishes to select to run the target program. In the construction of the compilation technique, a front-end, an intermediate expression, and a back-end may be included. The front-end mainly implements the conversion from the source program to the intermediate representation, namely: a user firstly describes calculation of an operator by using a Domain Specific Language (DSL), and inputs the calculation as a front-end source program, and the calculation is optimized through processing of each step of the front end and then input to a back end by using an Intermediate Representation (IR). The code generator at the back end performs the conversion from IR to specific target code according to the designated target processor (e.g., general purpose processor or co-processor).
However, in this technique, a programmer is required to manually specify the target processor on which the operator is to operate at an earlier stage, for example, during the DSL's description of the calculation of the operator. However, due to factors such as hardware instruction support, data alignment, computational efficiency, and peripheral operator coordination, the target processor specified by the compiler may not be the most suitable processor for running the target program, and thus, the processing efficiency is reduced. Also, this process of manually designating a target processor increases the workload of the programmer.
Disclosure of Invention
The application provides a method and a device for selecting a processor, which can improve the processing efficiency of computer equipment and reduce the burden of programmers.
In a first aspect, a method for selecting a processor is provided, which obtains hardware information of each of at least two processors, where the hardware information is used to indicate a corresponding instruction set of each processor; acquiring program information of a target program to be executed, wherein the program information is used for indicating instructions in the target program; and according to the program information and the hardware information, determining a target processor which meets a preset condition and can be used for executing the target program from the at least two processors, wherein the preset condition comprises that an instruction set corresponding to the processor comprises instructions in the target program.
According to the method for selecting the processor, the hardware information of each processor and the program information of the target program are obtained in advance, the processor with the hardware information matched with the program information is selected from the processors based on the hardware information and the program information, the selected processor can be matched with the target program, and the processor does not need to be manually specified, so that the processing efficiency of the computer equipment can be improved, and the burden of a programmer can be reduced.
Here, the "instruction set corresponding to the processor" may be understood as a function that the processor can process, and the hardware information may be used to indicate a function (for example, a function name) that the processor can process. Here, "instructions in the target program" may be understood as functions included in the target program, and the program information is used to indicate functions (for example, function names) included in the target program.
Optionally, the determining, according to the program information and the hardware information, a target processor that satisfies a preset condition and can be used to execute the target program from the at least two processors includes: determining a priority for each of the at least two processors; and sequentially judging whether the at least two processors meet the preset condition or not according to the sequence of the priorities of the at least two processors from high to low on the basis of the program information and the hardware information, and taking the first processor meeting the preset condition as the target processor. By setting the priority for the processor, personalized processing can be realized, different processing scenes can be flexibly coped with, the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.
Optionally, the determining at least one of parallel computing power or power consumption of each of the at least two processors determines a priority of each processor.
Optionally, the at least two processors include a central processing unit CPU, and the priority of the CPU among the at least two processors is lowest. Accordingly, it is possible to secure that a processor capable of processing a target program exists among at least two types of processors, and since the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, it is possible to increase the possibility that the coprocessor is selected as the target processor, and thus, the effect and the practicality of the present application can be further improved.
Optionally, the at least two processors comprise at least two of the following processors: a CPU, a graphics processor GPU, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU or a digital signal processing DSP.
Wherein the ASIC may perform the calculations by software.
Optionally, the hardware information is further used to indicate a size of an available memory space of the processor, the program information is further used to indicate a memory space that needs to be occupied by the target program, and the preset condition further includes that the available memory space of the processor is greater than or equal to the memory space that needs to be occupied by the target program. The available space of the processor may refer to a specified proportion of the total memory space of the processor. For example, the prescribed proportion may be 90%. Alternatively, the available space of the processor may refer to a specified proportion of the total free memory space of the processor.
In addition, the phrase "the preset condition further includes that the available memory space of the processor is greater than or equal to the memory space that needs to be occupied by the target program," which means that the preset condition is satisfied as long as the available memory space of the processor is greater than or equal to the memory space that needs to be occupied by the target program.
Alternatively, the "preset condition further includes that the available memory space of the processor is greater than or equal to the memory space that needs to be occupied by the target program," which means that the available memory space of the processor needs to be greater than or equal to the memory space that needs to be occupied by all programs executed by the target processor, including the target program, to satisfy the preset condition.
The preset conditions further comprise that the available memory space of the processor is larger than or equal to the memory space required to be occupied by the target program, so that the selected target processor can be ensured to support the operation of the target program, and the practicability of the application can be further improved.
Optionally, the acquiring program information of the target program includes: and determining the memory space occupied by the target program according to the data dimension of the target program. Thus, the memory space that the target program needs to occupy can be easily determined.
Optionally, the acquiring program information of the target program includes: the program information is determined according to an intermediate expression IR of the target program, wherein the IR of the target program is determined according to the domain description language DSL code of the target program.
Wherein the DSL code may be determined by a front-end compiler in the computer device and the IR may be determined by an intermediate compiler in the computer device. Thus, the program information can be easily obtained.
Optionally, the obtaining hardware information of each of the at least two processors: hardware information of each processor is obtained according to registration information of each processor, and the registration information is used for registering the processor in the computing device. The registration information may include hardware description information. Also, the hardware description information may be obtained offline by the computer device prior to installation of the processor. Alternatively, the hardware description information may be acquired by the computer device from the driver information of the processor when the processor is installed.
Optionally, the computer device includes at least two backend compilers, the at least two backend compilers correspond to the at least two processors one to one, and each backend compiler is configured to convert the IR into a code that can be recognized by the corresponding processor.
In this case, the method further includes: the IR of the target program is input to a target back-end compiler corresponding to the target processor. Wherein, the IR of the target program may be the IR after the IR optimization processing.
In a second aspect, a method of selecting a processor is provided, the method comprising: acquiring hardware information of each processor in at least two processors, wherein the hardware information is used for indicating the size of available memory space of the processors; acquiring program information of a target program, wherein the program information is used for indicating a memory space required to be occupied by the target program; and determining a target processor which meets preset conditions and can be used for executing the target program from the at least two processors according to the program information and the hardware information, wherein the preset conditions comprise that the available memory space of the processor is larger than or equal to the memory space required to be occupied by the target program.
According to the method for selecting the processor, the hardware information of each processor and the program information of the target program are obtained in advance, the process of matching the hardware information with the program information is selected from the processors based on the hardware information and the program information, the selected processor can be matched with the target program, and the processor does not need to be manually specified, so that the processing efficiency of the computer equipment can be improved, and the burden of a programmer can be reduced. The available space of the processor may refer to a specified proportion of the total memory space of the processor. For example, the prescribed proportion may be 90%. Alternatively, the available space of the processor may refer to a specified proportion of the total free memory space of the processor.
Optionally, the at least two processors comprise at least two of the following processors: CPU, graphic processor GPU, field programmable gate array FPGA, application specific integrated circuit ASIC, neural network processor NPU, image processing unit IPU or digital signal processing DSP.
In addition, the phrase "the preset condition further includes that the available memory space of the processor is greater than or equal to the memory space that needs to be occupied by the target program," which may mean that the preset condition is satisfied as long as the available memory space of the processor is greater than or equal to the memory space that needs to be occupied by the target program.
Alternatively, the "preset condition further includes that the available memory space of the processor is greater than or equal to the memory space that needs to be occupied by the target program," which means that the available memory space of the processor needs to be greater than or equal to the memory space that needs to be occupied by all programs executed by the target processor, including the target program, to satisfy the preset condition.
Optionally, the acquiring program information of the target program includes: and determining the memory space occupied by the target program according to the data dimension of the target program. Thus, the memory space that the target program needs to occupy can be easily determined.
Optionally, the hardware information is further used to indicate an instruction set corresponding to the processor, the program information is further used to indicate an instruction in the target program, and the preset condition further includes that the instruction set corresponding to the processor includes the instruction in the target program. Here, the "instruction set corresponding to the processor" may be understood as a function that the processor can process, and the hardware information may be used to indicate a function (for example, a function name) that the processor can process. Here, "instructions in the target program" may be understood as functions included in the target program, and the program information is used to indicate functions (for example, function names) included in the target program.
Optionally, the determining, according to the program information and the hardware information, a target processor that satisfies a preset condition and can be used to execute the target program from the at least two processors includes: determining a priority for each of the at least two processors; and sequentially judging whether the at least two processors meet the preset condition or not according to the sequence of the priorities of the at least two processors from high to low on the basis of the program information and the hardware information, and taking the first processor meeting the preset condition as the target processor.
By setting the priority for the processor, personalized processing can be realized, different processing scenes can be flexibly coped with, the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.
Optionally, the determining at least one of parallel computing power or power consumption of each of the at least two processors determines a priority of each processor.
Optionally, the at least two processors include a central processing unit CPU, and the priority of the CPU among the at least two processors is lowest. Accordingly, it is possible to secure that a processor capable of processing a target program exists among at least two types of processors, and since the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, it is possible to increase the possibility that the coprocessor is selected as the target processor, and thus, the effect and the practicality of the present application can be further improved.
Optionally, the acquiring program information of the target program includes: the program information is determined according to an intermediate expression IR of the target program, wherein the IR of the target program is determined according to the domain description language DSL code of the target program. Wherein the DSL code may be determined by a front-end compiler in the computer device and the IR may be determined by an intermediate compiler in the computer device. Thus, the program information can be easily obtained.
Optionally, the obtaining hardware information of each of the at least two processors: hardware information of each processor is obtained according to registration information of each processor, and the registration information is used for registering the processor in the computing device. The registration information may include hardware description information. Also, the hardware description information may be obtained offline by the computer device prior to installation of the processor. Alternatively, the hardware description information may be acquired by the computer device from the driver information of the processor when the processor is installed.
Optionally, the computer device includes at least two backend compilers, the at least two backend compilers correspond to the at least two processors one to one, and each backend compiler is configured to convert the IR into a code that can be recognized by the corresponding processor.
In this case, the method further includes: the IR of the target program is input to a target back-end compiler corresponding to the target processor. Wherein, the IR of the target program may be the IR after the IR optimization processing.
In a third aspect, an apparatus for selecting a processor is provided, the apparatus comprising: the identification module is used for acquiring hardware information of each processor in at least two processors, and the hardware information is used for indicating an instruction set corresponding to the processor; the analysis module is used for acquiring program information of a target program to be executed, wherein the program information is used for indicating instructions in the target program; and the selection module is used for determining a target processor which meets a preset condition and can be used for executing the target program from the at least two processors according to the program information and the hardware information, wherein the preset condition comprises that an instruction set corresponding to the processor comprises instructions in the target program.
According to the device for selecting the processor, the hardware information of each processor and the program information of the target program are obtained in advance, the process of matching the hardware information with the program information is selected from the processors based on the hardware information and the program information, the selected processor can be matched with the target program, and the processor does not need to be specified manually, so that the processing efficiency of the computer equipment can be improved, and the burden of a programmer can be reduced.
Here, the "instruction set corresponding to the processor" may be understood as a function that the processor can process, and the hardware information may be used to indicate a function (for example, a function name) that the processor can process. Here, "instructions in the target program" may be understood as functions included in the target program, and the program information is used to indicate functions (for example, function names) included in the target program.
Optionally, the selecting module is configured to determine a priority of each of the at least two processors, sequentially determine whether the at least two processors satisfy the preset condition according to a sequence from high to low of the priorities of the at least two processors based on the program information and the hardware information, and take a first processor that satisfies the preset condition as the target processor. By setting the priority for the processor, personalized processing can be realized, different processing scenes can be flexibly coped with, the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.
Optionally, the determining at least one of parallel computing power or power consumption of each of the at least two processors determines a priority of each processor.
Optionally, the at least two processors include a central processing unit CPU, and the priority of the CPU among the at least two processors is lowest. Accordingly, it is possible to secure that a processor capable of processing a target program exists among at least two types of processors, and since the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, it is possible to increase the possibility that the coprocessor is selected as the target processor, and thus, the effect and the practicality of the present application can be further improved.
Optionally, the at least two processors comprise at least two of the following processors: a CPU, a graphics processor GPU, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU or a digital signal processing DSP.
Wherein the ASIC may perform the calculations by software.
Optionally, the hardware information is further used to indicate a size of an available memory space of the processor, the program information is further used to indicate a memory space that needs to be occupied by the target program, and the preset condition further includes that the available memory space of the processor is greater than or equal to the memory space that needs to be occupied by the target program. The available space of the processor may refer to a specified proportion of the total memory space of the processor. For example, the prescribed proportion may be 90%. Alternatively, the available space of the processor may refer to a specified proportion of the total free memory space of the processor. In addition, the phrase "the preset condition further includes that the available memory space of the processor is greater than or equal to the memory space that needs to be occupied by the target program," which means that the preset condition is satisfied as long as the available memory space of the processor is greater than or equal to the memory space that needs to be occupied by the target program. Alternatively, the "preset condition further includes that the available memory space of the processor is greater than or equal to the memory space that needs to be occupied by the target program," which means that the available memory space of the processor needs to be greater than or equal to the memory space that needs to be occupied by all programs executed by the target processor, including the target program, to satisfy the preset condition. The preset conditions further comprise that the available memory space of the processor is larger than or equal to the memory space required to be occupied by the target program, so that the selected target processor can be ensured to support the operation of the target program, and the practicability of the application can be further improved.
Optionally, the analysis unit is configured to determine, according to the data dimension of the target program, a memory space that needs to be occupied by the target program. Thus, the memory space that the target program needs to occupy can be easily determined.
Optionally, the analysis module is configured to determine the program information according to an intermediate expression IR of the target program, where the IR is determined according to a domain description language DSL code of the target program. Thus, the program information can be easily obtained.
Optionally, the identification unit is configured to obtain hardware information of each processor according to registration information of each processor, where the registration information is used for registering the processor in the computing device. The registration information may include hardware description information. Also, the hardware description information may be obtained offline by the computer device prior to installation of the processor. Alternatively, the hardware description information may be acquired by the computer device from the driver information of the processor when the processor is installed.
Optionally, the computer device includes at least two backend compilers, the at least two backend compilers correspond to the at least two processors one to one, and each backend compiler is configured to convert the IR into a code that can be recognized by the corresponding processor.
In this case, the selection unit is configured to input the IR of the target program to a target back-end compiler corresponding to the target processor. Wherein, the IR of the target program may be the IR after the IR optimization processing.
In a fourth aspect, an apparatus for selecting a processor is provided, the apparatus comprising: the identification unit is used for acquiring hardware information of each processor in the at least two processors, and the hardware information is used for indicating the size of the available memory space of the processor; the analysis unit is used for acquiring program information of a target program, and the program information is used for indicating the memory space occupied by the target program; and the selecting unit is used for determining a target processor which meets preset conditions and can be used for executing the target program from the at least two processors according to the program information and the hardware information, wherein the preset conditions comprise that the available memory space of the processor is larger than or equal to the memory space required to be occupied by the target program.
According to the method for selecting the processor, the hardware information of each processor and the program information of the target program are obtained in advance, the process of matching the hardware information with the program information is selected from the processors based on the hardware information and the program information, the selected processor can be matched with the target program, and the processor does not need to be manually specified, so that the processing efficiency of the computer equipment can be improved, and the burden of a programmer can be reduced. The available space of the processor may refer to a specified proportion of the total memory space of the processor. For example, the prescribed proportion may be 90%. Alternatively, the available space of the processor may refer to a specified proportion of the total free memory space of the processor. In addition, the phrase "the preset condition further includes that the available memory space of the processor is greater than or equal to the memory space that needs to be occupied by the target program," which means that the preset condition is satisfied as long as the available memory space of the processor is greater than or equal to the memory space that needs to be occupied by the target program. Alternatively, the phrase "the predetermined condition further includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy" may mean that the available memory space of the processor needs to be greater than or equal to the memory space that all programs executed by the target processor, including the target program, need to occupy to meet the predetermined condition.
Optionally, the analysis unit is configured to determine, according to the data dimension of the target program, a memory space that needs to be occupied by the target program. Thus, the memory space that the target program needs to occupy can be easily determined.
Optionally, the at least two processors comprise at least two of the following processors: CPU, graphic processor GPU, field programmable gate array FPGA, application specific integrated circuit ASIC, neural network processor NPU, image processing unit IPU or digital signal processing DSP.
Optionally, the hardware information is further used to indicate an instruction set corresponding to the processor, the program information is further used to indicate an instruction in the target program, and the preset condition further includes that the instruction set corresponding to the processor includes the instruction in the target program. Here, the "instruction set corresponding to the processor" may be understood as a function that the processor can process, and the hardware information may be used to indicate a function (for example, a function name) that the processor can process. Here, "instructions in the target program" may be understood as functions included in the target program, and the program information is used to indicate functions (for example, function names) included in the target program.
Optionally, the selecting module is configured to determine a priority of each of the at least two processors, sequentially determine whether the at least two processors satisfy the preset condition according to a sequence from high to low of the priorities of the at least two processors based on the program information and the hardware information, and take a first processor that satisfies the preset condition as the target processor.
Optionally, the determining the parallel computing power of each of the at least two processors determines a priority of each processor.
Optionally, the at least two processors include a central processing unit CPU, and the priority of the CPU among the at least two processors is lowest.
Accordingly, it is possible to secure that a processor capable of processing a target program exists among at least two types of processors, and since the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, it is possible to increase the possibility that the coprocessor is selected as the target processor, and thus, the effect and the practicality of the present application can be further improved.
Optionally, the acquiring program information of the target program includes: acquiring a domain description language (DSL) code of the target program; determining an intermediate expression IR from the DSL code; based on the IR, the program information is determined. Thus, the program information can be easily obtained.
Optionally, the identification unit is configured to obtain hardware information of each processor according to registration information of each processor, where the registration information is used for registering the processor in the computing device. The registration information may include hardware description information. Also, the hardware description information may be obtained offline by the computer device prior to installation of the processor. Alternatively, the hardware description information may be acquired by the computer device from the driver information of the processor when the processor is installed.
Optionally, the computer device includes at least two backend compilers, the at least two backend compilers correspond to the at least two processors one to one, and each backend compiler is configured to convert the IR into a code that can be recognized by the corresponding processor.
In this case, the selection unit is configured to control input of the IR of the target program to a target back-end compiler corresponding to the target processor. Wherein, the IR of the target program may be the IR after the IR optimization processing.
In a fifth aspect, there is provided a compiling apparatus configured in a computer device including at least two types of processors, the apparatus including: the back-end compiling units correspond to the processors one by one and are used for converting the received IR into codes which can be recognized by the corresponding processors; the front-end compiling unit is used for acquiring the DSL corresponding to the target program; an intermediate compiling unit for determining the IR from the DSL; and the selecting unit is used for determining the program information of the target program according to the IR, acquiring the hardware information of each processor of the at least two processors, determining a target processor for executing the target program from the at least two processors according to the program information and the hardware information, and sending the IR to the back-end compiling unit corresponding to the target processor.
Wherein the program information is used for indicating an instruction in the target program, the hardware information is used for indicating an instruction set corresponding to a processor, and the target processor is a processor of the at least two processors that meets a preset condition, where the preset condition includes that the instruction set corresponding to the processor includes the instruction in the target program; and/or the hardware information is used for indicating the size of the available memory space of the processor, the program information is used for indicating the memory space required to be occupied by the target program, and the preset condition comprises that the available memory space of the processor is larger than or equal to the memory space required to be occupied by the target program.
According to the compiling device provided by the application, the hardware information of each processor and the program information of the target program are obtained in advance, the processing that the hardware information is matched with the program information is selected from the processors based on the hardware information and the program information, the selected processor can be matched with the target program, and the processor does not need to be specified manually, so that the processing efficiency of the computer equipment can be improved, and the burden of a programmer can be reduced.
Optionally, the selection unit is configured to determine a priority of each of the at least two processors; and judging whether the processors meet the preset condition or not according to the sequence of the priority from high to low based on the program information and the hardware information, and taking the first processor meeting the preset condition as the target processor. By setting the priority for the processor, personalized processing can be realized, different processing scenes can be flexibly coped with, the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.
Optionally, the determining the parallel computing power of each of the at least two processors determines a priority of each processor.
Optionally, the at least two processors include a central processing unit CPU, and the priority of the CPU among the at least two processors is lowest. Accordingly, it is possible to secure that a processor capable of processing a target program exists among at least two types of processors, and since the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, it is possible to increase the possibility that the coprocessor is selected as the target processor, and thus, the effect and the practicality of the present application can be further improved.
Optionally, the at least two processors comprise at least two of the following processors: CPU, graphic processor GPU, field programmable gate array FPGA, application specific integrated circuit ASIC, neural network processor NPU, image processing unit IPU or digital signal processing DSP.
Optionally, the identification unit is configured to obtain hardware information of each processor according to registration information of each processor, where the registration information is used for registering the processor in the computing device. The registration information may include hardware description information. Also, the hardware description information may be obtained offline by the computer device prior to installation of the processor. Alternatively, the hardware description information may be acquired by the computer device from the driver information of the processor when the processor is installed.
A sixth aspect provides a computer apparatus comprising a plurality of processors, a compiler and a selection means, the selection means performing the method of the first aspect and any of its possible implementations, or the method of the second aspect and any of its possible implementations. For example, the compilers include a front-end compiler, an intermediate compiler, and a back-end compiler.
In a seventh aspect, a chip or a chipset is provided, which includes at least one processor and at least one memory control unit, where the processor executes the method in the first aspect and any one of the possible implementations thereof, or the method in the second aspect and any one of the possible implementations thereof. Wherein the chip or chip set may comprise a smart chip. The smart chip may include at least two processors.
In an eighth aspect, there is provided a computer system comprising a processor and a memory, the processor comprising at least two processors and a memory control unit, the processor performing the method of the first aspect and any of its possible implementations, or the method of the second aspect and any of its possible implementations.
Optionally, the computing system further comprises a system bus for connecting the processor (in particular, the memory control unit) and the memory.
In a ninth aspect, there is provided a computer program product, the computer program product comprising: a computer program (also referred to as code, or instructions), which, when executed by a processor in a processor or chip, causes the processor to perform the method of the first aspect and any of its possible implementations, or the second aspect and any of its possible implementations.
A tenth aspect provides a computer-readable medium storing a computer program (which may also be referred to as code, or instructions) which, when run on a processor in a processor or chip, causes the processor to perform the method of the first aspect and any of its possible implementations, or the second aspect and any of its possible implementations.
Drawings
Fig. 1 is a schematic hardware configuration diagram of a computer device (or a computer system) to which the method and apparatus for monitoring a process according to the embodiment of the present application are applied.
Fig. 2 is a schematic diagram of an example of the lexical analysis process of the present application.
Fig. 3 is a schematic diagram of an example of a parsing process of the present application.
FIG. 4 is a schematic diagram of an example of an intermediate code generation and optimization process of the present application.
FIG. 5 is a schematic flow chart diagram of an example of a method of selecting a processor of the present application.
FIG. 6 is a schematic flow chart diagram of another example of a method of selecting a processor according to the present application.
Fig. 7 is a diagram showing an example of the compiling method of the present application.
Fig. 8 is a schematic configuration diagram of an example of the apparatus for selecting a processor according to the present application.
Fig. 9 is a schematic configuration diagram of an example of the compiling apparatus of the present application.
Detailed Description
The technical solution in the present application will be described below with reference to the accompanying drawings. First, a computing device 100 for executing the method for monitoring a process according to the embodiment of the present application will be described in detail with reference to fig. 1.
A computing device, which may also be referred to as a computer system, may include, from a logical layering perspective, a hardware layer, an operating system layer that runs above the hardware layer, and an application layer that runs above the operating system layer. The hardware layer includes hardware such as a processor, a memory, and a memory control unit, and the function and structure of the hardware are described in detail later. The operating system may be any one or more computer operating systems that implement business processing through processes (processes), such as a Linux operating system, a Unix operating system, an Android operating system, an iOS operating system, or a windows operating system. The application layer comprises application programs such as a browser, an address list, word processing software, instant messaging software and the like. In addition, in the embodiment of the present application, the computer system may be a handheld device such as a smart phone, or a terminal device such as a personal computer, and the present application is not particularly limited as long as the program code of the method of the present embodiment can be read and executed to monitor the sample process by the method for monitoring the memory access behavior of the sample process according to the embodiment of the present application. The execution subject of the method for monitoring the memory access behavior of the sample process in the embodiment of the present application may be a computer system, or a functional module, such as a processor, in the computer system, which can call a program and execute the program.
In this application, a program or program code refers to a collection of ordered sets of instructions (or code) that can be employed to achieve a relatively specialized functionality. A process is a running process of a program and its data on a computer device. Programs are usually designed in a modular manner, i.e., the functions of the programs are subdivided and broken down into a plurality of smaller functional modules. The program includes at least one function, and the function is a code segment for implementing a functional module. Thus, a function is a basic unit of program function modularization and can also be regarded as a subroutine.
Fig. 1 is a schematic architecture diagram of a computing device 100 according to an embodiment of the present application. The computing device shown in fig. 1 is used to perform a method of monitoring a process. The computing device 100 may include: at least two processors 110, and a memory 120.
Optionally, the computer device 110 may further include a system bus, wherein the processor 110 and the memory 120 are respectively connected to the system bus. The processor 110 can access the memory 120 through the system bus, for example, the processor 110 can read and write data or execute code in the memory 120 through the system bus. The processor 110 mainly functions to interpret instructions (or codes) of a computer program and process data in computer software. The instructions of the computer program and the data in the computer software may be stored in the memory 120 or the cache unit 116. In the embodiment of the present application, the processor 110 may be an integrated circuit chip or a component thereof, and has a signal processing capability.
In this application, the processor 110 may fetch instructions from memory or cache, place them into an instruction register, and decode the instructions. It decomposes the instruction into a series of micro-operations, then sends out various control commands, executes the micro-operation series, thereby completing the execution of one instruction. An instruction is a basic command that a computer specifies the type and operands of an execution operation. An instruction is composed of one or more bytes including an opcode field, one or more fields for operand addresses, and some status words and signature codes that characterize the state of the machine. Some instructions also include the operands themselves directly.
By way of example, and not limitation, in the present application, processor 110 may include a memory control unit 114 and at least one processing unit 112.
The processing units 112, which may also be referred to as cores (cores) or cores, are the most important components of a processor. The processing unit 112 may be fabricated from single crystal silicon in a certain manufacturing process, and the calculation, receiving command, storing command, and processing data of the processor 110 are executed by the core. The processing units 112 can independently execute the program instructions, and the running speed of the program can be increased by using the parallel computing capability. The various processors 110 have a fixed logic structure, for example, the processors 110 include logic units such as a level one cache, a level two cache, an execution unit, an instruction level unit, and a bus interface.
The memory control unit 114 is used for controlling data interaction between the memory 120 and the processing unit 112. In particular, memory control unit 114 may receive memory access requests from processing units 112 and control access to memory based on the memory access requests. By way of example and not limitation, in the embodiment of the present application, the memory control unit may be a Memory Management Unit (MMU) or other devices.
In the embodiment of the present application, each memory control unit 114 may address the memory 120 through the system bus. And an arbiter (not shown) may be configured in the system bus that may be responsible for handling and coordinating competing accesses by the plurality of processing units 112.
In the embodiment of the present application, the processing unit 112 and the memory control unit 114 may be communicatively connected through a connection line inside the chip, such as an address line, so as to implement communication between the processing unit 112 and the memory control unit 114.
Optionally, each processor 110 may further include a cache unit 116, where the cache unit 116 is a buffer for data exchange (referred to as a cache). When the processor 112 is going to read data, it first searches the cache unit 116 for the needed data, and if it is found, it directly executes the data, and if it is not found, it searches the memory 120. Since cache unit 116 operates at a much faster speed than memory 120, cache unit 116 functions to help processing unit 112 operate faster.
Memory 120 may provide an execution space for processes in computing device 100, for example, memory 120 may store computer programs (specifically, codes of the programs) for generating the processes, and memory 120 may store data generated during the execution of the processes, for example, intermediate data or process data. The memory may also be referred to as an internal memory, and functions to temporarily store operation data in the processor 110 and data exchanged with an external memory such as a hard disk. As long as the computer is running, the processor 110 will transfer the data to be operated to the memory 120 for operation, and the processing unit 112 will send out the result after the operation is completed.
By way of example, and not limitation, in the subject embodiments, memory 120 may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM). It should be noted that the memory 120 of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of storage.
It should be understood that the above-mentioned structure of the computing device 100 is only an exemplary illustration, and the present application is not limited thereto, and the computing device 100 of the embodiment of the present application may include various hardware in computer systems in the prior art, for example, the computing device 110 may further include other storage devices besides the memory 120, such as a disk storage device.
In embodiments of the present application, virtualization techniques may be applied on the computing device 100. By the virtualization technology, the computer device 100 can run a plurality of virtual machines simultaneously, each virtual machine can run at least one operating system, and each operating system runs a plurality of programs. A virtual machine (virtual machine) refers to a complete computer system with complete hardware system functions, which is simulated by software and runs in a completely isolated environment.
In the present application, the processor 110 may include a plurality of categories. For example, different kinds of processors may use different kinds of instructions. As another example, different kinds of processors may have different computing capabilities. As another example, different kinds of processors may be used to process different types of computations. By way of example, and not limitation, the various processors may include a general purpose processor and a coprocessor in the present application. The various processors described above are described in detail below.
A. General purpose processor
A general-purpose processor, which may also be referred to as a Central Processing Unit (CPU), is an ultra-large scale integrated circuit or a component thereof, and is a computer's arithmetic Core (Core) and Control Core (Control Unit). Its functions are mainly to interpret computer instructions and to process data in computer software. The cpu mainly includes an Arithmetic Unit (ALU), a Cache memory (Cache), and a Data (Data), control and status Bus (Bus) for implementing the connection between them. It is called three core components of an electronic computer together with an internal Memory (Memory) and an input/output (I/O) device. For example, the CPU includes an arithmetic logic unit, a register unit, a control unit, and the like.
Logic components (logic components) are arithmetic logic components. Fixed or floating point arithmetic operations, shift operations, and logical operations may be performed, as well as address operations and translations.
The registers include general purpose registers, special purpose registers, and control registers. General registers may be divided into fixed point numbers and floating point numbers, which are used to hold register operands temporarily stored during instruction execution and intermediate (or final) operation results.
The control unit is primarily responsible for decoding instructions and issuing control signals for the various operations to be performed to complete each instruction. The structure of the device has two types: one is a microprogram control mode with micro memory as core; one is a control scheme based on a logical hardwired structure. The micro-storage holds microcode, each microcode corresponds to a most basic micro-operation, also called micro-instruction; each instruction is composed of a different sequence of microcode that makes up the microprogram. After the CPU decodes the instruction, it sends out a control signal with a certain time sequence, and executes several micro-operations determined by these micro-codes according to the sequence of the given sequence and with the micro-period as the beat, thus completing the execution of a certain instruction. Simple instructions consist of (3-5) micro-operations, while complex instructions consist of tens or even hundreds of micro-operations.
B. Coprocessor
Coprocessor (coprocessor), a chip or component within a chip, to relieve a system microprocessor of certain processing tasks. The coprocessor is a processor which is developed and applied for assisting a central processing unit to finish the processing work which cannot be executed or has low execution efficiency and effect. There are many tasks that such a central processing unit cannot perform, such as signal transmission between devices, management of access devices, etc.; and graphic processing, audio processing, etc. are inferior in execution efficiency and effect. In order to perform these processes, various auxiliary processors are born. It should be noted that, since the integer arithmetic unit and the floating-point arithmetic unit are already integrated in the present computer, the floating-point processor is not necessarily an auxiliary processor. The coprocessor built in the CPU may not be an auxiliary processor. Of course, the coprocessor may also be stand-alone.
In this application, coprocessors may be used for specific processing tasks, e.g., math coprocessors may control digital processing; the graphics coprocessor may handle video rendering. A coprocessor may be attached to a general-purpose processor. A coprocessor extends the general-purpose processor core processing functions by extending the instruction set or providing configuration registers. One or more coprocessors may be connected to the general purpose processor core through a coprocessor interface. For example, coprocessors can also extend the instruction set by providing a specialized set of new instructions. By way of example and not limitation, a coprocessor may include, but is not limited to, at least one of the following:
B1. graphics processor
A Graphics Processing Unit (GPU), also called a display core, a visual processor, and a display chip, is a microprocessor dedicated to image operation on a personal computer, a workstation, a game machine, and some mobile devices (e.g., a tablet computer, a smart phone, etc.). The GPU is used for converting and driving display information required by a computer system, providing a line scanning signal for a display and controlling the display of the display correctly, is an important element for connecting the display and a personal computer mainboard, and is also one of important devices for man-machine conversation. For example, the processor of a graphics card is sometimes referred to as the Graphics Processor (GPU), which is the "heart" of the graphics card, similar to a CPU, except that the GPU is designed specifically to perform complex mathematical and geometric calculations that are required for graphics rendering. Some of the fastest GPUs integrate even more transistors than normal CPUs.
Most current GPUs have 2D or 3D graphics acceleration capabilities. If the CPU wants to draw a two-dimensional graph, only an instruction needs to be sent to the GPU, for example, if a rectangle with the length and width of a multiplied by b is drawn at a coordinate position (x, y), the GPU can quickly calculate all pixels of the graph, draw a corresponding graph at a specified position on a display, inform the CPU that the graph is drawn completely, and then wait for the CPU to send a next graph instruction. With the GPU, the CPU is liberated from the tasks of graphic processing, and can execute other more system tasks, thereby greatly improving the overall performance of the computer. For example, a GPU generates a large amount of heat, so a heat sink or fan is typically mounted above it.
The GPU is the brain of the display card, determines the grade and most of the performance of the display card, and is the distinguishing basis of the 2D display card and the 3D display card. The 2D display chip mainly depends on the processing capability of the CPU when processing 3D images and special effects, which is called soft acceleration. The 3D display chip is a function that concentrates three-dimensional image and special effect processing functions in the display chip, which is called "hardware acceleration". The display chip is typically the largest chip (and most pins) on the display card. At present, the GPU is not limited to 3D graphics processing, the development of GPU general-purpose computing technology has attracted much attention in the industry, and the fact proves that the GPU can provide tens of times or hundreds of times of the performance of the CPU in the aspects of floating point operation, parallel computing and other partial computing. The GPU reduces the dependence of the computer device on the CPU and shares a part of the work that the CPU originally executes.
B2. Special integrated circuit for field programmable gate array
A Field Programmable Gate Array (FPGA) is a product of further development on the basis of Programmable devices such as Programmable Array Logic (PAL), General Array Logic (GAL), Complex Programmable Logic Device (CPLD), and the like. The field programmable gate array (fpga) ASIC appears as a semi-custom Circuit in the field of ASICs (Application Specific Integrated circuits), which not only solves the disadvantages of custom circuits, but also overcomes the drawback of limited gate circuits of the original programmable devices. The system designer can connect logic blocks inside the FPGA as if a circuit test board is placed in a chip through editable connections as needed. The logic block and connection of a finished FPGA after leaving the factory can be changed according to a designer, so that the FPGA can complete the required logic function.
The FPGA adopts a Logic Cell Array (LCA), and includes three parts, i.e., a Configurable Logic Block (CLB), an Input/Output Block (IOB), and an Interconnect. FPGAs are programmable devices and, through different programming approaches, can have different structures compared to conventional logic circuits and gate arrays (e.g., PAL, GAL, and CPLD devices). The FPGA utilizes small lookup tables (16 × 1RAM) to realize combinational logic, each lookup table is connected to the input end of a D flip-flop, and the flip-flops drive other logic circuits or drive I/O (input/output) circuits, so that basic logic unit modules capable of realizing both combinational logic functions and sequential logic functions are formed, and the modules are connected with each other or connected to an I/O (input/output) module by utilizing metal connecting wires. The logic of the FPGA is implemented by loading programming data into the internal static memory cells, the values stored in the memory cells determine the logic function of the logic cells and the way of the connections between the modules or between the modules and the I/O and finally the functions that can be implemented by the FPGA, which allows an unlimited number of programming.
It should be noted that, since the FPGA does not include an instruction set, the method 200 described below may not be used to determine whether the FPGA can be used as a target processor.
However, since the FPGA has memory space, the method 300 described below can be used to determine whether the FPGA can be the target processor.
B3. Neural network processor
The neural-Network Process Units (NPUs) adopt a data-driven parallel computing architecture, and are particularly good at processing massive multimedia data of videos and images. The NPU can be used for deep learning, and from the technical point of view, the deep learning is actually a multilayer large-scale artificial neural network. It is constructed by simulating biological neural network and is formed by interconnecting several artificial neuron nodes. The neurons are connected pairwise through synapses, and the strength of weight of connection among the neurons is recorded by the synapses. Each neuron can be abstracted as a stimulus function whose inputs are determined by the outputs of the connected neurons and the synapses connecting the neurons. In order to express specific knowledge, a user generally needs to adjust (through some specific algorithms) values of synapses in the artificial neural network, a topology of the network, and the like. This process is called "learning". After learning, the artificial neural network can solve a particular problem through learned knowledge.
The basic operation of deep learning is the processing of neurons and synapses. The conventional processor instruction set is developed for performing general-purpose computation, and its basic operations are arithmetic operation (addition, subtraction, multiplication, division) and logic operation (and or not), which often requires hundreds or even thousands of instructions to complete the processing of a neuron, and the processing efficiency of deep learning is not high. In contrast, the NPU command directly faces to the processing of large-scale neurons and synapses, and one command can complete the processing of a group of neurons, and provides a series of special supports for the transmission of neuron and synapse data on a chip. In addition, the storage and the processing in the neural network are integrated and are all embodied by synaptic weights.
B4. Application specific integrated circuit
An Application Specific Integrated Circuit (ASIC) is an integrated circuit that is manufactured for a particular user or a particular electronic system. The universality and mass production of digital integrated circuits greatly reduce the cost of electronic products, promote the popularization of computer communication and electronic products, but also generate the contradiction between universality and specialization and the problem that the system design is disjointed with circuit manufacture. Meanwhile, the larger the scale of the integrated circuit, the more difficult it is to make changes to the specific requirements when the system is built. In order to solve these problems, an application specific integrated circuit featuring user-added design has appeared, which can realize the optimized design of the whole system, and has superior performance and strong security. The ASIC may be used to execute a software program or may perform calculations by hardware logic without executing a software program. For example, one or more processor cores may be included in an ASIC executing a software program to execute instructions and have a corresponding instruction set.
B5. Digital signal processor
Digital Signal Processing (DSP) is a theory and technique of representing and processing signals digitally. Digital signal processing and analog signal processing are a subset of signal processing. The purpose of digital signal processing is to measure or filter a continuous analog signal of the real world. It is therefore necessary to convert the signal from the analogue domain to the digital domain before digital signal processing, which is usually done by means of an analogue to digital converter. The output of the digital signal processing is often also converted to the analog domain, which is achieved by a digital-to-analog converter. The DSP is a dedicated chip for performing digital signal processing, and is a new device that has been developed with the development of microelectronics, digital signal processing technology, and computer technology.
B6. Image processing unit
An Image Processing Unit (IPU), which may also be referred to as an image signal processor (image signal processor), may be used to output signal processing units to the front-end image sensor to match image sensors of different manufacturers. And, it can be used to provide full support for end-to-end data stream signal processing from image input (camera sensor/television signal input, etc.) to a display device (e.g., liquid crystal display, television V output, or external image processing unit, etc.).
It should be understood that the above listed processors are merely exemplary, and the present application is not limited thereto, for example, the processors in the present application may also include programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. By way of example, and not limitation, in the present application, the above-described architecture including various processors may be referred to as a heterogeneous architecture, or a heterogeneous system architecture.
In the internet industry, with the popularization of informatization, the explosive increase of data volume makes people have new requirements on storage space, and meanwhile, the rise of the fields of machine learning, artificial intelligence, unmanned driving, industrial simulation and the like makes the general processor meet more and more performance bottlenecks when processing massive computing and massive data/pictures, such as low parallelism, insufficient bandwidth, high time delay and the like. In order to meet the demand of computational diversification, more and more scenes are introduced with hardware such as a GPU and an FPGA for acceleration, and heterogeneous computation is carried out accordingly. Heterogeneous Computing (Heterogeneous Computing) mainly refers to a Computing method of a system composed of Computing units of different types of instruction sets and architectures. The heterogeneous computing is a special way of performing computing, namely a heterogeneous computing, in which various computing units such as CPUs, DSPs, GPUs, ASICs, coprocessors, FPGAs, and the like, and computing units using different types of instruction sets and different architectures form a mixed system. Especially in the field of artificial intelligence, heterogeneous computing is more likely. As is well known, AI means an ultra-high requirement for computing power, and heterogeneous computing represented by GPU is a new generation computing architecture for accelerating AI innovation.
In a Heterogeneous System Architecture (HSA), multiple processors work together, i.e., the CPU can use most of the resources for caching and logic control (i.e., non-compute units) and a small portion of the resources for computation. This represents that the CPU is suitable for running serial programs with features of branch-intensive, irregular data structures, recursion, etc. In conjunction with traditional multi-core architectures, it is becoming a trend to incorporate specialized computing modules into systems as accelerators, such as Graphics Processing Units (GPUs), Digital Signal Processors (DSPs), Field Programmable Gate Arrays (FPGAs), and other programmable logic units, as accelerators (i.e., kernel heterogeneous architectures). The HSA deduces a new system architecture and an execution standard with the goal of realizing heterogeneous computation optimization, and finally aims to make the performance of each architecture in the whole SoC be exerted to the maximum extent through cooperative operation among heterogeneous architectures of each core (including a CPU, a GPU, a DSP and other processors) in the SoC. The heterogeneous system architecture enables uniform addressing of memory by multiple processors.
Parallel computing performed on heterogeneous computing systems is commonly referred to as heterogeneous computing. We have defined heterogeneous computations from different perspectives, and taken together this embodiment gives the following definitions: heterogeneous computing is a special form of parallel and distributed computing that accomplishes the computing task either with a Single independent computer that can support both Single Instruction Multiple Data (SIMD) and Multiple Instruction stream Multiple Data (MIMD) modes, or with a set of independent computers interconnected by a high speed network. It can coordinate the use of heterogeneous machines in performance, architecture, to meet different computing needs, and enable code (or code segments) to execute in a manner that achieves maximum overall performance.
Heterogeneous computing is a parallel and distributed computing technique that best utilizes various computing resources by best matching the type of parallelism (code type) of a computing task to the type of computation that the machine can efficiently support (i.e., machine capabilities). The chip with the heterogeneous system architecture may be referred to as an Artificial Intelligence (AI) chip or an Accelerated Processing Unit (APU).
The method for selecting a processor according to the present application may select a processor for executing a target process from the various processors described above. As described above, in the present application, the processor runs the object program by executing the code of the object program. In the present application, heterogeneous processors may have different Instruction Set Architectures (ISAs), for example, heterogeneous processors may have different instruction sets. The instruction set is stored or integrated in the processor in the form of hardware, and is a hard program for guiding and optimizing the operation of the processor. The processor may operate more efficiently through the instruction set.
To facilitate the programmer writing the program, compilation techniques may be used in this application. Compilation is the process of converting a program written in one programming language (the source language) to another (the target language). In the present application, the compiler used by the above-mentioned compiling technique may include, but is not limited to, the following structures:
A. a front-end compiler:
the front-end compiler is used to implement conversion from a source program (or source program code) to an intermediate expression (IR), that is, a user first describes calculation of an operator using a Domain Specific Language (DSL) language as an input of the front-end compiler. The processing of the front-end compiler mainly comprises lexical analysis, syntactic analysis and semantic analysis.
1) Lexical analysis obtains corresponding token sequences from the character sequences. For example, for a code (or instruction or function) of "b-3 +52 a", the front-end decoder may obtain a sequence of tokens as shown in fig. 2.
2) The Syntax analysis further results in an Abstract Syntax Tree (AST) from the token sequence. For example, the above token sequence may result in the syntax tree shown in fig. 3.
3) Semantic analysis identifies the type of variable, the scope of the operation, etc.
In this application, a front-end compiler may also be referred to as a front-end compiling apparatus or a front-end compiling unit.
B. An intermediate compiler:
the intermediate compiler is used for code generation and optimization, particularly, the intermediate code is pseudo code and can be regarded as a program on an abstract machine. Organized in a syntax tree. For example, for a code (or instruction or function) of "sum (10+20) × (num + square)", the syntax tree shown in fig. 4 can be obtained after code generation and optimization.
In the application, the optimization processing of the intermediate code is based on equivalence, so that the storage space can be saved, and the operation is faster. Common optimization approaches fall into two categories: 1) machine independent optimizations such as: merging constants, extracting common self-expression, expanding and merging circulation, code outsourcing (the circulation is removed from the invariant calculation), and the like; 2) machine-related optimizations such as: the utilization of registers (the common quantity is put into the registers to reduce the times of accessing the memory), and the storage strategy (the Cache can be arranged according to the requirement of access of the algorithm and a parallel storage system is arranged to reduce access conflicts).
In this application, the intermediate compiler may also be referred to as an intermediate compilation means or an intermediate compilation unit.
C. A back-end compiler:
the back-end compiler (backup) is mainly used for object code generation, that is, the back-end compiler may include a plurality of back-end compilers, which may correspond to a plurality of processors one by one, and each back-end compiler is configured to convert the input optimized IR into an object code (or an instruction or a function) that can be executed on the corresponding processor, wherein the object code may be an instruction code or an assembly code. As described above, a backend compiler for generating object code needs to be selected from a plurality of backend compilers, and in the prior art, the process is manually performed, whereas in the embodiment of the present application, the process may be automatically performed by a computer device. In addition, since the plurality of back-end compilers may be associated with a plurality of types of processors, the "selecting one back-end compiler from the plurality of back-end compilers for generating the object code" may also be understood as a process of selecting one processor from the plurality of types of processors for executing the object program.
In this application, a back-end compiler may also be referred to as a back-end compiling apparatus or a back-end compiling unit.
FIG. 5 is a schematic flow chart diagram of an example of a method 200 of selecting a processor of the present application. By way of example, and not limitation, the execution body of the method 200 (hereinafter, referred to as processing node # a for ease of understanding and explanation) may be any of a plurality of processors in a computing device, e.g., a central processor. Alternatively, processing node # a may be a virtual machine running in a computing device. In the present application, the processing node # a may be the above-mentioned back-end compiler, or may be a device independent of the above-mentioned back-end compiler, and the present application is not particularly limited.
It should be noted that the method 200 is to select a target processor based on an instruction, and since the FPGA does not include an instruction set, the method 200 may not be used to determine whether the FPGA can be used as the target processor.
Also, when an ASIC may be used to execute a software program, the method 200 may be used to determine whether the ASIC is capable of acting as a target processor.
The method 200 may not be used to determine whether an ASIC is capable of acting as a target processor when the ASIC is not executing a software program, but rather is performing a calculation via hardware logic.
As shown in fig. 5, the processing node # a may acquire hardware information of each of two types of processors included in the computing device 100S 210. Alternatively, in this application, the manufacturer of the computing device 100 may pre-configure the hardware information of each processor included in the computing device 100 when the computing device 100 is shipped from the factory, so that the processing node # a may acquire the hardware information of each of the two processors included in the computing device 100 based on the relevant information of the shipment configuration at S210. Alternatively, in the present application, the manufacturer of the computing device 100 may store the hardware information of each processor included in the computing device 100 on a server, so that the processing node # a connects the server via a network in advance at S210, and acquires the hardware information of each of the two processors included in the computing device 100 from the server. Alternatively, in the present application, a user of the computing device 100 may input hardware information of each processor included in the computing device 100 to the processing node # a. Alternatively, in this application, each processor may be installed in a hot plug manner, and the driver of each processor may cause each processor to complete registration at the time of hot plug, in which case, the processing node # a may acquire, at S210, hardware information of each of two types of processors included in the computing device 100 based on registration information of each processor or related information in the driver.
That is, in the present application, the computer apparatus 100 (or the processing node # a) may have a processor registration information collection function, so that it can identify which heterogeneous hardware is supported in the computer apparatus 100, and register the backings corresponding to the respective processors at the time of system startup according to the identified hardware. Therefore, the processing node can determine the hardware information of each processor according to the registration information of the backup corresponding to each processor.
In this application, the hardware information of a processor may include information of an instruction set corresponding to the processor. For example, hardware information for a processor may include information on the names of instructions that the processor is capable of executing. As another example, hardware information of a processor may include information of names of functions that the processor is capable of executing.
As shown in fig. 5, the processing node # a may determine the program information of the program (i.e., an example of the target program, referred to as the program # a) that needs to be currently run, S220. By way of example and not limitation, in the present application, the program information may be determined from the IR of program # a. For example, in the present application, a front-end compiler may obtain the source program code of program # A (denoted as code # A). Specifically, the compiler may provide for the developer to invoke the DSL for the write operator (i.e., an instance of code # a) through, for example, a domain description language Interface (DSL Interface); thereafter, the intermediate compiler may convert the code # a (e.g., DSL) corresponding to the program # a into IR of the program # a; also, in the present application, the intermediate compiler can also optimize the IR of the program # a. Thus, the processing node # a can determine the program information of the program # a from the IR (e.g., the optimized IR) of the program # a.
Note that, in the present application, the processing node # a may be a front decoder and a middle decoder as the code # a, and in this case, the processing node # a can directly obtain the IR of the program # a. Alternatively, in the present application, the front-end decoder and the intermediate decoder of the code # a may be implemented by the processing node # B, and in this case, the processing node # a may communicate with the processing node # B, so that the processing node # B may transmit the IR of the program # a to the processing node # a.
In the present application, the program information of the program # A may include instructions (denoted as: instruction # A) included in the code (e.g., the optimized IR) of the program # A. The instruction # a may include one instruction or a plurality of instructions, and the present application is not particularly limited. For example, the program information of the program # a may include the name of the instruction in the IR of the program # a. As another example, the program information of the program # a may include the name of the function in the IR of the program # a.
At S230, the processing node # a may determine a target processor (denoted as processor #1) from among the plurality of processors based on the program information of the program # a and the hardware information of each processor. The processor #1 may be a processor of which the instruction set corresponding to the plurality of processors includes the instruction # a. Alternatively, the processor #1 may be a processor among the plurality of processors that satisfies the constraint # a. The constraint # A includes that the processor's corresponding instruction set includes the instruction # A.
Alternatively, in the present application, the processing node # a may determine a priority of each of the plurality of processors. By way of example and not limitation, in the present application, the processing node # a may determine the priorities of the processors according to the parallel computing power of each of the plurality of processors, that is, in the present application, the processor with high parallel computing power has a higher priority than the processor with low parallel computing power, for example, for the processor # a and the processor # b, if the parallel computing power of the processor # b is higher than that of the processor # a, the processing node # a may consider the priority of the processor # b to be higher than that of the processor # a. Parallel computing, or parallel computing, is referred to as a serial computing. Parallel computing is an algorithm capable of executing a plurality of instructions at one time, and aims to improve the computing speed and solve large and complex computing problems by enlarging the problem solving scale. So-called parallel computing can be divided into temporal parallel and spatial parallel. Temporal parallelism refers to pipelining, while spatial parallelism refers to performing computations concurrently with multiple processors.
Alternatively, in the present application, the processing node # a may determine the priority of each processor according to the kind of the plurality of processors. For example, in the present application, the priority of the special-purpose processor is higher than that of the general-purpose processor. And, optionally, the general purpose processor may be the lowest priority processor of the plurality of processors. Thus, the processing node # a can sequentially determine whether each processor satisfies the above-described constraint # a in the order of priority of each processor, for example, in the order of priority from high to low. And, alternatively, the processing node # a may determine the first processor satisfying the constraint # a as the processor # 1.
Further, the processing node # a may stop determining another processor after determining the processor # 1. For example, in the present application, different processors have different instruction sets, and instructions for implementing the same function will be different on different chips, assuming that the instruction set of processor # a is intrin # a and the instruction set of processor # b is intrin # b. Let the expression of the source code of program # a be s ═ x × y + z. Also, processor # b is the lowest default priority processor by priority, e.g., processor # b may be a general purpose processor and processor # a special purpose processor, i.e., processor # b may perform functions but have less parallel computing power than special purpose processor # a. Describing the calculation in DSL language can be expressed as m ═ mul (x, y), s ═ add (m, z). Then in this application, after IR processing, the IR description of program # a is obtained and the analysis will use the computation of two instructions mul (multiply), add (sum). Then, the processing node # A preferentially judges whether the instructions belong to the inrin # a; if yes, selecting processor # a as processor # 1; if the determination is "no," processor # b is selected as processor # 1.
Optionally, the hardware information obtained by the processing node # a in S210 further includes information of the size of the currently available memory space of each processor. By way of example and not limitation, in the present application, the size of the processor's currently available memory space may be, for example, 90% of the processor's free space (or memory capacity). Accordingly, the program information obtained by the processing node # a at S220 further includes information of the size of the memory space required for the operation of the program # a. In this case, the processor #1 may be a processor in which the instruction set corresponding to the plurality of processors includes the instruction # a and the currently available memory space is greater than or equal to the memory space required for the operation of the program # a. Alternatively, the processor #1 may be a processor satisfying the constraint # a and the constraint # B among the plurality of processors. The constraint # B includes a current available space of the processor that is greater than or equal to a memory space required for the execution of the program # A (denoted as: space # A).
By way of example and not limitation, in the present application, processing node # a may determine the space # a based on the data dimensions of program # a (or code of program # a). The data dimension of the procedure # a may be understood as a dimension (shape) of a tensor of the procedure # a. The Tensor (Tensor) is a multiple linear mapping defined on the cartesian product of some vector space and some dual space, and its coordinates are a quantity of | | | components in | | | dimension space, where each component is a function of coordinates, and when the coordinates are transformed, these components are also transformed linearly according to some rules. r is called the rank or order of the tensor (independent of both the rank and order of the matrix). In an isomorphic sense, the zeroth order tensor (r ═ 0) is a Scalar (Scalar), the first order tensor (r ═ 1) is a Vector (Vector), and the second order tensor (r ═ 2) is a Matrix (Matrix). For example, for a 3-dimensional space, the tensor when r is 1 is this vector: (x, y, z). Due to the difference of transformation modes, tensors are divided into three types, namely, covariant tensor (index below), inversion tensor (index above), and mixed tensor (index above and index below both).
In the present application, all data can be represented by a data structure such as a tensor. That is, in the present application, a tensor can correspond to an array or list of n dimensions. A tensor has dimensions of a static type and a dynamic type. Tensors may be circulated between nodes in the graph.
In this application, the dimensions of the tensor are described as orders. It should be noted that the order of the tensor (sometimes with respect to, for example, order or degree or n-dimension) is a quantitative description of the dimension of the tensor. For example, in the present application, the processing node # a may determine the dimension of the IR (specifically, the tensor of the IR) of the program # a by performing shape analysis on the IR of the program # a, and further estimate the size of the memory space required for the operation of the program # a. Here, the method and process for estimating the size of the memory space based on the dimension of the data may be similar to those in the prior art, and a detailed description thereof is omitted here for avoiding redundancy.
In this case, let s be x y + z as an expression of the source code of the program # a. Also, processor # b is the lowest default priority processor by priority, e.g., processor # b may be a general purpose processor and processor # a special purpose processor, i.e., processor # b may perform functions but have less parallel computing power than special purpose processor # a. Describing the calculation in DSL language can be expressed as m ═ mul (x, y), s ═ add (m, z). Then in this application, after IR processing, the IR description of program # a is obtained and the analysis will use the computation of two instructions mul (multiply), add (sum). Also, the processing node # a may determine the size of the memory space that the program # a needs to occupy (for example, let the size of the memory space be X). Also, processing node # A may determine the size of the currently available memory space for each processor. Let the current memory size of processor # a be Y. Then, the processing node # A preferentially judges whether the instructions belong to the inrin # a; if the determination is "yes," it is further determined whether X is less than or equal to Y. If yes, selecting processor # a as processor # 1; if the determination is "no," processor # b is selected as processor # 1.
And, the processing node # a may control the intermediate compiler to transmit the IR of the program # a to the backskend corresponding to the processor # 1. Thus, the processor #1 corresponding to the backup can convert the IR of the program # a into a code that the processor #1 can recognize and process.
According to the method for selecting the processor, the hardware information of each processor and the program information of the target program are obtained in advance, the process of matching the hardware information with the program information is selected from the processors based on the hardware information and the program information, the selected processor can be matched with the target program, and the processor does not need to be manually specified, so that the processing efficiency of the computer equipment can be improved, and the burden of a programmer can be reduced.
FIG. 6 is a schematic flow chart diagram of an example of a method 300 of selecting a processor of the present application. By way of example, and not limitation, the execution body of method 300 (hereinafter, referred to as processing node # B for ease of understanding and explanation) may be any of a plurality of processors in a computing device, e.g., a central processor. Alternatively, processing node # B may be a virtual machine running in a computing device. In the present application, the processing node # B may be the above-mentioned back-end compiler, or may be a device independent of the above-mentioned back-end compiler, and the present application is not particularly limited.
As shown in fig. 6, the processing node # B may acquire hardware information of each of two types of processors included in the computing device 100S 310. Alternatively, in this application, the manufacturer of the computing device 100 may pre-configure the hardware information of each processor included in the computing device 100 when the computing device 100 is shipped from the factory, so that the processing node # B may acquire the hardware information of each of the two processors included in the computing device 100 based on the relevant information of the shipment configuration at S310. Alternatively, in the present application, the manufacturer of the computing device 100 may store the hardware information of each processor included in the computing device 100 on a server, so that the processing node # B connects the server in advance through a network at S310, and acquires the hardware information of each of the two processors included in the computing device 100 from the server. Alternatively, in the present application, the user of the computing apparatus 100 may input hardware information of each processor included in the computing apparatus 100 to the processing node # B. Alternatively, in this application, each processor may be installed in a hot plug manner, and the driver of each processor may cause each processor to complete registration at the time of hot plug, in which case, the processing node # B may acquire hardware information of each of the two processors included in the computing device 100 based on the registration information of each processor or related information in the driver at S310. That is, in the present application, the computer apparatus 100 (or the processing node # B) may have a processor registration information collection function, so that it can identify which heterogeneous hardware is supported in the computer apparatus 100, and register the backings corresponding to the respective processors at the time of system startup according to the identified hardware. Therefore, the processing node can determine the hardware information of each processor according to the registration information of the backup corresponding to each processor.
In this application, the hardware information of a processor may include the size of the currently available memory space of the processor. By way of example and not limitation, in the present application, the size of the processor's currently available memory space may be, for example, 90% of the processor's free space (or memory capacity).
As shown in fig. 6, the processing node # B may process the program information of the program (i.e., an example of the target program, referred to as: program # B) that needs to be currently run, S320. By way of example and not limitation, in the present application, the program information may be determined from the IR of program # B. For example, in the present application, a front-end compiler may obtain the source program code of program # B (denoted as code # B). Specifically, the compiler may provide for the developer to invoke the DSL for the write operator (i.e., an instance of code # B) via, for example, a domain description language Interface (DSL Interface); thereafter, the intermediate compiler may convert the code # B (e.g., DSL) corresponding to the program # B into IR of the program # B; also, in the present application, the intermediate compiler can also optimize the IR of the program # B. Thus, the processing node # B can determine the program information of the program # B from the IR (e.g., the optimized IR) of the program # B.
Note that, in the present application, the processing node # B may be a front-end decoder and a middle decoder as the code # B, in which case the processing node # B can directly obtain the IR of the program # B. Alternatively, in the present application, the front-end decoder and the intermediate decoder of the code # B may be implemented by the processing node # B, and in this case, the processing node # B may communicate with the processing node # B, so that the processing node # B may transmit the IR of the program # B to the processing node # B. In the present application, the program information of the program # B may include information on the size of a memory space (denoted as: space # B) required for the operation of the program # B.
By way of example and not limitation, in the present application, processing node # B may determine the space # B based on the data dimension of program # B (or code of program # B). The data dimension of the procedure # B may be understood as a dimension (shape) of a tensor of the procedure # B. The Tensor (Tensor) is a multiple linear mapping defined on the cartesian product of vector spaces and dual spaces, whose coordinates are a quantity of | n | components in | n | dimensional space, where each component is a function of the coordinates, and when the coordinates are transformed, the components are also transformed linearly according to some rule. r is called the rank or order of the tensor (independent of both the rank and order of the matrix). In an isomorphic sense, the zeroth order tensor (r ═ 0) is a Scalar (Scalar), the first order tensor (r ═ 1) is a Vector (Vector), and the second order tensor (r ═ 2) is a Matrix (Matrix). For example, for a 3-dimensional space, the tensor when r is 1 is this vector: (x, y, z). Due to the difference of transformation modes, tensors are divided into three types, namely, covariant tensor (index below), inversion tensor (index above), and mixed tensor (index above and index below both).
In the present application, all data can be represented by a data structure such as a tensor. That is, in the present application, a tensor can correspond to an array or list of n dimensions. A tensor has dimensions of a static type and a dynamic type. Tensors may be circulated between nodes in the graph. In this application, the dimensions of the tensor are described as orders. It should be noted that the order of the tensor (sometimes with respect to, for example, order or degree or n-dimension) is a quantitative description of the dimension of the tensor. For example, in the present application, the processing node # B may determine the dimension of the IR (specifically, the tensor of the IR) of the program # B by performing shape analysis on the IR of the program # B, and further estimate the size of the memory space required for the operation of the program # B. Here, the method and process for estimating the size of the memory space based on the dimension of the data may be similar to those in the prior art, and a detailed description thereof is omitted here for avoiding redundancy.
At S330, the processing node # B may determine a target processor (denoted as processor #2) from among the plurality of processors based on the program information of the program # B and the hardware information of each processor. Among the processors #2, the currently available memory space of the plurality of processors may be greater than or equal to the size of the memory space required for the operation of the program # B. Alternatively, the processor #2 may be a processor satisfying the constraint # C among the plurality of processors. The constraint # C includes: the current available space of the processor is greater than or equal to the size of the memory space required for the execution of the program # B.
Alternatively, in the present application, the processing node # B may determine a priority of each of the plurality of processors. By way of example and not limitation, in the present application, the processing node # B may determine the priorities of the processors according to the parallel computing power of each of the plurality of processors, that is, in the present application, the priority of a processor with high parallel computing power is higher than the priority of a processor with low parallel computing power, for example, for the processor # a and the processor # B, if the parallel computing power of the processor # B is higher than the parallel computing power of the processor # a, the processing node # B may consider the priority of the processor # B to be higher than the priority of the processor # a. Parallel computing, or parallel computing, is referred to as a serial computing. Parallel computing is an algorithm capable of executing a plurality of instructions at one time, and aims to improve the computing speed and solve large and complex computing problems by enlarging the problem solving scale. So-called parallel computing can be divided into temporal parallel and spatial parallel. Temporal parallelism refers to pipelining, while spatial parallelism refers to performing computations concurrently with multiple processors.
For another example, in the present application, the processing node # B may determine the priorities of the processors according to the power consumption of each of the plurality of processors, that is, in the present application, the priority of the processor with high power consumption is lower than the priority of the processor with low power consumption, and for example, for the processor # a and the processor # B, if the power consumption of the processor # B is higher than the power consumption of the processor # a, the processing node # B may consider the priority of the processor # B to be lower than the priority of the processor # a.
Alternatively, in the present application, the processing node # B may determine the priority of each processor according to the kind of the plurality of processors. For example, in the present application, the priority of the special-purpose processor is higher than that of the general-purpose processor. And, optionally, the general purpose processor may be the lowest priority processor of the plurality of processors. Thus, the processing node # B can sequentially determine whether or not each processor satisfies the above-described constraint condition # C in the order of priority of each processor, for example, in the order of priority from high to low. And, alternatively, the processing node # B may determine the first processor satisfying the constraint condition # C as the processor # 2. In addition, the processing node # B may stop determining the other processor after determining the processor # 2.
For example, let the expression of the source code of program # B be s ═ x y + z. Also, processor # b is the lowest default priority processor by priority, e.g., processor # b may be a general purpose processor and processor # a special purpose processor, i.e., processor # b may perform functions but have less parallel computing power than special purpose processor # a. Describing the calculation in DSL language can be expressed as m ═ mul (x, y), s ═ add (m, z). Then in this application, after IR processing, the IR description of program # B is obtained and the analysis will use the computation of two instructions mul (multiply), add (sum). Also, the processing node # B may determine the size of the memory space that the program # B needs to occupy (for example, let the size of the memory space be W). Also, processing node # B may determine the size of the currently available memory space for each processor. Let the current memory size of processor # a be Z. Thereafter, the processing node # B determines whether Z is greater than or equal to W; if the determination is yes, processor # a is selected as processor # 2. If the determination is "no," processor # b is selected as processor # 2.
Optionally, the hardware information obtained by the processing node # B at S310 further includes information of an instruction set corresponding to each processor. For example, hardware information for a processor may include information on the names of instructions that the processor is capable of executing. As another example, hardware information of a processor may include information of names of functions that the processor is capable of executing. Accordingly, the program information obtained by the processing node # B at S320 further includes an instruction (denoted as: instruction # B) included in the code (e.g., the optimized IR) of the program # B. The instruction # B may include one instruction or a plurality of instructions, and the present application is not particularly limited. For example, the program information of the program # B may include the name of the instruction in the IR of the program # B. As another example, the program information of the program # B may include the name of the function in the IR of the program # B. In this case, the processor #2 may be a processor of which the currently available memory space of the plurality of processors is greater than or equal to the memory space required for the operation of the program # B and the corresponding instruction set includes the instruction # B. Alternatively, the processor #1 may be a processor satisfying the constraint # C and the constraint # D among the plurality of processors. The constraint # D includes that the processor's corresponding instruction set includes the instruction # B. In this case, let s be x y + z as an expression of the source code of the program # B. Also, processor # b is the lowest default priority processor by priority, e.g., processor # b may be a general purpose processor and processor # a special purpose processor, i.e., processor # b may perform functions but have less parallel computing power than special purpose processor # a. For example, in the present application, different processors have different instruction sets, and instructions for implementing the same function will be different on different chips, assuming that the instruction set of processor # a is intrin # a and the instruction set of processor # b is intrin # b. Describing the calculation in DSL language can be expressed as m ═ mul (x, y), s ═ add (m, z). Then in this application, after IR processing, the IR description of program # B is obtained and the analysis will use the computation of two instructions mul (multiply), add (sum). Also, the processing node # B may determine the size of the memory space that the program # B needs to occupy (for example, let the size of the memory space be W). Also, processing node # B may determine the size of the currently available memory space for each processor. Let the current memory size of processor # a be Z. Thereafter, the processing node # B preferentially determines whether Z is greater than or equal to W. If yes, further judging whether the commands belong to the inrin # a; if the determination is "yes," processor # a is selected as processor # 2; if the determination is "no," processor # b is selected as processor # 2.
The processing node # B may control the intermediate compiler to send the IR of the program # B to the backskend corresponding to the processor # 2. Thus, the processor #2 corresponding to the backup can convert the IR of the program # a into a code that the processor #2 can recognize and process. The method for selecting the processor can be applied to compiling technology.
As shown in fig. 7, at S410, a compiling device (e.g., a front-end compiler) may provide the developer to invoke the DSL corresponding to the writer through the DSL interface. At S420, a compiling device (e.g., an intermediate compiler) may generate an intermediate expression IR for the DSL. At S430, the compiling device (e.g., the intermediate compiler) may optimize the intermediate expression IR. At S440, the compiling apparatus (e.g., the processing node # a or the processing node # B described above) selects an optimal back-end compiling backup based on the back-end hardware registration information acquired by the automatic recognition hardware device and the analysis result of the automatic analysis IR device. The specific process of this step may be similar to the process described in the method 200 or the method 300, and here, a detailed description thereof is omitted to avoid redundancy. At S450, the device (e.g., the selected back-end compiler) is compiled and generates operator code that can run on the processor corresponding to this backskend.
According to the method for selecting the processor provided by the application, the selected processor can be matched with the target program and the manual labor time can be reduced by acquiring the hardware information of each processor and the program information of the target program in advance and selecting the processing of which the hardware information is matched with the program information from the processors based on the hardware information and the program information.
In light of the foregoing, fig. 8 is a schematic diagram of a logic architecture of an apparatus 500 for selecting a processor according to an embodiment of the present application. Wherein the means for selecting a processor may be configured on a computing device comprising a plurality of processors, or the means for selecting a processor may itself be one of the plurality of processors. As shown in fig. 8, the means 500 for selecting a processor may comprise a recognition unit 510, an analysis unit 520 and a selection unit 530.
The identifying unit 510 may be configured to execute the method in S210 or S310, that is, the identifying unit 510 may obtain hardware information of each of the at least two processors, where the hardware information is used to indicate an instruction set corresponding to the processor, and/or the hardware information is used to indicate a size of an available memory space of the processor, and a specific processing procedure of the identifying unit 510 may be similar to the processing procedure described in S210 or S310, and a detailed description thereof is omitted here to avoid redundancy.
The analysis unit 520 may be configured to execute the method in S220 or S320, that is, the analysis unit 520 may obtain program information of the target program, where the program information is used to indicate an instruction in the target program, and/or the program information is used to indicate a memory space that needs to be occupied by the target program, and a specific processing procedure of the analysis unit 520 may be similar to the processing procedure described in S220 or S320, and a detailed description thereof is omitted here to avoid redundancy.
The selecting unit 530 may be configured to execute the method in S230 or S330, that is, the selecting unit 530 determines, according to the program information and the hardware information, a target processor for executing the target program from the at least two processors, where the target processor is a processor of the at least two processors that meets a preset condition, where the preset condition includes that an instruction set corresponding to the processor includes an instruction in the target program, and/or the preset condition includes that an available memory space of the processor is greater than or equal to a memory space that needs to be occupied by the target program, and a specific processing procedure of the selecting unit 530 may be similar to the processing procedure described in S230 or S350, and a detailed description thereof is omitted here to avoid redundant description. In addition, the selecting unit 530 may further control the intermediate compiler to send the IR of the target program to a back-end compiler backup corresponding to the target processor.
It should be noted that, in the present application, the actions and functions of the recognition unit 510, the analysis unit 520, and the selection unit 530 may be implemented by the same virtual machine or the same processor. Alternatively, the actions and functions of the recognition unit 510, the analysis unit 520, and the selection unit 530 may be implemented by different virtual machines or processors, respectively.
For the concepts, explanations, details and other steps related to the technical solutions provided in the embodiments of the present application related to the apparatus 500, please refer to the descriptions of the foregoing methods or other embodiments, which are not repeated herein. According to the device for selecting the processor, the hardware information of each processor and the program information of the target program are obtained in advance, the process of matching the hardware information with the program information is selected from the processors based on the hardware information and the program information, the selected processor can be matched with the target program, and the processor does not need to be specified manually, so that the processing efficiency of the computer equipment can be improved, and the burden of a programmer can be reduced.
Fig. 9 is a schematic diagram of a logic architecture of a compiling apparatus 600 to which the embodiment of the present application is applied according to the foregoing method. As shown in fig. 9, the compiling apparatus 600 may include a front-end compiling unit 610, an intermediate compiling unit 620, a selecting unit 630, and a plurality of back-end compiling units 640. The back-end compiling units 640 correspond to processors (or computing units, computing platforms, or processing units) one to one. The selection unit 630 may include a recognition module 632, an analysis module 634, and a selection module 636, among others. The front-end compiling unit 610 may provide the developer to invoke the DSL corresponding to the write operator through the DSL interface. The actions performed by the front-end compiling unit 610 may be similar to the actions performed by the front-end compiler, and the description thereof is omitted here for avoiding redundancy. The intermediate compiling unit 620 is communicatively connected to the front-end compiling unit 610, and is configured to obtain the DSL from the front-end compiling unit 610, generate an intermediate expression IR from the DSL, and optimize the intermediate expression IR, where actions performed by the intermediate compiling unit 620 may be similar to actions performed by the intermediate compiler described above, and a description thereof is omitted here to avoid redundancy.
The identifying module 632 may be configured to perform the method in S210 or S310, that is, the identifying unit 510 may obtain hardware information of each of the at least two processors, where the hardware information is used to indicate an instruction set corresponding to the processor, and/or the hardware information is used to indicate a size of an available memory space of the processor, and a specific processing procedure of the identifying module 632 may be similar to the processing procedure described in S210 or S310, and a detailed description thereof is omitted here to avoid redundancy.
The analysis module 634 is communicatively connected to the intermediate compiling unit 620, and is configured to acquire the IR from the intermediate compiling unit 620, and further may be configured to execute the method in S220 or S320, that is, the analysis module 634 may acquire program information of the target program, where the program information is used to indicate an instruction in the target program, and/or the program information is used to indicate a memory space that the target program needs to occupy, and a specific processing procedure of the analysis module 634 may be similar to the processing procedure described in S220 or S320, and a detailed description thereof is omitted here to avoid redundant description.
The selecting module 636 may be communicatively connected to the identifying module 632 and the analyzing module 634, and configured to obtain hardware information from the identifying module 632 and obtain program information from the analyzing module 634, and further the selecting module 636 may be configured to execute the method in S230 or S330, where the selecting unit 530 determines, according to the program information and the hardware information, a target processor for executing the target program from the at least two processors, where the target processor is one of the at least two processors that meets a preset condition, where the preset condition includes that an instruction set corresponding to the processor includes an instruction in the target program, and/or where the preset condition includes that an available memory space of the processor is greater than or equal to a memory space that the target program needs to occupy, and a specific processing procedure of the selecting module 636 may be similar to the processing procedure described in S230 or S350, here, a detailed description thereof is omitted in order to avoid redundancy. In addition, the selection module 636 may further control the intermediate compiler to send the IR of the target program to the back-end compiling unit 640 corresponding to the target processor. The back-end compiling unit 640 may convert the IR into a code that can be executed on a corresponding processor. The actions performed by the back-end compiling unit 640 may be similar to the actions performed by the back-end compiler, and the description thereof is omitted here for avoiding redundancy.
According to the compiling device provided by the application, the hardware information of each processor and the program information of the target program are obtained in advance, the processing that the hardware information is matched with the program information is selected from the processors based on the hardware information and the program information, the selected processor can be matched with the target program, and the processor does not need to be specified manually, so that the processing efficiency of the computer equipment can be improved, and the burden of a programmer can be reduced.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (17)

  1. A method of selecting a processor, the method comprising:
    acquiring hardware information of each processor in at least two processors, wherein the hardware information is used for indicating an instruction set corresponding to each processor;
    acquiring program information of a target program to be executed, wherein the program information is used for indicating instructions in the target program;
    and according to the program information and the hardware information, determining a target processor which meets preset conditions and can be used for executing the target program from the at least two processors, wherein the preset conditions comprise that an instruction set corresponding to the processor comprises instructions in the target program.
  2. The method of claim 1, wherein determining a target processor that satisfies a preset condition and is available for executing the target program from the at least two processors according to the program information and the hardware information comprises:
    determining a priority for each of the at least two processors;
    and sequentially judging whether the at least two processors meet the preset conditions or not according to the sequence of the priorities of the at least two processors from high to low on the basis of the program information and the hardware information, and taking the first processor meeting the preset conditions as the target processor.
  3. The method of claim 2, wherein said determining the priority of each of the at least two processors comprises:
    determining a priority of each of the at least two processors based on at least one of parallel computing power or power consumption of the each processor.
  4. A method according to claim 2 or 3, wherein the at least two processors comprise central processing units, CPUs, and wherein the priority of the CPUs is the lowest of the at least two processors.
  5. The method of any of claims 1 to 4, wherein the hardware information is further to indicate a size of an available memory space of a processor,
    the program information is also used to indicate the memory space that the target program needs to occupy, and
    the preset condition further includes that the available memory space of the processor is larger than or equal to the memory space required to be occupied by the target program.
  6. The method of any of claims 1 to 5, wherein the at least two types of processors comprise at least two of:
    a CPU, a graphics processor GPU, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU or a digital signal processing DSP.
  7. The method of any one of claims 1 to 6, wherein the obtaining program information of the target program comprises:
    determining the program information according to an intermediate expression IR of a target program, wherein the IR of the target program is determined according to a DSL code of a domain description language of the target program.
  8. The method of any of claims 1 to 7, further comprising:
    inputting the IR of the target program to a target back-end compiler corresponding to the target processor.
  9. An apparatus for selecting a processor, the apparatus comprising:
    the identification module is used for acquiring hardware information of each processor in at least two processors, and the hardware information is used for indicating an instruction set corresponding to the processor;
    the analysis module is used for acquiring program information of a target program to be executed, wherein the program information is used for indicating an instruction in the target program;
    and the selection module is used for determining a target processor which meets preset conditions and can be used for executing the target program from the at least two processors according to the program information and the hardware information, wherein the preset conditions comprise that an instruction set corresponding to the processor comprises instructions in the target program.
  10. The apparatus of claim 9, wherein the selection module is configured to determine a priority of each of the at least two processors, and based on the program information and the hardware information, determine whether the at least two processors satisfy the preset condition in order of priority of the at least two processors from high to low, and take a first processor satisfying the preset condition as the target processor.
  11. The apparatus of claim 10, wherein the selection module is to determine the priority of each of the at least two processors based on at least one of parallel computing power or power consumption of the each processor.
  12. The apparatus of claim 10 or 11, wherein the at least two processors comprise Central Processing Units (CPUs), and wherein the CPUs are of lowest priority among the at least two processors.
  13. The apparatus of any of claims 9 to 12, wherein the hardware information is further to indicate a size of an available memory space of a processor,
    the program information is also used to indicate the memory space that the target program needs to occupy, and
    the preset condition further includes that the available memory space of the processor is larger than or equal to the memory space required to be occupied by the target program.
  14. The apparatus of any of claims 9 to 13, wherein the at least two processors comprise at least two of:
    a CPU, a graphics processor GPU, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU or a digital signal processing DSP.
  15. The apparatus of any of claims 9 to 14, wherein the analysis module is to determine the program information according to an intermediate expression, IR, of the target program, wherein the IR is determined according to a domain description language, DSL, code of the target program.
  16. The apparatus according to any one of claims 9 to 15,
    the selection module is further to provide the IR of the target program to a target back-end compiler corresponding to the target processor.
  17. A computer-readable storage medium, comprising a computer program which, when run on a computer device or processor, causes the computer device or processor to perform the method of any one of claims 1 to 8.
CN201880094887.1A 2018-09-28 2018-09-28 Method and apparatus for selecting processor Active CN112292667B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/108459 WO2020062086A1 (en) 2018-09-28 2018-09-28 Method and device for selecting processor

Publications (2)

Publication Number Publication Date
CN112292667A true CN112292667A (en) 2021-01-29
CN112292667B CN112292667B (en) 2022-04-29

Family

ID=69949840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201880094887.1A Active CN112292667B (en) 2018-09-28 2018-09-28 Method and apparatus for selecting processor

Country Status (2)

Country Link
CN (1) CN112292667B (en)
WO (1) WO2020062086A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112988194A (en) * 2021-03-29 2021-06-18 北京市商汤科技开发有限公司 Program optimization method and device based on equipment information, electronic equipment and storage medium
CN116450055A (en) * 2023-06-15 2023-07-18 支付宝(杭州)信息技术有限公司 Method and system for distributing storage area between multi-processing cards

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111951363A (en) * 2020-07-16 2020-11-17 广州玖的数码科技有限公司 Cloud computing chain-based rendering method and system and storage medium
CN113778984A (en) * 2021-08-16 2021-12-10 维沃移动通信(杭州)有限公司 Processing component selection method and device
CN115330587B (en) * 2022-02-22 2023-10-10 摩尔线程智能科技(北京)有限责任公司 Distributed storage interconnection structure of graphic processor, display card and memory access method
CN115600664B (en) * 2022-09-28 2024-03-08 美的集团(上海)有限公司 Operator processing method, electronic device and storage medium
CN115391053B (en) * 2022-10-26 2023-03-24 北京云迹科技股份有限公司 Online service method and device based on CPU and GPU hybrid calculation
CN117032999B (en) * 2023-10-09 2024-01-30 之江实验室 CPU-GPU cooperative scheduling method and device based on asynchronous running
CN117076330B (en) * 2023-10-12 2024-02-02 北京开源芯片研究院 Access verification method, system, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901207A (en) * 2010-07-23 2010-12-01 中国科学院计算技术研究所 Operating system of heterogeneous shared storage multiprocessor system and working method thereof
CN103167021A (en) * 2013-02-01 2013-06-19 浪潮(北京)电子信息产业有限公司 Resource allocation method and resource allocation device
US20150020206A1 (en) * 2013-07-10 2015-01-15 Raytheon BBN Technologies, Corp. Synthetic processing diversity with multiple architectures within a homogeneous processing environment
CN105138406A (en) * 2015-08-17 2015-12-09 浪潮(北京)电子信息产业有限公司 Task processing method, task processing device and task processing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901207A (en) * 2010-07-23 2010-12-01 中国科学院计算技术研究所 Operating system of heterogeneous shared storage multiprocessor system and working method thereof
CN103167021A (en) * 2013-02-01 2013-06-19 浪潮(北京)电子信息产业有限公司 Resource allocation method and resource allocation device
US20150020206A1 (en) * 2013-07-10 2015-01-15 Raytheon BBN Technologies, Corp. Synthetic processing diversity with multiple architectures within a homogeneous processing environment
CN105138406A (en) * 2015-08-17 2015-12-09 浪潮(北京)电子信息产业有限公司 Task processing method, task processing device and task processing system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112988194A (en) * 2021-03-29 2021-06-18 北京市商汤科技开发有限公司 Program optimization method and device based on equipment information, electronic equipment and storage medium
CN116450055A (en) * 2023-06-15 2023-07-18 支付宝(杭州)信息技术有限公司 Method and system for distributing storage area between multi-processing cards
CN116450055B (en) * 2023-06-15 2023-10-27 支付宝(杭州)信息技术有限公司 Method and system for distributing storage area between multi-processing cards

Also Published As

Publication number Publication date
CN112292667B (en) 2022-04-29
WO2020062086A1 (en) 2020-04-02

Similar Documents

Publication Publication Date Title
CN112292667B (en) Method and apparatus for selecting processor
Gschwend Zynqnet: An fpga-accelerated embedded convolutional neural network
Chen et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning
EP2707797B1 (en) Automatic load balancing for heterogeneous cores
CN108804077A (en) For executing instruction and the logic of floating-point and integer operation for machine learning
CN112463216A (en) Programmable conversion hardware
Suh et al. Accelerating MATLAB with GPU computing: a primer with examples
Catanzaro et al. Ubiquitous parallel computing from Berkeley, Illinois, and Stanford
CN103870246A (en) Compiler-controlled region scheduling for simd execution of threads
US20190130268A1 (en) Tensor radix point calculation in a neural network
CN112446815A (en) Sparse matrix multiplication acceleration mechanism
US20180373514A1 (en) Application binary interface cross compilation
CN110326021A (en) The execution unit that acceleration in graphics processor calculates shares hybrid technology
JP2021093131A (en) Sparse matrix optimization mechanism
Jeon et al. Deep learning with GPUs
WO2023030507A1 (en) Compilation optimization method and apparatus, computer device and storage medium
Perkins Cltorch: a hardware-agnostic backend for the torch deep neural network library, based on opencl
Vaidya Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA: Effective techniques for processing complex image data in real time using GPUs
US20230195519A1 (en) Low power inference engine pipeline in a graphics processing unit
KR20230101851A (en) Highly parallel processing architecture using a compiler
Hesse Analysis and comparison of performance and power consumption of neural networks on cpu, gpu, tpu and fpga
CN115840894A (en) Method for processing multidimensional tensor data and related product thereof
CN115480743A (en) Compiling method and compiler for neural network and related product
Rościszewski et al. Optimizing throughput of Seq2Seq model training on the IPU platform for AI-accelerated CFD simulations
JP2021077347A (en) Dot product multiplier mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant