WO2020062086A1

WO2020062086A1 - Method and device for selecting processor

Info

Publication number: WO2020062086A1
Application number: PCT/CN2018/108459
Authority: WO
Inventors: 刘恺; 周小超; 庞俊
Original assignee: 华为技术有限公司
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2020-04-02
Also published as: CN112292667A; CN112292667B

Abstract

The present application provides a method and a device for selecting a processor. The method comprises: acquiring hardware information of each of at least two processors, the hardware information being used for indicating an instruction set corresponding to each processor; acquiring program information of a target program, the program information being used for indicating instructions in the target program; and determining, from the at least two processors according to the program information and the hardware information, a target processor that satisfies a preset condition and can be used for executing the target program, wherein the preset condition comprises that the instruction set corresponding to the processor comprises instructions in the target program, so that the processing efficiency of a computer device can be improved and the burden of programmers can be reduced.

Description

Method and device for selecting processor

Technical field

The present application relates to the field of computers, and, more particularly, to a method and apparatus for selecting a processor, and a computer device.

Background technique

With the development and advancement of computer technology, heterogeneous architectures have emerged, that is, computer devices include general-purpose processors (e.g., central processing units) and co-processors (e.g., graphics processors). Coprocessors can provide more parallel computing capabilities and increase computing speed. In addition to providing more powerful thread-level or data-level parallel computing performance, coprocessors have a more efficient energy efficiency and power consumption ratio than general-purpose processors. Therefore, the coprocessor can make up for the lack of computing power of the CPU processor, and can reduce the overall energy consumption of the system.

Currently, heterogeneous architectures have been commonly used in fields such as neural networks (NN) or machine learning (ML). In this field, in order to facilitate programmers to write software programs for the system, A compilation technique is proposed. For example, the compiler technique can be used to generate operator code.

Compiling is the process of converting a program written in one programming language (source language) to another language (target language). The source language may be a language used by a user when writing a target program, and the target language may be a language used by a processor that the user wishes to select to run the target program in a heterogeneous system. In the construction of the compilation technology, a front end, an intermediate expression, and a back end can be included. The front-end mainly implements the conversion from the source program to the intermediate representation, that is, the user first uses the domain description language (domain specific language) to describe the calculation of the operator as the input of the front-end source program. Intermediate representation (IR) is input to the back end. The back-end code generator completes the conversion from IR to specific target code according to a specified target processor (for example, a general-purpose processor or coprocessor).

However, in this technique, the programmer is required to manually specify the target processor on which the operator runs in the early stage, for example, during the description of the calculation of the operator by the DSL. However, because of the support of hardware instructions, data alignment, operation efficiency, and cooperation of peripheral operators, the target processor specified by the compiler may not be the most suitable processor to run the target program. This leads to a reduction in processing efficiency. And, the process of manually specifying the target processor increases the workload of the programmer.

Summary of the Invention

The present application provides a method and device for selecting a processor, which can improve the processing efficiency of computer equipment and reduce the burden on programmers.

In a first aspect, a method for selecting a processor is provided, and hardware information of each processor in at least two processors is obtained, where the hardware information is used to indicate an instruction set corresponding to each processor; and a target to be executed is obtained. Program information of a program, the program information being used to indicate instructions in the target program; and based on the program information and the hardware information, determining from the at least two processors a target that satisfies a preset condition and can be used to execute the target program A processor, the preset condition includes an instruction set corresponding to the processor including an instruction in the target program.

According to the method for selecting a processor provided in the present application, by obtaining hardware information of each processor and program information of a target program in advance, and based on the hardware information and program information, selecting hardware information and the program information from a variety of processors The matching processor can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.

The “instruction set corresponding to the processor” can be understood as a function that the processor can process, and the hardware information can be used to indicate a function (for example, a function name) that the processor can process. The "instructions in the target program" can be understood as functions included in the target program, and the program information is used to indicate functions (for example, function names) included in the target program.

Optionally, according to the program information and the hardware information, determining, from the at least two processors, a target processor that satisfies a preset condition and can be used to execute the target program, includes: determining that the at least two processors each Based on the program information and the hardware information, in order of priority of the at least two processors from high to low, determine whether the at least two processors meet the preset condition in order, and The first processor that meets the preset condition serves as the target processor. By setting a priority for the processor, personalized processing can be achieved, and different processing scenarios can be flexibly handled. Moreover, the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.

Optionally, determining at least one of a parallel computing capability or power consumption of each of the at least two processors, and determining a priority of each processor.

Optionally, the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors. Therefore, it is possible to ensure that there is a processor capable of processing the target program among at least two types of processors, and because the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, the coprocessor can be selected as the target processor. Therefore, the effect and practicability of the present application can be further improved.

Optionally, the at least two processors include at least two of the following processors: a CPU, a graphics processor GPU, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, an image processing unit IPU, or a digital signal Processing DSP.

Among them, the ASIC can perform calculations by software.

Optionally, the hardware information is also used to indicate the size of the available memory space of the processor, the program information is also used to indicate the memory space required by the target program, and the preset condition further includes the available memory space of the processor Greater than or equal to the memory space required by the target program. The available space of the processor may refer to a specified proportion of the memory space in the total memory space of the processor. For example, the prescribed ratio may be 90%. Alternatively, the available space of the processor may refer to a specified proportion of the memory space in the total free memory space of the processor.

In addition, "the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space required by the target program" may mean that as long as the available memory space of the processor is greater than or equal to the memory space required by the target program Meet preset conditions.

Or, “the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy” may mean that the available memory space of the processor needs to be greater than or equal to the The memory space required by all programs executed by the target processor can satisfy the preset conditions.

By making the preset conditions further include that the available memory space of the processor is greater than or equal to the memory space required by the target program, it can be ensured that the selected target processor can support the operation of the target program, thereby further improving the practicality of the present application. Sex.

Optionally, obtaining the program information of the target program includes: determining a memory space that the target program needs to occupy according to the data dimension of the target program. Therefore, it is possible to easily determine the memory space that the target program needs to occupy.

Optionally, obtaining the program information of the target program includes: determining the program information according to an intermediate expression IR of the target program, wherein the IR of the target program is determined according to a domain description language DSL code of the target program.

The DSL code may be determined by a front-end compiler in a computer device, and the IR may be determined by an intermediate compiler in the computer device. Thereby, the program information can be easily obtained.

Optionally, the hardware information of each of the at least two processors is obtained: hardware information of each processor is obtained according to registration information of each processor, and the registration information is used by the processor in the computing device. Registration. The registration information may include hardware description information. In addition, the hardware description information may be obtained offline by the computer device before the processor is installed. Alternatively, the hardware description information may be obtained by the computer device from the driver information of the processor when the processor is installed.

Optionally, the computer device includes at least two back-end compilers, and the at least two back-end compilers correspond to the at least two processors in a one-to-one manner, and each back-end compiler is configured to convert the IR to a corresponding one. The code recognized by the processor.

In this case, the method further includes: inputting the IR of the target program to a target back-end compiler corresponding to the target processor. The IR of the target program may be an IR that has been optimized by IR.

According to a second aspect, a method for selecting a processor is provided. The method includes: acquiring hardware information of each of the at least two processors, the hardware information used to indicate a size of a processor's available memory space; acquiring a target Program information of a program, the program information being used to indicate a memory space that the target program needs to occupy; and based on the program information and the hardware information, it is determined from the at least two processors that the preset conditions are satisfied and can be used to execute the target program The target processor, the preset condition includes that the available memory space of the processor is greater than or equal to the memory space required by the target program.

According to the method for selecting a processor provided in the present application, by obtaining hardware information of each processor and program information of a target program in advance, and based on the hardware information and program information, selecting hardware information and the program information from a variety of processors Matching processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer. The available space of the processor may refer to a specified proportion of the memory space in the total memory space of the processor. For example, the prescribed ratio may be 90%. Alternatively, the available space of the processor may refer to a specified proportion of the memory space in the total free memory space of the processor.

Optionally, the at least two processors include at least two of the following processors: a CPU, a graphics processor GPU, a field programmable gate array FPGA, an application specific integrated circuit ASIC, a neural network processor NPU, an image processing unit IPU, or Digital Signal Processing DSP.

In addition, "the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space required by the target program" can mean that as long as the available memory space of the processor is greater than or equal to the memory space required by the target program, Meet preset conditions.

Optionally, the hardware information is also used to indicate the instruction set corresponding to the processor, the program information is also used to indicate the instruction in the target program, and the preset condition further includes that the instruction set corresponding to the processor includes the target program. Instructions. The “instruction set corresponding to the processor” can be understood as a function that the processor can process, and the hardware information can be used to indicate a function (for example, a function name) that the processor can process. The "instructions in the target program" can be understood as functions included in the target program, and the program information is used to indicate functions (for example, function names) included in the target program.

Optionally, according to the program information and the hardware information, determining, from the at least two processors, a target processor that satisfies a preset condition and can be used to execute the target program, includes: determining that the at least two processors each Based on the program information and the hardware information, in order of priority of the at least two processors from high to low, determine whether the at least two processors meet the preset condition in order, and The first processor that meets the preset condition serves as the target processor.

By setting a priority for the processor, personalized processing can be achieved, and different processing scenarios can be flexibly handled. Moreover, the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.

Optionally, obtaining the program information of the target program includes: determining the program information according to an intermediate expression IR of the target program, wherein the IR of the target program is determined according to a domain description language DSL code of the target program. The DSL code may be determined by a front-end compiler in a computer device, and the IR may be determined by an intermediate compiler in the computer device. Thereby, the program information can be easily obtained.

According to a third aspect, an apparatus for selecting a processor is provided, which is characterized in that the apparatus includes: an identification module for acquiring hardware information of each of the at least two processors, and the hardware information is used to instruct the processor Corresponding instruction set; analysis module, for obtaining program information of the target program to be executed, the program information is used to indicate instructions in the target program; selection module, for obtaining information from the at least the program information and the hardware information Among the two types of processors, a target processor that satisfies a preset condition and can be used to execute the target program is determined, and the preset condition includes an instruction set corresponding to the processor including an instruction in the target program.

According to the processor selection device provided in the present application, hardware information and program information of a target program are obtained in advance by each processor, and based on the hardware information and program information, hardware information and the program information are selected from a variety of processors. Matching processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.

Optionally, the selection module is configured to determine a priority of each processor of the at least two processors, and based on the program information and the hardware information, according to a priority order of the at least two processors from high to low , Sequentially determine whether the at least two processors meet the preset condition, and use the first processor that meets the preset condition as the target processor. By setting a priority for the processor, personalized processing can be achieved, and different processing scenarios can be flexibly handled. Moreover, the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.

Optionally, the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors. Therefore, it is possible to ensure that there is a processor capable of processing the target program among at least two types of processors, and because the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, the coprocessor can be selected as the target processor Therefore, the effect and practicability of the present application can be further improved.

Among them, the ASIC can perform calculations by software.

Optionally, the hardware information is also used to indicate the size of the available memory space of the processor, the program information is also used to indicate the memory space required by the target program, and the preset condition further includes the available memory space of the processor Greater than or equal to the memory space required by the target program. The available space of the processor may refer to a specified proportion of the memory space in the total memory space of the processor. For example, the prescribed ratio may be 90%. Alternatively, the available space of the processor may refer to a specified proportion of the memory space in the total free memory space of the processor. In addition, "the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space required by the target program" may mean that as long as the available memory space of the processor is greater than or equal to the memory space required by the target program, Meet preset conditions. Or, “the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy” may mean that the available memory space of the processor needs to be greater than or equal to the The memory space required by all programs executed by the target processor can satisfy the preset conditions. By making the preset conditions further include that the available memory space of the processor is greater than or equal to the memory space required by the target program, it can be ensured that the selected target processor can support the operation of the target program, thereby further improving the practicality of the present application. Sex.

Optionally, the analysis unit is configured to determine a memory space required by the target program according to a data dimension of the target program. Therefore, it is possible to easily determine the memory space that the target program needs to occupy.

Optionally, the analysis module is configured to determine the program information according to the intermediate expression IR of the target program, where the IR is determined according to a domain description language DSL code of the target program. Thereby, program information can be easily obtained.

Optionally, the identification unit is configured to obtain hardware information of each processor according to registration information of each processor, and the registration information is used to register the processor in the computing device. The registration information may include hardware description information. In addition, the hardware description information may be obtained offline by the computer device before the processor is installed. Alternatively, the hardware description information may be obtained by the computer device from the driver information of the processor when the processor is installed.

In this case, the selection unit is used to input the IR of the target program to a target back-end compiler corresponding to the target processor. The IR of the target program may be an IR that has been optimized by IR.

According to a fourth aspect, an apparatus for selecting a processor is provided. The apparatus includes: an identifying unit configured to obtain hardware information of each of the at least two processors, and the hardware information is used to indicate available memory of the processor. The size of the space; the analysis unit is used to obtain the program information of the target program, the program information is used to indicate the memory space that the target program needs to occupy; the selection unit is used to select from the at least two The processor determines a target processor that satisfies a preset condition and can be used to execute the target program. The preset condition includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy.

According to the method for selecting a processor provided in the present application, by obtaining hardware information of each processor and program information of a target program in advance, and based on the hardware information and program information, selecting hardware information and the program information from a variety of processors Matching processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer. The available space of the processor may refer to a specified proportion of the memory space in the total memory space of the processor. For example, the prescribed ratio may be 90%. Alternatively, the available space of the processor may refer to a specified proportion of the memory space in the total free memory space of the processor. In addition, "the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space required by the target program" may mean that as long as the available memory space of the processor is greater than or equal to the memory space required by the target program Meet preset conditions. Or, "the preset condition also includes that the available memory space of the processor is greater than or equal to the memory space that the target program needs to occupy" may mean that the available memory space of the processor needs to be greater than or equal to the existing memory including the target program. The memory space required by all programs executed by the target processor can satisfy the preset conditions.

Optionally, the selection module is configured to determine a priority of each processor of the at least two processors, and based on the program information and the hardware information, according to a priority order of the at least two processors from high to low , Sequentially determine whether the at least two processors meet the preset condition, and use the first processor that meets the preset condition as the target processor.

Optionally, the parallel computing capability of each of the at least two processors is determined, and the priority of each processor is determined.

Optionally, the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors.

Therefore, it is possible to ensure that there is a processor capable of processing the target program among at least two types of processors, and because the power consumption of the CPU is high, by setting the priority of the CPU to the lowest, the coprocessor can be selected as the target processor. Therefore, the effect and practicability of the present application can be further improved.

Optionally, obtaining the program information of the target program includes: obtaining a domain description language DSL code of the target program; determining an intermediate expression IR according to the DSL code; and determining the program information according to the IR. Thereby, the program information can be easily obtained.

In this case, the selection unit is used to control the IR of the target program to a target back-end compiler corresponding to the target processor. The IR of the target program may be an IR that has been optimized by IR.

According to a fifth aspect, a compiling device is provided. The compiling device is configured in a computer device including at least two processors. The device includes a plurality of back-end compiling units, which correspond to the processors in a one-to-one manner. It is used to convert the received IR into the code that the corresponding processor can recognize; the front-end compilation unit is used to obtain the DSL corresponding to the target program; the intermediate compilation unit is used to determine the IR according to the DSL; the selection unit is used to Determine program information of the target program according to the IR, and obtain hardware information of each of the at least two processors, and use the program information and the hardware information to select from the at least two processors Determine a target processor for executing the target program, and send the IR to a back-end compilation unit corresponding to the target processor.

The program information is used to indicate an instruction in the target program, the hardware information is used to indicate a corresponding instruction set of a processor, and the target processor is a processor that satisfies a preset condition among the at least two processors, the The preset condition includes that the instruction set corresponding to the processor includes instructions in the target program; and / or the hardware information is used to indicate the size of the available memory space of the processor, and the program information is used to indicate the memory space required by the target program. And the preset condition includes that the available memory space of the processor is greater than or equal to the memory space required by the target program.

According to the compiler provided in this application, the hardware information of each processor and the program information of the target program are obtained in advance, and based on the hardware information and program information, the hardware information that matches the program information is selected from a variety of processors. Processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.

Optionally, the selection unit is configured to determine a priority of each processor of the at least two processors; based on the program information and the hardware information, determine whether the processor meets the pre-determined order of priority from high to low. Set conditions, and use the first processor that meets the preset condition as the target processor. By setting a priority for the processor, personalized processing can be achieved, and different processing scenarios can be flexibly handled. Moreover, the efficiency of determining the target processor can be improved, and the time for determining the target processor can be shortened.

According to a sixth aspect, a computer device is provided, which includes multiple processors, a compiler, and a selection device, and the selection device executes the method in the first aspect and any possible implementation manner thereof, or the second aspect and the foregoing Methods in any of the possible implementations. For example, the compiler includes a front-end compiler, an intermediate compiler, and a back-end compiler.

According to a seventh aspect, a chip or chipset is provided, including at least one processor and at least one memory control unit. The processor executes the method in the first aspect and any possible implementation manner thereof, or the second aspect And any of its possible implementations. The chip or chipset may include a smart chip. The smart chip may include at least two processors.

According to an eighth aspect, a computer system is provided, including a processor and a memory. The processor includes at least two processors and a memory control unit, and the processor executes the method in the first aspect and any possible implementation manner. Or, the method in the second aspect and any one of the possible implementation manners.

Optionally, the computing system further includes a system bus for connecting the processor (specifically, a memory control unit) and a memory.

According to a ninth aspect, a computer program product is provided. The computer program product includes a computer program (also referred to as code or instructions). When the computer program is executed by a processor or a processor in a chip, The processor executes the method in the foregoing first aspect and any one of its possible implementations, or the method in the foregoing second aspect and any of its possible implementations.

According to a tenth aspect, a computer-readable medium is provided, where the computer-readable medium stores a computer program (also referred to as code, or instructions) that, when executed on a processor or a processor in a chip, causes processing The processor executes the method in the foregoing first aspect and any one of its possible implementations, or the method in the foregoing second aspect and any of its possible implementations.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic hardware structural diagram of a computer device (or a computer system) to which a method and an apparatus for monitoring a process according to an embodiment of the present application are applied.

FIG. 2 is a schematic diagram of an example of a lexical analysis process of the present application.

FIG. 3 is a schematic diagram of an example of a syntax analysis process of the present application.

FIG. 4 is a schematic diagram of an example of an intermediate code generation and optimization process of the present application.

FIG. 5 is a schematic flowchart of an example of a method for selecting a processor according to the present application.

FIG. 6 is a schematic flowchart of another example of a method for selecting a processor according to the present application.

FIG. 7 is a schematic diagram of an example of a compilation method of the present application.

FIG. 8 is a schematic configuration diagram of an example of a processor selection device of the present application.

FIG. 9 is a schematic configuration diagram of an example of a compiler device of the present application.

detailed description

The technical solutions in this application will be described below with reference to the drawings. First, a computing device 100 that executes a method for monitoring a process in an embodiment of the present application will be described in detail with reference to FIG. 1.

A computing device can also be referred to as a computer system. From a logical hierarchical perspective, a computing device can include a hardware layer, an operating system layer running on the hardware layer, and an application layer running on the operating system layer. The hardware layer includes hardware such as a processor, a memory, and a memory control unit. The functions and structure of the hardware are described in detail later. The operating system may be any one or more computer operating systems that implement business processing through processes, such as a Linux operating system, a Unix operating system, an Android operating system, an iOS operating system, or a windows operating system. This application layer contains applications such as browsers, address books, word processing software, and instant messaging software. Moreover, in the embodiment of the present application, the computer system may be a handheld device such as a smart phone or a terminal device such as a personal computer, which is not particularly limited in the present application, as long as the program code of the method of the embodiment can be read, and Run the program code to monitor the sample process by the method of monitoring the memory access behavior of the sample process according to the embodiment of the present application. The execution subject of the method for monitoring a memory access behavior of a sample process in the embodiment of the present application may be a computer system, or a functional module, such as a processor, in the computer system that can call a program and execute the program.

In this application, a program or program code refers to a set of ordered instructions (or codes) used to implement some relatively independent function. A process is a running process of a program and its data on a computer device. The program usually adopts a modular design, that is, the function of the program is broken down into multiple smaller functional modules. The program contains at least one function. A function is a code segment that implements a functional module. Therefore, functions are the basic unit of program function modularity, and can also be regarded as subroutines.

FIG. 1 is a schematic structural diagram of a computing device 100 according to an embodiment of the present application. The method shown in FIG. 1 for a computing device to perform a monitoring process. The computing device 100 may include: at least two processors 110, and a memory 120.

Optionally, the computer device 110 may further include a system bus, where the processor 110 and the memory 120 are respectively connected to the system bus. The processor 110 can access the memory 120 through the system bus. For example, the processor 110 can read and write data or execute code in the memory 120 through the system bus. Among them, the function of the processor 110 is mainly to interpret instructions (or codes) of a computer program and to process data in computer software. The instructions of the computer program and data in the computer software may be stored in the memory 120 or the cache unit 116. In the embodiment of the present application, the processor 110 may be an integrated circuit chip or a component therein, and has a signal processing capability.

In this application, the processor 110 may fetch instructions from a memory or a cache memory, place them in an instruction register, and decode the instructions. It breaks down instructions into a series of micro-operations, and then issues various control commands to execute a series of micro-operations to complete the execution of an instruction. An instruction is a basic command that a computer specifies to perform the type and operand of an operation. The instruction is composed of one byte or multiple bytes, which includes the opcode field, one or more fields related to the operand address, and some status words and characteristic codes that characterize the state of the machine. Some instructions directly include the operand itself.

By way of example and not limitation, in this application, the processor 110 may include a memory control unit 114 and at least one processing unit 112.

The processing unit 112 may also be referred to as a core or a core, and is the most important component of the processor. The processing unit 112 may be manufactured by monocrystalline silicon in a certain production process. The calculation, the receiving command, the storing command, and the processing data of the processor 110 are all performed by the core. The processing unit 112 can run the program instructions independently, and use the ability of parallel computing to accelerate the running speed of the program. Various processors 110 have a fixed logical structure. For example, the processor 110 includes a logical unit such as a first-level cache, a second-level cache, an execution unit, an instruction-level unit, and a bus interface.

The memory control unit 114 is configured to control data interaction between the memory 120 and the processing unit 112. Specifically, the memory control unit 114 may receive a memory access request from the processing unit 112 and control access to the memory based on the memory access request. By way of example and not limitation, in the embodiment of the present application, the memory control unit may be a device such as a memory management unit (MMU).

In the embodiment of the present application, each memory control unit 114 may address the memory 120 through a system bus. In addition, an arbiter (not shown) may be configured in the system bus, and the arbiter may be responsible for processing and coordinating competing accesses of the plurality of processing units 112.

In the embodiment of the present application, the processing unit 112 and the memory control unit 114 may be connected through a connection line inside the chip, such as an address line, to implement communication between the processing unit 112 and the memory control unit 114.

Optionally, each processor 110 may further include a cache unit 116, where the cache unit 116 is a buffer (called a cache) for data exchange. When the processor 112 wants to read the data, it will first look up the required data from the cache unit 116, and if it finds it, execute it directly. If it cannot find it, it will look for it from the memory 120. Since the cache unit 116 runs much faster than the memory 120, the role of the cache unit 116 is to help the processing unit 112 run faster.

The memory 120 may provide a running space for a process in the computing device 100. For example, the memory 120 may store a computer program (specifically, a program code) for generating a process, and the memory 120 may store a process Data generated during operation, for example, intermediate data, or process data. The memory may also be called an internal memory, and its function is to temporarily store the operation data in the processor 110 and data exchanged with an external memory such as a hard disk. As long as the computer is running, the processor 110 will transfer the data to be calculated into the memory 120 for operation, and the processing unit 112 will transmit the result after the operation is completed.

By way of example and not limitation, in the embodiment of the present application, the memory 120 may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory. Among them, the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrical memory Erase programmable read-only memory (EPROM, EEPROM) or flash memory. The volatile memory may be a random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM ) And direct memory bus random access memory (direct RAMbus RAM, DR RAM). It should be noted that the memory 120 of the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.

It should be understood that the structure of the computing device 100 listed above is only an exemplary description, and the present application is not limited thereto. The computing device 100 in the embodiment of the present application may include various hardware in a computer system in the prior art. The computing device 110 may further include other memories besides the memory 120, for example, a disk memory and the like.

In the embodiment of the present application, a virtualization technology may be applied on the computing device 100. Through the virtualization technology, the computer device 100 can run multiple virtual machines at the same time, each virtual machine can run at least one operating system, and each operating system runs multiple programs. Virtual machine (virtual machine) refers to a complete computer system with complete hardware system functions and running in a completely isolated environment simulated by software.

In the present application, the processor 110 may include a plurality of categories. For example, different kinds of processors may use different kinds of instructions. As another example, different types of processors may have different computing capabilities. As another example, different kinds of processors can be used to handle different types of calculations. By way of example and not limitation, in this application, the various processors may include a general purpose processor and a coprocessor. The above-mentioned various processors are respectively described in detail below.

A. General Purpose Processor

A general-purpose processor can also be referred to as a central processing unit (CPU), which is a very large-scale integrated circuit or a component thereof, and is a computing core (Core) and a control core (Control Unit) of a computer. Its function is mainly to interpret computer instructions and process data in computer software. The central processing unit mainly includes an arithmetic unit (Arithmetic Logic Unit, ALU), a cache memory (Cache), and a data bus (Data), a control, and a status bus (Bus) that implement the connection between them. It, together with internal memory and input / output (I / O) equipment, is called the three core components of an electronic computer. For example, the CPU includes an arithmetic logic unit, a register unit, a control unit, and the like.

Logic components (logic components) are operational logic components. You can perform fixed-point or floating-point arithmetic operations, shift operations, and logical operations. You can also perform address operations and conversions.

Registers include general purpose registers, special purpose registers, and control registers. General-purpose registers can be divided into fixed-point and floating-point numbers. They are used to store register operands temporarily stored during instruction execution and intermediate (or final) operation results.

The control unit (control unit) is mainly responsible for decoding the instructions and sending out control signals to complete each operation to be performed by each instruction. There are two types of structures: one is a microprogram control method with micro-storage as the core; the other is a control method mainly with a logical hard-wired structure. Microcode is maintained in the micro memory, and each microcode corresponds to a basic micro operation, also called microinstruction; each instruction is composed of different sequences of microcode, and this microcode sequence constitutes a microprogram. After the central processor decodes the instructions, it sends out a certain timing control signal, and executes a number of micro operations determined by these microcodes in micro-cycles in the order of a given sequence to complete the execution of an instruction. Simple instructions are composed of (3 to 5) micro operations, while complex instructions are composed of dozens of micro operations or even hundreds of micro operations.

B. Coprocessor

A coprocessor, a chip or part of a chip, used to alleviate specific processing tasks of the system microprocessor. Coprocessor, which is a processor developed and applied to assist the central processing unit in performing processing tasks that it cannot perform or perform inefficiently and inefficiently. There are many tasks that the central processor cannot perform, such as signal transmission between devices, management of access devices, etc., and the execution efficiency and low effects include graphics processing and audio processing. In order to perform these processes, various auxiliary processors were born. It should be noted that, since the integer arithmetic unit and the floating-point arithmetic unit have been integrated in the current computer, the floating-point processor is no longer an auxiliary processor. The coprocessor built into the CPU is not necessarily an auxiliary processor. Of course, the coprocessor can also exist independently.

In this application, the coprocessor can be used for specific processing tasks, for example, a mathematical coprocessor can control digital processing; a graphics coprocessor can handle video rendering. The coprocessor can be attached to a general-purpose processor. A coprocessor extends the general-purpose processor core processing capabilities by extending the instruction set or providing configuration registers. One or more coprocessors can be connected to a general-purpose processor core through a coprocessor interface. For example, the coprocessor can also expand the instruction set by providing a new set of specialized instructions. By way of example and not limitation, the coprocessor may include, but is not limited to, at least one of the following processors:

B1. Graphics Processor

Graphics processing unit (GPU), also known as display core, visual processor, and display chip, is a kind of image specially designed for personal computers, workstations, game consoles, and some mobile devices (such as tablets, smart phones, etc.) Microprocessor for arithmetic work. The purpose of the GPU is to convert and drive the display information required by the computer system, and provide line scanning signals to the display to control the correct display of the display. It is an important component that connects the display and the main board of the personal computer, and is also an important device for "human-machine dialogue". one. For example, the processor of a graphics card is sometimes called a graphics processor (GPU). It is the "heart" of a graphics card, similar to a CPU, except that the GPU is designed to perform complex mathematical and geometric calculations. These calculations are graphics Required for rendering. Some of the fastest GPUs integrate even more transistors than a normal CPU.

Most current GPUs have 2D or 3D graphics acceleration capabilities. If the CPU wants to draw a two-dimensional graphic, it only needs to send an instruction to the GPU, such as "draw a rectangle with a length and width of a × b at the coordinate position (x, y)", and the GPU can quickly calculate the graphic And draw the corresponding graphics at the specified position on the display, after the drawing is finished, the CPU is notified that "I have finished drawing", and then waits for the CPU to issue the next graphics instruction. With the GPU, the CPU is freed from graphics processing tasks and can perform other system tasks, which can greatly improve the overall performance of the computer. For example, the GPU generates a lot of heat, so a radiator or fan is usually installed above it.

The GPU is the "brain" of the graphics card. The GPU determines the grade and most performance of the graphics card. At the same time, the GPU is also the basis for the difference between a 2D graphics card and a 3D graphics card. The 2D display chip mainly relies on the processing power of the CPU when processing 3D images and special effects, which is called soft acceleration. The 3D display chip is a three-dimensional image and special effects processing function concentrated in the display chip, which is the so-called "hardware acceleration" function. The display chip is generally the largest chip (and also has the most pins) on the display card. At present, the GPU is no longer limited to 3D graphics processing. The development of GPU general computing technology has attracted a lot of attention in the industry. The facts also prove that GPU can provide dozens of times or more in floating point computing, parallel computing and other partial calculations. A hundred times the performance of the CPU. The GPU enables computer equipment to reduce its dependence on the CPU and share some of the work that was originally performed by the CPU.

B2. Field Programmable Gate Array Application Specific Integrated Circuit

Field programmable gate array (FPGA) is, for example, Programmable Array Logic (PAL, Programmable Array Logic), General Array Logic (GAL, Generic Array Logic), Complex Programmable Logic Device (CPLD, Complex Programmable) Logic Device) and other programmable products based on the further development of the product. Field Programmable Gate Array ASIC appears as a semi-custom circuit in the field of Application Specific Integrated Circuit (ASIC, Application Specific Integrated Circuit), which not only solves the shortcomings of custom circuits, but also overcomes the original programmable device gates. Disadvantages of limited number of circuits. System designers can connect logic blocks inside the FPGA through editable connections as needed, as if a circuit test board was placed in a chip. The logic blocks and connections of a finished FPGA after leaving the factory can be changed according to the designer, so the FPGA can complete the required logic functions.

The FPGA uses a logic cell array (LCA, Logic Cell Array), which includes three parts: a configurable logic module (CLB, Configurable Logic Block), an input output module (IOB, Input Output Block), and an internal connection (Interconnect). As a programmable device, FPGA can have different structures compared to traditional logic circuits and gate arrays (such as PAL, GAL, and CPLD devices) through different programming methods. FPGA uses a small lookup table (16 × 1RAM) to implement combinational logic. Each lookup table is connected to the input of a D flip-flop. The flip-flop then drives other logic circuits or drives I / O. The logic function can also realize the basic logic unit module of the sequential logic function. These modules are connected to each other or connected to the I / O module by metal wiring. The logic of the FPGA is implemented by loading programming data into the internal static storage unit. The value stored in the memory unit determines the logic function of the logic unit and the connection mode between the modules or between the modules and I / O, and finally determines FPGA can realize the function, FPGA allows unlimited programming.

It should be noted that, because the FPGA does not include an instruction set, the method 200 described below may not be used to determine whether the FPGA can be used as a target processor.

However, since the FPGA has a memory space, a method 300 described later can be used to determine whether the FPGA can be used as a target processor.

B3. Neural Network Processor

Neural network processors (neural-network process units) adopt a "data-driven parallel computing" architecture and are particularly good at processing massive multimedia data such as video and images. Among them, NPU can be used for deep learning. From a technical perspective, deep learning is actually a type of multilayer large-scale artificial neural network. It is modeled after a biological neural network and consists of several artificial neuron nodes interconnected. Neurons are connected one by one through synapses. Synapses record the weight of the connections between neurons. Each neuron can be abstracted into a stimulus function whose input is determined by the output of the neuron connected to it and the synapses that connect the neuron. In order to express specific knowledge, users usually need to adjust (by some specific algorithms) the value of synapses in the artificial neural network, the topology of the network, and so on. This process is called "learning". After learning, artificial neural networks can use the acquired knowledge to solve specific problems.

The basic operation of deep learning is the processing of neurons and synapses. The traditional processor instruction set was developed for general purpose computing. Its basic operations are arithmetic operations (addition, subtraction, multiplication, and division) and logical operations (and or not). It often takes hundreds or even thousands of instructions to complete a neuron. Processing, the processing efficiency of deep learning is not high. In contrast, NPU instructions directly face the processing of large-scale neurons and synapses. One instruction can complete the processing of a group of neurons and provide a series of specialized support for the transmission of neurons and synaptic data on the chip. . In addition, the storage and processing in the neural network are integrated, and both are represented by synaptic weights.

B4. Application Specific Integrated Circuits

Application specific integrated circuit (ASIC) is an integrated circuit made for a specific user or a specific electronic system. The universality and mass production of digital integrated circuits has greatly reduced the cost of electronic products and promoted the popularization of computer communications and electronic products. However, it has also caused the contradiction between general and special applications, and the disconnection between system design and circuit production. At the same time, the larger the integrated circuit scale, the more difficult it is to change for special requirements when building a system. In order to solve these problems, ASICs featuring user participation in design have emerged, which can realize the optimized design of the entire system, with superior performance and strong confidentiality. ASICs can be used to execute software programs, or they can perform calculations through hardware logic instead of software programs. For example, an ASIC executing a software program may include one or more processor cores to execute instructions and have a corresponding instruction set.

B5. Digital Signal Processor

Digital signal processing (DSP) is a theory and technology that represents and processes signals digitally. Digital signal processing and analog signal processing are a subset of signal processing. The purpose of digital signal processing is to measure or filter continuous analog signals in the real world. Therefore, before performing digital signal processing, the signal needs to be converted from the analog domain to the digital domain, which is usually achieved by an analog-to-digital converter. And the output of digital signal processing often needs to be transformed into the analog domain, which is realized by a digital-to-analog converter. DSP is a special-purpose chip for digital signal processing. It is a new device that is accompanied by the development of microelectronics, digital signal processing technology, and computer technology.

B6. Image processing unit

An image processing unit (IPU) can also be called an image signal processor (image signal processor), which can be used to process the output signal of the front-end image sensor to match image sensors from different manufacturers. And, it can be used to provide comprehensive support for end-to-end data stream signal processing from image input (camera sensor / TV signal input, etc.) to display devices (eg, LCD screen, TV V output or external image processing unit, etc.).

It should be understood that the above-listed processors are merely exemplary descriptions, and the present application is not limited thereto. For example, the processors in this application may further include programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. By way of example and not limitation, in the present application, the above-mentioned structure including multiple processors may be referred to as a heterogeneous architecture or a heterogeneous system architecture.

In the Internet industry, with the popularization of information and the explosion of data, people have new requirements for storage space. At the same time, the rise of machine learning, artificial intelligence, unmanned driving, industrial simulation and other fields has made general-purpose processors When dealing with massive calculations and massive data / pictures, more and more performance bottlenecks are encountered, such as low parallelism, insufficient bandwidth, and high latency. In order to cope with the demand for computing diversification, more and more scenarios have begun to introduce hardware such as GPUs and FPGAs for acceleration, and heterogeneous computing has emerged at the historic moment. Heterogeneous computing (Heterogeneous Computing), mainly refers to the calculation method of the system composed of different types of instruction sets and architecture computing units. The so-called heterogeneity refers to various computing units such as CPUs, DSPs, GPUs, ASICs, coprocessors, and FPGAs, computing units that use different types of instruction sets, and different architectures to form a mixed system that performs special calculations. This method is called "heterogeneous computing". Especially in the field of artificial intelligence, heterogeneous computing has great potential. As we all know, AI means ultra-high requirements for computing power. At present, heterogeneous computing represented by GPU has become a new generation of computing architecture to accelerate AI innovation.

In a heterogeneous system architecture (HSA), multiple processors work together, that is, the CPU can use most resources for cache and logic control (that is, non-computing units), and a small part of resources for computing. This shows that the CPU is suitable for running serial programs with branch-intensive, irregular data structures, recursion and other characteristics. Combined with traditional multi-core architecture, dedicated computing modules are added to the system as accelerators, such as graphics processing units (GPUs), digital signal processors (DSPs), field programmable gate arrays (FPGAs), and other programmable logic units. Utilization as an accelerator (ie, heterogeneous kernel architecture) has become a trend. HSA launches new system architecture and execution standards with the goal of optimizing heterogeneous computing. The ultimate goal is to perform collaborative operations through heterogeneous architectures of the cores (including CPU, GPU, DSP, and other processors) within the SoC. In this way, the performance of each architecture in the entire SoC can be maximized. Heterogeneous system architecture enables multiple processors to implement unified memory addressing.

Parallel computing on heterogeneous computing systems is often called heterogeneous computing. People have defined heterogeneous computing from different perspectives. In summary, this embodiment gives the following definition: Heterogeneous computing is a special form of parallel and distributed computing. It can also support a single instruction and multiple data streams ( Single independent computer with Single Instruction, Multiple Data (SIMD) method and multiple instruction stream, multiple data stream (MIMD) method, or a group of independent computers interconnected by a high-speed network to complete computing tasks. It can coordinate the use of machines with different performances and structures to meet different computing needs, and enables the code (or code segment) to be executed in a way that maximizes overall performance.

Heterogeneous computing technology is a parallel and distributed computing technology that enables the type of parallelism (code type) of computing tasks to best match the type of computing that the machine can effectively support (that is, machine capabilities) and makes the best use of various computing resources. The above chip with a heterogeneous system architecture may be called an artificial intelligence (AI) chip, or an accelerated processing unit (APU).

According to the method for selecting a processor provided in the present application, a processor for executing a target process may be selected from the foregoing multiple processors. As described above, in the present application, the processor executes the target program by executing the code of the target program. In this application, different types of processors may have different instruction-set architectures (ISA). For example, different types of processors may have different instruction sets. The instruction set is stored or integrated in the processor in the form of hardware, a hard program that guides and optimizes processor operations. The processor can run more efficiently through the instruction set.

To facilitate programmers to write programs, compilation techniques can be used in this application. Compiling is the process of converting a program written in one programming language (source language) to another language (target language). In this application, the compiler used by the above compilation technology may include, but is not limited to, the following structures:

A. Front-end compiler:

The front-end compiler is used to implement the conversion from source program (or source code) to intermediate representation (IR), that is, the user first calculates the operator using the domain description language (DSL) language Describe as input to the front-end compiler. The processing of the front-end compiler mainly includes lexical analysis, syntax analysis and semantic analysis.

1) Lexical analysis obtains the corresponding token sequence from the character sequence. For example, for a code (or instruction or function) of "b = 3 + 52 * a", the preceding decoder can obtain a token sequence as shown in FIG. 2.

2) Syntax analysis further obtains an Abstract Syntax Tree (AST) from the sequence of tokens. For example, the above symbol sequence can obtain the syntax tree shown in FIG. 3.

3) Semantic analysis identifies the types of variables, the scope of operations, etc.

In this application, the front-end compiler may also be referred to as a front-end compilation device or a front-end compilation unit.

B. Intermediate compiler:

The intermediate compiler is used for code generation and optimization. Specifically, the intermediate code is pseudo-code and can be regarded as a program on an abstract machine. Its characteristics are simple specifications, machine-independent, and easy to optimize and convert. Organized as a syntax tree. For example, for the code (or instruction or function) of “sum = (10 + 20) * (num + square)”, the code tree shown in FIG. 4 can be obtained after code generation and optimization.

In this application, the optimization processing of the intermediate code is based on the equivalent, which can save storage space and run faster. Common optimization methods are divided into two categories: 1) machine-independent optimizations, such as: constant combination, extraction of common self-expressions, unrolling and merging of loops, code extraction (moving constant calculations out of loops), etc. Machine-related optimizations, such as: the use of registers (putting common quantities into registers to reduce the number of times to access memory), storage strategies (you can arrange Cache and parallel storage architecture to reduce access conflicts according to the requirements of algorithm fetching).

In this application, the intermediate compiler may also be referred to as an intermediate compilation device or an intermediate compilation unit.

C. Back-end compiler:

Backend compiler (backend) is mainly used for object code generation, that is, the backend compiler can include multiple, the multiple backend compilers can correspond to a variety of processors, each backend compiler is used to The input optimized IR is converted into an object code (or an instruction or a function) that can be run on a corresponding processor, where the object code may be an instruction code or an assembly code. As described above, it is necessary to select one back-end compiler from multiple back-end compilers for generating the target code. In the prior art, this process is manually completed. In contrast, in the embodiment of the present application, the The process can be automated by computer equipment. In addition, since the multiple back-end compilers can interact with multiple processors, "selecting a back-end compiler from multiple back-end compilers to generate object code" can also be understood as referring to multiple processors. The process of selecting a processor for executing a target program.

In this application, the back-end compiler may also be referred to as a back-end compilation device or a back-end compilation unit.

FIG. 5 is a schematic flowchart of an example of a method 200 for selecting a processor according to the present application. By way of example and not limitation, the execution body of the method 200 (hereinafter, referred to as processing node #A for ease of understanding and description) may be any processor among multiple processors in a computing device, such as a central processing unit. Alternatively, processing node #A may be a virtual machine running in a computing device. In addition, in this application, the processing node #A may be the above-mentioned back-end compiler, or may be a device independent of the above-mentioned back-end compiler, which is not particularly limited in this application.

It should be noted that the method 200 selects a target processor based on instructions. Since the FPGA does not include an instruction set, the method 200 may not be used to determine whether the FPGA can be used as a target processor.

Moreover, when the ASIC can be used to execute a software program, the method 200 can be used to determine whether the ASIC can be used as a target processor.

When the ASIC does not execute a software program but performs calculations through hardware logic, the method 200 may not be used to determine whether the ASIC can serve as a target processor.

As shown in FIG. 5, S210, the processing node #A may obtain hardware information of each of the two processors included in the computing device 100. Optionally, in the present application, the manufacturer of the computing device 100 may pre-configure the hardware information of the processors included in the computing device 100 in the computing device 100 when the computing device 100 leaves the factory, so that the processing node # A In S210, hardware information of each of the two processors included in the computing device 100 may be obtained based on the relevant information of the appearance configuration. Optionally, in this application, the manufacturer of the computing device 100 may save the hardware information of each processor included in the computing device 100 on the server, so that the processing node #A is connected to the server through the network in advance in S210. And obtain hardware information of each of the two processors included in the computing device 100 from the server. Optionally, in this application, a user of the computing device 100 may input hardware information of each processor included in the computing device 100 to the processing node #A. Optionally, in this application, each processor may be installed in a hot-pluggable manner, and a driver of each processor may complete registration of each processor during hot-plugging. In the previous case, the processing node #A In S210, hardware information of each of the two processors included in the computing device 100 may be obtained based on registration information of each processor or related information in a driver.

That is, in this application, the computer device 100 (or processing node #A) may have a processor registration information collection function, so as to be able to identify which heterogeneous hardware is supported in the computer device 100, and according to the identified hardware, in the system Register the backend corresponding to each processor at startup. Therefore, the processing node can determine the hardware information of each processor according to the registration information of the backend corresponding to each processor.

In this application, hardware information of a processor may include information of an instruction set corresponding to the processor. For example, the hardware information of a processor may include information about the names of instructions that the processor can execute. As another example, the hardware information of a processor may include information about the names of functions that the processor can execute.

As shown in FIG. 5, in S220, the processing node #A can determine program information of a program (that is, an example of a target program, which is described as: program #A) that needs to be currently run. By way of example and not limitation, in this application, the program information may be determined according to the IR of the program #A. For example, in this application, the front-end compiler can obtain the source program code of program #A (denoted as: code #A). Specifically, the compiler may use, for example, a domain description language interface (DSL interface) for developers to call the DSL corresponding to the write operator (ie, an example of code #A); thereafter, the intermediate compiler may use the program # The code #A (for example, DSL) corresponding to A is converted into the IR of the program #A; and, in this application, the intermediate compiler may also optimize the IR of the program #A. Therefore, the processing node #A can determine the program information of the program #A from the IR (for example, the optimized IR) of the program #A.

It should be noted that, in this application, the processing node #A may be a front-end decoder and an intermediate decoder as the code #A. In this case, the processing node #A can directly obtain the program #A. IR. Alternatively, in the present application, the front-end decoder and the intermediate decoder of the code #A may be implemented by the processing node #B. In this case, the processing node #A may also communicate with the processing node #B, so that the processing The node #B may send the IR of the program #A to the processing node #A.

In the present application, the program information of the program #A may include instructions (denoted as: instruction #A) included in the code (for example, optimized IR) of the program #A. The instruction #A may include one instruction or multiple instructions, which is not particularly limited in this application. For example, the program information of the program #A may include the name of the instruction in the IR of the program #A. As another example, the program information of the program #A may include the names of functions in the IR of the program #A.

In S230, the processing node #A may determine a target processor (denoted as processor # 1) from a plurality of processors based on the program information of the program #A and the hardware information of each processor. The processor # 1 may be a processor whose corresponding instruction set in the multiple processors includes the instruction #A. In other words, the processor # 1 may be a processor among the multiple processors that meets the constraint #A. The constraint condition #A includes that the instruction set corresponding to the processor includes the instruction #A.

Optionally, in this application, the processing node #A may determine the priority of each processor in the multiple processors. By way of example and not limitation, in this application, the processing node #A may determine the priority of each processor according to the parallel computing capability of each of the multiple processors, that is, in this application, the parallel computing capability is high The priority of the processor is higher than that of the processor with low parallel computing capability. For example, for processor #a and processor #b, if the parallel capability of processor #b is higher than the parallel computing capability of processor #a , The processing node #A may consider that the priority of the processor #b is higher than the priority of the processor #a. Among them, parallel computing or parallel computing is relative to serial computing. Parallel computing is an algorithm that can execute multiple instructions at one time. The purpose is to increase the computing speed and solve large and complex computing problems by expanding the problem solving scale. The so-called parallel computing can be divided into parallel in time and parallel in space. Temporal parallelism refers to pipeline technology, while spatial parallelism refers to the use of multiple processors to perform calculations concurrently.

Optionally, in this application, the processing node #A may determine the priority of each processor according to the types of multiple processors. For example, in this application, the priority of a special-purpose processor is higher than that of a general-purpose processor. And, optionally, the general-purpose processor may be the processor with the lowest priority among the multiple processors. Therefore, the processing node #A may sequentially determine whether each processor satisfies the above-mentioned constraint condition #A according to the priority of each processor, for example, in the order of priority from high to low. And, optionally, the processing node #A may determine the first processor satisfying the constraint condition #A as the processor # 1.

In addition, the processing node #A may stop determining other processors after determining the processor # 1. For example, in this application, different processors will have different instruction sets. The instructions used to implement the same function will be different on different chips. Assume that the instruction set of processor #a is intrin # a, processor The instruction set of #b is intrin # b. Let the expression of the source code of program #A be s = x * y + z. And, in order of priority, processor #b is the processor with the lowest default priority. For example, processor #b may be a general-purpose processor, and processor #a is a dedicated processor, that is, processor #b may be Implements functions, but the parallel computing power is not as good as the dedicated processor #a. DSL is used to describe this calculation as m = mul (x, y), s = add (m, z). In this application, the IR description of program #A is obtained after IR processing. The analysis uses calculations of two instructions, mul (multiplication) and add (addition). Thereafter, the processing node #A preferentially determines whether the above instructions belong to the intrin # a; if it is determined as "yes", it selects processor #a as the processor # 1; if it is determined as "no", it selects the processor # b as processor # 1.

Optionally, the hardware information obtained by the processing node #A in S210 further includes information about the size of the currently available memory space of each processor. By way of example and not limitation, in the present application, the size of the currently available memory space of the processor may be, for example, 90% of the free space (or memory capacity) of the processor. Correspondingly, the program information obtained by the processing node #A at S220 further includes information on the size of the memory space required for the operation of the program #A. In this case, the processor # 1 may be a processor whose corresponding instruction set of the multiple processors includes the instruction #A, and the currently available memory space is greater than or equal to the memory space required for the operation of the program #A. In other words, the processor # 1 may be a processor among the multiple processors that meets the constraint #A and the constraint #B. The constraint #B includes the current free space of the processor that is greater than or equal to the memory space required for the operation of the program #A (recorded as: space #A).

By way of example and not limitation, in this application, the processing node #A may determine the space #A according to the data dimension of the program #A (or the code of the program #A). The data dimension of the program #A can be understood as the shape of the tensor of the program #A. A tensor is a multilinear mapping defined on the Cartesian product of some vector spaces and some dual spaces. Its coordinates are || in a dimension space, there is a quantity of || components, each of which The components are all functions of coordinates, and when the coordinates are transformed, these components are linearly transformed according to certain rules. r is called the rank or order of the tensor (which has nothing to do with the rank and order of the matrix). In the sense of isomorphism, the zeroth-order tensor (r = 0) is a scalar (Scalar), the first-order tensor (r = 1) is a vector, and the second-order tensor (r = 2) is Become a Matrix. For example, for a 3-dimensional space, the tensor at r = 1 is this vector: (x, y, z). Due to the different transformation methods, the tensor is divided into a covariant tensor (CovariantTensor (the index is below)), an inverter tensor (ContravariantTensor (the index is above)), and a mixed tensor (both the index is above and the index is below). class.

In this application, a data structure such as a tensor can be used to represent all data. That is, in this application, a tensor may correspond to an n-dimensional array or list. A tensor has a dimension of static type and dynamic type. Tensors can be circulated between nodes in the graph.

In this application, the dimensionality of a tensor is described as an order. It should be noted that the order of the tensor (sometimes about order or degree or n dimensions) is a quantitative description of the tensor dimension. For example, in this application, the processing node #A may perform a shape analysis on the IR of the program #A, thereby determining the dimension of the IR (specifically, the IR tensor) of the program #A, and then estimating the The amount of memory space required for program #A to run. Here, the method and process of estimating the size of the memory space based on the dimensions of the data may be similar to the prior art. Here, in order to avoid redundant description, detailed descriptions thereof are omitted.

In this case, let the expression of the source code of the program #A be s = x * y + z. And, in order of priority, processor #b is the processor with the lowest default priority. For example, processor #b may be a general-purpose processor, and processor #a is a dedicated processor, that is, processor #b may be Implements functions, but the parallel computing power is not as good as the dedicated processor #a. DSL is used to describe this calculation as m = mul (x, y), s = add (m, z). In this application, the IR description of program #A is obtained after IR processing. The analysis uses calculations of two instructions, mul (multiplication) and add (addition). In addition, the processing node #A may determine the size of the memory space that the program #A needs to occupy (for example, let the size of the memory space be X). And, the processing node #A can determine the size of the currently available memory space of each processor. Let the current memory space size of processor #a be Y. Thereafter, the processing node #A first determines whether the above-mentioned instructions belong to intrinsic # a; if the determination is "YES", it is further determined whether the X is less than or equal to Y. If the determination is "YES", processor #a is selected as the processor # 1; if the determination is "NO", processor #b is selected as the processor # 1.

In addition, the processing node #A may control the intermediate compiler to send the IR of the program #A to the backend corresponding to the processor # 1. Therefore, the backend corresponding to the processor # 1 can convert the IR of the program #A into code that the processor # 1 can recognize and process.

According to the method for selecting a processor provided in the present application, by obtaining hardware information of each processor and program information of a target program in advance, and based on the hardware information and program information, selecting hardware information and the program information from a variety of processors Matching processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.

FIG. 6 is a schematic flowchart of an example of a method 300 for selecting a processor according to the present application. By way of example and not limitation, the execution body of the method 300 (hereinafter, referred to as processing node #B for ease of understanding and description) may be any processor among multiple processors in a computing device, such as a central processing unit. Alternatively, processing node #B may be a virtual machine running in a computing device. In addition, in this application, the processing node #B may be the above-mentioned back-end compiler, or may be a device independent of the above-mentioned back-end compiler, which is not particularly limited in this application.

As shown in FIG. 6, in S310, the processing node #B can obtain hardware information of each of the two processors included in the computing device 100. Optionally, in the present application, the manufacturer of the computing device 100 may pre-configure the hardware information of the processors included in the computing device 100 in the computing device 100 when the computing device 100 leaves the factory, so that the processing node # B In S310, hardware information of each of the two processors included in the computing device 100 may be obtained based on the relevant information of the appearance configuration. Optionally, in this application, the manufacturer of the computing device 100 may save the hardware information of each processor included in the computing device 100 on a server, so that the processing node #B is connected to the server through the network in advance in S310. And obtain hardware information of each of the two processors included in the computing device 100 from the server. Optionally, in this application, a user of the computing device 100 may input hardware information of each processor included in the computing device 100 to the processing node #B. Optionally, in this application, each processor may be installed in a hot-pluggable manner, and a driver of each processor may complete registration of each processor during hot-plugging. In the previous case, the processing node #B In S310, hardware information of each of the two processors included in the computing device 100 may be obtained based on registration information of each processor or related information in a driver. That is, in the present application, the computer device 100 (or processing node #B) may have a processor registration information collection function, so as to be able to identify which heterogeneous hardware is supported in the computer device 100. Register the backend corresponding to each processor at startup. Therefore, the processing node can determine the hardware information of each processor according to the registration information of the backend corresponding to each processor.

In this application, the hardware information of a processor may include the size of the currently available memory space of the processor. By way of example and not limitation, in the present application, the size of the currently available memory space of the processor may be, for example, 90% of the free space (or memory capacity) of the processor.

As shown in FIG. 6, in S320, the processing node #B may have program information of a program (that is, an example of a target program, which is described as: program #B) that needs to be currently run. By way of example and not limitation, in this application, the program information may be determined according to the IR of the program #B. For example, in this application, the front-end compiler can obtain the source program code of program #B (denoted as: code #B). Specifically, the compiler can use, for example, a domain description language interface (DSL interface) for developers to call the DSL corresponding to the write operator (that is, an example of code #B); thereafter, the intermediate compiler can use the program # The code #B (for example, DSL) corresponding to B is converted into the IR of the program #B; and, in this application, the intermediate compiler may also optimize the IR of the program #B. Therefore, the processing node #B can determine the program information of the program #B from the IR (for example, the optimized IR) of the program #B.

It should be noted that, in this application, the processing node #B may be a front-end decoder and an intermediate decoder as the code #B. In this case, the processing node #B can directly obtain the program #B. IR. Alternatively, in the present application, the front-end decoder and the intermediate decoder of the code #B may be implemented by the processing node #B. In this case, the processing node #B may also communicate with the processing node #B. The node #B may send the IR of the program #B to the processing node #B. In the present application, the program information of the program #B may include information on the size of a memory space (denoted as: space #B) required for the operation of the program #B.

By way of example and not limitation, in this application, the processing node #B may determine the space #B according to the data dimension of the program #B (or the code of the program #B). The data dimension of the program #B can be understood as the shape (shape) of the tensor of the program #B. A tensor is a multiple linear mapping defined on the Cartesian product of some vector spaces and some dual spaces. Its coordinates are a quantity of | n | components in the | n | -dimensional space, where Each component is a function of coordinates, and when the coordinates are transformed, these components are also linearly transformed according to certain rules. r is called the rank or order of the tensor (which has nothing to do with the rank and order of the matrix). In the sense of isomorphism, the zeroth-order tensor (r = 0) is a scalar, the first-order tensor (r = 1) is a vector, and the second-order tensor (r = 2) is Become a Matrix. For example, for a 3-dimensional space, the tensor at r = 1 is this vector: (x, y, z). Due to the different transformation methods, the tensor is divided into a covariant tensor (CovariantTensor (the index is below)), an inverter tensor (ContravariantTensor (the index is above)), and a mixed tensor (both the index is above and the index is below) class.

In this application, a data structure such as a tensor can be used to represent all data. That is, in this application, a tensor may correspond to an n-dimensional array or list. A tensor has a dimension of static type and dynamic type. Tensors can be circulated between nodes in the graph. In this application, the dimensionality of a tensor is described as an order. It should be noted that the order of the tensor (sometimes about order or degree or n dimensions) is a quantitative description of the tensor dimension. For example, in the present application, the processing node #B may perform a shape analysis on the IR of the program #B, thereby determining the dimension of the IR (specifically, the IR tensor) of the program #B, and then estimating the The amount of memory space required for program #B to run. Here, the method and process of estimating the size of the memory space based on the dimensions of the data may be similar to the prior art. Here, in order to avoid redundant description, detailed descriptions thereof are omitted.

In S330, the processing node #B may determine a target processor (denoted as processor # 2) from a plurality of processors based on the program information of the program #B and the hardware information of each processor. Wherein, the processor # 2 may be a processor in which the currently available memory space of the multiple processors is greater than or equal to the size of the memory space required for the running of the program #B. In other words, the processor # 2 may be a processor among the multiple processors that meets the constraint #C. The constraint #C includes: the current available space of the processor is greater than or equal to the size of the memory space required for the operation of the program #B.

Optionally, in this application, the processing node #B may determine the priority of each processor in the plurality of processors. By way of example and not limitation, in this application, the processing node #B may determine the priority of each processor according to the parallel computing capability of each of the multiple processors, that is, in this application, the parallel computing capability is high The priority of the processor is higher than that of the processor with low parallel computing capability. For example, for processor #a and processor #b, if the parallel capability of processor #b is higher than the parallel computing capability of processor #a , The processing node #B may consider that the priority of the processor #b is higher than the priority of the processor #a. Among them, parallel computing or parallel computing is relative to serial computing. Parallel computing is an algorithm that can execute multiple instructions at one time. The purpose is to increase the computing speed and solve large and complex computing problems by expanding the problem solving scale. The so-called parallel computing can be divided into parallel in time and parallel in space. Temporal parallelism refers to pipeline technology, while spatial parallelism refers to the use of multiple processors to perform calculations concurrently.

As another example, in this application, the processing node #B may determine the priority of each processor according to the power consumption of each of the multiple processors, that is, in this application, the priority of the processor with high power consumption Level is lower than the priority of the processor with low power consumption, for example, for processor #a and processor #b, if the power consumption of processor #b is higher than the power consumption of processor #a, processing node #B may It is considered that the priority of the processor #b is lower than the priority of the processor #a.

Optionally, in this application, the processing node #B may determine the priority of each processor according to the types of multiple processors. For example, in this application, the priority of a special-purpose processor is higher than that of a general-purpose processor. And, optionally, the general-purpose processor may be the processor with the lowest priority among the multiple processors. Therefore, the processing node #B can sequentially determine whether each processor satisfies the above-mentioned constraint condition #C according to the priority of each processor, for example, in the order of priority from high to low. And, optionally, the processing node #B may determine the first processor satisfying the constraint condition #C as the processor # 2. In addition, the processing node #B may stop determining other processors after determining the processor # 2.

For example, let the expression of the source code of the program #B be s = x * y + z. And, in order of priority, processor #b is the processor with the lowest default priority. For example, processor #b may be a general-purpose processor, and processor #a is a dedicated processor, that is, processor #b may be Implements functions, but the parallel computing power is not as good as the dedicated processor #a. DSL is used to describe this calculation as m = mul (x, y), s = add (m, z). In this application, the IR description of program #B is obtained after IR processing. The analysis uses calculations of two instructions, mul (multiplication) and add (addition). In addition, the processing node #B can determine the size of the memory space that the program #B needs to occupy (for example, let the size of the memory space be W). And, processing node #B can determine the size of the currently available memory space of each processor. Let the current memory space size of processor #a be Z. Thereafter, the processing node #B judges whether the Z is greater than or equal to W; if it is determined as "YES", the processor #a is selected as the processor # 2. If the determination is "No", the processor #b is selected as the processor # 2.

Optionally, the hardware information obtained by processing node #B in S310 further includes information of an instruction set corresponding to each processor. For example, the hardware information of a processor may include information about the names of instructions that the processor can execute. As another example, the hardware information of a processor may include information about the names of functions that the processor can execute. Correspondingly, the program information obtained by the processing node #B at S320 further includes instructions (denoted as: instruction #B) included in the code (for example, optimized IR) of the program #B. The instruction #B may include one instruction or multiple instructions, which is not particularly limited in this application. For example, the program information of the program #B may include the name of the instruction in the IR of the program #B. As another example, the program information of the program #B may include the names of functions in the IR of the program #B. In this case, the processor # 2 may be a processor in which the currently available memory space of the multiple processors is greater than or equal to the memory space required for the operation of the program #B, and the corresponding instruction set includes the instruction #B. In other words, the processor # 1 may be a processor among the multiple processors that meets the constraint #C and the constraint #D. The constraint condition #D includes that the instruction set corresponding to the processor includes the instruction #B. In this case, let the expression of the source code of the program #B be s = x * y + z. And, in order of priority, processor #b is the processor with the lowest default priority. For example, processor #b may be a general-purpose processor, and processor #a is a dedicated processor, that is, processor #b may be Implements functions, but the parallel computing power is not as good as the dedicated processor #a. For example, in this application, different processors will have different instruction sets. The instructions used to implement the same function will be different on different chips. Assume that the instruction set of processor #a is intrin # a, processor The instruction set of #b is intrin # b. DSL is used to describe this calculation as m = mul (x, y), s = add (m, z). In this application, the IR description of program #B is obtained after IR processing. The analysis uses calculations of two instructions, mul (multiplication) and add (addition). In addition, the processing node #B can determine the size of the memory space that the program #B needs to occupy (for example, let the size of the memory space be W). And, processing node #B can determine the size of the currently available memory space of each processor. Let the current memory space size of processor #a be Z. Thereafter, the processing node #B first determines whether Z is greater than or equal to W. If the determination is "Yes", then it is further determined whether all the above instructions belong to inintrin # a; if the determination is "Yes", then the processor #a is selected as the processor # 2; if the determination is "No", the processor is selected #b 以 Handler # 2.

The processing node #B can control the intermediate compiler to send the IR of the program #B to the corresponding backend of the processor # 2. Therefore, the backend corresponding to the processor # 2 can convert the IR of the program #A into code that the processor # 2 can recognize and process. The method for selecting a processor of the present application can be applied to a compilation technique.

As shown in FIG. 7, in S410, a compiling device (for example, a front-end compiler) may use a DSL interface for a developer to call a DSL corresponding to a write operator. At S420, a compilation device (eg, an intermediate compiler) may generate the DSL to an intermediate expression IR. At S430, the compilation device (eg, an intermediate compiler) may optimize the intermediate expression IR. In S440, the compiling device (for example, the processing node #A or processing node #B described above) selects the optimal backend compilation backend based on the backend hardware registration information obtained by the automatic identification hardware device and the analysis result of the IR device. The specific process of this step may be similar to the process described in the above method 200 or method 300. Here, in order to avoid redundant description, detailed descriptions thereof are omitted. In S450, the device (for example, the selected back-end compiler) is compiled and the operator code that can be run on the processor corresponding to this backend is generated.

According to the method for selecting a processor provided in the present application, by obtaining hardware information of each processor and program information of a target program in advance, and based on the hardware information and program information, selecting hardware information and the program information from a variety of processors The matching process can match the selected processor with the target program, and can reduce the labor time.

According to the foregoing method, FIG. 8 is a schematic diagram of a logical architecture of a processor selection apparatus 500 according to an embodiment of the present application. The processor selection device may be configured on a computing device including multiple processors, or the processor selection device itself is one of the multiple processors. As shown in FIG. 8, the apparatus 500 for selecting a processor may include a recognition unit 510, an analysis unit 520, and a selection unit 530.

The identification unit 510 may be configured to execute the method in S210 or S310. That is, the identification unit 510 may obtain hardware information of each of the at least two processors, and the hardware information is used to instruct the processor. The corresponding instruction set, and / or the hardware information is used to indicate the size of the available memory space of the processor, and the specific processing procedure of the identification unit 510 may be similar to the processing procedure described in the above S210 or S310, in order to avoid redundant description here , And its detailed description is omitted.

The analysis unit 520 may be configured to execute the method in S220 or S320, that is, the analysis unit 520 may obtain program information of a target program, where the program information is used to indicate instructions in the target program, and / or, the The program information is used to indicate the memory space that the target program needs to occupy, and the specific processing process of the analysis unit 520 may be similar to the processing process described in the above S220 or S320. To avoid redundant descriptions, detailed descriptions are omitted here.

The selection unit 530 may be configured to execute the method in S230 or S330, that is, the selection unit 530 determines a target for executing the target program from the at least two processors according to the program information and the hardware information. A processor, wherein the target processor is a processor that satisfies a preset condition among the at least two processors, the preset condition includes an instruction set corresponding to the processor including an instruction in the target program, and / Or, the preset condition includes that the available memory space of the processor is greater than or equal to the memory space required by the target program, and the specific processing procedure of the selection unit 530 may be similar to the processing procedure described in the above S230 or S350, To avoid redundant descriptions, detailed descriptions are omitted here. In addition, the selection unit 530 can also control the intermediate compiler to send the IR of the target program to the backend compiler backend corresponding to the target processor.

It should be noted that, in this application, the actions and functions of the identification unit 510, analysis unit 520, and selection unit 530 may be implemented by the same virtual machine or the same processor. Alternatively, the actions and functions of the identification unit 510, analysis unit 520, and selection unit 530 may be implemented by different multiple virtual machines or multiple processors, respectively.

For concepts, explanations and detailed descriptions and other steps related to the technical solution provided by the embodiments of the present application related to the device 500, please refer to the description of these contents in the foregoing method or other embodiments, and will not be repeated here. According to the processor selection device provided in the present application, hardware information and program information of a target program are obtained in advance by each processor, and based on the hardware information and program information, hardware information and the program information are selected from a variety of processors. Matching processing can match the selected processor with the target program, and there is no need to manually specify the processor, thereby improving the processing efficiency of the computer equipment and reducing the burden on the programmer.

According to the foregoing method, FIG. 9 is a schematic diagram of a logical architecture of a compiling device 600 to which the embodiment of the present application is applied. As shown in FIG. 9, the compilation apparatus 600 may include a front-end compilation unit 610, an intermediate compilation unit 620, a selection unit 630, and a plurality of back-end compilation units 640. The multiple back-end compiling units 640 correspond to multiple processors (or computing units, computing platforms, or processing units). The selection unit 630 may include an identification module 632, an analysis module 634, and a selection module 636. The front-end compilation unit 610 may use a DSL interface for developers to call the DSL corresponding to the write operator. The actions performed by the front-end compilation unit 610 may be similar to the actions performed by the aforementioned front-end compiler. Here, in order to avoid redundant description, descriptions thereof are omitted. The intermediate compilation unit 620 is communicatively connected to the front-end compilation unit 610, and is configured to obtain the DSL from the front-end compilation unit 610, and the DSL can generate an intermediate expression IR, and the intermediate expression IR can be optimized. The actions performed by the compiling unit 620 may be similar to the actions performed by the above-mentioned intermediate compiler. Here, in order to avoid redundant description, the description is omitted.

The identification module 632 may be configured to execute the method in S210 or S310, that is, the identification unit 510 may obtain hardware information of each of the at least two processors, and the hardware information is used to indicate instructions corresponding to the processors. Set, and / or, the hardware information is used to indicate the size of the available memory space of the processor, and the specific processing process of the identification module 632 may be similar to the processing process described in the above S210 or S310, in order to avoid redundant descriptions, omitted here Detailed description.

The analysis module 634 is communicatively connected to the intermediate compilation unit 620, and is configured to obtain the IR from the intermediate compilation unit 620, and can further be used to execute the method in S220 or S320, that is, the analysis module 634 can obtain the program information of the target program. The program information is used to indicate instructions in the target program, and / or the program information is used to indicate the memory space that the target program needs to occupy, and the specific processing process of the analysis module 634 can be described with the above S220 or S320 The processing process is similar. To avoid redundant description, detailed description is omitted here.

The selection module 636 may be communicatively connected to the identification module 632 and the analysis module 634, and is configured to obtain hardware information from the identification module 632 and program information from the analysis module 634. The selection module 636 may be used to execute the method in S230 or S330, that is, The selection unit 530 determines a target processor for executing the target program from the at least two processors according to the program information and the hardware information, wherein the target processor is the at least two processors. A processor in the processor that satisfies a preset condition, the preset condition includes an instruction set corresponding to the processor including instructions in the target program, and / or, the preset condition includes an available memory space of the processor It is greater than or equal to the memory space required by the target program, and the specific processing process of the selection module 636 may be similar to the processing process described in the above S230 or S350. To avoid redundant description, detailed descriptions are omitted here. In addition, the selection module 636 can also control the intermediate compiler to send the IR of the target program to the back-end compilation unit 640 corresponding to the target processor. The back-end compilation unit 640 can convert the IR into code that can be executed on the corresponding processor. The actions performed by the back-end compilation unit 640 may be similar to the actions performed by the back-end compiler described above. Here, to avoid repetition, the description is omitted.

Those of ordinary skill in the art may realize that the units and algorithm steps of each example described in connection with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not implemented. In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.

If the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application is essentially a part that contributes to the existing technology or a part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. The aforementioned storage media include: U disks, mobile hard disks, read-only memory (ROM), random access memory (RAM), magnetic disks or compact discs, and other media that can store program codes .

The above is only a specific implementation of this application, but the scope of protection of this application is not limited to this. Any person skilled in the art can easily think of changes or replacements within the technical scope disclosed in this application. It should be covered by the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

A method for selecting a processor, wherein the method includes:

Acquiring hardware information of each processor of at least two processors, where the hardware information is used to indicate an instruction set corresponding to each processor;

Acquiring program information of a target program to be executed, where the program information is used to indicate instructions in the target program;

Determining, from the at least two processors, a target processor that satisfies a preset condition and can be used to execute the target program according to the program information and the hardware information, and the preset condition includes instructions corresponding to the processor The set includes instructions in the target program.
The method according to claim 1, wherein, according to the program information and the hardware information, determining from the at least two processors that meets a preset condition and can be used to execute the target program Target processors, including:

Determining a priority of each processor of the at least two processors;

Based on the program information and the hardware information, in order of priority of the at least two processors from high to low, sequentially determine whether the at least two processors meet the preset condition, and set the first A processor satisfying the preset condition is used as the target processor.
The method according to claim 2, wherein determining the priority of each processor of the at least two processors comprises:

A priority of each processor is determined according to at least one of a parallel computing capability or power consumption of each of the at least two processors.
The method according to claim 2 or 3, wherein the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors.
The method according to any one of claims 1 to 4, wherein the hardware information is further used to indicate a size of an available memory space of a processor,

The program information is further used to indicate a memory space required by the target program, and

The preset condition further includes that an available memory space of the processor is greater than or equal to a memory space required by the target program.
The method according to any one of claims 1 to 5, wherein the at least two processors include at least two of the following processors:

CPU, graphics processor GPU, application specific integrated circuit ASIC, neural network processor NPU, image processing unit IPU or digital signal processing DSP.
The method according to any one of claims 1 to 6, wherein the obtaining program information of a target program comprises:

The program information is determined according to an intermediate expression IR of the target program, wherein the IR of the target program is determined according to a domain description language DSL code of the target program.
The method according to any one of claims 1 to 7, further comprising:

The IR of the target program is input to a target back-end compiler corresponding to the target processor.
An apparatus for selecting a processor, wherein the apparatus includes:

An identification module, configured to obtain hardware information of each of the at least two processors, where the hardware information is used to indicate a corresponding instruction set of the processor;

An analysis module, configured to obtain program information of a target program to be executed, where the program information is used to indicate instructions in the target program;

A selection module, configured to determine, according to the program information and the hardware information, a target processor that satisfies a preset condition and can be used to execute the target program from the at least two processors, the preset condition includes The instruction set corresponding to the processor includes instructions in the target program.
The apparatus according to claim 9, wherein the selection module is configured to determine a priority of each processor of the at least two processors, and based on the program information and the hardware information, according to the The order of priority of at least two processors is determined from high to low, sequentially determining whether the at least two processors meet the preset conditions, and using the first processor that meets the preset conditions as the target process. Device.
The device according to claim 10, wherein the selection module is configured to determine each of the processors according to at least one of a parallel computing capability or a power consumption of each of the at least two processors. The priority of the processor.
The apparatus according to claim 10 or 11, wherein the at least two processors include a central processing unit CPU, and the CPU has the lowest priority among the at least two processors.
The device according to any one of claims 9 to 12, wherein the hardware information is further used to indicate a size of an available memory space of the processor,

The program information is further used to indicate a memory space required by the target program, and

The preset condition further includes that an available memory space of the processor is greater than or equal to a memory space required by the target program.
The apparatus according to any one of claims 9 to 13, wherein the at least two processors include at least two of the following processors:

CPU, graphics processor GPU, application specific integrated circuit ASIC, neural network processor NPU, image processing unit IPU or digital signal processing DSP.
The apparatus according to any one of claims 9 to 14, wherein the analysis module is configured to determine the program information according to an intermediate expression IR of the target program, wherein the IR is based on the The target program's domain description language is determined by the DSL code.
The device according to any one of claims 9 to 15, wherein

The selection module is further configured to provide the IR of the target program to a target back-end compiler corresponding to the target processor.
A computer-readable storage medium, comprising a computer program, which when executed on a computer device or processor, causes the computer device or processor to execute the method according to any one of claims 1 to 8. method.