CN116149797A

CN116149797A - Heterogeneous scene-oriented AI unified computing method, device, equipment and medium

Info

Publication number: CN116149797A
Application number: CN202310348238.2A
Authority: CN
Inventors: 鲍国庆; 石恒; 张亚林; 姚建国
Original assignee: Shanghai Enflame Technology Co ltd
Current assignee: Shanghai Suiyuan Technology Co ltd
Priority date: 2023-04-04
Filing date: 2023-04-04
Publication date: 2023-05-23
Anticipated expiration: 2043-04-04
Also published as: CN116149797B

Abstract

The invention discloses an AI unified computing method, device, equipment and medium for heterogeneous scenes. Comprising the following steps: acquiring an AI calculation graph, and after the AI calculation graph is converted into intermediate expression, splitting the AI calculation graph in the intermediate expression form into a plurality of graph units; compiling each graph unit into a computing byte code unit which can be identified by at least one type of computing device and providing the computing byte code unit to a runtime system in an AI computing framework; wherein, a plurality of standard computing interfaces are realized in each computing device in advance; and after distributing and scheduling each calculation byte code unit to the target computing equipment, the runtime system calls each standard computing interface to implement calculation, and responds to the memory access instruction, and the unified memory manager performs unified memory allocation and recovery on each target computing equipment. The technical scheme of the embodiment of the invention shields the influence of the equipment differences of different chip manufacturers on the upper AI computing frame, and realizes the compatibility and multiplexing of the main AI computing frame by the different chip manufacturers.

Description

Heterogeneous scene-oriented AI unified computing method, device, equipment and medium

Technical Field

The embodiment of the invention relates to an AI (artificial intelligence ) computing technology, in particular to an AI unified computing method, device, equipment and medium for heterogeneous scenes.

Background

The artificial intelligence computing platform mainly comprises an AI computing chip and an AI computing framework. The AI computing chip is an effective carrier for realizing AI computing force, and the AI computing framework is a basic software platform which is established based on the AI computing chip and has the function of enabling an upper AI application.

Currently, in the field of AI applications, considerable-scale AI applications are built on top of a few mainstream AI computing frameworks. Therefore, in order to multiplex the mainstream AI computing frames, the computing chip manufacturer needs to expend a lot of effort to adapt the developed AI computing chips to the mainstream AI computing frames. In addition, with the continuous development of AI technology, heterogeneous or super-heterogeneous computing is required in more and more computing scenarios, however, the mainstream AI computing framework has the defects of single hardware support, poor portability, incapability of effectively enabling a plurality of heterogeneous AI chips to implement hybrid computing, and the like, and cannot meet the development requirement of the next generation AI technology.

Disclosure of Invention

The invention provides an AI unified computing method, device, equipment and medium for heterogeneous scenes, which are used for shielding the influence of the equipment differences of different chip manufacturers on an upper AI computing frame and realizing the compatibility and multiplexing of the different chip manufacturers on a main stream AI computing frame.

In a first aspect, an embodiment of the present invention provides a heterogeneous scene-oriented AI unified computing method, which is executed by a unified computing abstraction layer configured between an AI computing framework and a heterogeneous AI computing platform, including:

acquiring an AI calculation graph generated by an AI calculation frame, and splitting the AI calculation graph in an intermediate expression form into a plurality of graph units after the AI calculation graph is converted into the intermediate expression by using an AI compiler;

compiling each graph unit into a calculation byte code unit which can be identified by at least one type of computing equipment in the heterogeneous AI computing platform, and providing each calculation byte code unit for a runtime system in the AI computing framework;

the AI computing platform comprises a plurality of standard computing interfaces defined by a unified computing abstract layer, wherein the standard computing interfaces defined by the unified computing abstract layer are realized in each computing device of the AI computing platform in advance; the runtime system calls each standard computing interface to implement computation after distributing and scheduling each computing byte code unit to matched target computing equipment in an AI computing platform;

and in response to a memory access instruction sent by the runtime system to the unified memory manager in the process of implementing the calculation, the unified memory manager performs unified memory allocation and reclamation on each target computing device.

In a second aspect, an embodiment of the present invention provides an AI unified computing device for heterogeneous scenarios, which is executed by a unified computing abstraction layer configured between an AI computing framework and a heterogeneous AI computing platform, and is characterized by comprising:

The system comprises a graph unit generating module, a graph unit generating module and a graph unit generating module, wherein the graph unit generating module is used for acquiring an AI calculation graph generated by an AI calculation framework, and splitting the AI calculation graph in an intermediate expression form into a plurality of graph units after an AI compiler is used for converting the AI calculation graph into the intermediate expression;

the diagram unit compiling and distributing module is used for compiling each diagram unit into a calculation byte code unit which can be identified by at least one type of computing equipment in the heterogeneous AI computing platform and providing each calculation byte code unit for a runtime system in the AI computing framework;

the memory allocation and recovery unit is used for responding to a memory access instruction sent by the runtime system to the unified memory manager in the process of implementing calculation, and the unified memory manager performs unified memory allocation and recovery on each target computing device.

In a third aspect, an embodiment of the present invention provides an electronic device, including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor, so that the at least one processor can execute the heterogeneous scene-oriented AI unified computing method according to any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, where the computer readable storage medium stores computer instructions, where the computer instructions are configured to cause a processor to implement, when executed, an AI unified computing method for a heterogeneous scenario according to any one of the embodiments of the present invention.

According to the technical scheme, the unified computing abstraction layer is used as a base, so that computing instructions issued by an upper AI computing frame can be transferred to different computing devices through a consistent behavior interface, the influence of computing device computing differences of different chip manufacturers on the upper AI computing frame is shielded, in addition, an AI computing graph is converted into distribution of computing byte code units among a plurality of isomorphic or heterogeneous computing devices through an AI compiler, a large-scale distributed scene and a complex heterogeneous computing scene can be dealt with, meanwhile, the influence of the computing device memory use differences of different chip manufacturers on the upper AI computing frame can be shielded through a unified memory management strategy, and finally the compatibility and multiplexing of different chip manufacturers on a main stream AI computing frame can be realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram showing a comparison of AI calculation and memory management formats for a different vendor of calculation chips, according to the prior art;

fig. 2 is a flowchart of an AI unified computing method for heterogeneous scenarios according to a first embodiment of the present invention;

fig. 3 is a flowchart of an AI unified computing method for heterogeneous scenarios according to a second embodiment of the present invention;

FIG. 4 is a schematic diagram of a heterogeneous computation and compiling, packing and distributing forms of heterogeneous computation bytecodes to which the method according to the embodiment of the present invention is applied;

FIG. 5 is a schematic diagram of a runtime system implementing computations by invoking standard computing interfaces to which the method according to an embodiment of the present invention is applicable;

fig. 6 is a flowchart of an AI unified computing method for heterogeneous scenarios according to a third embodiment of the present invention;

FIG. 7 is a schematic diagram of a unified device memory management format applicable to the method according to an embodiment of the present invention;

FIG. 8 is a flowchart of a method for unified memory allocation and reclamation of target computing devices by a unified memory manager according to a third embodiment of the present invention;

fig. 9 is a schematic structural diagram of an AI unified computing device for heterogeneous scenarios according to a fourth embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device implementing the heterogeneous scene-oriented AI unified computing method according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

To facilitate understanding of the embodiments of the present invention, first, related knowledge and interaction methods of an AI computing chip (hereinafter also referred to as computing hardware) and an AI computing framework included in an AI intelligent computing platform in the prior art will be briefly described.

As previously described, the AI computing chip is an efficient carrier for implementing AI computing forces, and the AI computing framework is a basic software platform with an enabling upper-level AI application built based on the AI computing chip. The key factor of the landing and popularization of the AI computing chip is an ecological chain except for the hardware design and manufacturing technology, namely a complete software and hardware ecological system constructed around the AI computing chip. The current mainstream artificial intelligent computing platform is built around the hardware platform of individual manufacturer, the ecological form is not friendly to other hardware manufacturers, and the realization of the ecological compatibility and multiplexing of the existing AI is a necessary way for developing complete localization substitution in the future.

The neural network algorithm represented by deep learning is a common form of AI calculation, and the corresponding calculation task has the characteristics of low logic control difficulty, large parallel operation amount, high storage requirement and the like, so that the calculation hardware needs to support multi-core parallel operation, low delay and high-efficiency linear algebraic operation capability, and a large amount of unstructured data such as text, images, voice, video and the like needs to be processed in a short time and high efficiency. The AI computing framework is a package of model algorithms, data processing and computing scheduling, and provides a set of standard AI development interfaces for developers, including an algorithm library, a tool chain and the like, so that the developers can efficiently design, train and verify the algorithm model based on the AI framework. In addition, the AI computing framework provides simple and efficient support for reasoning services towards end users during the model deployment phase.

The AI computing framework is positioned at a core position in the whole artificial intelligence technical system, plays a role in supporting up and down, and provides a standard development and deployment environment for algorithm engineering by shielding the bottom layer difference and providing a reliable model execution environment for the lower-layer hardware computing resource called by the AI computing framework and applying algorithm model construction to the upper-support AI. The combination of the AI computing chip and the AI computing frame determines the main technical route of the application of the artificial intelligence industry to a certain extent, the research and development of the AI computing chip and the AI computing frame can promote the development of the ecological chain and surrounding industries, and along with the continuous highlighting of the value of the AI computing frame, the AI computing frame becomes one of the focus of innovation of the artificial intelligence industry and draws great importance to academia and industry.

In the field of AI application, although the industrial scope is wide and the subsequent advantages are obvious, the AI application of a considerable scale is built on an international mainstream AI calculation frame, from hardware adaptation to operator development and from the completion of a model library to the construction of an algorithm model, and the top of an AI ecological chain is always controlled by a plurality of mainstream manufacturers. At present, the ecological challenge of diversification, complicacy and fragmentation of artificial intelligence software and hardware in China is faced with the urgent need to advance the hardware adaptation of an AI computing framework and the standardized work of an operator interface, but the hardware architecture design and interface specifications of all manufacturers are difficult to unify, the main stream computing AI framework is not friendly to the domestic AI computing chip, so that the domestic AI computing chip needs to be difficultly adapted to the main stream framework, the large-scale popularization is restricted by compatibility all the time, and the complete software and hardware full stack autonomous controllability is difficult to realize.

On the other hand, heterogeneous or super-heterogeneous computation is an effective solution form suitable for future artificial intelligence application scene diversification and complicacy, and the current mainstream AI computation framework, including the domestic AI computation framework raised in recent years, has the defects of single hardware support, poor portability, incapability of effectively performing heterogeneous scene computation and the like. Therefore, research on a uniform AI computing method oriented to heterogeneous scenes is needed to realize flexible and efficient computing of an AI computing framework in complex and super-heterogeneous scenes.

In fig. 1, a comparison diagram of AI calculation and memory management forms of a calculation chip of different manufacturers provided in the prior art is shown.

As shown in the left part of 1, the upper AI computing framework is strongly bound to hardware products of hardware vendors, i.e., the main AI computing framework is built based on the hardware products of inflight (GPU (Graphics Processing Unit, graphics processor)) and the software stack. After the AI computing framework issues the AI computing graph to the runtime system, the runtime system accesses the computing device through a CUDA parallel programming interface in an Inlet and View software stack and a hardware abstraction layer (NV), splits the AI computing graph into operator library calling forms to implement computation in different GPUs, and in the process of computation, invokes a memory manager (NV) through the CUDA parallel programming interface by the runtime system to manage the device memory of each GPU.

At this time, if other hardware manufacturers want to multiplex the upper AI computing framework instead of developing a set of AI computing frameworks independently, specifically, as shown in the right part of fig. 1, each software and hardware product, a hardware abstraction layer (TOPs), a memory manager (TOPs), a DSA (Domain Specific Architecture, domain-specific architecture) computing chip and a DSA device memory in the manufacturer all need to spend a lot of manpower and material resources to adapt the AI computing framework, so as to multiplex the AI computing framework to realize the AI computing requirement based on DSA, which greatly increases the development and maintenance difficulty of the DSA chip, and further provides for adapting a plurality of mainstream AI computing frameworks simultaneously.

In addition, in the prior art, the memory management mode is to perform shallow application program interface packaging on the memory of the AI computing device so that the upper AI computing frame directly operates the memory of the computing device, the memory management modes of various computing devices are mutually incompatible, the phenomenon of memory fragmentation of the computing device easily occurs, and the use efficiency of the computing device is seriously reduced.

In view of this, the embodiment of the present invention provides a new heterogeneous scenario-oriented AI computing solution, in which a unified computing abstraction layer is used as a middleware for connecting an AI computing framework and a computing device, and is disposed between the AI computing framework and the computing device, so as to mask the problem of AI ecological incompatibility caused by heterogeneous hardware differences. The unified computing abstraction layer provides a consistent computing call interface for the AI computing framework upwards, and is compatible with different computing devices through a pre-built unified heterogeneous hardware access layer downwards.

Example 1

Fig. 2 is a flowchart of an AI unified computing method for a heterogeneous scene, which is provided in an embodiment of the present invention, and the embodiment may be applicable to a case of performing AI unified computing in a heterogeneous scene using the same AI computing framework; the method may be performed by a unified computing abstraction layer configured between the AI computing framework and the heterogeneous AI computing platform, which may be configured in a terminal, a server, or a server cluster, which is not limited in this embodiment.

As shown in fig. 2, the method includes:

s110, acquiring an AI calculation graph generated by an AI calculation frame, and splitting the AI calculation graph in an intermediate expression form into a plurality of graph units after the AI calculation graph is converted into the intermediate expression by an AI compiler.

As described above, the AI computing framework is a basic software platform with calling upper AI application functions built based on AI computing chips. Currently, the main AI computing framework mainly includes: pytorch, tensorflow, mindspore, paddlePaddle, oneforce, etc.

AI computation mainly refers to computation based on a set AI model, which can be understood as a complex network system formed by a large number of neurons widely interconnected. Thus, the AI model may be described by an AI computation graph, which is typically in the form of a directed acyclic graph, which may be composed of a plurality of vertices and edges. Each vertex corresponds to a computation operator. The computation operator may be used to perform, for example: the arithmetic logic such as addition, subtraction, multiplication, division, or convolution has one or more inputs and one or more outputs.

In general, an AI computation graph belongs to a high-level representation that depends on a programming language, and one or more conversions of the low-level representation of the AI computation graph are required in order to convert the AI computation graph into underlying execution code that can be recognized by a computing device. Specifically, the AI computation graph may be first split into a plurality of graph elements after being first translated into a lower-level intermediate representation (Intermedium Representation, IR) by an AI Compiler (AI Compiler). Further, by taking the graph units as units, each graph unit can be re-interpreted as the bottom layer execution code and then distributed to one or more computing devices in the heterogeneous AI computing platforms for execution.

The AI compiler may be a compiler implemented based on an MLIR (Multi-Level Intermediate Representation) framework or the like, which is not limited in this embodiment. The intermediate expression may be understood as an intermediate representation of the source program, i.e. intermediate code, from after parsing to before generating the target machine code. In brief, the intermediate representation is closer to the machine language, and is an intermediate code that connects the front and back ends.

In this embodiment, the graph unit may be understood as a subunit of the AI computation graph, that is, each graph unit is a part of the AI computation graph in the intermediate expression form, and carries part of information in the AI computation graph in the intermediate expression form, and accordingly, each graph unit may include one or several computation operators.

S120, compiling each graph unit into a calculation byte code unit which can be identified by at least one type of computing equipment in the heterogeneous AI computing platform, and providing each calculation byte code unit for a runtime system in the AI computing framework.

The AI computing platform comprises a plurality of standard computing interfaces defined by a unified computing abstract layer, wherein the standard computing interfaces defined by the unified computing abstract layer are realized in each computing device of the AI computing platform in advance; and the runtime system calls each standard computing interface to implement computation after distributing and scheduling each computing byte code unit to matched target computing equipment in the AI computing platform.

Alternatively, a plurality of different types of computing devices may be included in the heterogeneous AI computing platform, and the computing devices may be understood as computing chips of a set type, such as GPU, DSA, and CPU (Central processing unit ), etc. The one or more computing devices of the same type may be configured in the same cabinet, may be configured in multiple cabinets at different spatial locations, or may be configured in the same cabinet or in different cabinets.

The computing bytecode unit may be understood as a bottom-level execution code corresponding to the setup graph unit for implementing the computing logic defined in the setup graph unit, which may be recognized and directly executed by a computing device of a setup type.

In this embodiment, according to the calculation characteristics of each calculation operator in each graph unit, it may first determine which graph unit is allocated to which type of computing device, and then compile each graph unit into a calculation byte code unit that can be identified by the computing device of the type to which it is adapted according to a compilation rule matched with the type of computing device.

Further, after compiling the plurality of graph units into a plurality of calculation bytecode units, the plurality of calculation bytecode units may be provided to a Runtime system (run time) in the AI computing framework. The runtime system distributes and schedules each calculation byte code unit to one or more calculation devices of the same type or different types to implement calculation after receiving the calculation byte code units.

In this embodiment, after distributing and scheduling each calculation byte code unit to a matched target computing device in the AI computing platform, the runtime system does not call various interfaces in a hardware abstraction layer matched with a certain hardware manufacturer or operator library interfaces provided by a specific manufacturer any more as in the prior art, controls each computing device to implement calculation, and can call a plurality of standard computing interfaces defined by a unified computing abstraction layer in a unified manner to control each computing device to implement calculation.

Specifically, the embodiment further defines a unified hardware access layer (Unified heterogeneous hardware interface, UHHI) in the unified computing abstraction layer, and the UHHI is used for abstracting various interfaces in the hardware abstraction layers of different hardware manufacturers at a higher level to form a call interface consistent with an upper-level AI computing framework. Accordingly, each computing device in the AI computing platform implements multiplexing of the upper AI computing framework as long as it implements the UHHI-defined interface specification.

In a specific example, for copying data from host memory to memory of a computing device, the UHHI only needs to define a standardized call interface for the operation and define an input/output data form of the call interface, and when each type of computing device implements the function, specific implementation forms may be different based on different hardware architectures of the computing device, however, as a unified upper layer abstraction is performed on the UHHI, each computing device only needs to implement the interface function correspondingly, so that differential shielding on specific hardware is implemented in the UHHI.

S130, responding to a memory access instruction sent by the runtime system to the unified memory manager in the process of implementing calculation, and uniformly distributing and recycling the memory of each target computing device by the unified memory manager.

In this embodiment, when the runtime system invokes each standard computing interface to control each computing device that receives the computing byte code unit to perform computing, memory management for each computing device may be involved. Specifically, after a computing device completes a calculation, a section of memory space needs to be allocated in a local memory or a device memory to store the calculation result; alternatively, when a computing device completes the use of a piece of data, the memory space occupied by the data may need to be reclaimed, and so on.

In this embodiment, in order to solve the problems that in the prior art, device application program interfaces (Application Programming Interface, APIs) of different computing devices are directly called, and memory management modes of computing devices are incompatible and memory fragmentation of the computing devices easily occurs in a memory management mode of the computing devices, an implementation mode of using a unified memory manager and adopting a unified memory management policy to perform unified memory allocation and recovery on each target computing device is provided.

Wherein, the memory access instruction includes: and (3) setting a memory allocation instruction of the size of a storage space in any computing device in the heterogeneous AI computing platform, or setting a memory reclamation instruction of a storage area in any computing device, wherein the memory allocation instruction or the memory reclamation instruction is generated by a runtime system in the process of implementing the computation and is sent to a unified memory manager in the unified computing abstraction layer for execution.

Example two

Fig. 3 is a flowchart of a heterogeneous scenario-oriented AI unified computing method according to a second embodiment of the present invention, where the embodiment is based on the foregoing embodiment, and in this embodiment, an AI computing diagram generated by acquiring an AI computing framework is defined, and after the AI computing diagram is converted into an intermediate expression by using an AI compiler, the AI computing diagram in the intermediate expression form is split into a plurality of graph units, and operations of compiling each graph unit into a computing byte code unit that can be identified by at least one type of computing device in a heterogeneous AI computing platform are defined. As shown in fig. 3, the method includes:

S310, an AI compiler is used for escaping the AI calculation graph into an intermediate expression, and the input and output data type and the input and output data size of each calculation operator in the AI calculation graph are deduced according to the intermediate expression.

As described above, the AI computation graph of the high-level representation can be converted into a low-level intermediate expression by the compiling process of the AI compiler, and the intermediate expression can be recognized and interpreted by the computer. By parsing the intermediate representation, the input-output data type and input-output data size of each computation operator in the AI computation graph can be deduced.

Wherein the input-output data types include an input data type and an output data type, and the input-output data sizes include an input data size and an output data size.

The input data type may be understood as the data type of each input data item of a computation operator. The input data item may be the output result of the previous calculation operator, or may be a preset constant operand. The data type may be scalar, matrix, tensor, integer or floating point, etc. Similarly, an output data type may be understood as the data type of each output data item of a computation operator, which may have one or more input data items, and one or more output data items.

The input data size may be understood as the data size of each input data item of a computation operator, for example, if an input data item is a scalar, the data size of the input data item may be the word size of the data, for example 16 bits or 32 bits, etc., if an input data item is a tensor, the input data item may be a tensor size of 2 x 2 or 2 x 3, etc. Similarly, the output data size can be understood as the data size of each output data item of one computation operator.

S320, splitting the AI calculation graph in the intermediate expression form into a plurality of graph units according to the input/output data type and the input/output data size of each calculation operator in the AI calculation graph.

In this embodiment, after deriving the input/output data type and the input/output data size of each computation operator in the AI computation graph, the AI computation graph may be split into graph units that may be executed by multiple computing devices respectively according to one or more preset AI computation graph splitting rules.

Each graph unit comprises at least one calculation operator, and the operator type, the input and output form and the load form of each calculation operator are recorded.

Specifically, the operator type is an overview of the computational logic performed by a computational operator, and may include: adding, subtracting, multiplying, convolving, activating operation based on a set activation function, and the like. The input/output mode may refer to a data size of one or more input data items and a data size of one or more output data items of a graph unit after the graph unit is taken as a whole, and the load mode may be used to describe a source of each input data item of the graph unit, specifically including at least one of an operand and an output of a previous graph unit.

In an alternative implementation manner of the present embodiment, the type limitation rule of the calculation operator or the input/output data size limitation rule of the calculation operator may be defined according to the number of calculation operators, which is not limited in the present embodiment.

S330, determining the type of the computing device corresponding to each graph unit according to the computing characteristics of each computing operator in the graph unit and the hardware characteristics of each type of computing device in the AI computing platform.

One or more calculation operators can be included in one graph unit, and further, according to the number of operators included in one graph unit or the attribute characteristics of the operator types, the calculation characteristics of one graph unit can be determined. Further, by combining the hardware characteristics of the various types of computing devices in the AI computing platform, the type of computing device that is suitable for executing each of the graph elements can be determined.

For example, assuming that the number of calculation operators included in the graph unit a is large, the repeated calculation amount is large, and the calculation time is long, the calculation characteristics of the graph unit a can be determined to be computationally intensive, and further, the type of the computing device corresponding to the graph unit a can be determined to be a GPU which is very suitable for performing intensive calculation; for another example, if the number of operators included in the graph unit B is small, but the calculation logic of these operators is relatively complex, the calculation feature of the graph unit B can be determined to be a complex logic type, and further, the type of the calculation device corresponding to the graph unit B can be determined to be a CPU or the like that is very suitable for performing a small amount of complex logic calculation.

Through the above arrangement, a correspondence relationship between each graph unit in the AI computation graph and the matched computing device type can be established.

S340, compiling each graph unit into a calculation byte code unit matched with the type of the computing device according to a compiling rule matched with the type of the computing device of each graph unit.

Because the formats of the computing byte codes which can be executed by the computing devices of different types are different, a bottom layer compiler which is respectively matched with the different computing device types can be built in the unified computing abstract layer in advance, and each graph unit is respectively compiled into the computing byte code unit which is matched with the computing device type.

S350, providing each calculation byte code unit for a runtime system in the AI calculation framework, and calling each standard calculation interface to implement calculation after distributing and scheduling each calculation byte code unit to matched target calculation equipment in the AI calculation platform by the runtime system.

In this embodiment, after compiling each graph unit into a matched computation bytecode unit, each computation bytecode unit may be packaged separately, and each data packet may be sent separately to a runtime system in the AI computation framework.

The runtime system may dispatch each compute bytecode unit to multiple target computing devices of the same or different types in the same enclosure, or to multiple target computing devices of the same or different types in different enclosures, to achieve homogenous and/or heterogeneous computing.

Specifically, according to a specific computing scene of the AI computing graph, the computing byte code units obtained by packaging all graph units are sent to computing equipment of the same type (equivalent to a traditional computing scene); alternatively, the calculation byte code units obtained by packing all the graph units can be sent to different types of computing devices (equivalent to heterogeneous or mixed computing scenes); or, the calculation byte code units obtained by packaging all the graph units can be sent to the cross-cabinet computing equipment of the same type (equivalent to the traditional distributed computing scene); alternatively, the packed computing bytecode units of all the graph units may be sent to different types of cross-enclosure computing devices (equivalent to a distributed hybrid computing scenario), and so on.

In this embodiment, a plurality of standard computing interfaces defined by a unified computing abstraction layer are implemented in advance in each computing device of the AI computing platform; and the runtime system calls each standard computing interface to implement computation after distributing and scheduling each computing byte code unit to matched target computing equipment in the AI computing platform.

By way of example and not limitation, multiple standard computing interfaces for the following device calls may be defined in a unified hardware access layer (UHHI) of a unified computing abstraction layer:

a device interface for implementing computing device initialization and computing device capability queries; the module interface is used for managing the loading of the computing core, the release of the computing core, the caching of the computing core instance and the acceleration of the calling of the computing equipment; a compute kernel interface for implementing a compute operator operation instance; an event interface for implementing event synchronization under asynchronous operation; a flow interface for implementing sequential execution of the plurality of asynchronous compute core calls; a device memory address interface for pointing to a block of memory allocated on the computing device and enabling a jump access; a device buffer interface for allocating and reclaiming buffer areas on a computing device; a memory interface for implementing memory allocation and copying between the computing device and the host; and the asynchronous processing interface is used for converting error codes generated in the heterogeneous AI computing platform into platform independent error codes and displaying the platform independent error codes in the form of recognizable character strings.

Based on the above embodiments, the UHHI standard computing interface may implement high-level encapsulation of function calls for different computing devices based on programming languages such as c++ or Rust.

Based on the above embodiments, the UHHI standard computing interface uses a virtual data type that is independent of specific hardware, and the instantiation of the virtual data type is only performed when the corresponding hardware platform implements the interface function, that is, when each standard computing interface is constructed, the UHHI standard computing interface is implemented using a virtual data type that is independent of a computing device, and in particular, the virtual data type may be a data type in a handle form.

Through the arrangement, based on the UHHI standard computing interfaces, the difference between different types of computing devices realizes back-end shielding at each interface.

S360, responding to a memory access instruction sent by the runtime system to the unified memory manager in the process of implementing calculation, and uniformly distributing and recycling the memory of each target computing device by the unified memory manager.

According to the technical scheme of the embodiment of the invention, the AI computing graphs generated by the AI computing framework are obtained, the unified computing abstract layer is taken as a base, so that the computing instructions issued by the upper AI computing framework can be called to different computing devices through a consistent behavior interface, the influence of computing device computing differences of different chip manufacturers on the upper AI computing framework is shielded, in addition, according to the computing characteristics of operators in graph units and the hardware characteristics of computing devices of various types in the AI computing platform, each graph unit is compiled into a computing byte code unit matched with the computing device type, and each computing byte code unit is provided for a runtime system in the AI computing framework, a large-scale distributed scene and a complex heterogeneous computing scene can be handled, meanwhile, the influence of the computing device memory use differences of different chip manufacturers on the upper AI computing framework can be shielded, and finally, the compatibility and multiplexing of the different chip manufacturers on the mainstream AI computing framework can be realized.

In order to more visually describe the embodiment of the present invention, a schematic diagram of compiling, packaging and distributing forms of heterogeneous computation and heterogeneous computation bytecodes to which the method of the embodiment of the present invention is applicable is shown in fig. 4.

As shown in fig. 4, the graph unit 1 is one graph unit of the AI computation graph of the intermediate expression form; further, the graph unit 1 includes: the load form W and the load form Y, a matrix multiplication function operator and an output O1, namely in the graph unit, a calculation operator is a matrix multiplication function, input data of the function are W and Y, the output obtained through the calculation of the matrix multiplication function is O1, the sizes of the load form W and the load form Y are matrixes of input forms (2*3) and (3*2) respectively, and the output form of O1 is a matrix of (2 x 2). Accordingly, for the graph unit 2, the graph unit 2 includes: the load form O1 and the load form h, the addition function operator and the output O2, namely in the graph unit, the calculation operator is an addition function, the input data of the function are O1 and h, the output obtained through calculation of the addition function is O2, the size of the load form O1 and the size of the load form h are respectively a matrix of an input form (2 x 2) and a constant, and the output form of the O2 is a matrix of an output form (2 x 2), wherein the O1 is a space occupying load, namely in the graph unit, the meaning of the O1 in the calculation graph is that the input information of the O1 position of the graph unit 2 is the output information of the O1 position of the graph unit 1. Accordingly, for the graph unit 3, the graph unit 3 includes: a load form O2, an activation function operator and an output O3, namely in the graph unit, the calculation operator is an activation function, the input data of the function is O2, the output obtained by calculation of the addition function is O3, the size of the load form O2 is a matrix of an input form (2 x 2), and the output form of O3 is a matrix of the input form (2 x 2); further, the O2 is the space occupation load of the unit 3. As can be seen from this embodiment, the load form of the input information of the computation operator may be a form such as a matrix load, a constant load, and an output data occupation load of the upper function.

Further, the AI computation graph in fig. 4 is split to form each graph unit in fig. 4, and then each graph unit is compiled and packaged into a computation byte code unit which can be identified by at least one type of computing device in the heterogeneous AI computation platform, and each computation byte code unit is distributed and scheduled to the corresponding computing device through the heterogeneous byte code distributor. For example, the unit 1 shown in fig. 4 is first dispatched and scheduled to the GPU to complete the computation corresponding to the information contained in the unit 1, meanwhile, the information in the unit 1 is compiled into a computation byte code unit that can be identified by the GPU, and the output data O1 with the output form of (2×2) is obtained after the computation in the GPU; further, as shown in fig. 4, the calculation result O1 obtained in the calculation step is transmitted to the DSA through a data connection and is used as the occupation load of the DSA, then the graph unit 2 is distributed and scheduled to the DSA to complete the calculation corresponding to the information contained in the graph unit 2, meanwhile, the information in the graph unit 2 is compiled into a calculation byte code unit which can be identified by the DSA, and the output data O2 with the output form of (2 x 2) is obtained after the calculation in the DSA; similarly, the calculation result O2 obtained in the calculation step is transmitted to the CPU through a data connection and is used as the occupation load of the CPU, then the graph unit 3 is distributed and scheduled to the CPU to complete the calculation corresponding to the information contained in the graph unit 3, meanwhile, the information in the graph unit 3 is compiled into a calculation byte code unit which can be identified by the CPU, and the output data O3 with the output form of (2×2) is obtained after the calculation in the CPU.

In this embodiment, as shown in fig. 5, an AI computing diagram generated by an AI computing framework is obtained, and after the AI computing diagram is converted into an intermediate representation by using an AI compiler, the AI computing diagram in the intermediate representation form is split into a plurality of diagram units, each diagram unit is compiled into a computing byte code unit which can be identified by at least one type of computing device in a heterogeneous AI computing platform, each computing byte code unit is provided to a runtime system in the AI computing framework by a heterogeneous byte code distributor, the runtime system invokes a standard computing interface which is implemented in advance on each computing device through a unified heterogeneous hardware access layer, so that the computing device completes the computing work, finally, the runtime system receives the computing result of each computing device, queries whether a storage address required by the computing result satisfies the storage condition of the computing result in a unified memory manager, and the unified memory manager performs unified memory allocation and recovery on each target computing device based on a memory access instruction and the memory usage condition of each computing device and feeds back query information to the runtime system to complete the storage work of the computing result.

Example III

Fig. 6 is a flowchart of an AI unified computing method for heterogeneous scenarios, which is provided in the third embodiment of the present invention, and is refined based on the foregoing embodiment, specifically in this embodiment, the method for unified memory allocation and recovery by the unified memory manager for each target computing device in response to a memory access instruction sent by the runtime system in the process of implementing computing is refined. As shown in fig. 6, the method includes:

S610, acquiring an AI calculation graph generated by an AI calculation frame, and splitting the AI calculation graph in an intermediate expression form into a plurality of graph units after the AI calculation graph is converted into the intermediate expression by an AI compiler.

S620, compiling each graph unit into a calculation byte code unit which can be identified by at least one type of computing equipment in the heterogeneous AI computing platform, and providing each calculation byte code unit for a runtime system in the AI computing framework.

S630, responding to the call of the running system to the memory allocation and recovery interface defined by the unified computing abstraction layer in the process of implementing the computation, and carrying out unified memory allocation and recovery on each target computing device by the unified memory manager; the memory allocation and reclamation interface is a virtual application program interface.

In this embodiment, compared to the conventional device memory management form, the unified device memory management policy uses the unified hardware access layer to take a higher level of abstraction for device memory allocation and reclamation, without directly calling a device driver API, to unify memory management entries between different computing devices. The higher-level abstract concrete is to realize the memory allocation and recovery function of the concrete based on a predefined virtual application program interface in the unified computing abstract layer.

In this embodiment, as shown in fig. 7, specifically, the heterogeneous bytecode distributor in the unified computing abstraction layer provides each computing bytecode unit to the runtime system in the AI computing framework, the runtime system invokes the standard computing interface implemented in advance on each computing device through the unified heterogeneous hardware access layer, so that the computing device completes the computing work, finally, the runtime system receives the computing result of each computing device, and queries the unified memory manager whether the storage address required by the computing result meets the storage condition of the computing result, and the unified memory manager performs unified memory allocation and recovery on each target computing device through the memory allocation and recovery interface based on the memory access instruction and the memory usage condition of each computing device, and feeds the query information back to the runtime system to complete the storage work of the computing result.

In this embodiment, in the process of calling the virtual application program interface through the unified device memory management policy, the device memory allocation is monitored, the device memory pool is initialized, corresponding records are made in the memory pool for each allocation and recovery action, including the size of the allocated memory block, the target device, the device memory pointer actually allocated, and the like, and an initial memory pool is built accordingly, so that the purpose that the memory pool of the heterogeneous device can be unified managed is achieved.

Correspondingly, a memory pool of the computing equipment is configured in the unified memory manager, and memory allocation records are stored in the memory pool of the computing equipment for different computing equipment in the heterogeneous AI computing platform respectively; the memory allocation record comprises the planned cache block in the computing device and the current occupied state of the planned cache block.

Specifically, if the heterogeneous AI computing platform has a total of 5 computing devices, then 5 isolated storage spaces may be created in the computing device memory pool for storing the storage memory allocation records for each computing device. Specifically, the memory allocation record may use the size of a memory block as a key, its value (value) is a flag that a string (or a buffer block pointer) identifies whether the memory block is occupied, when all buffer blocks in the string are occupied, the current key corresponds to the node to be in a full state, otherwise, it is in a not full state; in addition, a binary search tree is built with keys (memory block sizes) as nodes to quickly locate the appropriate node, i.e., a cache lookup.

Optionally, in response to the call of the runtime system to the memory allocation and reclamation interface defined by the unified computing abstraction layer in the process of implementing computation, the unified memory manager performs unified memory allocation and reclamation on each target computing device, and may include:

Responding to the call of the running system to the memory allocation and recovery interface defined by the unified computing abstraction layer in the process of implementing the computation, and acquiring a memory allocation request of the first computing device by the unified memory manager; the unified memory manager queries a first memory allocation record matched with the first computing device in the device memory pool, and determines whether a target cache block matched with the memory size required by the memory allocation request exists in the first computing device; when the unified memory manager determines that the target cache block exists, the unified memory manager distributes the target cache block to the first computing equipment and updates a first memory distribution record; when the unified memory manager determines that the target cache block does not exist, if the first computing device is queried that the residual memory meeting the memory allocation request exists, the target cache block which is planned to be in the memory size in the residual memory is allocated to the first computing device, and the first memory allocation record is updated.

The first computing device is a computing device which performs computation and obtains a computation result and needs to store the computation result; further, the planned cache block is a cache block planned in the first computing device and used for storing the computing result; further, the occupancy state includes: an occupied state and an unoccupied state. If the buffer information exists in the current buffer block, the occupied state of the buffer block is occupied, otherwise, the occupied state is unoccupied.

Further, after the unified memory manager determines that the target cache block does not exist, the method may further include: if the unified memory manager inquires that the residual memory meeting the memory allocation request does not exist in the first computing equipment, searching a minimum cache block which is larger than the memory and is in an unoccupied state in a first memory allocation record; if the unified memory manager finds the minimum cache block, the unified memory manager distributes the minimum cache block to the first computing equipment and updates a first memory distribution record; if the minimum cache block is not found, the unified memory manager queries an unoccupied idle cache block in a first memory allocation record, and updates the first memory allocation record after releasing the cache block in the first computing device; the unified memory manager again queries whether the first computing device has residual memory meeting the memory allocation request; and if the unified memory manager inquires that the residual memory meeting the memory allocation request exists in the first computing device, allocating the target cache block which is planned to be of the memory size from the residual memory to the first computing device, and updating a first memory allocation record.

Further, after the unified memory manager queries again whether there is remaining memory in the first computing device that satisfies the memory allocation request, the method may further include: if the unified memory manager inquires that the first computing device does not have residual memory meeting the memory allocation request, performing memory optimization on the first computing device according to the access frequency of each planned cache block in the first memory allocation record; after the unified memory manager completes the memory optimization of the first computing device, continuously inquiring whether the first computing device has residual memory meeting the memory allocation request; and if the unified memory manager inquires that the residual memory meeting the memory allocation request exists in the first computing device, allocating the target cache block which is planned to be of the memory size from the residual memory to the first computing device, and updating a first memory allocation record.

The optimization operation may be a memory move operation, that is: the cache blocks that are not frequently used are moved from the device memory to the host memory.

Further, after the unified memory manager continues to query whether there is remaining memory in the first computing device that satisfies the memory allocation request, the method may further include: if the unified memory manager does not inquire that the residual memory meeting the memory allocation request exists in the first computing device, acquiring associated computing devices matched with the first computing device from the heterogeneous AI computing platform; the unified memory manager performs an operation of querying a pool of device memory for an associated memory allocation record matching the associated computing device, attempting to allocate memory in the associated computing device to the first computing device.

The unified memory manager obtains the associated computing device matched with the first computing device from the heterogeneous AI computing platform, and the method comprises the following steps: the unified memory manager obtains associated computing devices in the same cabinet and of the same type as the first computing device in the heterogeneous AI computing platform.

Optionally, in response to the call of the runtime system to the memory allocation and reclamation interface defined by the unified computing abstraction layer in the process of implementing computation, the unified memory manager performs unified memory allocation and reclamation on each target computing device, and may include: responding to the call of the running system to the memory allocation and recovery interface defined by the unified computing abstraction layer in the process of implementing the computation, the unified memory manager obtains a memory release request to the second computing device, and obtains a release memory block in the memory release request; the unified memory manager queries a second memory allocation record matched with the second computing device in the device memory pool, and updates the current occupied state of the released memory block to an unoccupied state in the second memory allocation record.

Specifically, fig. 8 is a flowchart of a method for unified memory allocation and reclamation of target computing devices by a unified memory manager according to the third embodiment of the present invention, as shown in the following:

Firstly, responding to the call of a memory allocation and recovery interface defined by a unified computing abstraction layer in the implementation computing process of a runtime system, after a unified memory manager obtains a memory allocation request of a memory block with a set size of first computing equipment, determining whether a target cache block meeting the requirement and free exists in the first computing equipment or not by the unified memory manager, if the target cache block exists, directly allocating the target cache block to the first computing equipment to store a corresponding computing result, and defining the state of the target cache block in a first memory allocation record corresponding to the first computing equipment as occupied; if the target cache block does not exist, the unified memory manager searches whether a minimum cache block which is larger than the memory size and is in an unoccupied state exists in the first computing equipment, if so, the target cache block is distributed to the first computing equipment to store a corresponding computing result, and the state of the target cache block in a first memory distribution record is defined as occupied; if the target cache block does not exist, releasing the memory of the first computing device, namely releasing the cache block which is not fully occupied and updating the corresponding release state in a first memory allocation record, then querying whether residual memory meeting the memory allocation request exists in the first computing device again, if the target cache block exists, allocating the target cache block to the first computing device to store a corresponding calculation result, and defining the state of the target cache block in the first memory allocation record as occupied; if the target cache block does not exist, continuously inquiring whether the residual memory meeting the memory allocation request exists in the first computing equipment, namely, an unplanned cache block, if the target cache block exists in the unplanned cache block, allocating the target cache block in the first computing equipment to store a corresponding calculation result, and defining the state of the target cache block in a first memory allocation record as occupied; and if the target cache block does not exist, acquiring the associated computing device matched with the first computing device in the heterogeneous AI computing platform, and attempting to allocate the memory in the associated computing device to the first computing device.

According to the technical scheme, the unified computing abstract layer is configured between the AI computing framework and the heterogeneous AI computing platform, the AI computing abstract layer is used as a base, so that computing instructions issued by the upper AI computing framework can be called to different computing devices through a consistent behavior interface, the influence of computing device computing differences of different chip manufacturers on the upper AI computing framework is shielded, in addition, the AI computing graph is converted into distribution of computing byte code units among a plurality of isomorphic or heterogeneous computing devices through an AI compiler, a large-scale distributed scene and a complex heterogeneous computing scene can be dealt with, simultaneously, in response to a memory access instruction sent to a unified memory manager by a running system in the implementation computing process, the unified memory manager performs unified memory allocation and recovery on each target computing device, the influence of the computing device memory use differences of different chip manufacturers on the upper AI computing framework can be shielded, and finally compatibility and multiplexing of different chip manufacturers on the main stream AI computing framework can be realized.

Example IV

Fig. 9 is a diagram of an AI unified computing device for heterogeneous scenarios according to a fourth embodiment of the present invention, which is implemented by a unified computing abstraction layer configured between an AI computing framework and a heterogeneous AI computing platform.

As shown in fig. 9, the apparatus includes:

the diagram unit generating module 910 is configured to obtain an AI computation diagram generated by the AI computation framework, and split the AI computation diagram in the intermediate expression form into a plurality of diagram units after the AI computation diagram is converted into the intermediate expression by using an AI compiler;

the graph unit compiling and distributing module 920 is configured to compile each graph unit into a computing bytecode unit that can be identified by at least one type of computing device in the heterogeneous AI computing platform, and provide each computing bytecode unit to a runtime system in the AI computing framework;

the memory allocation and reclamation unit 930 is configured to respond to a memory access instruction sent by the runtime system to the unified memory manager in the process of performing computation, where the unified memory manager performs unified memory allocation and reclamation on each target computing device.

According to the technical scheme, the method is executed by a unified computing abstraction layer configured between an AI computing framework and a heterogeneous AI computing platform, an AI computing graph generated by the AI computing framework is acquired, an AI compiler is used for transferring the AI computing graph into an intermediate expression, the AI computing graph in the intermediate expression form is split into a plurality of graph units, each graph unit is compiled into a computing byte code unit which can be identified by at least one type of computing equipment in the heterogeneous AI computing platform, each computing byte code unit is provided to a runtime system in the AI computing framework, finally, in response to a memory access instruction sent to a unified memory manager by the runtime system in the implementation computing process, the unified memory manager performs unified memory allocation and recovery on each target computing device, so that the calculation of the AI computing framework in a heterogeneous scene is realized, the influence of the heterogeneous scene on the AI computing framework is solved, the heterogeneous computing difficulty is simplified, and the computing performance of the AI computing framework is improved.

On the basis of the above embodiment, the graph unit generation module 910 includes:

the computing graph escaping unit is used for escaping the AI computing graph into intermediate expression by using an AI compiler, and deducing the input and output data type and the input and output data size of each computing operator in the AI computing graph according to the intermediate expression;

the computing graph splitting unit is used for splitting the AI computing graph in the intermediate expression form into a plurality of graph units according to the input and output data types and the input and output data sizes of each computing operator in the AI computing graph;

Based on the above embodiment, the graph unit compiling and distributing module 920 includes:

the device type determining unit is used for determining the type of the computing device corresponding to each graph unit according to the computing characteristics of each computing operator in the graph unit and the hardware characteristics of each type of computing device in the AI computing platform;

and the graph unit compiling unit is used for compiling each graph unit into a calculation byte code unit matched with the type of the computing device according to the compiling rule matched with the type of the computing device of each graph unit.

Based on the above embodiments, the graph unit compiling and distributing module 920 is specifically configured to:

the runtime system dispatches each compute bytecode unit to multiple target computing devices of the same or different types in the same enclosure, or to multiple target computing devices of the same or different types in different enclosures, to achieve isomorphic and/or heterogeneous computing.

Based on the above embodiment, the memory allocation and reclamation unit 930 includes:

the memory allocation and recovery unit is used for responding to the call of the memory allocation and recovery interface defined by the unified calculation abstract layer in the implementation calculation process of the runtime system, and the unified memory manager performs unified memory allocation and recovery on each target calculation device; the memory allocation and reclamation interface is a virtual application program interface.

Based on the above embodiment, the memory allocation and reclamation unit further includes:

the first memory allocation request acquisition unit is used for responding to the call of the running system to the memory allocation and recovery interface defined by the unified calculation abstract layer in the calculation implementation process, and the unified memory manager acquires the memory allocation request of the first computing device;

The unified memory manager is configured with a memory pool of the computing equipment, and memory allocation records are stored in the memory pool of the computing equipment for different computing equipment in the heterogeneous AI computing platform respectively; the memory allocation record comprises a planned cache block in the computing equipment and the current occupied state of the planned cache block;

the first target cache block determining unit is used for inquiring a first memory allocation record matched with the first computing device in the device memory pool by the unified memory manager and determining whether a target cache block matched with the memory size required by the memory allocation request exists in the first computing device or not;

the first memory allocation recording unit is used for allocating the target cache block to the first computing equipment and updating the first memory allocation record when the unified memory manager determines that the target cache block exists;

and the first memory allocation record updating unit is used for allocating the target cache block with the memory size to the first computing device and updating the first memory allocation record if the unified memory manager inquires that the residual memory meeting the memory allocation request exists in the first computing device when determining that the target cache block does not exist.

On the basis of the above embodiment, the first memory allocation record updating unit further includes:

the minimum cache block searching unit is used for searching a minimum cache block which is larger than the memory size and is in an unoccupied state in the first memory allocation record if the unified memory manager inquires that the first computing device does not have residual memory meeting the memory allocation request;

the minimum cache block allocation unit is used for allocating the minimum cache block to the first computing equipment and updating a first memory allocation record if the minimum cache block is found by the unified memory manager;

the buffer block releasing unit is used for inquiring an unoccupied idle buffer block in a first memory allocation record if the minimum buffer block is not found by the unified memory manager, and updating the first memory allocation record after releasing the buffer block in the first computing equipment;

the remaining memory inquiry unit is used for inquiring whether the remaining memory meeting the memory allocation request exists in the first computing device again by the unified memory manager;

and the remaining memory planning unit is used for planning the target cache block with the memory size from the remaining memory to be distributed to the first computing device and updating the first memory distribution record if the unified memory manager inquires that the remaining memory meeting the memory distribution request exists in the first computing device.

On the basis of the above embodiment, the remaining memory query unit further includes:

the memory optimization unit is used for performing memory optimization on the first computing equipment according to the access frequency of each planned cache block in the first memory allocation record if the unified memory manager inquires that the first computing equipment does not have residual memory meeting the memory allocation request;

the secondary query unit of the surplus memory is used for continuously querying whether the surplus memory meeting the memory allocation request exists in the first computing device after the unified memory manager completes the memory optimization of the first computing device;

and if the unified memory manager inquires that the residual memory meeting the memory allocation request exists in the first computing device, allocating the target cache block which is planned to be of the memory size from the residual memory to the first computing device, and updating a first memory allocation record.

On the basis of the above embodiment, the remaining memory secondary query unit further includes:

the associated computing device matching unit is used for acquiring associated computing devices matched with the first computing device from the heterogeneous AI computing platform if the unified memory manager does not inquire that the first computing device has residual memory meeting the memory allocation request;

And the associated memory allocation unit is used for executing the operation of inquiring the associated memory allocation record matched with the associated computing device in the device memory pool by the unified memory manager and attempting to allocate the memory in the associated computing device to the first computing device.

On the basis of the above embodiment, the association computing device matching unit further includes:

and the associated computing device acquisition unit is used for acquiring associated computing devices which are in the same cabinet as the first computing device and belong to the same type in the heterogeneous AI computing platform by the unified memory manager.

the second memory block release unit is used for responding to the call of the memory allocation and recovery interface defined by the unified computing abstraction layer in the implementation computing process of the runtime system, and the unified memory manager acquires a memory release request for the second computing device and acquires a release memory block in the memory release request;

the second memory allocation unit is used for inquiring a second memory allocation record matched with the second computing device in the device memory pool by the unified memory manager, and updating the current occupied state of the released memory block into an unoccupied state in the second memory allocation record.

The heterogeneous scene-oriented AI unified computing device provided by the embodiment of the invention can execute the heterogeneous scene-oriented AI unified computing method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method.

Example five

Fig. 10 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 10, the electronic device 10 includes at least one processor 11, at least one AI computing unit 12 (which may be understood as an AI chip), and a memory communicatively connected to the at least one processor 11 and the at least one AI computing unit 12, such as a read-only memory (ROM) 13, a Random Access Memory (RAM) 14, etc., in which a computer program executable by the at least one processor 11 or the at least one AI computing unit 12 is stored, and the processor 11 or the AI computing unit 12 may perform various appropriate actions and processes according to the computer program stored in the read-only memory (ROM) 13 or the computer program loaded from the storage unit 19 into the Random Access Memory (RAM) 14. In the RAM 14, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the AI computing unit 12, the ROM 13, and the RAM 14 are connected to each other via a bus 15. An input/output (I/O) interface 16 is also connected to bus 15. The AI computing unit 12 includes a memory module, which is equivalent to a device memory in each computing device in the AI computing framework in the embodiments of the present invention.

Various components in the electronic device 10 are connected to the I/O interface 16, including: an input unit 17 such as a keyboard, a mouse, etc.; an output unit 18 such as various types of displays, speakers, and the like; a storage unit 19 such as a magnetic disk, an optical disk, or the like; and a communication unit 20 such as a network card, modem, wireless communication transceiver, etc. The communication unit 20 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), some examples of AI computing unit 12 include, but are not limited to, a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processors, controllers, microcontrollers, etc. The processor 11 or the AI computing unit 12 performs the respective methods and processes described above, such as the heterogeneous scene-oriented AI unified computing method.

Accordingly, the method is performed by a unified computing abstraction layer configured between an AI computing framework and a heterogeneous AI computing platform, comprising:

In some embodiments, the heterogeneous scenario-oriented AI unified computing method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 19. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 13 and/or the communication unit 20. When the computer program is loaded into the RAM 14 and executed by the processor 11 or the AI computing unit 12, one or more steps of the heterogeneous scenario-oriented AI uniform computing method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the heterogeneous scenario-oriented AI unified computing method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

Claims

1. The artificial intelligence AI unified computing method facing to the heterogeneous scene is executed by a unified computing abstract layer configured between an AI computing framework and a heterogeneous AI computing platform, and is characterized by comprising the following steps:

2. The method of claim 1, wherein after the AI computation graph is misinterpreted as an intermediate representation using an AI compiler, splitting the AI computation graph in the intermediate representation into a plurality of graph elements, comprising:

an AI compiler is used for escaping the AI calculation graph into intermediate expression, and the input and output data type and the input and output data size of each calculation operator in the AI calculation graph are deduced according to the intermediate expression;

splitting the AI calculation graph in the intermediate expression form into a plurality of graph units according to the input and output data types and the input and output data sizes of each calculation operator in the AI calculation graph;

3. The method of claim 2, wherein compiling each graph element into a computational bytecode element recognizable by at least one type of computing device in the heterogeneous AI computing platform comprises:

Determining the type of the computing device corresponding to each graph unit according to the computing characteristics of each computing operator in the graph unit and the hardware characteristics of each type of computing device in the AI computing platform;

each graph unit is compiled into a computing byte code unit matching the computing device type according to a compiling rule matching the computing device type of each graph unit.

4. The method of claim 1, wherein the manner in which the runtime system dispatches each compute bytecode unit to a matching target computing device in the AI computing platform comprises:

5. The method of claim 1, wherein the type of standard computing interface comprises at least one of:

a device interface for implementing computing device initialization and computing device capability queries;

the module interface is used for managing the loading of the computing core, the release of the computing core, the caching of the computing core instance and the acceleration of the calling of the computing equipment;

A compute kernel interface for implementing a compute operator operation instance;

an event interface for implementing event synchronization under asynchronous operation;

a flow interface for implementing sequential execution of the plurality of asynchronous compute core calls;

a device memory address interface for pointing to a block of memory allocated on the computing device and enabling a jump access;

a device buffer interface for allocating and reclaiming buffer areas on a computing device;

a memory interface for implementing memory allocation and copying between the computing device and the host; and

and the asynchronous processing interface is used for converting error codes generated in the heterogeneous AI computing platform into platform independent error codes and displaying the platform independent error codes in the form of recognizable character strings.

6. The method of claim 5, wherein each of the standard computing interfaces is implemented using a computing device independent virtual data type.

7. The method of any of claims 1-6, wherein in response to a memory access instruction sent by the runtime system to the unified memory manager in performing the computation, the unified memory manager performs unified memory allocation and reclamation for each target computing device, comprising:

Responding to the call of the running system to the memory allocation and recovery interface defined by the unified calculation abstract layer in the process of implementing calculation, and carrying out unified memory allocation and recovery on each target calculation device by a unified memory manager; the memory allocation and reclamation interface is a virtual application program interface.

8. The method of claim 7, wherein in response to a call by the runtime system to the unified compute abstraction layer defined memory allocation and reclamation interface during the implementation of the computation, the unified memory manager performs unified memory allocation and reclamation for each target computing device, comprising:

responding to the call of the running system to the memory allocation and recovery interface defined by the unified computing abstraction layer in the process of implementing the computation, and acquiring a memory allocation request of the first computing device by the unified memory manager;

The unified memory manager queries a first memory allocation record matched with the first computing device in the device memory pool, and determines whether a target cache block matched with the memory size required by the memory allocation request exists in the first computing device;

when the unified memory manager determines that the target cache block exists, the unified memory manager distributes the target cache block to the first computing equipment and updates a first memory distribution record;

when the unified memory manager determines that the target cache block does not exist, if the first computing device is queried that the residual memory meeting the memory allocation request exists, the target cache block which is planned to be in the memory size in the residual memory is allocated to the first computing device, and the first memory allocation record is updated.

9. The method of claim 8, wherein after the unified memory manager determines that the target cache block is not present, further comprising:

if the unified memory manager inquires that the residual memory meeting the memory allocation request does not exist in the first computing equipment, searching a minimum cache block which is larger than the memory and is in an unoccupied state in a first memory allocation record;

if the unified memory manager finds the minimum cache block, the unified memory manager distributes the minimum cache block to the first computing equipment and updates a first memory distribution record;

If the minimum cache block is not found, the unified memory manager queries an unoccupied idle cache block in a first memory allocation record, and updates the first memory allocation record after releasing the cache block in the first computing device;

the unified memory manager again queries whether the first computing device has residual memory meeting the memory allocation request;

10. The method of claim 9, further comprising, after the unified memory manager re-queries whether there is remaining memory in the first computing device that satisfies the memory allocation request:

if the unified memory manager inquires that the first computing device does not have residual memory meeting the memory allocation request, performing memory optimization on the first computing device according to the access frequency of each planned cache block in the first memory allocation record;

after the unified memory manager completes the memory optimization of the first computing device, continuously inquiring whether the first computing device has residual memory meeting the memory allocation request;

11. The method of claim 10, further comprising, after the unified memory manager continues to query whether there is remaining memory in the first computing device that satisfies the memory allocation request:

if the unified memory manager does not inquire that the residual memory meeting the memory allocation request exists in the first computing device, acquiring associated computing devices matched with the first computing device from the heterogeneous AI computing platform;

the unified memory manager performs an operation of querying a pool of device memory for an associated memory allocation record matching the associated computing device, attempting to allocate memory in the associated computing device to the first computing device.

12. The method of claim 11, wherein the unified memory manager obtains the associated computing device in the heterogeneous AI computing platform that matches the first computing device, comprising:

the unified memory manager obtains associated computing devices in the same cabinet and of the same type as the first computing device in the heterogeneous AI computing platform.

13. The method of claim 8, wherein in response to a call by the runtime system to the unified compute abstraction layer defined memory allocation and reclamation interface during the implementation of the computation, the unified memory manager performs unified memory allocation and reclamation for each target computing device, comprising:

responding to the call of the running system to the memory allocation and recovery interface defined by the unified computing abstraction layer in the process of implementing the computation, the unified memory manager obtains a memory release request to the second computing device, and obtains a release memory block in the memory release request;

the unified memory manager queries a second memory allocation record matched with the second computing device in the device memory pool, and updates the current occupied state of the released memory block to an unoccupied state in the second memory allocation record.

14. An artificial intelligence AI unified computing device for heterogeneous scenes, executed by a unified computing abstraction layer configured between an AI computing framework and a heterogeneous AI computing platform, comprising:

15. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform a heterogeneous scenario-oriented artificial intelligence AI unified computing method of any of claims 1-13.

16. A computer readable storage medium, wherein the computer readable storage medium stores computer instructions for causing a processor to implement the heterogeneous scenario-oriented artificial intelligence AI unified computing method of any of claims 1-13 when executed.