CN114625370A

CN114625370A - Method, device and heterogeneous system for data layout between host and device

Info

Publication number: CN114625370A
Application number: CN202011463516.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2022-06-14

Abstract

The present disclosure discloses a method, device and heterogeneous system for data placement between a host and a device. The apparatus may be included in a computing processing device of a combined processing device, which may include one or more data processing devices. The aforementioned combined processing means may also comprise interface means and other processing means. And the computing processing device interacts with other processing devices to jointly complete computing operation designated by a user. The combined processing means may further comprise storage means connected to the device and the other processing means, respectively, for storing data of the device and the other processing means. By utilizing the scheme of the present disclosure, data layout between a host and a device can be effectively achieved.

Description

Method, device and heterogeneous system for data layout between host and device

Technical Field

The present disclosure relates generally to the field of computers. More particularly, the present disclosure relates to a method, compiler, and heterogeneous system for machine learning for data layout between a host and a device.

Background

With the continuous development of artificial intelligence technology, various hardware architectures and corresponding algorithms are continuously evolved to meet the increasingly expanded application fields and scenes. In terms of machine learning, a variety of different hardware architectures have emerged to accommodate the computational operations of different neural network models. In terms of operation, how to arrange the data to be involved in the operation becomes an important aspect to be solved. Currently, there are several common ways of laying out data. Taking the manual layout approach as an example, it requires the user to adapt to a specific hardware architecture and follow the software conventions, thereby increasing labor costs. When layout is performed in a heterogeneous system comprising a host and a device, it is common that either the host-side is laid out automatically at compile time or the device-side is laid out automatically at run time. This places high demands on the compiler when automatically laying out on the host side, and such a layout approach is poor in performance. When the device is automatically laid out, a significant amount of work is brought to the device, which may cause the overall performance of the device to be reduced.

In view of the above, there is currently a need for a scheme for reasonable layout of data.

Disclosure of Invention

At least to address one or more of the above-mentioned problems, the present disclosure proposes a distributed data layout approach, i.e., a data layout task is reasonably distributed at the host and device ends through an effective mechanism, so that the data layout is effectively realized without affecting the performance of the host and device.

To this end, in a first aspect, the present disclosure discloses a method for data layout between a host and a device, comprising: traversing a plurality of data nodes in a computational graph to determine in-out information for each data node in the computational graph, wherein each of the data nodes has associated data; and determining, from the ingress and egress information, that a layout operation is to be performed by one of the host and the device with respect to the data.

In a second aspect, the present disclosure discloses an apparatus for data layout between a host and a device, comprising: at least one processor; at least one memory for storing computer program instructions that, when executed by the at least one processor, cause the apparatus to perform the aforementioned methods and their later described embodiments.

In a third aspect, the disclosure discloses a computer-readable storage medium storing computer program instructions for data layout between a host and a device, which when executed by at least one processor, implement the aforementioned method and its later described embodiments.

In a fourth aspect, the present disclosure discloses a compiler for data layout between a host and a device, comprising: a traversal unit configured to traverse a computational graph comprising a plurality of data nodes, each of the data nodes having associated data, to determine in-out information for each data node; and a data layout optimization unit configured to determine, from the ingress and egress information, that a layout operation for the data is performed by one of the host and the device.

In a fifth aspect, the present disclosure discloses a heterogeneous system for machine learning, comprising interconnected hosts and devices, wherein the hosts comprise: a compiler to be described in the foregoing and following embodiments; and a first layout unit configured to perform data layout according to the host layout instruction or the tag information; and the apparatus comprises: a second placement unit configured to perform data placement according to the device placement instructions received from the host.

In a sixth aspect, the present disclosure discloses an integrated circuit device comprising a compiler as described above and in a number of embodiments later herein.

In a seventh aspect, the present disclosure discloses a board card comprising the above heterogeneous system and its later described embodiments.

By utilizing the above-described aspects of the present disclosure, flexible layout of data may be achieved. In particular, the data layout operation of the present disclosure is more reasonable and efficient than the prior art due to the consideration of the position of the data node in the computation graph and/or the size of the data amount with the data node, and the optimization of the data layout is realized. For example, for an intermediate data node in a computational graph, the present disclosure places layout operations on its data on the device side. In contrast, for data nodes located at the edge of the computational graph, when the data volume is large, the layout operation of the data associated with the data nodes is arranged at the host side. Therefore, compared with the prior art, the automatic distributed data layout mode disclosed by the invention reduces the labor cost of manual layout and reduces the high-performance requirement on the host-side compiling system. In addition, due to the cooperative layout of the host side, the device side of the present disclosure is improved in terms of computation and performance, and finally, the performance of the heterogeneous system including the host and the device is improved.

Drawings

The above features of the present invention may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings, wherein like reference numerals refer to like elements and in which:

FIG. 1 is a simplified flow diagram illustrating a method for data layout between a host and a device in accordance with an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a computational graph structure according to an embodiment of the present disclosure;

FIG. 3 is a detailed flow diagram illustrating a method for data layout between a host and a device according to an embodiment of the present disclosure;

FIG. 4 is a block diagram illustrating the structure of a compiler in accordance with an embodiment of the present disclosure;

FIG. 5 is an architecture block diagram illustrating a heterogeneous system for machine learning in accordance with an embodiment of the present disclosure;

FIG. 6 is a block diagram illustrating a combined treatment device according to an embodiment of the present disclosure; and

fig. 7 is a schematic diagram illustrating a structure of a board according to an embodiment of the disclosure.

Detailed Description

At least in view of the various problems with data placement in the prior art, the present disclosure proposes a distributed placement scheme that performs automation in a heterogeneous system including hosts and devices. Specifically, a layout operation of data associated with a data node is assigned to one of a host and a device based at least on the ingress and egress information of the data node in the computational graph. In one implementation scenario, when it is determined that the data node is located in an intermediate position of the computational graph (e.g., the data node is not an initial node or a final node of the computational graph) according to the access information, it may be considered that a data layout operation for the data node is performed by arranging the data node at a device side, so as to reduce data copy of data associated with the data node between a host and the device, and reduce IO overhead. Correspondingly, when the data node is determined to be located at an edge position of the computational graph (e.g., the data node is an initial node or a final node of the computational graph), the data node may be considered to be arranged at the host side for completion. In one or more embodiments, the amount of data at the data node may also be considered to determine which of the hosts or devices to perform the data layout operation. Based on the scheme, the data layout is remarkably optimized.

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

Fig. 1 is a simplified flow diagram illustrating a method 100 for data layout between a host and a device, which may constitute a heterogeneous system 500 as illustrated in fig. 5 (described in detail later in connection with fig. 5), according to an embodiment of the disclosure. As shown in fig. 1, at step S102, the method 100 traverses a plurality of data nodes in a computational graph, each of which has associated data, to determine entrance and exit information for each data node in the computational graph. To facilitate an understanding of aspects of the present disclosure, a calculation graph will first be described below in conjunction with fig. 2.

Fig. 2 shows a computational graph 200 of an embodiment of the present disclosure. In the field of machine learning, a computational graph is a directed graph that describes a computational process, typically comprising a set of nodes and edges. Specifically, the nodes may represent input start points, output end points, model parameters, and the like of the data. In addition, the nodes may also represent various types of operation processing, including mathematical operations, variable reading and writing, data stuffing, and the like. In view of this, the nodes in the computational graph may be roughly divided into data nodes 202 and operation nodes 204, shown in the graph as circles and rectangles, respectively. For edges in the computational graph (as indicated at 206 in the graph), they can be generally divided into two classes, where one class is edges that communicate specific data and the other class is edges that represent control dependencies between nodes.

According to the corresponding positions of the data nodes in the calculation graph, the entrance and exit degree information of the data nodes in the calculation graph can be determined. In accordance with aspects of the present disclosure, in one embodiment, the in-out information herein may include in-degree and out-degree of the data nodes in the computational graph. Here, the in-degree may indicate the number of previous operation nodes in which the data node participates, when the data node is an output of the previous operation node. The out-degree may represent the number of subsequent operation nodes that the data node participates in, when the data node is an input of the subsequent operation node. Taking the data node 202 in the graph as an example, since it is the initial node in the computation graph, i.e. the number of the previous operation nodes is 0, its in-degree is 0. Further, since the number of its subsequent operation nodes (i.e., operation node 204) is 1, the out-degree of the data node 202 is 1. Similarly, since the number of the preceding operation nodes and the following operation nodes of the data node 208 are both 1, the in-degree and the out-degree thereof are both 1. Likewise, for the data node 210, since it is the final node in the computation graph, i.e., the number of subsequent operation nodes is 0, its out-degree is 0, and since it has 1 previous operation node, its in-degree is 1.

The in and out information and the in and out of the present disclosure are described above in connection with fig. 2, and the description of the method 100 continues back to fig. 1. After determining the access information for each data node in the data graph, the method 100 proceeds to step S104. At this step, the method 100 determines from the access information that a layout operation is performed by one of the host and the device for the data. In other words, the method 100 uses the access information of the data node to determine which of the host and the device has performed the layout operation.

In one embodiment, when the out-degree and the in-degree of the data node in the computational graph are both greater than 0, then it may be determined that the layout operation is performed by the device. As described above, when the out-degree and the in-degree are both greater than 0, it may be determined that the data node is located at an intermediate position in the computational graph, that is, the data node is neither the initial node nor the final node. Since the computational graph is executed by the device, the data layout of the data associated with the data node at the middle position of the computational graph is arranged to be executed at the device, so that the data copy of the data associated with the data node between the host and the device can be reduced, and the IO overhead operation is reduced. Meanwhile, in consideration of the operation participation degree of the data node in the computational graph, the present disclosure proposes to arrange the data layout operation for such data node to be executed at the device, so that the efficiency of data processing and operation at the device can be improved, and the workload of the host-side compiling operation is also relieved.

In one embodiment, when one of the above-mentioned out-degree and in-degree is equal to 0, it is determined that the layout operation is performed by the host. As described above, when one of the out-degree or the in-degree of a data node is equal to 0, that is, implicitly indicates that the data node is an initial node (the in-degree is 0, such as the data node 202 in fig. 2) or a final node (the out-degree is 0, such as the data node 210 in fig. 2) in the computation graph. Given that the data node is the initial or final node in the computational graph, the present disclosure proposes to arrange for data layout operations for such data nodes to be performed at the host, thereby both reducing the workload of device-side data layout and at the same time improving the overall performance of the heterogeneous system. After that, the host can copy the associated data of the data node after the data layout operation is completed from the host end to the device end, so that the device end can directly perform operation according to the associated data after the data layout operation is completed, and the operation efficiency of the device can be improved.

In one embodiment, when one of the out-degree and the in-degree of the data node is equal to 0, the method may further include comparing the data amount of the data to a threshold to determine which of the host and the device is to perform the data layout operation for the data node. Specifically, when one of the out-degree and the in-degree of the data node is equal to 0 and the data amount is greater than or equal to a threshold value, it may be determined that the layout operation is performed by the host. Correspondingly, when one of the out-degree and the in-degree of the data node is equal to 0 and the data amount is less than the threshold, it may be determined that the layout operation is performed by the device. It can be seen that in this embodiment, the scheme of the present disclosure considers not only the position of the data node in the computation graph, but also the size of the data amount at the data node. In particular, even if the data node is an initial node or a final node in the computation graph, it is not directly divided to the host for data layout. In contrast, the present disclosure proposes to count and compare the data volume size of a data node, when the data volume is too large at the data node, e.g., exceeds a predetermined threshold, then the host, rather than the device, is selected to perform the data layout. In other words, in this case, the present disclosure arranges the data layout operation on the host side, thereby effectively relieving the layout pressure on the device side, and thereby also fully utilizing the processing capability of the host to speed up the data processing and operation.

FIG. 3 is a detailed flow diagram illustrating a method 300 for data layout between a host and a device according to an embodiment of the present disclosure. For convenience of description, it is assumed that the computation graph includes a total of N data nodes, where for the nth data node, the out degree is represented as "o", the in degree is represented as "i", and the data amount at the data node is represented as "s", where N is 1,2, … … N.

As shown in FIG. 3, at step S302, the method 300 initializes a variable n to 1 indicating a particular data node. Next, at steps S304 and S305, the method 300 determines the in-degree i and the out-degree o of the 1 st data node, respectively, which has been described in detail with reference to fig. 2 for specific determination operations, and is not repeated herein. After obtaining the in-out information regarding the degree of exit and the degree of entry, flow proceeds to step S308 where the method 300 determines whether both the degree of exit and the degree of entry are greater than 0. When both are greater than 0, then at step S310, the method 300 determines to perform data layout at the device side, i.e., to arrange the data layout with respect to the 1 st data node to be performed at the device side. In one implementation scenario, this may be accomplished by a compiler generating corresponding layout instructions. Next, at step S312, the method 300 determines whether N is less than N, i.e., whether the data node at this time is the last data node traversed in the computational graph. If the determination is "no," at step S314, the method 300 increments the variable n by 1 (at which time the value of n becomes "2") to begin performing the operations shown in the flow for the 2 nd data node at step S304.

When it is judged at step S308 that at least one of the out-degree and the in-degree of the 1 st data node is not more than 0 (i.e., equal to 0), that is, the judgment result is no, the flow advances to step S316. Here, the method 300 determines the data size s of the 1 st data node. Here, there may be different ways of determining the amount of data and ways of comparing with the threshold value according to the data structure. For example, for tensor data (e.g., three-dimensional tensor data), the total number of elements in three dimensions may be calculated, and the total number may be multiplied by the storage occupation space of a unit element to obtain the data volume size of the tensor data. Next, at step S318, the method 300 may compare the data size with a predetermined threshold, which may be a numerical value in bytes or bits. When the comparison result is "no," the method 300 determines to perform a data layout operation on the device side at step S310. In other words, the present disclosure arranges the data layout operation of the node with the small data amount among the initial or final data nodes to be performed on the device side. When it is determined at step S318 that the data size S is greater than the threshold, at step S320, the method 300 determines that the data layout on the 1 st data node is performed at the host side, and determines at step S312 whether the data node is the final data node in the computational graph, otherwise the flow proceeds to step S314 to increment the variable n by 1, and returns to step S304 to perform the processing on the next data node.

The steps shown in fig. 3 are repeatedly executed for each data node in the computation graph until N equals N, i.e., all data nodes in the computation graph are traversed, and the method 300 ends at step S322. It is to be understood that the steps and their order of execution shown in fig. 3 are merely exemplary and not limiting, and that substitution or alteration of the relevant steps and their order will also occur to those skilled in the art in light of the present disclosure. For example, although steps S304 and S306 are shown in a sequential order, step S306 may be performed first and then step S304, or both may be performed simultaneously. In addition, although the data nodes are individually laid out in fig. 3, the data nodes may be grouped or concurrently laid out.

Fig. 4 is a block diagram illustrating a structure of a compiler 400 according to an embodiment of the present disclosure. In the heterogeneous network of the present disclosure, the compiler may be a software program running on the host side, and an executable program (binary instructions including data layout instructions) obtained by compiling by the compiler may be transferred to the device side via the driver interface for execution. It is understood that compiler 400 may perform the operations described in conjunction with fig. 1-3, and thus the technical description above with respect to fig. 1-3 is also applicable to compiler 400, and the same is not described in detail below.

As shown in fig. 4, the compiler 400 may be a neural network compiler, which may be applied to framework compilers such as TVM, Mindspore, etc., and may also be applied to neural network frameworks such as tensrflow, Pytorch, etc., and is not limited herein. The compiler 400 includes a traversal unit 402 configured to traverse a computational graph (as shown in fig. 2) including a plurality of data nodes, each having associated data, to determine access information for each data node. The compiler 400 also includes a data layout optimization unit 404, which may be configured to generate data layout instructions based on the access information, wherein the data layout instructions are used by one of the host and the device to perform layout operations on the data.

In one embodiment, the traversal unit may include an in-out counter 406, which may be configured to count the in and out of each of the data nodes. Further, the data layout optimization unit may include a threshold determiner 408, which may be configured to compare the out-degree and the in-degree with zero to obtain a first comparison result (i.e., a result of the execution of step S308 in fig. 3). In one embodiment, compiler 400 of the present disclosure may further include an instruction generator 410, which may be configured to generate a data layout instruction according to the first comparison result.

In one embodiment, when the first comparison result is that both the out-degree and the in-degree are greater than zero, the instruction generator may be configured to generate a data layout instruction for use by the apparatus to perform a layout operation. Correspondingly, when the first comparison result is that one of the out-degree and the in-degree is equal to zero, the instruction generator may be configured to generate a data layout instruction for use by the host to perform a layout operation.

In one embodiment, the traversal unit of the present disclosure may further include a data volume calculator 412, which may be configured to calculate and count the data volume size of the data nodes. The threshold determiner may be configured to compare the aforementioned size of the data amount with a threshold to obtain a second comparison result when one of the out-degree and the in-degree is equal to zero. In this case, when the second comparison result is that the data amount size is greater than or equal to the threshold, then the instruction generator may be configured to generate a data layout instruction used by the host to perform the layout operation. Accordingly, when the second comparison result is that the data size is less than the threshold, then the instruction generator may be configured to generate a data layout instruction for use by the device to perform a layout operation in accordance with the second comparison result. It can be seen that the compiler 400 implements the operations at step S320 and step S310 of fig. 3 by generating corresponding data layout instructions.

It is understood that the data layout instructions generated by the instruction generator are different for the data layout operations at the device side and the host side. In one embodiment, the data layout instructions of the present disclosure may include host layout instructions and device layout instructions, wherein the host layout instructions cause the host to perform layout operations on data, and the device layout instructions are used to join into an instruction group to be executed by a device for transmission to the device for performing layout operations. In one application scenario, at the host side, host layout instructions generated by the compiler may be transmitted to a conversion unit within the host side for performing data layout, so that when raw data is input into the host side, the raw data is laid out by the conversion unit according to the host layout instructions. Correspondingly, the generated device layout instructions may be inserted or placed into an instruction group to be executed by the device for issue to the device side via, for example, a driver interface. Therefore, the equipment end lays out the data according to the received equipment layout instruction.

Alternatively or optionally, compiler 400 of the present disclosure may also include tag transmitter 414. In this embodiment, when one of the out-degree and the in-degree of the data node in the computation graph is equal to zero (i.e., the aforementioned first comparison result), the tag transmitter may be configured to transmit the tag information of the data node to the first layout unit of the host (such as the first layout unit 506 shown in fig. 5), so that the first layout unit of the host performs layout according to the aforementioned tag information. It can be seen that in this embodiment, the instruction generator does not need to regenerate the data layout instructions for the host as previously described. In other words, the instruction generator only needs to generate the data layout instruction for the device in this embodiment, and in this embodiment, the instruction generator may generate the data layout instruction for the device according to the tag information of the data node.

Of course, based on the disclosure of the present disclosure, a person skilled in the art can think of a command generator to generate corresponding data layout commands according to the tag information of the data nodes, including but not limited to data layout commands for the host and data layout commands for the device, and the specific command generation process can refer to the above description.

With respect to the above-mentioned label information, the compiler of the present disclosure may first traverse the computational graph of the neural network and determine the label information of each data node in the neural network respectively. In one embodiment, the tag information of the data node may include dynamic tag information and static tag information, and the compiler binds the tag information of the data to the corresponding data node.

In one embodiment, the static tag information may be used to characterize associated information of data participating in neural network operations, which may include at least one of: data category, static data type, static data dimension order, and dimension value corresponding to each static data dimension. By way of example, the static tag information may be given in the following format:

Static:classification,type1,DIM_A1…An,{x1…xn}

wherein static is an identifier indicating that the tag information is static tag information. classification indicates the data type and type1 indicates the static data type. N in DIM _ A1 … An represents the static data dimension, and A1 … An represents the static data dimension order A1 … An. The dimension value of a1 is x1 … An, and the dimension value is xn. For example, the static label information of a certain data node is: IW, Float32, DIM _ HW, {10,4}, which indicates that the data associated with the data node is input weight (data type), 32-bit floating point number (Static data type), two-dimensional row-first (Static data dimension and Static data dimension order), 10 number of rows (dimension value), 4 number of columns (dimension value).

In one embodiment, the aforementioned dynamic tag information is used to characterize information that associates the data with the processor, which may be specifically determined according to the architecture of the multi-core processor. The dynamic tag information may include at least one of: the method comprises the following steps of dynamic data type, dynamic data dimension sequence, fragmentation parameters, filling parameters, data size, split indexes, identification of a target storage space, target exchange level and other information. As an example, the dynamic tag information may be given in the following format:

dynamic:type2,DIM_B1…Bn,tiling,padding,size

wherein dynamic indicates that the tag information is an identifier of dynamic tag information. type2 represents a dynamic data type. DIM _ B1 … Bn indicates that the dynamic data dimension order is B1 … Bn. Tiling is a slicing parameter. Padding is the Padding parameter and size is the data size. For example, assume that the Static tag information is Static IW, Float32, DIM _ HW, {10,4 }. Further, assuming that the processor executing the neural network computation graph stores data in a column-first manner, and is capable of processing 16-bit floating point numbers, and at most 8 computations can be performed at a time, the dynamic tag information of the data to be processed is "8", "padding" (10-padding) — 6 ", and" size "(10 +6) × 4 × 2 (bytes occupied by 16-bit floating point) — 128 Byte. Thus, the dynamic tag information of the data to be processed can be expressed as dynamic: Float16, DIM _ WH,8,6,128 Byte.

Fig. 5 is an architecture block diagram illustrating a heterogeneous system 500 for machine learning in accordance with an embodiment of the present disclosure. As shown in fig. 5, the heterogeneous system 500 includes interconnected hosts 502 and devices 508, where the hosts include the compiler 400 described in conjunction with fig. 4. According to various embodiments, the compiler may be configured to generate host data layout instructions and device data layout instructions, or may be configured to send tag information and generate device data layout instructions to the host. In view of this, the host is configured to perform data layout according to host data layout instructions or tag information, and the device is configured to perform data layout according to device data layout instructions received from the host. In one embodiment, the host includes a first placement unit 506 configured to perform data placement according to the host data placement instructions or configured to perform data placement according to tag information. When the label information is used, when the dimension order of the dynamic data in the dynamic label is inconsistent with the dimension order of the static data in the static label, the first layout unit at the host end can convert the dimension order of the data nodes from the dimension order of the static data to the dimension order of the dynamic data, so that the data layout operation is realized.

In an application scenario, the first layout unit may be a data conversion unit on the host side to implement a layout operation such as spatial conversion (e.g. row to column or column to row) on data. Accordingly, in one embodiment, the device may include a second layout unit 510 that may be configured to perform data layout according to the device data layout instructions received from the host. Similar to the first layout unit, the second layout unit may also be implemented as a data conversion unit on the device side in some application scenarios.

In one embodiment, the host and device of the present disclosure may interact with information through a driver (e.g., "driver _ api"). Based on this, the compiler of the present disclosure may insert the device data layout instruction into the instruction group to be executed on the device side, and after compiling to generate the executable program, transfer the executable program to the device via the driver. Thus, the device performs a data layout operation necessary for the operation based on the data layout instruction.

Fig. 6 is a block diagram illustrating a combined processing device according to an embodiment of the present disclosure. As shown in fig. 6, the combined processing device 600 includes a computing processing device 602, an interface device 604, other processing devices 606, and a storage device 608. Depending on the application scenario, one or more computing devices 610 may be included in the computing device, and may be configured to perform various computing operations, such as various operations involved in machine learning in the field of artificial intelligence.

In various embodiments, the computing processing device of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or as part of a hardware structure of an artificial intelligence processor core, computing processing devices of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure.

In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively perform user-specified operations. Other Processing devices of the present disclosure may include one or more types of general and/or special purpose processors, such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), and artificial intelligence processors, depending on the implementation. These processors may include, but are not limited to, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, etc., and the number may be determined based on actual needs. As previously mentioned, the computing processing device of the present disclosure may be considered to have a single core structure or an isomorphic multi-core structure only. However, when considered together, a computing processing device and other processing devices may be considered to form a heterogeneous multi-core structure. In this case, the computing processing device herein may be considered to be equivalent to the apparatus described in conjunction with fig. 1-5 of the present disclosure, and the other processing devices may be considered to be equivalent to the host described in conjunction with fig. 1-5 of the present disclosure.

In one or more embodiments, the other processing device can interface with external data and controls as a computational processing device of the present disclosure (which can be embodied as an artificial intelligence, e.g., a computing device associated with neural network operations), performing basic controls including, but not limited to, data handling, starting and/or stopping of the computing device, and the like. In other embodiments, other processing devices may cooperate with the computing processing device to collectively perform computational tasks.

In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing processing device may obtain input data from other processing devices via the interface device, and write the input data into a storage device (or memory) on the computing processing device. Further, the computing processing device may obtain the control instruction from the other processing device via the interface device, and write the control instruction into the control cache on the computing processing device slice. Alternatively or optionally, the interface device may also read data from the memory device of the computing processing device and transmit the data to the other processing device. In some scenarios, the interface device may also be implemented as an application programming interface between the computing processing device and other processing devices, including, for example, a driver interface to pass various types of instructions and programs to be executed by the computing processing device between the two.

Additionally or alternatively, the combined processing device of the present disclosure may further include a storage device. As shown in the figure, the storage means is connected to the computing processing means and the further processing means, respectively. In one or more embodiments, the storage device may be used to hold data for the computing processing device and/or the other processing devices. For example, the data may be data that is not fully retained within internal or on-chip storage of a computing processing device or other processing device.

In some embodiments, the present disclosure also discloses a chip (e.g., chip 702 shown in fig. 7). In one implementation, the Chip is a System on Chip (SoC). The chip may be connected to other associated components through an external interface device, such as external interface device 706 shown in fig. 7. The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. In some application scenarios, other processing units (e.g., video codecs) and/or interface modules (e.g., DRAM interfaces) and/or the like may be integrated on the chip. In some embodiments, the disclosure also discloses a chip packaging structure, which includes the chip. In some embodiments, the present disclosure also discloses a board card including the above chip packaging structure. The board will be described in detail below with reference to fig. 7.

Fig. 7 is a schematic diagram illustrating a board 700 according to an embodiment of the disclosure, which may include the heterogeneous system described in conjunction with fig. 1-5 of the disclosure. As shown in FIG. 7, the card includes a memory device 704 for storing data, which includes one or more memory cells 710. The memory device may be connected and data transferred to and from the control device 708 and the chip 702 as described above, for example, by a bus. Further, the board card further comprises an external interface means 706 configured for data relay or transfer function between the chip (or the chip in the chip package structure) and an external device 712 (e.g. a server or a computer). For example, the data to be processed may be transferred to the chip by an external device through an external interface means. For another example, the calculation result of the chip may be transmitted back to an external device via the external interface device. According to different application scenarios, the external interface device may have different interface forms, for example, it may adopt a standard PCIE interface or the like.

In one or more embodiments, the control device in the disclosed card may be configured to regulate the state of the chip. Therefore, in an application scenario, the control device may include a single chip Microcomputer (MCU) for controlling the operating state of the chip.

From the above description in conjunction with fig. 6 and 7, it will be understood by those skilled in the art that the present disclosure also discloses an electronic device or apparatus, which may include one or more of the above boards, one or more of the above chips and/or one or more of the above combination processing devices.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud end, an edge end, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are divided based on the logic functions, and there may be other dividing manners in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in embodiments of the present disclosure. The Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In view of this, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause a1, a method for data layout between a host and a device, comprising:

traversing a plurality of data nodes in a computational graph to determine in-and-out degree information for each data node in the computational graph, wherein each of the data nodes has associated data; and

determining, from the in-out information, that a layout operation is performed by one of a host and a device for the data.

Clause a2, the method of clause a1, wherein the in-out information comprises out-degrees and in-degrees of the data nodes in the computational graph.

Clause A3, the method of clause a2, wherein determining from the ingress and egress information that the layout operation is performed by one of a host and a device comprises:

determining that the layout operation is performed by the device when the out-degree and the in-degree are both greater than zero.

Clause a4, the method of clause A3, wherein the placement operation is determined to be performed by the host when one of the out-degree and in-degree is equal to zero.

Clause a5, the method of clause A3, wherein when one of the out-degree and in-degree is equal to zero, the method further comprises:

comparing a data volume of the data to a threshold;

determining that the layout operation is performed by the host when the amount of data is greater than or equal to the threshold; and

determining to perform the layout operation by the device when the amount of data is less than the threshold.

Clause a6, an apparatus for data layout between a host and a device, comprising:

at least one processor;

at least one memory for storing computer program instructions that, when executed by the at least one processor, cause the apparatus to perform the method of any one of clauses a1-a 5.

Clause a7, a computer-readable storage medium storing computer program instructions for data layout between a host and a device, the computer program instructions, when executed by at least one processor, implementing the method of any one of clauses a1-a 5.

Clause A8, a compiler for data layout between a host and a device, comprising:

a traversal unit configured to traverse a computational graph comprising a plurality of data nodes, each of the data nodes having associated data, to determine in-out information for each data node; and

a data layout optimization unit configured to determine that a layout operation for the data is performed by one of a host and a device according to the in-and-out degree information.

Clause a9, the compiler of clause A8, wherein the in-out information comprises out-degrees and in-degrees of the data nodes in the computational graph.

Clause a10, the compiler of clause a9, wherein the traverse unit comprises an in-out counter configured to count out-degrees and in-degrees for each of the data nodes;

the data layout optimization unit includes a threshold determiner configured to compare the out-degree and the in-degree with zero to obtain a first comparison result, and determine, according to the first comparison result, that a layout operation for the data is performed by one of the host and the device.

Clause a11, the compiler of clause a10, wherein the compiler further comprises an instruction generator configured to generate the data layout instruction according to the first comparison result.

Clause a12, the compiler of clause a11, wherein:

the first comparison result is that the out-degree and the in-degree are both greater than zero, the instruction generator configured to generate a data layout instruction for use by the apparatus to perform the layout operation based on the first comparison result; or

The first comparison result is that one of the out-degree and in-degree is equal to zero, the instruction generator configured to generate a data layout instruction for use by the host in performing the layout operation based on the first comparison result.

Clause a13, the compiler of clause a12, wherein the traverse unit further comprises a data volume calculator configured to calculate a data volume size of the data of each data node; and wherein:

the threshold determiner is further configured to compare the data size to a threshold to obtain a second comparison result when the first comparison result is that one of the out-degree and the in-degree is equal to zero; and

the instruction generator is configured to generate the data layout instruction according to the second comparison result.

Clause a14, the compiler of clause a13, wherein:

when the second comparison result is that the data size is greater than or equal to a threshold, the instruction generator is configured to generate a data layout instruction used by the host to perform the layout operation according to the second comparison result; and

when the second comparison result is that the data size is less than the threshold, the instruction generator is configured to generate a data layout instruction for use by the device in performing the layout operation according to the second comparison result.

Clause a15, the compiler of any one of clauses A8-a14, wherein the data layout instructions comprise host layout instructions and device layout instructions, wherein the host layout instructions cause the host to perform layout operations on data; the device placement instructions are used to join into an instruction group to be executed by the device for transmission to the device for placement operations.

Clause a16, the compiler of clause a10, wherein the host comprises a first placement unit for performing placement operations, and the compiler further comprises an instruction generator and a tag transmitter, wherein:

when the first comparison result is that the out-degree and the in-degree are both greater than zero, the instruction generator is configured to generate a data layout instruction for use by the apparatus to perform the layout operation in accordance with the first comparison result; and

when the first comparison result is that one of the out-degree and the in-degree is equal to zero, the tag transmitter is configured to transmit the tag information of the data node to the first layout unit of the host.

Clause a17, the compiler of clause a16, wherein the traverse unit further comprises a data volume calculator configured to calculate a data volume size of the data of each data node; and wherein:

the threshold determiner is further configured to compare the data size to a threshold to obtain a second comparison result when the first comparison result is that one of the out-degree and the in-degree is equal to zero;

when the second comparison result is that the data size is less than the threshold, the instruction generator is configured to generate a data layout instruction for use by the device in performing the layout operation according to the second comparison result;

when the second comparison result is that the data size is greater than or equal to a threshold value, the tag transmitter is configured to transmit the tag information of the data node to a first layout unit of a host.

Clause a18, a heterogeneous system for machine learning, comprising interconnected hosts and devices, wherein:

the host includes:

the compiler of any one of clauses A8-a 17; and

a first layout unit configured to arrange the tag information according to the host layout instruction

To perform data layout; and

the apparatus comprises:

a second placement unit configured to perform data placement according to the device data placement instructions received from the host.

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that equivalents or alternatives within the scope of these claims be covered thereby.

Claims

1. A method for data layout between a host and a device, comprising:

traversing a plurality of data nodes in a computational graph to determine in-out information for each data node in the computational graph, wherein each of the data nodes has associated data; and

2. The method of claim 1, wherein the in-out information comprises out-of-degree and in-degree of the data nodes in the computational graph.

3. The method of claim 2, wherein determining from the ingress and egress information that the placement operation is performed by one of a host and a device comprises:

4. The method of claim 3, wherein the placement operation is determined to be performed by the host when one of the out-degree and in-degree equals zero.

5. The method of claim 3, wherein when one of the out-degree and in-degree is equal to zero, the method comprises:

comparing a data volume of the data to a threshold;

determining that the layout operation is performed by the device when the amount of data is less than the threshold.

6. An apparatus for data placement between a host and a device, comprising:

at least one processor;

at least one memory for storing computer program instructions that, when executed by the at least one processor, cause the apparatus to perform the method of any of claims 1-5.

7. A computer-readable storage medium storing computer program instructions for data layout between a host and a device, which when executed by at least one processor implement the method of any one of claims 1-5.

8. A compiler for data layout between a host and a device, comprising:

a data layout optimization unit configured to determine, from the ingress and egress information, that a layout operation is performed by one of the host and the device with respect to the data.

9. The compiler of claim 8, wherein said in-out information comprises out-of-degree and in-degree of said data nodes in said computational graph.

10. The compiler of claim 9, wherein the traversal unit includes an in-out counter configured to count an out-degree and an in-degree for each of the data nodes;

11. The compiler of claim 10, wherein the compiler further comprises an instruction generator configured to generate the data layout instruction according to the first comparison result.

12. The compiler of claim 11, wherein:

13. The compiler of claim 12, wherein the traversal unit further comprises a data volume calculator configured to calculate a data volume size of the data for each data node; and wherein:

14. The compiler of claim 13, wherein:

15. The compiler of any of claims 8-14, wherein the data placement instructions include host placement instructions and device placement instructions, wherein the host placement instructions cause the host to perform placement operations on data; the device placement instructions are used to join into an instruction group to be executed by the device for transmission to the device for placement operations.

16. The compiler of claim 10, wherein the host includes a first placement unit to perform placement operations, and the compiler further includes an instruction generator and a tag transmitter, wherein:

17. The compiler of claim 16, wherein the traversal unit further comprises a data volume calculator configured to calculate a data volume size of the data for each data node; and wherein:

18. A heterogeneous system for machine learning, comprising interconnected hosts and devices, wherein:

the host includes:

a compiler according to any one of claims 8-17; and

a first layout unit configured to perform data layout according to the host layout instruction or the tag information; and

the apparatus comprises: