CN116796289A

CN116796289A - Operator processing method and device, electronic equipment and storage medium

Info

Publication number: CN116796289A
Application number: CN202310860789.7A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Biren Intelligent Technology Co Ltd
Current assignee: Shanghai Biren Intelligent Technology Co Ltd
Priority date: 2023-07-13
Filing date: 2023-07-13
Publication date: 2023-09-22

Abstract

The invention provides an operator processing method, an operator processing device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, wherein the operator processing method comprises the following steps: performing memory rearrangement processing on M first operators in the neural network model based on the first memory arrangement operators to generate M second operators; the memory arrangement of each second operator is linear memory arrangement; fusing the M second operators to generate a first fusion operator; and performing operator fusion based on the first memory arrangement operator, the first fusion operator and the second memory arrangement operator to generate a target operator. By the method, operators under different memory arrangements are fused, so that different hardware related scenes of each first operator are not required to be supported, CPU logic can be multiplexed, the workload of operator development is greatly reduced, and the performance of operator execution is improved.

Description

Operator processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to an operator processing method, an operator processing device, an electronic device, and a storage medium.

Background

With the rapid development of artificial intelligence (Artificial Intelligence, AI) technology, neural networks are widely used in various fields, and a large number of operators are required to operate the neural networks.

In the related art, a single operator mode (eager mode) is basically adopted for running the neural network, real memory rearrangement is needed, each single operator corresponds to different realization in different hardware related scenes, operators of different data arrangements (layout) are difficult to fuse and eliminate, and therefore operator development workload is huge.

Therefore, how to reduce the workload of operator development and improve the performance of operator execution is a current urgent problem to be solved.

Disclosure of Invention

Aiming at the problems existing in the prior art, the embodiment of the invention provides an operator processing method, an operator processing device, electronic equipment and a storage medium.

The invention provides an operator processing method, which comprises the following steps:

performing memory rearrangement processing on M first operators in the neural network model based on the first memory arrangement operators to generate M second operators; the memory arrangement of each second operator is linear memory arrangement, and M is a positive integer greater than or equal to 1;

fusing the M second operators to generate a first fusion operator;

performing operator fusion based on the first memory arrangement operator, the first fusion operator and the second memory arrangement operator to generate a target operator, wherein the target operator is used for executing target operation; the target operation is associated with an instruction map.

Optionally, before the memory rearrangement processing is performed on the M first operators in the neural network model based on the first memory arrangement operator, the method further includes:

and obtaining a calculation graph corresponding to the neural network model, wherein the calculation graph comprises the M first operators.

Optionally, the performing memory rearrangement processing on the M first operators in the neural network model based on the first memory arrangement operator to generate M second operators includes:

inserting the first memory arrangement operators into the head of the computation graph, and performing memory rearrangement processing on the M first operators by using the first memory arrangement operators to generate the M second operators.

Optionally, the performing operator fusion based on the first memory allocation operator, the first fusion operator and the second memory allocation operator includes:

inserting the second memory arrangement operator at the tail part of the calculation graph, and performing memory rearrangement processing on the first fusion operator based on the second memory arrangement operator to generate a second fusion operator; the memory arrangement of the second fusion operator is nonlinear memory arrangement;

and performing operator fusion on the first memory arrangement operator, the second fusion operator and the second memory arrangement operator to generate the target operator.

Optionally, the method further comprises:

determining physical coordinates of the target operator in hardware equipment based on the target operator;

performing instruction mapping based on the template schedule and the physical coordinates to generate a read-write instruction;

and based on the read-write instruction, performing read-write operation on the data associated with the target operator in the hardware equipment.

Optionally, based on the read-write instruction, performing, in the hardware device, read-write operation on the data associated with the target operator, including:

based on the memory arrangement of the hardware equipment, the data associated with the target operator are segmented to generate a plurality of data blocks;

and based on the read-write instruction, performing read-write operation on at least one data block in the hardware equipment.

Optionally, based on the read-write instruction, performing read-write operation on at least one data block in the hardware device includes:

under the condition that read-write operation is carried out on a plurality of data blocks at the same time, carrying out read-write operation on each data block based on a vector instruction; the vector instruction is used for calling a plurality of threads and simultaneously performing read-write operation on a plurality of data.

Optionally, based on the read-write instruction, performing read-write operation on at least one data block in the hardware device, including any one of the following:

Under the condition of performing read-write operation on internal data of any data block, writing the internal data into a shared memory area to perform data rearrangement to generate target data; based on a vector instruction, performing read-write operation on the target data in the shared memory area;

or alternatively, the process may be performed,

based on a target read-write instruction, performing read-write operation on the internal data; the target read-write instruction is used for calling a single thread to perform read-write operation on single data.

Optionally, the vector instruction includes at least one of:

a data load ldm instruction;

the data store stm instruction.

Optionally, the target read-write instruction includes at least one of the following:

a data load ld instruction;

the data stores st instructions.

Optionally, in a case where the data associated with the M first operators exceeds a hardware device limit, the method further includes:

inserting a third operator into the head of the computation graph, wherein the third operator comprises the memory arrangement operator and a fourth operator, and the fourth operator is used for executing an operation related to data transformation reshape;

and processing the data associated with the M first operators by using the third operator.

Optionally, in case the data type of the first operator-associated data does not conform to the target data type, the method further comprises:

Inserting a fifth operator into the head part and the tail part of the computation graph, wherein the fifth operator is used for converting the data type of the data associated with the first operator;

and converting the data type of the data associated with the first operator into the target data type by utilizing the fifth operator.

Optionally, the fusing the M second operators to generate a fused operator includes:

and the M second operators are fused by the arithmetic logic of the multiplexing central processing unit CPU to generate the fusion operator.

The invention also provides an operator processing device, which comprises:

the first generation module is used for carrying out memory rearrangement processing on M first operators in the neural network model based on the first memory arrangement operators to generate M second operators; the memory arrangement of each second operator is linear memory arrangement, and M is a positive integer greater than or equal to 1;

the second generation module is used for fusing the M second operators to generate a first fusion operator;

the execution module is used for carrying out operator fusion based on the first memory arrangement operator, the first fusion operator and the second memory arrangement operator to generate a target operator, and the target operator is used for executing target operation; the target operation is associated with an instruction map.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing any one of the operator processing methods described above when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an operator processing method as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements an operator processing method as described in any one of the above.

According to the operator processing method, the device, the electronic equipment and the storage medium, memory rearrangement processing is carried out on M first operators in a neural network model through a first memory arrangement operator, M second operators in linear memory arrangement are generated, the M second operators are fused to generate a first fusion operator, and then operator fusion is carried out based on the first memory arrangement operator, the first fusion operator and the second memory arrangement operator to generate a target operator; by the method, the fusion of operators under different memory arrangements is realized, target operators are utilized to execute target operations related to instruction mapping, different hardware related scenes of the first operators are not required to be supported, CPU logic can be multiplexed, further, the workload of operator development is greatly reduced, and the performance of operator execution is improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of an operator processing method provided by the present invention;

FIG. 2 is a second flow chart of the operator processing method according to the present invention;

FIG. 3 is a schematic diagram of the processing logic of the operator processing method provided by the present invention;

FIG. 4 is a second schematic diagram of the processing logic of the operator processing method according to the present invention;

FIG. 5 is a schematic diagram of an operator processing apparatus provided by the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to facilitate a clearer understanding of various embodiments of the present application, some relevant knowledge will be presented first.

With the rapid development of artificial intelligence technology, neural networks are widely applied in various fields, and a large number of operators are required to support the operation of the neural networks.

However, the operation of the neural network in the related art has the following drawbacks:

1. the neural network is basically operated by adopting an eager mode, real memory rearrangement is needed, and operator development workload under the eager mode is huge.

2. Typically, operators of different layout, such as Memory operators, unary operators, binary operators, are difficult to fuse and eliminate.

The layout corresponds to the memory arrangement of the data on the hardware, such as reading according to rows or reading according to columns; reading a certain block and then reading which block; different alignment requirements in different dimensions, etc.

Unary: a unary operator, such as an operator of sin, cos, exp, log, erf;

binary operators, such as add, mul, mod, and operators, including broadcast and elementwise scenes;

memory: mainly some operators involving memory mapping, such as transpose, reshape, etc.

3. For some oversized shapes (shapes) exceeding the hardware limit, explicit reshape is required, and each handwriting operating system kernel (kernel) takes into account a large tensor (large tensor).

4. Each operator needs to consider the following permutation and combination of different scenes:

(1) Different layout.

(2) Different data types (data types), such as fp32, bf16, and the like.

(3) Different shapes.

(4) Different Burst modes, such as Burst1/2/4;

the concept that a Burst mode corresponds to one instruction and corresponds to a plurality of data (Single Instruction Multiple Data, SIMD) is generally called vectorization, and one instruction can process a plurality of data at the same time.

(5)Large tensor。

(6) Different operator properties, such as different axis exchange of trans.

Different layout, different data types, different burst and considering large tension scenes, if the scenes are arranged and combined, for the same operator, the handwriting kernel workload needing to be realized can be expected to be huge.

5. Some operators involve reordering within a data block (block) only point-to-point elements for data load ld or data store st, but after fusion may use multiple data loads (load mux, ldm) or multiple data stores (stm).

In summary, in order to enable different single operators to be implemented in different hardware-related scenarios (layout, data type, burst mode, different operator implementations, different shapes, large tension, etc.), workload of operator development is reduced, and performance of operator execution is improved.

The operator processing method provided by the invention is specifically described below with reference to fig. 1 to 4. Fig. 1 is a schematic flow chart of an operator processing method provided in the present invention, referring to fig. 1, the method includes steps 101 to 103, where:

step 101, performing memory rearrangement processing on M first operators in a neural network model based on a first memory arrangement operator to generate M second operators; the memory arrangement of each second operator is linear memory arrangement, and M is a positive integer greater than or equal to 1.

It should be noted that the execution body of the present invention may be any electronic device capable of implementing operator processing, for example, any one of a smart phone, a smart watch, a desktop computer, a laptop computer, and the like.

In this embodiment, the neural network model may be applied to fields of image recognition, speech processing, natural language processing, and the like, and the neural network model may be, for example, a convolutional neural network model (Convolutional Neural Networks, CNN), a cyclic neural network model (Recurrent Neural Networks, RNN), and the like, where the operation of the neural network requires the support of M first operators.

The first operator can be, for example, a Memory operator, a Unary operator, a Binary operator, etc., and each operator corresponds to different nonlinear Memory arrangements. Therefore, in this embodiment, the basic setting of the AI compiler is needed to perform memory rearrangement processing on the nonlinear M first operators based on the memory allocation operators, so as to generate M second operators of linear memory allocation, that is, nonlinear allocation (non-ByteObject) →linear allocation (ByteObject).

The first memory allocation operator is used for performing memory rearrangement processing on the first operator, and the first memory allocation operator can be expressed as: the layout_controller operator.

102, fusing the M second operators to generate a first fusion operator.

In this embodiment, since the memory arrangement of each second operator is a linear memory arrangement, each second operator can be directly fused into one large first fusion operator.

Optionally, the fusing the M second operators to generate a first fused operator may be implemented by:

and the M second operators are fused by the arithmetic logic of the multiplexing central processing unit CPU to generate the first fusion operator.

For example, the operation logic of the CPU of the tensor virtual machine (Tensor Virtual Machine, TVM) community can be multiplexed to integrate a Memory operator, a Unary operator and a Binry operator into one large operator; at this time, the Memory arrangement of the Memory operator, the Unary operator and the Binary operator is linear Memory arrangement.

In the above embodiment, since the second operators are fused into one large first fusion operator, it is not necessary to implement a corresponding operator for each permutation and combination of different layout, different data types, different burst modes, and large tension scenes; and, realize the automatic fusion among different layout operators, such as Memory operator, unary operator, binary operator, has promoted the performance greatly.

Step 103, performing operator fusion based on the first memory arrangement operator, the first fusion operator and the second memory arrangement operator to generate a target operator, wherein the target operator is used for executing target operation; the target operation is associated with an instruction map.

In this embodiment, the second memory allocation operator is used to perform memory rearrangement processing on the first fusion operator. Generating the target operator means that a single conversion of each first operator of the coordinates from logical coordinates to physical coordinates in the hardware device is achieved.

Wherein the target operator is used to perform a target operation associated with the instruction map.

Specifically, each hardware device has its own instruction set, where instructions that can be read by the hardware device, such as read-write instructions, calculation instructions, etc., need to be generated.

According to the operator processing method provided by the invention, memory rearrangement processing is carried out on M first operators in a neural network model through a first memory arrangement operator, M second operators in linear memory arrangement are generated, each M second operators are fused to generate a first fusion operator, and then operator fusion is carried out based on the first memory arrangement operator, the first fusion operator and the second memory arrangement operator to generate a target operator; by the method, the fusion of operators under different memory arrangements is realized, target operators are utilized to execute target operations related to instruction mapping, different hardware related scenes of the first operators are not required to be supported, CPU logic can be multiplexed, further, the workload of operator development is greatly reduced, and the performance of operator execution is improved.

Optionally, before the memory rearrangement processing is performed on the M first operators in the neural network model based on the first memory arrangement operator, the M first operators in the neural network model need to be acquired, which is specifically implemented by the following manner:

In this embodiment, the neural network model is first abstracted into a corresponding computational graph, and then each first operator is obtained from the computational graph.

The computational graph corresponding to the neural network model is a directed acyclic graph for describing operation, and has two main elements: nodes (nodes) and edges (edges); wherein each node may correspond to an operator, such as a vector, matrix, tensor, etc.; edges represent operations such as addition, subtraction, multiplication, division, convolution, etc.

Optionally, the memory rearrangement processing is performed on the M first operators in the neural network model based on the first memory arrangement operator to generate M second operators, which is specifically implemented by the following manner:

For example, a layout_cover operator is inserted into the header of the computation graph, the Memory operator, the Unary operator and the Binary operator of the nonlinear Memory arrangement are rearranged into a linear Memory arrangement byte object, and then the Memory operator, the Unary operator and the Binary operator can be fused into a large fusion operator according to the logic of the CPU.

By the method, the operators of different layout are automatically fused, the workload of operator development is greatly reduced, and the performance of operator execution is further improved.

Optionally, the operator fusion based on the first memory allocation operator, the first fusion operator and the second memory allocation operator may be specifically implemented by the following steps:

step 1), inserting the second memory arrangement operator at the tail part of the computation graph, and performing memory rearrangement processing on the first fusion operator based on the second memory arrangement operator to generate a second fusion operator; the memory arrangement of the second fusion operator is nonlinear memory arrangement;

and 2) performing operator fusion on the first memory arrangement operator, the second fusion operator and the second memory arrangement operator to generate the target operator.

In this embodiment, after the second operators are fused to generate the first fusion operator, the first fusion operator of the linear memory arrangement needs to be subjected to memory rearrangement processing again based on the second memory arrangement operator to generate the second fusion operator of the nonlinear arrangement, so as to realize non-ByteObject- & gt non-ByteObject.

The target operator is then utilized to perform a target operation associated with the instruction map.

Optionally, the performing the target operation based on the target operator may specifically be achieved by:

step 1), determining physical coordinates of the target operator in hardware equipment based on the target operator;

step 2), performing instruction mapping based on the template schedule and the physical coordinates to generate a read-write instruction.

And step 3) based on the read-write instruction, performing read-write operation on the data associated with the target operator in the hardware equipment.

In this embodiment, for the process of "non-byteobject→byteobject→non-ByteObject", the fusion of the tensor IR may be performed, and the calculation of the three parts may be fused into one calculation to obtain the target operator, so as to implement one conversion of coordinates. That is, conversion of the original coordinates of the target operator (logical coordinates of the target operator) to coordinates in the hardware device (physical coordinates) is achieved. Meanwhile, the compiler technology can be utilized to perform corresponding formulation simplification on the fusion of the calculation parts.

Among them, tensor IR is an intermediate expression (IR) to describe the calculation of tensors, which is for loop, if-else, read-write, and corresponding calculation, etc.

Optionally, based on the read-write instruction, performing read-write operation on the data associated with the target operator in the hardware device, which may be specifically implemented by the following steps:

step 1), based on the memory arrangement of the hardware equipment, segmenting the data associated with the target operator to generate a plurality of data blocks;

and 2) based on the read-write instruction, performing read-write operation on at least one data block in the hardware equipment.

In this embodiment, the data associated with the target operator needs to be segmented according to the memory arrangement of the hardware device through a schedule, so as to obtain a plurality of data blocks (blocks).

And then, based on the read-write instruction, performing data read-write operation on at least one data block in the hardware equipment.

Where schedule is a collection of computed transforms that achieve different performance through computation loops in the transform program.

In the above embodiment, the data associated with the target operator is segmented according to the memory arrangement of the hardware device, so that the data can be aligned with the memory arrangement of the hardware device.

Optionally, based on the read-write instruction, performing a read-write operation on at least one data block in the hardware device may be implemented in at least one of the following manners:

In the first case, the read-write operation is performed on a plurality of data blocks at the same time, namely, the data rearrangement between blocks is involved.

Under the condition that read-write operation is carried out on a plurality of data blocks at the same time, carrying out read-write operation on each data block based on vector instructions; the vector instruction is used for calling a plurality of threads and simultaneously carrying out instruction mapping operation on a plurality of data.

In this embodiment, the vector instruction is a concept that one instruction corresponds to multiple threads (Single Instruction Multiple Thread, SIMT), so that multiple threads can simultaneously perform corresponding reading and writing, and certain alignment restrictions exist, for example, 32 threads must be simultaneously operated, but cannot be partially operated.

Optionally, the vector instruction includes at least one of:

a) A data load ldm instruction;

b) The data store stm instruction.

Note that, the ldm instruction is one type of vector instruction, and a plurality of threads read data from the outside simultaneously in accordance with the SIMT concept, and different layout corresponds to different instruction configurations.

The stm instruction is one of vector instructions, and a plurality of threads write data to the outside simultaneously, corresponding to different layout, and different layout corresponds to different instruction configuration.

In the embodiment, based on the vector instruction, a plurality of threads can be simultaneously called to perform data read-write operation on each data block, so that the data read-write efficiency can be improved, and the performance is improved.

And secondly, performing read-write operation on internal data of any data block, namely, data rearrangement in a block.

In the mode 1, under the condition of performing read-write operation on internal data of any data block, writing the internal data into a shared memory area to perform data rearrangement to generate target data; and based on the vector instruction, performing read-write operation on the target data in the shared memory area.

In the present embodiment, when the rearrangement of data in the block is involved, internal data of any one of the data blocks is written into a shared memory area (shared memory), thereby generating a data-continuous target. By the method, the object data can be written out from the shared memory by utilizing the stm instruction; compared with the accurate position reading and writing, the stm instruction is utilized faster, and data can be shared among a plurality of threads.

Mode 2, based on a target read-write instruction, performing read-write operation on the internal data; the target read-write instruction is used for calling a single thread to perform read-write operation on single data.

a) A data load ld instruction;

b) The data stores st instructions.

In this embodiment, when the rearrangement of the data in the block is involved, the internal data may be precisely read and written by the ld instruction and the st instruction.

It should be noted that this part is mainly embodied in layout and schedule, and a byte_color layout representation is added for the non-byte object linear arrangement, and the layout is flattened into a linear shape according to the memory of the color, so that corresponding point-to-point operation can be performed, and an application ld instruction or a st instruction can be applied.

Optionally, in the case that the data associated with the M first operators exceeds the hardware device limit, the following steps are further required to be performed:

step 1), inserting a third operator into the head of the computation graph, wherein the third operator comprises the memory allocation operator and a fourth operator, and the fourth operator is used for executing an operation related to data transformation reshape;

and 2) processing the data associated with the M first operators by using the third operator.

In this embodiment, for a shape that exceeds the hardware limit, a third operator is automatically inserted in the header of the computation graph so that the limit of the hardware device is not exceeded.

It should be noted that, the implementation of newly adding a reshape to correspond to different layout is not needed, and only the reshape operator is split into layout_controller+ reshape (ByteObject), so that the original CPU implementation of the reshape is multiplexed. The reshape operator herein is a generalized reshape operator, and includes any operator that can perform operations related to reshape.

Optionally, in case the data type of the first operator related data does not conform to the target data type, the following steps are also performed:

step 1), inserting a fifth operator into the head part and the tail part of the computation graph, wherein the fifth operator is used for converting the data type of the data associated with the first operator;

the fifth operator may be, for example, a cast operator.

Step 2) converting the data type of the data associated with the first operator into the target data type by using the fifth operator.

Fig. 2 is a second flow chart of the operator processing method provided in the present invention, referring to fig. 2, the method includes steps 201 to 210, in which:

step 201, obtaining a computation graph corresponding to the neural network model, wherein the computation graph comprises M first operators.

Step 202, inserting a first memory arrangement operator into the header of the computation graph, and performing memory rearrangement processing on the M first operators based on the first memory arrangement operator to generate M second operators.

And 203, multiplexing the operation logic of the CPU to fuse the M second operators to generate a first fusion operator.

Step 204, inserting a second memory arrangement operator at the tail part of the calculation graph, and performing memory rearrangement processing on the first fusion operator based on the second memory arrangement operator to generate a second fusion operator; the memory arrangement of the second fusion operator is nonlinear memory arrangement.

Step 205, performing operator fusion on the first memory arrangement operator, the second fusion operator and the second memory arrangement operator to generate a target operator.

Step 206, determining physical coordinates of the target operator in the hardware device based on the target operator.

Step 207, performing instruction mapping based on the schedule and the physical coordinates to generate a read-write instruction.

And step 208, based on the memory arrangement of the hardware equipment, segmenting the data associated with the target operator to generate a plurality of data blocks.

Step 209, under the condition of performing read-write operation on a plurality of data blocks at the same time, performing read-write operation on each data block based on the vector instruction; the vector instruction is used for calling a plurality of threads and simultaneously performing read-write operation on a plurality of data.

Step 210, under the condition of performing read-write operation on internal data of any data block, writing the internal data into a shared memory area to perform data rearrangement to generate target data; performing read-write operation on target data in the shared memory area based on the vector instruction; or performing read-write operation on the internal data based on a target read-write instruction, wherein the target read-write instruction is used for calling a single thread to perform read-write operation on the single data.

The operator processing method provided by the invention is further described in detail below with reference to specific embodiments.

Embodiment one: the data type is Fp32, inter-Block rearrangement

FIG. 3 is a schematic diagram of processing logic of an operator processing method according to the present invention. Referring to FIG. 3, input layers, memory operators (ops), unary ops, binary ops, and output layers of the neural network model are shown in (a).

And then inserting a layout_cover operator at the head and the tail of the step (a) to obtain the calculation diagram shown in the step (b).

Wherein the layout_overt operator below input is used to convert Col-Major into ByteObject; intermediate memory ops, unary ops and binary ops are ByteObjects, and operator fusion is realized through multiplexing tvm community CPU logic.

The layout_controller operator above output is used to convert the ByteObject to Col-Major.

And (3) carrying out operator fusion on all operators to obtain the computational graph shown in (c), wherein fused op represents a final fusion operator. And then automatically generating a fusion operator code to realize one-time coordinate conversion, namely Col-Major- & gt Col-Major. Here, the primary coordinate conversion is a primary coordinate conversion with layout, and corresponds to the generation of the vector instruction.

The primary coordinate conversion process comprises the following steps: firstly, computer fusion is realized by utilizing a computer input, and then, coordinate conversion is realized once for the fused computer by using a master schedule template.

A brief summary can be described as:

the layer IR- > tensor represents IR (for loop level) - > instruction layer IR (e.g. MLIR/LLNM IR). The MLIR/LLVM IR is mainly embodied to a corresponding instruction map, such as Loadmatrix instruction (coordinates, parameters, …), store matrix instruction (coordinates, parameters, …), and the like.

The generic Schedule template can be expressed as:

j0,j1＝sch.split(j,factors＝[None,j_align_factor])

k0,k1＝sch.split(j,factors＝[None,k_align_factor])

sch.reorder(j0,k0,j1,k1)

sch.bind(j1,“threadxxx”)

…

wherein the slicing alignment parameters, sequence and thread binding of different cyclic axes are related to the hardware design of layout.

For the data type FP32, rearrangement in Block, schedule template:

in Block: rearranging from cache_read to share memory;

between blocks: the previous logic is then followed, and the last stm is written out.

Embodiment two:

FIG. 4 is a second schematic diagram of the processing logic of the operator processing method according to the present invention. Referring to FIG. 4, input, memory, unary, binary, and output layers of the neural network model are shown in (a).

Under the condition that the applicable data type of the tvm community is Fp32 and the data types of the memory op, the unary op and the binary op are BF16, inserting a cast operator and a layout_conversion operator from beginning to end in the step (a), and obtaining the calculation map shown in the step (b).

The cast operator below input is used for converting the data types of memory ops, unary ops and binary ops from BF16 to Fp32; the layout_controller operator is used to convert Col-Major into ByteObject; intermediate memory ops, unary ops and binary ops are ByteObjects, and operator fusion is realized through multiplexing tvm community CPU logic.

The layout_controller operator above output is used to convert the ByteObject to Col-Major; the cast operator is used to convert the data types of memory ops, unary ops, and binary ops from Fp32 to BF16.

And (3) carrying out operator fusion on all operators to obtain the computational graph shown in (c), wherein fused op represents a final fusion operator. And then automatically generating a fusion operator code to realize one-time coordinate conversion, namely BF16Col-Major to BF16Col-Major. Here, the primary coordinate conversion is a primary coordinate conversion with layout, and corresponds to the generation of the vector instruction.

The operator processing apparatus provided by the present invention will be described below, and the operator processing apparatus described below and the operator processing method described above may be referred to correspondingly to each other. Fig. 5 is a schematic structural diagram of an operator processing apparatus according to the present invention, and as shown in fig. 5, the operator processing apparatus 500 includes: a first generating module 501, a first fusing module 502, and a second fusing module 503, wherein:

The first generating module 501 is configured to perform memory rearrangement processing on M first operators in the neural network model based on the first memory arrangement operator, and generate M second operators; the memory arrangement of each second operator is linear memory arrangement, and M is a positive integer greater than or equal to 1;

the first fusion module 502 is configured to fuse the M second operators to generate a first fusion operator;

a second fusion module 503, configured to perform operator fusion based on the first memory allocation operator, the first fusion operator, and the second memory allocation operator, and generate a target operator, where the target operator is used to execute a target operation; the target operation is associated with an instruction map.

The operator processing device provided by the invention performs memory rearrangement processing on M first operators in a neural network model through a first memory arrangement operator to generate M second operators of linear memory arrangement, fuses the M second operators to generate a first fused operator, and then performs operator fusion based on the first memory arrangement operator, the first fused operator and the second memory arrangement operator to generate a target operator; by the method, the fusion of operators under different memory arrangements is realized, target operators are utilized to execute target operations related to instruction mapping, different hardware related scenes of the first operators are not required to be supported, CPU logic can be multiplexed, further, the workload of operator development is greatly reduced, and the performance of operator execution is improved.

Optionally, the apparatus further comprises:

the obtaining module is used for obtaining a calculation graph corresponding to the neural network model, wherein the calculation graph comprises the M first operators.

Optionally, the generating module 501 is further configured to:

Optionally, the first generating module 501 is further configured to:

Optionally, the apparatus further comprises:

the determining module is used for determining physical coordinates of the target operator in hardware equipment based on the target operator;

the second generation module is used for carrying out instruction mapping based on the template schedule and the physical coordinates to generate a read-write instruction;

And the read-write module is used for performing read-write operation on the data associated with the target operator in the hardware equipment based on the read-write instruction.

Optionally, the read-write module is further configured to:

under the condition that read-write operation is carried out on a plurality of data blocks at the same time, carrying out read-write operation on each data block based on vector instructions; the vector instruction is used for calling a plurality of threads and simultaneously performing read-write operation on a plurality of data.

Optionally, the read-write module is further configured to:

or alternatively, the process may be performed,

Optionally, the vector instruction includes at least one of:

a data load ldm instruction;

the data store stm instruction.

a data load ld instruction;

the data stores st instructions.

Optionally, the apparatus further comprises:

the first inserting module is used for inserting a third operator into the head of the computation graph, the third operator comprises the memory arrangement operator and a fourth operator, and the fourth operator is used for executing an operation related to data transformation reshape;

and the processing module is used for processing the data associated with the M first operators by utilizing the third operator.

Optionally, the apparatus further comprises:

the second inserting module is used for inserting a fifth operator into the head part and the tail part of the calculation graph, and the fifth operator is used for converting the data type of the data related to the first operator;

and the conversion module is used for converting the data type of the data associated with the first operator into the target data type by utilizing the fifth operator.

Optionally, the first fusing module 502 is further configured to:

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. Processor 610 may call logic instructions in memory 630 to perform an operator processing method comprising: performing memory rearrangement processing on M first operators in the neural network model based on the first memory arrangement operators to generate M second operators; the memory arrangement of each second operator is linear memory arrangement, and M is a positive integer greater than or equal to 1; fusing the M second operators to generate a first fusion operator; performing operator fusion based on the first memory arrangement operator, the first fusion operator and the second memory arrangement operator to generate a target operator, wherein the target operator is used for executing target operation; the target operation is associated with an instruction map.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the operator processing method provided by the methods described above, the method comprising: performing memory rearrangement processing on M first operators in the neural network model based on the first memory arrangement operators to generate M second operators; the memory arrangement of each second operator is linear memory arrangement, and M is a positive integer greater than or equal to 1; fusing the M second operators to generate a first fusion operator; performing operator fusion based on the first memory arrangement operator, the first fusion operator and the second memory arrangement operator to generate a target operator, wherein the target operator is used for executing target operation; the target operation is associated with an instruction map.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the operator processing method provided by the above methods, the method comprising: performing memory rearrangement processing on M first operators in the neural network model based on the first memory arrangement operators to generate M second operators; the memory arrangement of each second operator is linear memory arrangement, and M is a positive integer greater than or equal to 1; fusing the M second operators to generate a first fusion operator; performing operator fusion based on the first memory arrangement operator, the first fusion operator and the second memory arrangement operator to generate a target operator, wherein the target operator is used for executing target operation; the target operation is associated with an instruction map.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An operator processing method, comprising:

fusing the M second operators to generate a first fusion operator;

2. The operator processing method according to claim 1, wherein before the memory rearrangement processing is performed on the M first operators in the neural network model based on the first memory arrangement operator, the method further comprises:

3. The operator processing method according to claim 2, wherein the performing memory rearrangement processing on the M first operators in the neural network model based on the first memory arrangement operator to generate M second operators includes:

4. The operator processing method according to claim 2, wherein the performing operator fusion based on the first memory arrangement operator, the first fusion operator, and the second memory arrangement operator includes:

5. The operator processing method according to claim 1, characterized in that the method further comprises:

6. The operator processing method according to claim 5, wherein the performing, in the hardware device, read-write operation on the data associated with the target operator based on the read-write instruction includes:

7. The operator processing method according to claim 6, wherein said performing, based on the read-write instruction, a read-write operation on at least one data block in the hardware device includes:

8. The operator processing method according to claim 6, wherein the performing, based on the read-write instruction, a read-write operation on at least one data block in the hardware device includes any one of:

or alternatively, the process may be performed,

9. The operator processing method according to claim 7 or 8, wherein the vector instruction comprises at least one of:

a data load ldm instruction;

the data store stm instruction.

10. The operator processing method of claim 8, wherein the target read-write instruction comprises at least one of:

A data load ld instruction;

the data stores st instructions.

11. The operator processing method according to any one of claims 2 to 4, wherein in case the M first operator associated data exceeds a hardware device limit, the method further comprises:

12. The operator processing method according to any one of claims 2 to 4, wherein in case the data type of the first operator-associated data does not conform to a target data type, the method further comprises:

13. The operator processing method according to any one of claims 1 to 8, wherein the fusing the M second operators to generate a first fused operator includes:

14. An operator processing apparatus, comprising:

the first fusion module is used for fusing the M second operators to generate a first fusion operator;

the second fusion module is used for carrying out operator fusion based on the first memory arrangement operator, the first fusion operator and the second memory arrangement operator to generate a target operator, wherein the target operator is used for executing target operation; the target operation is associated with an instruction map.

15. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the operator processing method of any one of claims 1 to 13 when the program is executed by the processor.

16. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the operator processing method according to any of claims 1 to 13.

17. A computer program product comprising a computer program which, when executed by a processor, implements the operator processing method of any one of claims 1 to 13.