CN114492781A

CN114492781A - Hardware accelerator, data processing method, system, equipment and medium

Info

Publication number: CN114492781A
Application number: CN202210340279.2A
Authority: CN
Inventors: 曹其春; 董刚; 胡克坤; 杨宏斌; 尹文枫; 王斌强
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-05-13

Abstract

The application discloses a hardware accelerator and a data processing method, system, equipment and medium, wherein the method comprises the steps of obtaining a neural network operation instruction; splitting a neural network operation instruction into a convolution instruction and other instructions; acquiring feature data and filter data corresponding to the neural network operation instruction, and partitioning the feature data and the filter data to obtain block data; and performing parallel operation on the block data based on the convolution instruction and other instructions to obtain a target operation result. In the application, after the hardware accelerator acquires the neural network operation instruction, the neural network operation instruction is split into the convolution instruction and other instructions, the neural network operation instruction is partitioned into feature data and filter data corresponding to the feature data to obtain block data, and finally the block data is operated in parallel based on the convolution instruction and other instructions, so that a target operation result can be quickly obtained, and the efficiency is high.

Description

Hardware accelerator, data processing method, system, equipment and medium

Technical Field

The present application relates to the field of neural network technologies, and in particular, to a hardware accelerator, and a data processing method, system, device, and medium.

Background

With the development of artificial intelligence in various fields such as agriculture, finance, security, health care, manufacturing and the like, people urgently expect that an algorithm can be calculated faster, has higher precision and lower power consumption. CNN (convolutional neural network), one of the most important representatives in the field of artificial intelligence algorithms, has made a lot of breakthrough progress in the field of image analysis and processing, and has been widely applied to various image-related applications.

However, due to the special calculation mode of the CNN, the general purpose processor is not efficient for implementing the CNN, and cannot meet the performance requirement. Therefore, various hardware accelerators designed based on FPGA (Field-Programmable Gate Array), GPU (graphics processing unit) or even ASIC (Application Specific Integrated Circuit) have been proposed recently to improve the performance of CNN design. If the hardware accelerator architecture is not carefully designed, its computational throughput does not match the memory bandwidth that provides the FPGA platform. This means that performance will be degraded due to under-utilization of logic resources or memory bandwidth.

In summary, how to improve the operation efficiency of the hardware accelerator on the neural network is a problem to be solved urgently by those skilled in the art.

Disclosure of Invention

The application aims to provide a data processing method which can solve the technical problem of improving the operation efficiency of a hardware accelerator on a neural network to a certain extent. The application also provides a hardware accelerator, a data processing system, a device and a computer readable storage medium.

In order to achieve the above purpose, the present application provides the following technical solutions:

a data processing method is applied to a hardware accelerator and comprises the following steps:

acquiring a neural network operation instruction;

splitting the neural network operation instruction into a convolution instruction and other instructions;

acquiring feature data and filter data corresponding to the neural network operation instruction, and blocking the feature data and the filter data to obtain block data;

and performing parallel operation on the block data based on the convolution instruction and the other instructions to obtain a target operation result.

Preferably, the splitting the neural network operation instruction into a convolution instruction and another instruction includes:

and splitting the neural network operation instruction into the convolution instruction and the other instructions according to the channel correlation.

Preferably, the other instructions include a pooling instruction, an activation instruction, a splicing instruction, and a splitting instruction.

Preferably, the obtaining of the neural network operation instruction includes:

acquiring the neural network operation instruction;

the neural network operation instruction comprises a current node number, a father node type, a child node number, a child node type, a batch size, a weight kernel size, a padding number in the height direction, a padding number in the width direction, a stride, an input width, an input height, an input channel number, an output channel number, an input featuremap address, a weight address, a quantization parameter address, an output address and a size of a calculation block.

Preferably, the obtaining of the neural network operation instruction includes:

acquiring a neural network computational graph in a json file format;

and reading the neural network computation graph based on python and analyzing to obtain the neural network computation instruction in a dit format.

A data processing system for use with a hardware accelerator, comprising:

the first acquisition module is used for acquiring a neural network operation instruction;

the first splitting module is used for splitting the neural network operation instruction into a convolution instruction and other instructions;

the second acquisition module is used for acquiring feature data and filter data corresponding to the neural network operation instruction, and blocking the feature data and the filter data to obtain block data;

and the first operation module is used for operating the block data in parallel based on the convolution instruction and the other instructions to obtain a target operation result.

A data processing apparatus comprising:

a memory for storing a computer program;

a processor for implementing the steps of the data processing method as described above when executing the computer program.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the data processing method according to any one of the preceding claims.

A hardware accelerator, comprising:

the memory is used for acquiring and storing the neural network operation instruction, feature data and filter data;

the splitter is used for splitting the neural network operation instruction into a convolution instruction and other instructions; partitioning the feature data and the filter data to obtain block data;

the convolution arithmetic unit is used for carrying out parallel arithmetic on the block data based on the convolution instruction to obtain a target arithmetic result;

and the other arithmetic units are used for parallelly operating the block data based on the other instructions to obtain a target operation result.

Preferably, the convolution arithmetic unit is formed based on a DSP array core; the other operators are constructed based on tensor ALUs.

Preferably, the method further comprises the following steps:

and the buffer is used for buffering data.

The data processing method is applied to a hardware accelerator and used for obtaining a neural network operation instruction; splitting a neural network operation instruction into a convolution instruction and other instructions; acquiring feature data and filter data corresponding to the neural network operation instruction, and partitioning the feature data and the filter data to obtain block data; and performing parallel operation on the block data based on the convolution instruction and other instructions to obtain a target operation result. In the application, after the hardware accelerator acquires the neural network operation instruction, the neural network operation instruction is split into the convolution instruction and other instructions, the neural network operation instruction is partitioned into feature data and filter data corresponding to the feature data to obtain block data, and finally the block data is operated in parallel based on the convolution instruction and other instructions, so that a target operation result can be quickly obtained, and the efficiency is high. The hardware accelerator, the data processing system, the data processing equipment and the computer readable storage medium solve the corresponding technical problems.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a neural network computational graph;

FIG. 3 is a schematic diagram of type structure;

fig. 4 is a schematic structural diagram of a data processing system according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a hardware accelerator according to an embodiment of the present application;

FIG. 6 is a diagram illustrating data transmission of a hardware accelerator according to an embodiment of the present application;

FIG. 7 is a schematic diagram of data processing of a convolution operator;

fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 9 is another schematic structural diagram of a data processing apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a flowchart of a data processing method according to an embodiment of the present disclosure.

The data processing method provided by the embodiment of the application is applied to a hardware accelerator and can comprise the following steps:

step S101: and acquiring a neural network operation instruction.

In practical applications, the hardware accelerator may obtain the neural network operation instruction first, and the type and content of the neural network operation instruction may be determined according to actual needs, which is not specifically limited herein.

In a specific application scenario, in order to facilitate processing of the neural network operation instruction, the neural network operation instruction can be obtained in the process of obtaining the neural network operation instruction; the neural network operation instruction comprises a current node number, a father node type, a child node number, a child node type, a batch size, a weight kernel size, a padding number in the height direction, a padding number in the width direction, a stride, an input width, an input height, an input channel number, an output channel number, an input feature map (characteristic map) address, a weight address, a quantization parameter address, an output address and a size of a calculation block.

In a specific application scenario, in the process of obtaining the neural network operation instruction, the neural network operation instruction may be obtained based on a neural network computation graph, and specifically, the neural network computation graph in a json (JavaScript Object Notation) file format may be obtained; and reading the neural network computation graph based on python and analyzing to obtain a neural network operation instruction in a ditt (dictionary) format. Among them, python was designed by the institute of mathematics and computer science of netherlands, gibo fanosu, in the early 1990's, as a substitute for the language named ABC.

It should be noted that the generation manner of the Neural Network computation graph may be determined according to actual needs, for example, the Neural Network computation graph may be generated based on open Neural Network exchange ONNX (open Neural Network exchange), that is, for a standard IR (intermediate representation) representation such as ONNX, parameter information of each operation is analyzed, some operations are transformed and fused, for example, operations such as converting shape (property) information of input and output in the operation to corresponding parameters such as input length, width, channel, and kernel size in HW gggraph IR, etc., Batch Normalization (BN), Scale (Scale), add _ bais (increased offset), etc. are fused into a convolution operation, and input and output addresses, context node numbers, sizes of blocks, etc. are calculated, and a model file thereof is converted into a unified Neural Network computation graph supporting hardware applications. Furthermore, ONNX is an open format representing a deep neural network model, introduced by Microsoft and Facebook in 2017, and currently supported by major frameworks such as Caffe2, PyTorch, Apache MXNet, and other frameworks such as Tensorflow, which also have open source scripts to provide conversion.

For the sake of understanding, assuming that the neural network computational graph is shown in fig. 2, the resulting neural network operation instructions may be: { "runid":3, "entities": [1,2], "entries _ type": [0x000,0x000], "children": [4,5], "children _ type": [0x000,0x000], "batch _ size":1, "type": 0x030, "kernel _ size":7, "h _ pad":3, "v _ pad":3, "strand": 2, "input _ width":224, "input _ height":224, "input _ channel":3, "output _ channel":64, "input _ addr": 0x20000,0x30000], "filter _ addr": 0x130000, "quant _ addr": 0x230000, "output _ addr [0x40000]," size _ addr } 64. Where node 3 in the instruction represents an eltwise, node 1 represents the first conv2d input to the eltwise, node 2 represents the second conv2d input to the eltwise, node 4 represents the first conv2d that receives the eltwise output, and node 5 represents the second conv2d that receives the eltwise output.

Where runid denotes a current node number, entries denotes a parent node number, entries _ type denotes a parent node type (e.g., type), child denotes a child node number, child _ type denotes a child node type (e.g., type), bay _ size denotes a bay size, kernel _ size denotes a weight kernel size, h _ pad denotes a number of padding in a height direction, v _ pad denotes a number of padding in a width direction, stride denotes a stride, input _ width denotes an input width, input _ height denotes an input height, input _ channel denotes a number of input channels, output _ channel denotes a number of output channels, input _ addr denotes an input featuremap address, e.g., when a residual is input [0x20000,0x30000], and filter _ addr denotes a weight address; quant _ addr represents the quantization parameter address; output _ addr represents an output address; block _ size represents the size of the computation block, i.e. the number of parallel channels.

It should be noted that the structure of type in the present application can be flexibly determined according to actual needs, for example, the structure of type can be as shown in fig. 3, that is, type can be composed of five-bit binary number, the first four bits can represent the total type of the neural network instruction, the last bit can represent the specific type in the total type, and the specific description can be as shown in table 1.

TABLE 1 type instruction set Specification

It should be noted that the type of the neural network operation instruction may be determined according to actual needs, for example, it may be a binary instruction, and the description thereof may be as shown in table 2, and the like.

TABLE 2 binary instruction type Specification

Wherein InvalidWait represents invalid wait; InputFeatureAddress represents the input characteristic address; InputChannel represents the channel of the input; the OutputChannel represents the channel of the output.

Step S102: and splitting the neural network operation instruction into a convolution instruction and other instructions.

In practical application, after acquiring the neural network operation instruction, the hardware accelerator can split the neural network operation instruction into a convolution instruction and other instructions, and then process the convolution instruction and other instructions in parallel.

In a specific application scenario, because the convolution instruction is related to the channel and other instructions are not related to the channel, in the process of splitting the neural network operation instruction into the convolution instruction and other instructions, the neural network operation instruction can be split into the convolution instruction and other instructions according to the channel correlation.

In a specific application scenario, the other instructions may include instructions other than convolution in the neural network operation process, such as a pooling instruction, an activation instruction, a splicing instruction, a splitting instruction, and the like, and the application is not specifically limited herein.

Step S103: acquiring feature data and filter data corresponding to the neural network operation instruction, and partitioning the feature data and the filter data to obtain block data.

In practical application, the neural network operation instruction cannot be processed without corresponding data, so that after the neural network operation instruction is split into a convolution instruction and other instructions, the hardware accelerator needs to acquire feature data and filter data corresponding to the neural network operation instruction, and block the feature data and the filter data to obtain block data. The specific blocking manner may be determined according to actual needs, and the present application is not specifically limited herein.

Step S104: and performing parallel operation on the block data based on the convolution instruction and other instructions to obtain a target operation result.

In practical application, after the hardware accelerator acquires feature data and filter data corresponding to a neural network operation instruction, and blocks the feature data and the filter data to obtain block data, the hardware accelerator can operate the block data in parallel based on a convolution instruction and other instructions to obtain a target operation result. It should be noted that, the block data may be operated in parallel based on a plurality of convolution instructions, may also be operated in parallel based on a convolution instruction and other instructions, may also be operated in parallel based on a plurality of other instructions, and the like, which is not limited in this application.

The data processing method is applied to a hardware accelerator and used for obtaining a neural network operation instruction; splitting a neural network operation instruction into a convolution instruction and other instructions; acquiring feature data and filter data corresponding to the neural network operation instruction, and partitioning the feature data and the filter data to obtain block data; and performing parallel operation on the block data based on the convolution instruction and other instructions to obtain a target operation result. In the application, after the hardware accelerator acquires the neural network operation instruction, the neural network operation instruction is split into the convolution instruction and other instructions, the neural network operation instruction is partitioned into feature data and filter data corresponding to the feature data to obtain block data, and finally the block data is operated in parallel based on the convolution instruction and other instructions, so that a target operation result can be quickly obtained, and the efficiency is high.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a data processing system according to an embodiment of the present disclosure.

The data processing system provided by the embodiment of the application is applied to a hardware accelerator, and may include:

a first obtaining module 101, configured to obtain a neural network operation instruction;

the first splitting module 102 is configured to split the neural network operation instruction into a convolution instruction and another instruction;

the second obtaining module 103 is configured to obtain feature data and filter data corresponding to the neural network operation instruction, and block the feature data and the filter data to obtain block data;

and the first operation module 104 is configured to perform parallel operation on the block data based on the convolution instruction and other instructions to obtain a target operation result.

The data processing system provided in the embodiment of the present application is applied to a hardware accelerator, and the first splitting module may include:

and the first splitting unit is used for splitting the neural network operation instruction into a convolution instruction and other instructions according to the channel correlation.

The data processing system provided by the embodiment of the application is applied to a hardware accelerator, and other instructions comprise a pooling instruction, an activating instruction, a splicing instruction and a splitting instruction.

The data processing system provided in the embodiment of the present application is applied to a hardware accelerator, and the first obtaining module may include:

the first acquisition unit is used for acquiring a neural network operation instruction;

the second acquisition unit is used for acquiring the neural network calculation graph in the json file format;

and the first analysis unit is used for reading the neural network computation graph based on python and analyzing to obtain the neural network computation instruction in the ditt format.

Referring to fig. 5 and fig. 6, fig. 5 is a schematic structural diagram of a hardware accelerator according to an embodiment of the present disclosure, and fig. 6 is a schematic data transmission diagram of the hardware accelerator according to the embodiment of the present disclosure.

The hardware accelerator provided in the embodiment of the present application may include:

the memory 11 is used for acquiring and storing a neural network operation instruction, feature data and filter data;

the splitter 12 is used for splitting the neural network operation instruction into a convolution instruction and other instructions; partitioning the feature data and the filter data to obtain block data;

the convolution arithmetic unit 13 is used for carrying out parallel operation on the block data based on a convolution instruction to obtain a target operation result;

and the other arithmetic unit 14 is used for carrying out parallel operation on the block data based on other instructions to obtain a target operation result.

In the hardware accelerator provided by the embodiment of the application, the convolution arithmetic unit can be formed based on a DSP array core; other operators may be constructed based on tensor ALU (arithmetic and logic unit).

In a specific application scenario, the splitter, the convolution operator and other operators may communicate with each other via a fifo (First Input First output) queue and a Memory block of a single writer/single reader SRAM (Static Random-Access Memory), so as to implement task-level pipeline parallelism. In addition, as shown in fig. 6, the convolution element calculation can be divided into a plurality of blocks to participate in calculation, after the Block1 calculation is completed, other modules can process the data Block of Block1, as shown in fig. 7, after the Block1 data Block of CONV1 is completed, other modules can process the data of Block1, so that after the CONV1 operation processing is completed, the CONV2 can process the data Block of Block1, and the processing time of other modules is hidden.

The hardware accelerator provided in the embodiment of the present application may further include: and the buffer is used for buffering data.

It should be noted that the type of the hardware accelerator provided in this application may be determined according to actual needs, for example, the hardware accelerator may be an FPGA (Field Programmable Gate Array), and at this time, the hardware accelerator may perform data interaction with an external CPU (central processing unit) through Runtime (for example, the Runtime may use C + + language to read a device file of the FPGA, add python to access a pybind11 library package of C + +, implement an interface function of python calling the interaction between the CPU and the FPGA, implement different data preprocessing operations for different networks, write data into the FPGA to wait for return information, read a final result, and calculate a performance index of the network. In addition, the pressure of designing a hardware accelerator by a user can be relieved by means of a hardware design template, for example, the hardware design template is set to provide modularization for the user, and the hardware data type, the memory architecture, the core dimension of a Digital Signal Processor (DSP) array, a hardware operator and a pipeline stage can be selectively modified; exposing multiple hardware design variants to a compiler stack facilitates compiler development; the core dimension of the DSP array can be modified to influence the utilization of hardware resources, the input, the weight and the shape of an accumulator tensor of a DSP array core unit are modified, and the number of multipliers to be instantiated and the width required by an SRAM port are directly influenced; in addition, each data type can be customized to a different integer precision: the weight and input type may be 8 bits or less, and the accumulation type may be 32 bits or less; integer precision control allows a user to extend the arithmetic density on a chip when resources are limited.

The application also provides a data processing device and a computer readable storage medium, which have the corresponding effects of the data processing method provided by the embodiment of the application. Referring to fig. 8, fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure.

The data processing device provided by the embodiment of the application comprises a memory 201 and a processor 202, wherein a computer program is stored in the memory 201, and the processor 202 realizes the following steps when executing the computer program:

acquiring a neural network operation instruction;

splitting a neural network operation instruction into a convolution instruction and other instructions;

acquiring feature data and filter data corresponding to the neural network operation instruction, and partitioning the feature data and the filter data to obtain block data;

and performing parallel operation on the block data based on the convolution instruction and other instructions to obtain a target operation result.

The data processing device provided by the embodiment of the application comprises a memory 201 and a processor 202, wherein a computer program is stored in the memory 201, and the processor 202 realizes the following steps when executing the computer program: and splitting the neural network operation instruction into a convolution instruction and other instructions according to the channel correlation.

The data processing device provided by the embodiment of the application comprises a memory 201 and a processor 202, wherein a computer program is stored in the memory 201, and the processor 202 realizes the following steps when executing the computer program: other instructions include pooling instructions, activation instructions, splicing instructions, splitting instructions.

The data processing device provided by the embodiment of the application comprises a memory 201 and a processor 202, wherein a computer program is stored in the memory 201, and the processor 202 realizes the following steps when executing the computer program: acquiring a neural network operation instruction; the neural network operation instruction comprises a current node number, a father node type, a child node number, a child node type, a batch size, a weight kernel size, a padding number in the height direction, a padding number in the width direction, a stride, an input width, an input height, an input channel number, an output channel number, an input featuremap address, a weight address, a quantization parameter address, an output address and a size of a calculation block.

The data processing device provided by the embodiment of the application comprises a memory 201 and a processor 202, wherein a computer program is stored in the memory 201, and the processor 202 realizes the following steps when executing the computer program: acquiring a neural network computational graph in a json file format; and reading the neural network computation graph based on python and analyzing to obtain the neural network computation instruction in the ditt format.

Referring to fig. 9, another data processing apparatus provided in the embodiment of the present application may further include: an input port 203 connected to the processor 202, for transmitting externally input commands to the processor 202; a display unit 204 connected to the processor 202, for displaying the processing result of the processor 202 to the outside; and the communication module 205 is connected with the processor 202 and is used for realizing the communication between the data processing device and the outside. The display unit 204 may be a display panel, a laser scanning display, or the like; the communication method adopted by the communication module 205 includes, but is not limited to, mobile high definition link technology (HML), Universal Serial Bus (USB), High Definition Multimedia Interface (HDMI), and wireless connection: wireless fidelity technology (WiFi), bluetooth communication technology, bluetooth low energy communication technology, ieee802.11s based communication technology.

A computer-readable storage medium is provided in an embodiment of the present application, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the following steps:

acquiring a neural network operation instruction;

A computer-readable storage medium is provided in an embodiment of the present application, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the following steps: and splitting the neural network operation instruction into a convolution instruction and other instructions according to the channel correlation.

A computer-readable storage medium provided in an embodiment of the present application stores a computer program, and when executed by a processor, the computer program implements the following steps: other instructions include pooling instructions, activation instructions, splicing instructions, splitting instructions.

A computer-readable storage medium is provided in an embodiment of the present application, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the following steps: acquiring a neural network operation instruction; the neural network operation instruction comprises a current node number, a father node type, a child node number, a child node type, a batch size, a weight kernel size, a padding number in the height direction, a padding number in the width direction, a stride, an input width, an input height, an input channel number, an output channel number, an input featuremap address, a weight address, a quantization parameter address, an output address and a size of a calculation block.

A computer-readable storage medium is provided in an embodiment of the present application, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the following steps: acquiring a neural network computational graph in a json file format; and reading the neural network computation graph based on python and analyzing to obtain the neural network computation instruction in the ditt format.

The computer-readable storage media to which this application relates include Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage media known in the art.

For a description of a hardware accelerator, a data processing system, a device, and a related part in a computer readable storage medium provided in the embodiments of the present application, refer to a detailed description of a corresponding part in a data processing method provided in the embodiments of the present application, and are not described herein again. In addition, parts of the above technical solutions provided in the embodiments of the present application, which are consistent with the implementation principles of corresponding technical solutions in the prior art, are not described in detail so as to avoid redundant description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A data processing method is applied to a hardware accelerator and comprises the following steps:

acquiring a neural network operation instruction;

2. The method of claim 1, wherein splitting the neural network operation instruction into a convolution instruction and another instruction comprises:

3. The method of claim 2, wherein the other instructions comprise a pooling instruction, an activation instruction, a stitching instruction, and a splitting instruction.

4. The method of claim 1, wherein said obtaining a neural network operation instruction comprises:

acquiring the neural network operation instruction;

5. The method of claim 1, wherein the obtaining the neural network operation instruction comprises:

acquiring a neural network computational graph in a json file format;

6. A data processing system, for application to a hardware accelerator, comprising:

7. A data processing apparatus, characterized by comprising:

a memory for storing a computer program;

a processor for implementing the steps of the data processing method according to any one of claims 1 to 5 when executing the computer program.

8. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the data processing method according to any one of claims 1 to 5.

9. A hardware accelerator, comprising:

10. The hardware accelerator of claim 9 wherein the convolution operator is constructed based on a DSP array kernel; the other operators are constructed based on tensor ALUs.

11. The hardware accelerator of claim 9, further comprising:

and the buffer is used for buffering data.