CN115271047A

CN115271047A - Data processing method and device

Info

Publication number: CN115271047A
Application number: CN202110474504.7A
Authority: CN
Inventors: 姚棋中; 项方品; 吴辉阳
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2022-11-01
Also published as: WO2022227962A1

Abstract

The application discloses a data processing method and device, and relates to the field of data processing of a neural network. The data processing method comprises the following steps: in the operation process of the neural network, the first processor writes a matrix of at least one OP of a plurality of OPs of the neural network stored in the memory into the cache according to the storage capacity of the cache of the first processor, and then the first processor generates first data according to the matrix of the at least one OP stored in the cache. The storage space required by the at least one OP matrix is less than or equal to the storage capacity of the cache. According to the data processing method provided by the embodiment of the application, the first processor can pre-read a plurality of OP matrixes from the memory according to the storage capacity of the cache, so that the times of reading the OP matrixes from the memory by the first processor are reduced, the data reading time in the operation process of the neural network and the total operation time required by the neural network are reduced, and the operation efficiency of the neural network is improved.

Description

Data processing method and device

Technical Field

The present application relates to the field of data processing of neural networks, and in particular, to a data processing method and apparatus.

Background

Neural Networks (NN) are mathematical models for data processing formed by connecting a plurality of processing units (or neurons) by simulating the connection of human brain nerve cells. In NN, the computational steps performed by neurons are called Operations (OPs). At present, in the process of running a neural network by a processor, before the processor executes an OP, a data matrix and a weight matrix of a single OP are pre-read from a memory, the data matrix and the weight matrix of the single OP are stored in a cache, and a data processing result is obtained after all OPs of the neural network are traversed. However, the neural network generally includes a plurality of OPs, and the processor can only pre-read the matrix required by a single OP at a time, resulting in a large number of times of pre-reading data from the memory and a long data reading time in the process of executing the OPs by the processor. Therefore, how to reduce the time for the processor to read the matrix required by the NN and improve the operation efficiency of the NN is an urgent problem to be solved at present.

Disclosure of Invention

The application provides a data processing method and device, and solves the problems that under the condition that the storage capacity of a cache can accommodate a plurality of OP matrixes, a first processor only reads a single OP matrix from a memory in advance each time, so that the number of times that the first processor reads the OP matrixes from the memory is large, and the data reading time is long.

In order to achieve the purpose, the following technical scheme is adopted in the application.

In a first aspect, an embodiment of the present application provides a data processing method, where the method is applicable to a first processor, or the method is applicable to a computing device that may support the first processor to implement the method, for example, where the computing device includes a chip system, and the data processing method includes: in the operation process of the neural network, the first processor writes a matrix of at least one OP of the plurality of OPs of the neural network stored in the memory into the cache according to the storage capacity of the cache of the first processor, and then the first processor generates first data according to the matrix of the at least one OP stored in the cache. The storage space required by the matrix of the at least one OP is less than or equal to the storage capacity of the cache. In the data processing method provided in the embodiment of the present application, the first processor may pre-read at least one OP matrix from the memory according to the storage capacity of the cache, and solves the problems that, when the storage capacity of the cache can accommodate a plurality of OP matrices, the first processor only pre-reads a single OP matrix from the memory each time, which results in a large number of times that the first processor reads the OP matrix from the memory, and a long data reading time.

In another optional implementation manner, the buffer includes a first buffer unit and a second buffer unit, the storage capacity of the first buffer unit is smaller than that of the second buffer unit, and the data reading speed of the first buffer unit is greater than that of the second buffer unit. The first processor writes a matrix of at least one OP of a plurality of OPs of a neural network stored in a memory into a cache, including: if the storage space required by the at least one OP matrix is smaller than or equal to the storage capacity of the second cache unit, the first processor judges whether the storage space required by the at least one OP matrix is smaller than or equal to the storage capacity of the first cache unit; if the storage space required by the at least one OP matrix is smaller than or equal to the storage capacity of the first cache unit, the first processor writes the at least one OP matrix stored in the memory into the first cache unit; if the storage space required by the matrix of at least one OP is larger than the storage capacity of the first cache unit, the first processor writes the matrix of a part of OPs in at least one OP stored in the memory into the first cache unit, and writes the matrix of another part of OPs in at least one OP into the second cache unit, wherein the part of OPs have continuously executed incidence relation. The storage capacity of the first cache unit is smaller than that of the second cache unit, and the first processor writes the matrix of the at least one OP into the first cache unit under the condition that the storage capacity of the first cache unit is sufficient.

In another optional implementation manner, the matrix of the at least one OP includes a data matrix and/or a weight matrix, the data matrix is used for indicating the input data of the first OP in the at least one OP, and the weight matrix is used for indicating the weight of the data matrix.

In another optional implementation manner, the matrix of the at least one OP includes a matrix of a first OP and a matrix of a second OP, and the first OP and the second OP have a continuously executed association relationship; the first processor generates first data according to a matrix of at least one OP stored in a cache, and the first data comprises: reading a matrix of a first OP stored in a cache; the first processor generates second data according to the matrix of the first OP, then writes the second data into a cache and deletes the matrix of the first OP stored in the cache; the first processor also reads the matrix of the second OP and the second data stored in the cache, and generates first data according to the matrix of the second OP and the second data. The first processor may delete the matrix of the second OP stored in the cache, and write the first data into the cache, so that the first processor may read the first data from the cache in a process of executing a subsequent OP of the neural network, thereby preventing the first processor from reading the first data from the memory, reducing data transfer times of the first data, and reducing an operation time of the neural network.

In another alternative implementation, the matrix of the first OP includes a data matrix of the first OP, and the data matrix of the first OP is used to indicate input data of the neural network. For example, in a case where the first OP is a first OP of a plurality of OPs of the neural network, the data matrix of the first OP may be input data of the neural network.

In another optional implementation manner, after the first processor generates the first data, the data processing method further includes: if the execution of the plurality of OPs is completed, the first processor outputs the first data as output data. For example, the first processor outputs the first data as output data, specifically including: the first processor sends first data to the second processor. For example, the sending, by the first processor, the first data to the second processor may specifically include: the first processor writes the first data into the memory and sends a task response to the second processor, wherein the task response indicates that the first processor has completed the operation process of the neural network, and the second processor reads the first data from the memory when receiving the task response. For another example, the task response may include first data, and the sending, by the first processor, the first data to the second processor may specifically include: the first processor sends a task response including the first data to the second processor.

In another optional implementation manner, before the first processor writes the matrix of at least one OP of the plurality of OPs of the neural network stored in the memory into the cache, the data processing method further includes: the first processor receives a task request sent by the second processor, wherein the task request is used for instructing the first processor to start a calculation task of the neural network. For example, if the first processor is an NN processor and the second processor is a Central Processing Unit (CPU), after the CPU reads all the matrices required by the neural network from the hard disk into the memory, the CPU issues a task request to the NN processor, so that the NN processor starts an operation task of the neural network according to the task request.

In another optional implementation manner, the data processing method further includes: if the storage space required by the matrix of the third OP in the multiple OPs is larger than the storage capacity of the cache, the first processor divides the matrix of the third OP into multiple sub-matrices, and then the first processor writes at least one sub-matrix in the multiple sub-matrices stored in the memory into the cache, wherein the storage space required by the at least one sub-matrix is smaller than or equal to the storage capacity of the cache. It should be noted that, in a case that the storage capacity of the cache is fixed, the smaller the granularity (data amount) of the submatrices, the larger the number of the submatrices that the cache can accommodate, the higher the hit rate of the first processor in the cache, and therefore, when determining which submatrices in the third OP are written into the cache, the first processor may segment the matrix of the third OP by using the storage capacity of the cache and the minimum specification of the first processor, and write the submatrices of the third OP into the cache according to the storage capacity of the cache, so as to improve the hit rate of the first processor reading data in the cache, thereby reducing the data reading time of the first processor acquiring data required by the neural network, and the time required by the first processor to operate the neural network, and improving the efficiency of the first processor to operate the neural network.

In order to improve the hit rate of the first processor for reading data from the Cache, in the embodiment of the present application, the first processor writes the sub-matrices stored in the memory into the Cache, but in some possible examples, for example, the first processor includes a multi-level Cache (e.g., an L1 Cache and an L2 Cache), the L2Cache stores matrices of 2 OPs in the neural network, and storage spaces required by the matrices of the 2 OPs are all greater than a storage capacity of the L1 Cache, in an operation process of the neural network, the first processor may also segment the matrices of the 2 OPs stored in the L2Cache, and write part of the sub-matrices of the 2 OPs into the L1 Cache, so as to improve the hit rate of the first processor for reading data from the L1 Cache, reduce data handling times and data reading time of the first processor, further improve the efficiency of the first processor for operating the neural network, and reduce processing delay of the neural network.

In a second aspect, an embodiment of the present application provides a data processing apparatus, and beneficial effects may refer to descriptions of any aspect of the first aspect, which are not repeated herein. The data processing apparatus has the functionality to implement the actions in the method instance of any of the above first aspects. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above-described functions. In one possible design, a data processing apparatus is applied to the first processor, the data processing apparatus including: the pre-reading module is used for writing a matrix of at least one OP (operational complexity) in a plurality of OPs (operational complexity) of the neural network stored in the memory into the cache according to the storage capacity of the cache of the first processor in the operation process of the neural network, wherein the storage space required by the matrix of the at least one OP is less than or equal to the storage capacity of the cache; and the processing module is used for generating first data according to the matrix of at least one OP stored in the cache.

In an optional implementation manner, the cache includes a first cache unit and a second cache unit, a storage capacity of the first cache unit is smaller than a storage capacity of the second cache unit, and a data reading speed of the first cache unit is greater than a data reading speed of the second cache unit. If the storage space required by the at least one OP matrix is smaller than or equal to the storage capacity of the second cache unit, the pre-reading module is specifically configured to determine whether the storage space required by the at least one OP matrix is smaller than or equal to the storage capacity of the first cache unit; the pre-reading module is specifically used for writing the at least one OP matrix stored in the memory into the first cache unit if the storage space required by the at least one OP matrix is less than or equal to the storage capacity of the first cache unit; the pre-reading module is specifically configured to, if a storage space required by a matrix of at least one OP is larger than a storage capacity of the first cache unit, write a matrix of a part of OPs in the at least one OP stored in the memory into the first cache unit, write a matrix of another part of OPs in the at least one OP into the second cache unit, where the part of OPs have continuously executed association relationships.

In another optional implementation manner, the matrix of the at least one OP includes a matrix of a first OP and a matrix of a second OP, and the first OP and the second OP have a continuously executed association relationship. The processing module is specifically used for reading a matrix of the first OP stored in the cache; the processing module is specifically used for generating second data according to the matrix of the first OP; the processing module is specifically used for writing the second data into the cache and deleting the matrix of the first OP stored in the cache; the processing module is specifically used for reading a matrix of a second OP (operational amplifier) and second data stored in the cache; and the processing module is specifically used for generating first data according to the matrix of the second OP and the second data.

In another alternative implementation, the matrix of the first OP includes a data matrix of the first OP, and the data matrix of the first OP is used to indicate input data of the neural network.

In another optional implementation manner, the processing module is further configured to write the first data into the cache after the first data is generated, and delete the matrix of the second OP stored in the cache.

In another optional implementation manner, the data processing apparatus further includes: a communication module; the communication module is used for outputting the first data as output data if the execution of the plurality of OPs is finished.

In another optional implementation manner, the communication module is specifically configured to send the first data to the second processor.

In another optional implementation manner, the communication module is further configured to receive a task request sent by the second processor before the pre-reading module writes the matrix of at least one OP of the plurality of OPs of the neural network stored in the memory into the cache, where the task request is used to instruct the first processor to start an operation task of the neural network.

In another optional implementation manner, the pre-reading module is further configured to split the matrix of a third OP into a plurality of sub-matrices if a storage space required by the matrix of the third OP in the plurality of OPs is larger than a storage capacity of a cache; the pre-reading module is further used for writing at least one sub-matrix in the plurality of sub-matrices stored in the memory into the cache, and the storage space required by the at least one sub-matrix is smaller than or equal to the storage capacity of the cache.

In a third aspect, an embodiment of the present application provides a chip, including a memory and a processor, where the processor is configured to, during an operation of a neural network, write a matrix of at least one OP of a plurality of OPs of the neural network stored in the memory into the memory according to a storage capacity of the memory, where a storage space required by the matrix of the at least one OP is less than or equal to the storage capacity of the memory; the processor is further configured to generate first data from the matrix of at least one OP stored in the memory. In one possible example, the memory is used for storing computer instructions, and the processor is used for calling and executing the computer instructions from the memory to perform the operation steps of the method in the first aspect and any possible implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computing device, which includes a processor and an interface circuit, where the interface circuit is configured to receive and transmit a signal from or to another computing device other than the computing device to the processor, and the processor is configured to implement, through a logic circuit or executing a code instruction, the operation steps of the data processing method according to any one of the first aspect and the possible implementation manners of the first aspect.

In a fifth aspect, the present application provides a computer-readable storage medium, in which a computer program or instructions are stored, and when the computer program or instructions are executed by a computing device or a processor or a chip, the operating steps of the data processing method of any one of the first aspect and the first possible implementation manner are implemented.

In a sixth aspect, the present application provides a computer program product for causing a computing device to perform the operational steps of the data processing method of any one of the possible implementations of the first aspect and the first aspect when the computer program product is run on a computer.

The present application may further combine to provide more implementation manners on the basis of the implementation manners provided by the above aspects.

Drawings

FIG. 1 is a schematic diagram of a neural network provided in the prior art;

fig. 2 is a first schematic diagram illustrating a data processing process according to the present application;

FIG. 3 is a block diagram of a data processing system according to the present application;

fig. 4 is a first flowchart illustrating a data processing method provided in the present application;

FIG. 5 is a first schematic diagram of a neural network provided herein;

fig. 6 is a second schematic diagram of a neural network provided in the present application;

fig. 7 is a schematic flowchart illustrating a data processing method according to the present application;

FIG. 8 is a second schematic diagram of a data processing process according to the present application;

FIG. 9 is a schematic structural diagram of a data processing apparatus provided in the present application;

fig. 10 is a schematic structural diagram of a computing device provided in the present application.

Detailed Description

For clarity and conciseness of the description of the embodiments described below, a brief introduction to the related art is first given.

The network layers in the neural network include: the input layer is used for acquiring input data of the neural network, the hidden layer is used for extracting features according to the input data and the weight of each input data, and the output layer is used for outputting the processing result of the neural network according to the features extracted by the hidden layer and the weight of each feature. As shown in fig. 1, fig. 1 is a schematic diagram of a neural network provided in the prior art, where the neural network includes an input layer 110, a hidden layer 120, and an output layer 130, where the hidden layer 120 may also be referred to as an intermediate layer, circles shown in fig. 1 are neurons of the neural network, and connecting lines between the neurons in each layer are weight values (also referred to as weights, parameters, and the like).

In the operation process of the neural network, input data X and weight (weight, W) are calculated and pass through a nonlinear function to obtain output data Z. The input data may also be referred to as an activation value (activation), and the output data Z may be a final output of the neural network or may be input data of a subsequent network layer of the neural network to continue performing calculations. Commonly used activation functions are linear rectification functions (ReLU), sigmoid functions, tanh, and the like. In one possible example, the operation process of the neural network can be expressed in vector and matrix form, such as: z = g (Wx).

Wherein z = [ z ]₁,z₂]^T，x＝[x₁,x₂,x₃]^T，

z is the output vector of the neuron, x is the input vector of the neuron, W is the weight matrix, the superscript T denotes the transpose, and g denotes the same computation performed on each element of the input vector/matrix.

Regarding the relationship among the input data, the weight and the output data of the neural network, the description is given by taking any two adjacent layers of neurons in the neural network shown in fig. 1 as an example, as shown in fig. 2, fig. 2 is a schematic diagram of a data processing process provided by the present application, where the input data is x_iI =1,2,3, output data z_jJ =1,2. As shown in FIG. 2 (a), z₁＝g(w₁₁*x₁+w₁₂*x₂+w₁₃*x₃) (ii) a Also shown in FIG. 2 as (b), z₂＝g(w₂₁*x₁+w₂₂*x₂+w₂₃*x₃)。

If the neural network has multiple sets (multiple samples) of different input data, the output vectors z and x can be expanded into a matrix form: z = g (WX).

Wherein the content of the first and second substances,

z is the output data of the neuron, X is the data matrix of the input neuron, W is the weight matrix, the superscript T denotes the transpose, "g (WX)" means that the same calculation is performed on each element of the input matrix X and the weight matrix W, and N is the number of input groups (sample number).

The above computation process between two adjacent layers of neurons can be abstracted into performing two computation steps on input data of the neurons: matrix multiplication and nonlinear transformation. In a neural network comprising a greater number of layers of neurons, the neural network performs a greater number of calculation steps on initial input data of the neural network, wherein each calculation step comprises the initial input data or output data of a preceding calculation step, and the data required for the calculation step may or may not comprise one or more weight matrices (e.g., matrix multiplication) or weight matrices (e.g., non-linear functions). The "calculation step" is referred to herein as an Operation (OP) in the neural network.

In the process of executing a computing task by a processor, the processor reads data required by the computing task from a memory, and because the data reading speed of the memory is low, the processor is generally provided with a plurality of levels of caches (caches). The data reading speed of the Cache is faster than that of the memory, but the Cache cost of unit storage capacity is high, so the storage capacity of the Cache in the processor is generally small. In the architectural design of a processor, if the processor includes multiple levels of caches, the Cache of the computing unit closer to the processor generally has smaller storage capacity and faster data reading speed.

Under the condition that a multi-stage Cache is arranged in a processor, a computing unit in the processor firstly checks whether a data matrix and a weight matrix required by a computing task are stored in the Cache, and if the data matrix and the weight matrix are in the Cache (the Cache can also be called as the Cache hits data required by the computing task), the computing unit reads the data matrix and the weight matrix from the Cache and executes the computing task to obtain a computing result.

At present, before a processor executes an OP of a neural network, a data matrix and a weight matrix of a single OP are pre-read from a memory, and the data matrix and the weight matrix of the single OP are stored in a Cache. The pre-reading means that the processor pre-judges data required by the computing unit and reads the data into the Cache from the memory in advance so as to reduce the time required by the computing unit to acquire the data. However, the neural network generally includes a plurality of OPs, and the processor can only pre-read the matrix required by a single OP at a time, resulting in a large number of times of pre-reading data from the memory and a long data reading time in the process of executing the OP of the neural network by the processor.

In order to solve the above problem, an embodiment of the present application provides a data processing method, including: in the operation process of the neural network, the first processor writes a matrix of at least one OP of the plurality of OPs of the neural network stored in the memory into the cache according to the storage capacity of the cache of the first processor, and then the first processor generates first data according to the matrix of the at least one OP stored in the cache. The storage space required by the matrix of the at least one OP is less than or equal to the storage capacity of the cache. In the data processing method provided in the embodiment of the application, the first processor may pre-read at least one OP matrix from the memory according to the storage capacity of the cache, and the problem that the first processor reads the OP matrix from the memory more times and the data reading time is longer due to the fact that the first processor only pre-reads a single OP matrix from the memory each time under the condition that the storage capacity of the cache can accommodate the multiple OP matrices is solved.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 3 is a schematic structural diagram of a data processing system provided in the present application, where the data processing system 300 includes a first processor 310, a second processor 320, a memory 330, and a hard disk 340, and the first processor 310 and the second processor 320 can read data from the memory 330.

If the first processor 310 and the second processor 320 are in the same computing device, the first processor 310 and the second processor 320 may be connected by a bus, which may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, the communication relationship between the first processor 310 and the second processor 320 is illustrated in fig. 3 as bidirectional connections, but does not represent only one bus or one type of bus between the first processor 310 and the second processor 320.

If the first processor 310 and the second processor 320 are in different computing devices, the first processor 310 and the second processor 320 may communicate through a network, which may include network devices such as switches, routers, or gateways.

The first processor 310 includes a multi-level cache and compute unit 314, which may be a level one cache 311, a level two cache 312, and a level three cache 313 shown in fig. 3. It is noted that fig. 3 is only an example provided by the embodiment of the present application, and in some possible examples, the first processor 310 may further include more levels of cache, or fewer levels of cache. Herein, under the condition of not causing misunderstanding, the N-level Cache may be represented by an LN Cache, where N is a positive integer, for example, the first-level Cache may be represented by an L1 Cache, the second-level Cache may be represented by an L2Cache, the third-level Cache may be represented by an L3Cache, and if the processor in this document is further provided with more levels of caches, such as a fourth-level Cache, the fourth-level Cache may be represented by an L4 Cache. In other examples, if the processor is further configured with a "heterogeneous Cache," e.g., a vendor proposes a "2.5 level Cache," the "2.5 level Cache" may also be represented by an L2.5 Cache.

The computing unit 314 may be, but is not limited to, a processor having neural network processing capability, such as a CPU, a Neural Processing Unit (NPU), or a Graphics Processing Unit (GPU).

The second processor 320 may be a CPU, a Network Processor (NP), or the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc.

The memory 330 may be, but is not limited to, a Random Access Memory (RAM), a flash memory, a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), a register, a double-data-rate synchronous dynamic random access memory (DDR SDRAM), and the like. In this context, DDR SDRAM may also be denoted DDR without causing misunderstanding.

The hard Disk 340 (Disk) may be, but is not limited to, a Hard Disk Drive (HDD), a Solid State Drive (SSD), a Disk array (RAID), or the like.

In a possible example, the first processor 310 may be integrated in the second processor 320, if the second processor 320 is a CPU, the first processor 310 is an NPU, and the memory is a DDR, during the Chip architecture design process, the NPU and the CPU may be disposed on the same System on Chip (SoC), and the NPU and the CPU, and the SoC and the memory (e.g., disk and DDR) may be interconnected through a bus protocol (e.g., advanced extensible interface, AXI). For example, the data processing system 300 provided in the embodiment of the present application may be a Neural Network (NN) dedicated processor, the NN dedicated processor may be integrated into an SoC chip, and the data processing system 300 may also be made into an ASIC chip and applied to a processor module of a computing device (e.g., a smartphone processor), a video monitoring processor in a safe city, or an autopilot processor. If data processing system 300 is integrated in a chip, data processing system 300 may be a component of a memory management module in the chip.

The data processing method provided by the present application is described below on the basis of the data processing system 300 shown in fig. 3, and fig. 4 is a first flowchart of a data processing method provided by the present application, where the data processing method may include the following steps.

S410, the second processor 320 sends a task request to the first processor 310.

The task request is used to instruct the first processor 310 to initiate a computational task of the neural network. For example, if the first processor 310 is an NN processor (such as an NPU or a GPU) and the second processor 320 is a CPU, after the CPU reads all matrices required by the neural network from the Disk into a memory (such as a DDR), the CPU issues a task request to the NN processor, so that the NN processor starts an operation task of the neural network according to the task request. All the matrices required by the neural network include a data matrix and a weight matrix, the data matrix is used for indicating input data of the neural network, the weight matrix is used for indicating the weight of the data matrix, and more contents of the weight matrix can refer to related explanations of the prior art, which are not repeated herein.

The neural network may be at least one of a convolutional neural network, a deep neural network, or a recurrent neural network, e.g., the neural network may include a plurality of OPs. In practical applications, to facilitate analysis of the structure, computational characteristics and data flow direction of a neural network, all OPs of the neural network are usually put together to form a computational graph (computational graph), and the computational graph is presented in an open neural network exchange (ONNX) format. In ONNX, OPs in a computational graph are defined as nodes in a text form, and directional connecting lines between the nodes represent output and input relationships between the OPs. To facilitate distinguishing between the OPs, the user may customize the name of each OP.

Fig. 5 is a schematic diagram of a neural network provided in the present application, where (a) in fig. 5 shows a schematic diagram of a network layer of the neural network, and the neural network includes: 2 convolutional layers, 2 non-linear (non-linear) active layers and 1 other network layer. The nonlinear activation layer may be a linear rectification unit (ReLU) function, such as ReLU1 and ReLU2 shown in fig. 5 (a). Both the convolutional layer M1 and the convolutional layer M2 may be an OP of matrix multiplication (matrix multiplex, matMul), and the other network layer M3 is an OP of element-wise addition (element-wise add); fig. 5 (b) shows a calculation chart of the network structure shown in fig. 5 (a).

As shown in fig. 5 (a), the inputs to the computation graph include X, weight _1 (W1), and weight _2 (W2), where X is a 4 × 3 matrix in 32-bit floating point (fp32) format, W1 is a 3 × 5 FP32 matrix, and W2 is a 3 × 5 FP32 matrix.

In the calculation diagram shown in fig. 5 (b), "% d" (d denotes a natural number) denotes an intermediate variable in the calculation diagram, and "Float (a, b)" denotes that the intermediate variable is an FP32 matrix having a size of a × b; "onx:: matMul" represents a matrix multiplied by OP, "onx:: relu" represents Relu OP which is called twice, the inputs are the intermediate variables% 3 and% 5, respectively, "% X,% weight _1" represents the inputs of the OP are X and weight _1; "return (% Y)" indicates that the Y matrix is output.

Any neural network structure can be divided into a plurality of ONNX format calculation graphs formed by taking an ONNX built-in OP as a basic unit, and the network structure and the calculation graphs are mapped one by one. In this context, the "network structure" and the "computation graph" will not be distinguished without causing misunderstandings. It is noted that OPs may be nested, i.e. one large OP may contain other small OPs. For example, the matrix can be wrapped by OP (WX) and the nonlinear activation OP g (-) to get a larger OP, and the calculation graph of the neural network can be presented in units of OP. It should be noted that in the calculation graph of the ONNX format, the OP with the smallest granularity must be the OP built in the ONNX format (as "ONNX:: matMul" shown in fig. 5), and in a possible example, the calculation graph of the ONNX format may also be presented by taking the OP built in the ONNX format as a basic unit.

After the first processor 310 starts the operation process of the neural network according to the task request, please continue to refer to fig. 4, and the data processing method provided by the present application further includes the following steps.

S420, the first processor 310 writes the matrix of at least one OP of the plurality of OPs of the neural network stored in the memory into the cache according to the storage capacity of the cache of the first processor 310.

The neural network includes a plurality of OPs, at least one of the OPs having an associative relationship being continuously executed.

For example, the computation graph of the neural network is denoted as G, and during the operation of the neural network, G may be sliced by the first processor 310 into a plurality of sub-computation graphs, where any one of the plurality of sub-computation graphs includes one or more OPs. The sub-computation graph set of the neural network is recorded as G₁,G₂,…,G_K}，G＝G₁∪G₂∪…∪G_K，

K is the number of sub-computation graphs included in the neural network, G_iIs { G }₁,G₂,…,G_KI th sub-calculation chart in (1), G_jIs { G₁,G₂,…,G_KJ in the jth sub-computation graph, i and j are positive integers, and i ≠ j.

The "association performed continuously" means that, of the at least one OP shown at S420, each OP is "connected". E.g., { G₁,G₂,…,G_KSub-calculation chart G in_kIncluding the at least one OP, k being a sub-computation graphThe sequence number K is less than or equal to K, and in the operation process of the neural network, the sub-calculation graph G is used_kIncluding any two OP1 and OP2, from OP1 through the sub-computation graph G_kWherein several other OPs can always reach OP2, where "pass through" means that there is an input-output dependency between OP1 and OP 2.

According to the data processing method provided by the embodiment of the application, the first processor can segment a plurality of OPs included in the neural network according to the computation graph of the neural network and the storage capacity of the cache, and pre-reads the matrixes of the sub-computation graphs obtained by the segmentation from the memory into the cache, so that the times of reading the matrixes of the OPs from the memory by the first processor are reduced, the data reading time of the operation process of the neural network and the total operation time required by the neural network are reduced, and the operation efficiency of the neural network is improved.

The matrix of the at least one OP comprises at least one of a data matrix and a weight matrix, wherein the data matrix is used for indicating the input data of a first OP of the at least one OP, and the weight matrix is used for indicating the weight of the data matrix. By way of example, the matrix of the at least one OP may also be referred to as a sub-computation graph G_kThe required matrix.

For example, fig. 6 is a schematic diagram of a neural network provided in the present application, the neural network includes 3 convolutional layers (convolutional layers R1 to R3) and 2 nonlinear active layers (ReLU 1 to ReLU 2), the 3 convolutional layers are MatMul OPs, and the weight matrices required by the 3 MatMul OPs are W matrices respectively₁(convolutional layer R1, matMul 1), W₂(convolutional layer R2, matMul 2), W₃(convolutional layer R3, matMul 3); the 2 ReLU OPs (ReLU 1 and ReLU 2) do not require a weight matrix.

The data matrix input into the neural network is recorded as X, and the matrix output by 3 MatMul OPs is respectively X₁(convolutional layer R1), X₂(convolutional layer R2), X₃(convolutional layer R3). As shown in FIG. 7, the at least one OP may include 4 OPs such as ReLU1-MatMul2-ReLU2-MatMul3, and the matrix of the at least one OP includes a data matrix referred to as X₁The weight matrix included in the at least one OP is W₂And W₃。

It is to be understood that the required storage space of the matrix of the at least one OP is less than or equal to the storage capacity of the cache.

In one possible example, the cache of the first processor 310 comprises a single cache unit, and the required storage space of the matrix of the at least one OP is less than or equal to the storage capacity of the single cache unit.

In another possible example, the cache of the first processor 310 may include a plurality of cache units, and the storage space required by the matrix of the at least one OP is less than or equal to the storage capacity of any one of the plurality of cache units, or the storage space required by the matrix of the at least one OP is less than or equal to the total storage capacity of the plurality of cache units. For example, the buffer includes a first buffer unit and a second buffer unit, wherein the storage capacity of the first buffer unit is smaller than that of the second buffer unit, and the data reading speed of the first buffer unit is greater than that of the second buffer unit.

As shown in fig. 3, if the first Cache unit is a first-level Cache 311 (L1 Cache), the second Cache unit may be a second-level Cache 312 (L2 Cache); for another example, if the first Cache unit is a level two Cache 312 (L2 Cache), the second Cache unit may be a level three Cache 313 (L3 Cache). In one possible scenario, if the first processor 310 further includes a level four Cache (L4 Cache), the first Cache unit may be a level three Cache 313 (L3 Cache), and the second Cache unit may be a level four Cache (L4 Cache). In some examples, the L1 Cache may be referred to as a "high-speed Cache", and the L2Cache and the L3Cache may be referred to as a "low-speed Cache"; it is worth noting that the "low speed" of the Cache is the "high speed" of the Cache relatively closer to the computing unit, for example, the L1 Cache is the "high speed" with respect to the L2Cache to the L4 Cache, and the L2Cache to the L4 Cache are the "low speed" with respect to the L1 Cache; for another example, the L2Cache is a "high-speed Cache" with respect to the L3 caches to the L4 caches, and the L3 caches to the L4 caches are a "low-speed Cache" with respect to the L2 Cache.

In order to improve the efficiency of the first processor 310 in computing the neural network, the embodiment of the present application provides a possible implementation manner for pre-reading the matrix stored in the memory to the first processor 310, please refer to fig. 7, where fig. 7 is a second flowchart of a data processing method provided by the present application, and if the storage space required by the matrix of at least one OP is less than or equal to the storage capacity of the second cache unit, the above step S420 may include the following steps S4201 to S4203.

S4201, the first processor 310 determines whether the storage space required by the matrix of at least one OP is less than or equal to the storage capacity of the first cache unit.

If the storage space required by the matrix of at least one OP is less than or equal to the storage capacity of the first cache unit, performing S4202; if the storage space required by the matrix of at least one OP is larger than the storage capacity of the first cache unit, S4203 is executed.

S4202, the first processor 310 writes the matrix of at least one OP stored in the memory into the first cache unit.

Under the condition that the storage space required by the at least one OP matrix is less than or equal to the storage capacity of the first cache unit, the first processor 310 writes the at least one OP matrix into the first cache unit, which is beneficial to improving the hit rate of the computing unit 314 in the first processor 310 in querying the at least one OP matrix in the first cache unit, reducing the number of data handling times of the at least one OP matrix and the data reading time of the first processor 310, further reducing the time of the first processor 310 in operating the neural network, and improving the processing efficiency of the first processor 310 in operating the neural network.

As shown in fig. 3, if the first Cache unit is an L1 Cache (first level Cache 311), in the process of computing the at least one OP, the computing unit 314 may read the matrix of the at least one OP from the first level Cache 311, and compared to the computing unit 314 reading the matrix of the at least one OP from the memory 330 or the low-speed Cache (L2 Cache or L3 Cache), the size relationship between the data reading speeds of the Cache included in the first processor 310 and the memory 330 is as follows: the L1 Cache > L2Cache > L3Cache > memory 330, so that the time required for the computing unit 314 to read the matrix of the at least one OP is less, the computing efficiency of the first processor 330 to execute the at least one OP is improved, and the efficiency of the first processor 310 to operate the neural network is improved.

In a possible scenario, if the first Cache unit is an L2Cache (second level Cache 312), in the calculation process of the at least one OP, the calculation unit 314 may read the matrix of the at least one OP from the second level Cache 312, compared to the calculation unit 314 reading the matrix of the at least one OP from the memory 330 or the low-speed Cache (L3 Cache), since the size relationship between the data reading speeds of the Cache included in the first processor 310 and the memory 330 is: the L2Cache > L3Cache > the memory 330, so that the time required for the computing unit 314 to read the matrix of the at least one OP is less, and the efficiency of the first processor 310 in computing the neural network is improved.

S4203, the first processor 310 writes the matrix of a part of the at least one OP stored in the memory into the first cache unit, and writes the matrix of another part of the at least one OP into the second cache unit.

The part of OPs has an association relation that is continuously executed. Illustratively, the portion of OPs may include one or more OPs. For example, during the operation of the at least one OP,

the storage capacity of the first buffer unit is smaller than that of the second buffer unit, and in the case that the storage capacity of the first buffer unit is sufficient, the first processor 310 writes the matrix of at least one OP into the first buffer unit, which is beneficial to increase the speed of reading the matrix of at least one OP from the buffer by the computing unit 314 in the first processor 310 because the data reading speed of the first buffer unit is greater than that of the second buffer unit.

S4201 to S4203 may be implemented by the first processor 310 dividing the computation graph G of the neural network into a plurality of sub-computation graphs. For example, assume that the storage capacity of the first Cache unit (e.g., L2 Cache) in the first processor 310 is: 4 Megabytes (MB), and the storage capacity of the second Cache unit (e.g., L3 Cache) is: 16MB; assuming that the size and data amount of each data matrix and weight matrix shown in fig. 6 are as shown in table 1 below, for example, the size of the input data matrix X of the neural network is 16 × 1024 × 512, which means that 16 1024 × 512 data matrices will perform the same operation in the neural network; the matrix is an FP32 matrix, and the storage space required by each data in the FP32 matrix is 4 bytes (B).

TABLE 1

Data matrix	Size of	Data volume	Weight matrix	Size of	Data volume
						X	16×1024×512	32MB	W₁	512×128	0.25MB
X₁	16×1024×128	8MB	-	-
						X`₁	16×1024×128	8MB	W₂	128×64	32 Kilobytes (KB)
X₂	16×1024×64	4MB	-	-
						X`₂	16×1024×64	4MB	W₃	64×32	8KB
X₃	16×1024×32	2MB

As shown in fig. 6, the computation graph G = { MatMul1, reLU1, matMul2, reLU2, matMul3} of the neural network, and the execution sequence of each OP in the neural network is: matMul1-ReLU1-MatMul2-ReLU2-MatMul3, to increase the hit rate of the first processor 310 in the cache, before the operation of MatMul1 (first state shown in FIG. 6), due to X and W₁The required storage space is 32MB +0.25MB =32.25MB > 16MB, and the first processor 310 may divide MatMul1 in the computation graph G into sub-metersAbacus G₁= MatMul 1. Since the first processor 310 cannot apply the weight matrix W and the data matrix X required by MatMul1₁Writing from memory (e.g., DDR) to the second cache unit, the first processor 310 may batch read X and W from DDR₁The specific implementation process of the sub-matrix can refer to the following description related to fig. 8.

In the second state, as shown in fig. 6, after the operation of MatMul1 is finished, the first processor 310 analyzes the subsequent ReLU1. Similarly, the matrix required by ReLU1 should be read from DDR into the low-speed Cache (e.g. the second Cache unit) as much as possible to improve the hit rate. ReLU1 has no weight matrix, and low-speed Cache (such as a second Cache unit) only needs to store an input data matrix X₁(8 MB is occupied). Input data matrix X due to ReLU1₁And an output data matrix X₁The size is the same, so the first processor 310 may be in the same place as X₁Upon entry of ReLU1, X is deleted from the low speed Cache₁And apply X' to₁And writing into the low-speed Cache. Thus, the calculation graph G of the neural network is the sub-calculation graph G₂Now containing ReLU1, i.e. G₂= { ReLU1}. After the ReLU1 is calculated, the output matrix X' is₁Can be stored in the low-speed Cache. Sub-calculation graph G₂By the present calculation process, the low-speed Cache is occupied by 8MB at most (store X)₁Or X₁)。

In the third state, as shown in fig. 6, after the first processor 310 has calculated ReLU1, it analyzes the subsequent MatMul2. Similarly, the input matrix required by MatMul2 needs to be read from the DDR into the low-speed Cache as much as possible to improve the hit rate. Input data matrix X' of MatMul2₁Already in the low-speed Cache, so only the weight matrix W of the Cache needs to be read again₂(occupying 32 KB). Similar to the idea of multiplexing the space occupied by the input and output data matrices in ReLU1, the first processor 310 may assign X ″₁After being input into MatMul2, the data matrix X is deleted from the low-speed Cache and output₂And writing into the low-speed Cache. No low-speed Cache overflow occurs in the whole process, so that the sub-computation graph G₂May also include MatMul2, i.e., G₂= ReLU1, matMul2. After MatMul2 calculation is finished, the weight matrix W₂Will not be used any more and can be loweredDeleting in the fast Cache, and outputting matrix X₂Stored in the low speed Cache. Sub-calculation graph G₂In the calculation process at present, the maximum occupied low-speed Cache is 8MB +32KB (X is saved)₁And W₂) And no overflow occurs.

In the fourth state, as shown in fig. 6, after the first processor 310 has calculated MatMul2, it analyzes the subsequent ReLU2. Similar to the case of the ReLU1, the low-speed Cache overflow can not occur in the calculation process of the ReLU2, and the input data matrix and the output data matrix have the same size and can be reused in space. Therefore, the sub-calculation graph G₂ReLU2, i.e. G, may also be included₂= { ReLU1, matMul2, reLU2}. After the ReLU2 is calculated, the output data matrix X' is₂Stored in the low speed Cache. Sub-calculation graph G₂In the calculation process at present, the low-speed Cache is occupied at most by 8MB +32KB +8KB (storing X₁、W₂And W₃) And no overflow occurs.

In the fifth state, as shown in fig. 6, the first processor 310 analyzes the final MatMul3 after it has calculated ReLU2. Similar to the case of MatMul2, W must be read from DDR₃(occupying 8 KB). After MatMul3 is calculated, the weight matrix W is obtained₃And an input data matrix X ″₂Will not be used any more and can be deleted from the low speed Cache. As shown in the sixth state of fig. 6, since the neural network has no subsequent OP after MatMul3, the first processor 310 may apply the output data matrix X of MatMul3₃Write memory (DDR). No low-speed Cache overflow occurs in the whole process, so that the sub-computation graph G₂May also include MatMul3, i.e., G₂＝{ReLU1，MatMul2，ReLU2，MatMul3}。

In summary, the computation graph G of the neural network can be split into two sub-computation graphs: g₁＝{MatMul1}，G₂= { ReLU1, matMul2, reLU2, matMul3}. In the first to sixth states shown in fig. 6, the memory (e.g., DDR) and the cache frame below each OP show the matrix stored in the corresponding memory before the OP is calculated, and the bold variable shows the matrix that is changed after the OP is calculated.

In the data processing method provided by the embodiment of the application, the first processor uses the sub-computation graph instead of the OP as a basic unit, and the first processor pre-reads a matrix required by the sub-computation graph into the low-speed Cache (such as the L2Cache or the L3 Cache), so that the high-speed Cache (such as the L1 Cache) and the computing unit only obtain required data from the low-speed Cache, the hit rate of the computing unit for searching data in the low-speed Cache is improved, the times of reading data from a lower-speed memory (such as a DDR) by the computing unit and the high-speed Cache are reduced, and the data reading time in the operation process of the neural network is reduced, thereby improving the speed of reading the input matrix and the weight matrix by the neural network, and reducing the total time of the first processor for completing all computations by the neural network.

Optionally, if the first processor includes more multi-level caches (e.g., L1 Cache to L4 Cache), after the first processor writes the matrix of the sub-computation graph (including multiple OPs) into the L4 Cache from the memory, the first processor may further segment the sub-computation graph to obtain multiple fine-grained computation graphs (including at least one OP), and write the matrix of the fine-grained computation graph into the L3Cache from the L4 Cache according to the storage capacity of the L3Cache, so as to improve the speed of the first processor reading the matrix from the Cache. Under the condition that the storage capacity of the L2Cache is sufficient and the fine-grained computation graph can be further segmented, the first processor can further segment the fine-grained computation graph, and the process of segmenting the fine-grained computation graph can refer to the process of the first processor 310 dividing the computation graph G of the neural network to obtain a plurality of sub-computation graphs, which is not described herein again.

After the first processor 310 writes the matrix of at least one OP stored in the memory into the cache of the first processor 310, please continue to refer to fig. 4, and the data processing method according to the embodiment of the present disclosure further includes the following step S430.

S430, the first processor 310 generates the first data according to the matrix of the at least one OP stored in the cache.

For example, if the matrix of the at least one OP includes a matrix of a first OP and a matrix of a second OP, and the first OP and the second OP have a correlation relationship that is continuously executed, please refer to fig. 7, where the above-mentioned S430 includes the following steps S4301 to S4305.

S4301, the first processor 310 reads the matrix of the first OP stored in the cache.

The matrix of the first OP may include at least one of a data matrix and a weight matrix of the first OP.

In a first possible example, the matrix of the first OP comprises a data matrix of the first OP. As shown in FIG. 6, if the first OP is ReLU1, the matrix of the first OP may include X₁。

In a second possible example, the matrix of the first OP includes a weight matrix of the first OP, and as shown in fig. 6, if the first OP is ReLU1, the matrix of the first OP may include W₂。

In a third possible example, the matrix of the first OP includes a data matrix and a weight matrix of the first OP, and as shown in fig. 6, if the first OP is MatMul1, the matrix of the first OP may include X and W₁. In the case that MatMul1 is a first OP of a plurality of OPs of the neural network, the data matrix of the first OP is used to indicate input data of the neural network.

The first processor 310 reading the matrix of first OPs stored in the cache may include: the calculation unit 314 sequentially searches the cache for the matrix required by the first OP.

"query in turn" means: the computing unit 314 checks whether the first level Cache 311 (L1 Cache) has a matrix required by the first OP, and if the matrix is in the L1 Cache (hit), the computing unit 314 reads the matrix from the L1 Cache to complete the computation of the first OP.

If the matrix is not in the L1 Cache (miss), the computing unit 314 checks whether the second level Cache 312 (L2 Cache) has the matrix, and if the matrix is in the L2Cache (hit), the computing unit 314 writes the matrix stored in the L2Cache into the L1 Cache, and further, the computing unit 314 reads the matrix from the L1 Cache to complete the computation of the first OP.

If the matrix is not in the L2Cache (miss), the computing unit 314 checks whether the third-level Cache 313 (L3 Cache) has the matrix, and if the matrix is in the L3Cache (hit), the computing unit 314 writes the matrix stored in the L3Cache into the L2Cache, then writes the matrix stored in the L2Cache into the L1 Cache, and further, the computing unit 314 reads the matrix from the L1 Cache to complete the computation of the first OP.

That is, if the matrix of the first OP is stored in a buffer (e.g. the first-level buffer 311) closer to the computing unit 314, the less the number of times the matrix of the first OP is carried during the computing unit 314 operating the first OP, the faster the neural network reads the matrix, and the lower the total time for the first processor 310 to complete all computations in the neural network.

S4302, the first processor 310 generates second data according to the matrix of the first OP.

In one possible example, as shown in FIG. 6, if the first OP is MatMul2, the first processor 310 will apply the data matrix X₁And weight matrix W₂Performing matrix multiplication to obtain X₂(second data).

In another possible example, as shown in FIG. 6, if the first OP is ReLU1, the first processor 310 will apply the data matrix X₁Performing nonlinear activation to obtain X ″₁(second data).

S4303, the first processor 310 writes the second data into the buffer, and deletes the matrix of the first OP stored in the buffer.

As shown in FIG. 6, if the first OP is ReLU1, the first processor 310 will apply the data matrix X₁(occupying 8 MB) to carry out nonlinear activation to obtain X ″₁(the second data, occupies 8 MB), if the storage capacity of the cache is 18MB, the first processor 310 may delete the matrix of the first OP stored in the cache after writing the second data into the cache; if the storage capacity of the cache is 10MB, the first processor 310 may delete the matrix of the first OP stored in the cache after reading the matrix of the first OP and before generating the second data, and then the first processor 310 writes the generated second data into the cache.

S4304, the first processor 310 reads the matrix of the second OP and the second data stored in the buffer.

As shown in FIG. 6, if the first OP is ReLU1, the second OP may be MatMul2, and the second data is X ″₁The matrix of the second OP is W₂The process of the first processor 310 reading the matrix of the second OP and the second data stored in the buffer may refer to S43 described above01, and will not be described herein.

S4305, the first processor 310 generates first data according to the matrix of the second OP and the second data.

In one possible example, the first data may be intermediate data of a neural network, as shown in fig. 6, if the second OP is MatMul2, the first data is X₂The first processor 310 may delete the matrix of the second OP stored in the cache, and write the first data into the cache, so that the first processor 310 may read the first data from the cache in the process of executing the subsequent OP of the neural network, thereby preventing the first processor 310 from reading the first data from the memory, and reducing the number of data transfer times of the first data, thereby reducing the operation time of the neural network.

In another possible example, the first data may be output data of a neural network. For example, if the multiple OPs are executed, the first processor may output the first data as the output data, as shown in fig. 7, and the data processing method may further include the following step S440.

S440, the first processor 310 sends the first data to the second processor 320.

In an optional implementation manner, S440 may specifically include: the first processor 310 writes the first data into the memory and sends a task response to the second processor 320, the task response indicating that the first processor 310 has completed the operation of the neural network, and the second processor reads the first data from the memory in the case that the second processor 320 receives the task response. As shown in FIG. 6, if the second OP is MatMul3, the first data is X₃The first processor 310 may convert X₃Writing into the memory, and reading the X stored in the memory by the second processor 320₃。

As a possible example, the task response may include the first data, and S440 specifically includes: the first processor 310 sends a task response including the first data to the second processor 320.

In the operation process of the neural network, the first processor can multiplex the storage space of the cache so as to read the intermediate matrix required by the OP from the cache instead of reading the intermediate matrix from the memory, thereby reducing the times of reading data from the memory by the first processor, improving the speed of reading the data matrix and the weight matrix by the neural network, and reducing the total time of finishing all calculations of the neural network by the first processor.

Generally, the storage capacity of a cache for storing operation data in a processor is small, and in a neural network, the size of a matrix required for OP calculation is large, so that the cache in the processor cannot read all the matrices required by the neural network into the cache. For example, in a typical ImageNet2012 classification task, the first layer input data of the ResNet50 network is a four-dimensional tensor that occupies at least 3 × 224 × 4b =558kb; the first layer weight matrix of the ResNet50 network is also a four-dimensional tensor, and occupies a space of 3 × 16 × 3 × 4b =1.73kb; the weight matrix of the last layer of the ResNet50 network is a two-dimensional tensor, and occupies a space of 2048 × 1000 × 4b =8192kb. However, the storage capacity of the L1 Cache in a typical CPU is only 8KB to 64kb, and the L1 Cache cannot read in a complete data matrix and/or weight matrix.

As an optional implementation manner, if a storage space required by a matrix of a third OP in the multiple OPs of the neural network is greater than a storage capacity of the cache, the first processor divides the matrix of the third OP into multiple sub-matrices, and then the first processor writes at least one sub-matrix in the multiple sub-matrices stored in the memory into the cache, where the storage space required by the at least one sub-matrix is less than or equal to the storage capacity of the cache.

For example, fig. 8 is a second process diagram of data processing provided by the present application, in which a first processor 810 is in communication with a memory 820, and the first processor 810 includes a computing unit 811 and a cache 812, where the first processor 810 may implement the functions of the first processor 310, and the memory 820 may implement the functions of the memory 330.

In fig. 8 (a), the data matrix required for the third OP is X, the weight matrix is W, where X is a 128 × 200 matrix, and the minimum specification of the matrix calculated by the first processor 810 is 64 × 64, the first processor 810 divides X into 8 data sub-matrices (e.g., X is X)₁₁、X₁₂、X₁₃、X₁₄、X₂₁、X₂₂、X₂₃、X₂₄) Of note, X₁₁、X₁₂、X₁₃、X₁₄、X₂₁、X₂₂、X₂₃、X₂₄Each of 64X 64 sub-matrices, X₁₄And X₂₄Complementary zero values are needed for the 8 data sub-matrices to meet the computational requirements of the first processor 810; similarly, the first processor 810 may divide the weight matrix W into 8 weight sub-matrices (e.g., W)₁₁、W₁₂、W₁₃、W₁₄、W₂₁、W₂₂、W₂₃、W₂₄) Wherein W is₁₁、W₁₂、W₁₃、W₁₄Is a weight submatrix of 64 x 64, W₂₁、W₂₂、W₂₃And W₂₄A supplemental zero value is needed to make these 8 weight sub-matrices meet the computational requirements of the first processor 810; in addition, after the data sub-matrices and the weight sub-matrices are supplemented with zero values, the result of calculating the matrix multiplication (XW) by the first processor 810 using the 8 data sub-matrices and the 8 weight sub-matrices is not changed.

If the matrix is in FP32 format (i.e. each data occupies 4B), the size of each matrix shown in fig. 8 is as shown in table 2 below, for example, the size of the input data matrix X of the third OP is 16 × 128 × 200, which means that 16 data matrices of 128 × 200 will perform the same operation in the third OP, the data size of the data matrix X is 1600KB, the data size of each data sub-matrix is 256KB, the data size of the weight matrix W is 100KB, and the data size of each weight sub-matrix is 16KB.

TABLE 2

Matrix of	Size of	Data volume
			Data matrix X	16×128×200	1600KB
A data sub-matrix: x₁₁、X₁₂、X₁₃、X₁₄、X₂₁、X₂₂、X₂₃、X₂₄	16×64×64	256KB
			Weight matrix W	128×200	100KB
Weight submatrix: w₁₁、W₁₂、W₁₃、W₁₄、W₂₁、W₂₂、W₂₃、W₂₄	64×64	16KB

For example, assuming that the storage capacity of the cache 820 of the first processor 810 is 1MB, since the total storage space required for the data matrix X and the weight matrix W is 1600kb +100kb =1700kb ≈ 1.66MB, the cache 820 cannot read all the data of the data matrix X and the weight matrix W from the memory 830330. As shown in fig. 8 (b), the first processor 810 may slice the data matrix X and the weight matrix W, and write the partial sub-matrices into the buffer 820, and since the storage space (occupied 768kb +128kb =896kb = 0.875mb) required by the 3 data sub-matrices and the 8 weight sub-matrices is less than the storage capacity (1 MB) of the buffer 820, the first processor 810 may divide the 3 data sub-matrices (e.g., X +128kb =896kb = 0.875mb)₁₁、X₁₂And X₁₃) And 8 weight submatrix writesIn the cache 820.

After the first processor 810 traverses the combination between all the data sub-matrices and the weight sub-matrices of the third OP, a calculation result of the third OP is obtained. "combination between data submatrix and weight submatrix" means that X is the factor of₁₁Need to be in contact with W₁₁、W₁₂And calculating, wherein the combination of the calculation processes is as follows: (X)₁₁，W₁₁) And (X)₁₁，W₁₂) "traversal" means that the first processor 810 combines (X) from the computation submatrixes₁₁，W₁₁) To submatrix combination (X)₂₄，W₄₂) Then, a calculation result of the third OP is obtained.

It is noted that, in the case that the storage capacity of the cache 820 is fixed, the smaller the granularity (data amount) of the sub-matrices, the larger the number of sub-matrices that the cache 820 can accommodate, the higher the hit rate of the first processor 810 in the cache 820, however, the smaller the granularity (data amount) of the sub-matrices, the larger the number of sub-matrix combinations, and the more complicated the pre-reading mechanism of the sub-matrices. Therefore, when determining which sub-matrices of the third OP are written into the cache 820, the above process of determining the OP by using the computation graph may be adopted, so that the first processor 810 divides the matrix of the third OP by using the storage capacity of the cache 820 and the minimum specification of the first processor 810, writes the sub-matrices of the third OP into the cache 820 according to the storage capacity of the cache 820, improves the hit rate of reading data in the cache 820, further reduces the data reading time of the first processor 810 for obtaining the data required by the neural network, reduces the time required by the first processor 810 for operating the neural network, and improves the efficiency of the first processor 810 for operating the neural network

It can be understood that, in order to improve the hit rate of the first processor reading data from the Cache, in the embodiment of the present application, the first processor writes the sub-matrices stored in the memory into the Cache, but in some possible examples, for example, the first processor includes a multi-level Cache (e.g., an L1 Cache and an L2 Cache), the L2Cache stores a matrix of 2 OPs in the neural network, and storage spaces required by the matrices of the 2 OPs are all greater than a storage capacity of the L1 Cache, in an operation process of the neural network, the first processor may also split the matrices of the 2 OPs stored by the L2Cache, and write a part of the sub-matrices of the 2 OPs into the L1 Cache, so as to improve the hit rate of the first processor reading data from the L1 Cache, reduce data transfer times and data reading time of the first processor, further improve the efficiency of the first processor operating the neural network, and reduce processing time delay of the neural network.

It is understood that, in order to implement the functions of the above embodiments, the computing devices and chips include corresponding hardware structures and/or software modules for performing the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software driven hardware depends on the particular application scenario and design constraints imposed on the solution.

The data processing method provided in the present application is described in detail above with reference to fig. 1 to 8, and the data processing apparatus and the computing device provided in the present embodiment are described below with reference to fig. 9 and 10.

Fig. 9 is a schematic structural diagram of a data processing apparatus provided in the present application, where the data processing apparatus 900 includes a pre-fetch module 910, a processing module 920, and a communication module 930, and the data processing apparatus 900 may implement the functions of the first processor 310 or the second processor 320 in fig. 4 and 7, and the data processing apparatus 900 may also implement the functions of the first processor 810 in fig. 8.

When the data processing apparatus 900 is used to implement the functions of the first processor 310 in the method embodiment shown in fig. 4, the pre-reading module 910 is used to implement S420, the processing module 920 is used to implement S430, and the communication module 930 is used to implement S440.

When the data processing device 900 is used to implement the functions of the second processor 320 in the method embodiment shown in fig. 4, the communication module 930 is used to implement S410.

When the data processing apparatus 900 is used to implement the functions of the first processor 310 in the method embodiment shown in fig. 7, the pre-reading module 910 is configured to implement S4201 to S4203, the processing module 920 is configured to implement S4301 to S4305, and the communication module 930 is configured to implement S440.

When the data processing device 900 is used to implement the functions of the second processor 320 in the method embodiment shown in fig. 7, the communication module 930 is used to implement S410.

Optionally, the data processing apparatus 900 may further include a storage module 940, and the storage module 940 may be configured to implement the function of caching in the first processor. It should be understood that the present embodiment merely provides an exemplary division for the structure and functional modules of the data processing apparatus 900, and the present application does not set any limit to the specific division.

More detailed descriptions about the data processing apparatus 900 can be directly obtained by referring to the related descriptions in the method embodiments shown in fig. 3 to fig. 8, which are not repeated herein.

Fig. 10 is a schematic structural diagram of a computing device 1000 provided in the present application, where the computing device 1000 includes a processor 1010 and a communication interface 1020. Processor 1010 and communication interface 1020 are coupled to one another. It will be appreciated that the communication interface 1020 may be a transceiver or an input-output interface. Optionally, computing device 1000 may also include a memory 1030 to store instructions for execution by processor 1010, or to store input data required by processor 1010 to execute the instructions, or to store data generated by processor 1010 after executing the instructions.

The processor 1010 is configured to, during an operation of the neural network, write a matrix of at least one OP of the plurality of OPs of the neural network stored in the memory into the memory 1030 according to a storage capacity of the memory 1030, where a storage space required by the matrix of the at least one OP is smaller than or equal to the storage capacity of the memory 1030; the processor 1010 is further configured to generate first data from the matrix of at least one OP stored in the memory 1030. The processor can read a plurality of OP matrixes from the memory in advance according to the storage capacity of the memory, so that the number of times of reading the OP matrixes from the memory by the processor is reduced, the data reading time in the operation process of the neural network and the total operation time required by the neural network are reduced, and the operation efficiency of the neural network is improved.

When the computing device 1000 is used to implement the method shown in fig. 4 or fig. 7, the processor 1010, the communication interface 1020 and the memory 1030 may also cooperatively implement various operational steps in a data processing method performed by the first processor and the second processor. The computing device 1000 may also perform the functions of the data processing apparatus 900 shown in fig. 9, which are not described herein in detail.

The specific connection medium among the communication interface 1020, the processor 1010 and the memory 1030 is not limited in the embodiments of the present application. In the embodiment of the present application, the communication interface 1020, the processor 1010 and the memory 1030 are connected by a bus 1040 in fig. 10, the bus is represented by a thick line in fig. 10, and the connection manner between other components is merely illustrative and not limited thereto. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.

The memory 1030 may be used for storing software programs and modules, such as program instructions/modules corresponding to the data processing method provided in the embodiments of the present application, and the processor 1010 executes the software programs and modules stored in the memory 1030, thereby executing various functional applications and data processing. The communication interface 1020 may be used for communicating signaling or data with other devices. The computing device 1000 may have multiple communication interfaces 1020 in this application.

The memory may be, but is not limited to, RAM, ROM, PROM, EPROM, EEPROM, and the like.

The processor may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor including a CPU, NP, etc.; but also DSPs, ASICs, FPGAs or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.; the system can also be a GPU, an NPU and other processors with neural network computing power.

The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by software instructions executed by a processor. The software instructions may consist of corresponding software modules that may be stored in RAM, flash memory, ROM, PROM, EPROM, EEPROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in a computing device. Of course, the processor and the storage medium may reside as discrete components in a computing device.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network appliance, a user device, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire or wirelessly. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape; or an optical medium, such as a Digital Video Disc (DVD); but may also be a semiconductor medium, such as an SSD.

In the embodiments of the present application, unless otherwise specified or conflicting with respect to logic, the terms and/or descriptions in different embodiments have consistency and may be mutually cited, and technical features in different embodiments may be combined to form a new embodiment according to their inherent logic relationship.

The terms "first," "second," and "third," etc. in the description and claims of this application and the above-described drawings are used for distinguishing between different objects and not for limiting a particular order. In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion.

In the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. Furthermore, for elements (elements) that appear in the singular form "a," an, "and" the, "they are not intended to mean" one or only one "unless the context clearly dictates otherwise, but rather" one or more than one. For example, "a device" means for one or more such devices. Still further, at least one (at least one of a). In the description of the text of the present application, the character "/" generally indicates that the former and latter associated objects are in an "or" relationship; in the formula of the present application, the character "/" indicates that the preceding and following related objects are in a relationship of "division".

It is to be understood that the various numerical references referred to in the embodiments of the present application are merely for convenience of description and distinction and are not intended to limit the scope of the embodiments of the present application. The sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of the processes should be determined by their functions and inherent logic.

Claims

1. A method of data processing, the method being performed by a first processor, the method comprising:

writing a matrix of at least one OP in a plurality of operations OP of the neural network stored in a memory into the cache according to the storage capacity of the cache of the first processor in the operation process of the neural network, wherein the storage space required by the matrix of the at least one OP is less than or equal to the storage capacity of the cache;

and generating first data according to the at least one OP matrix stored in the cache.

2. The method according to claim 1, wherein the buffer includes a first buffer unit and a second buffer unit, the storage capacity of the first buffer unit is smaller than that of the second buffer unit, and the data reading speed of the first buffer unit is greater than that of the second buffer unit;

writing a matrix of at least one OP of a plurality of OPs of the neural network stored in memory to the cache, comprising:

if the storage space required by the at least one OP matrix is smaller than or equal to the storage capacity of the second cache unit, judging whether the storage space required by the at least one OP matrix is smaller than or equal to the storage capacity of the first cache unit;

if the storage space required by the at least one OP matrix is smaller than or equal to the storage capacity of the first cache unit, writing the at least one OP matrix stored in the memory into the first cache unit;

if the storage space required by the matrix of the at least one OP is larger than the storage capacity of the first cache unit, writing the matrix of a part of OPs in the at least one OP stored in the memory into the first cache unit, and writing the matrix of another part of OPs in the at least one OP into the second cache unit, wherein the part of OPs have continuously executed incidence relation.

3. The method according to claim 1 or 2, wherein the matrix of the at least one OP comprises a data matrix for indicating input data of a first OP of the at least one OP and/or a weight matrix for indicating a weight of the data matrix.

4. The method according to any one of claims 1-3, wherein the matrix of at least one OP comprises a matrix of a first OP and a matrix of a second OP, the first OP and the second OP having a continuously performed correlation;

the generating first data according to the matrix of the at least one OP stored in the cache includes:

reading the matrix of the first OP stored in the cache;

generating second data according to the matrix of the first OP;

writing the second data into the cache, and deleting the matrix of the first OP stored in the cache;

reading the matrix of the second OP and the second data stored in the cache;

generating the first data according to the matrix of the second OP and the second data.

5. The method of claim 4, wherein the matrix of the first OP comprises a data matrix of the first OP, the data matrix of the first OP being indicative of input data of the neural network.

6. The method according to claim 4 or 5, wherein after generating the first data from the matrix of the second OP and the second data, the method further comprises:

and writing the first data into the cache, and deleting the matrix of the second OP stored in the cache.

7. The method of claim 4 or 5, wherein after generating the first data, the method further comprises:

and if the execution of the plurality of OPs is finished, outputting the first data as output data.

8. The method of claim 7, wherein outputting the first data as output data comprises:

and sending the first data to a second processor.

9. The method according to any of claims 1-8, wherein prior to said writing a matrix of at least one of a plurality of OPs of the neural network stored in memory to the cache, the method further comprises:

and receiving a task request sent by a second processor, wherein the task request is used for instructing the first processor to start an operation task of a neural network.

10. The method according to any one of claims 1-9, further comprising:

if the storage space required by the matrix of the third OP in the multiple OPs is larger than the storage capacity of the cache, the matrix of the third OP is segmented into multiple sub-matrices;

and writing at least one sub-matrix in the plurality of sub-matrices stored in the memory into the cache, wherein the storage space required by the at least one sub-matrix is less than or equal to the storage capacity of the cache.

11. A data processing apparatus, for use with a first processor, the apparatus comprising:

the pre-reading module is used for writing a matrix of at least one operational Operation (OP) in a plurality of OPs of the neural network stored in a memory into the cache according to the storage capacity of the cache of the first processor in the operation process of the neural network, wherein the storage space required by the matrix of the at least one OP is less than or equal to the storage capacity of the cache;

and the processing module is used for generating first data according to the at least one OP matrix stored in the cache.

12. The apparatus according to claim 11, wherein the buffer includes a first buffer unit and a second buffer unit, a storage capacity of the first buffer unit is smaller than a storage capacity of the second buffer unit, and a data reading speed of the first buffer unit is greater than a data reading speed of the second buffer unit;

if the storage space required by the at least one OP matrix is less than or equal to the storage capacity of the second cache unit, the pre-reading module is specifically configured to determine whether the storage space required by the at least one OP matrix is less than or equal to the storage capacity of the first cache unit;

the pre-reading module is specifically configured to write the matrix of the at least one OP stored in the memory into the first cache unit if a storage space required by the matrix of the at least one OP is smaller than or equal to a storage capacity of the first cache unit;

the pre-reading module is specifically configured to, if a storage space required by the matrix of the at least one OP is larger than a storage capacity of the first cache unit, write the matrix of a part of OPs in the at least one OP stored in the memory into the first cache unit, write the matrix of another part of OPs in the at least one OP into the second cache unit, where the part of OPs have continuously executed association relationships.

13. The apparatus according to claim 11 or 12, wherein the matrix of the at least one OP comprises a data matrix for indicating input data of a first OP of the at least one OP and/or a weight matrix for indicating a weight of the data matrix.

14. The apparatus according to any of claims 11-13, wherein the matrix of the at least one OP comprises a matrix of a first OP and a matrix of a second OP, the first OP and the second OP having a continuously performed correlation;

the processing module is specifically configured to read the matrix of the first OP stored in the cache;

the processing module is specifically configured to generate second data according to the matrix of the first OP;

the processing module is specifically configured to write the second data into the cache, and delete the matrix of the first OP stored in the cache;

the processing module is specifically configured to read the matrix of the second OP and the second data stored in the cache;

the processing module is specifically configured to generate the first data according to the matrix of the second OP and the second data.

15. The apparatus of claim 14, wherein the matrix of the first OP comprises a data matrix of the first OP, the data matrix of the first OP indicating input data of the neural network.

16. The apparatus according to claim 14 or 15, wherein the processing module is further configured to write the first data into the buffer and delete the matrix of the second OP stored in the buffer after the first data is generated.

17. The apparatus of claim 14 or 15, further comprising: a communication module;

the communication module is configured to output the first data as output data if the plurality of OPs are executed.

18. The apparatus of claim 17, wherein the communication module is specifically configured to send the first data to a second processor.

19. The apparatus according to any one of claims 11-18, further comprising: a communication module;

the communication module is configured to receive a task request sent by a second processor before the pre-reading module writes the matrix of at least one OP of the plurality of OPs of the neural network stored in the memory into the cache, where the task request is used to instruct the first processor to start an operation task of the neural network.

20. The apparatus according to any of claims 11-19, wherein the pre-fetch module is further configured to split the matrix of a third OP of the plurality of OPs into a plurality of sub-matrices if the storage space required by the matrix is larger than the storage capacity of the buffer;

the pre-reading module is further configured to write at least one sub-matrix of the plurality of sub-matrices stored in the memory into the cache, where a storage space required by the at least one sub-matrix is smaller than or equal to a storage capacity of the cache.

21. A chip comprising a memory and a processor;

the processor is used for writing a matrix of at least one OP (operational noise) in a plurality of operations OP of the neural network stored in the memory into the memory according to the storage capacity of the memory in the operation process of the neural network, wherein the storage space required by the matrix of the at least one OP is less than or equal to the storage capacity of the memory;

the processor is further configured to generate first data according to the matrix of the at least one OP stored in the memory.

22. A computing device comprising a processor and interface circuitry to receive signals from or transmit signals to other computing devices than the computing device, the processor being arranged to implement the method of any of claims 1 to 10 by logic circuitry or executing code instructions.

23. A computer storage medium, in which a computer program or instructions are stored which, when executed by a computing device or processor, implement the method of any one of claims 1 to 10.