CN111428879B

CN111428879B - Data processing method, device, chip and computer readable storage medium

Info

Publication number: CN111428879B
Application number: CN202010142403.5A
Authority: CN
Inventors: 闯小明; 杨龚轶凡; 郑瀚寻; 高雷; 侯觉
Original assignee: Zhonghao Xinying Hangzhou Technology Co ltd
Current assignee: Zhonghao Xinying Hangzhou Technology Co ltd
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2024-02-02
Anticipated expiration: 2040-03-04
Also published as: CN111428879A

Abstract

The embodiment of the invention discloses a data processing method, a data processing device, a chip and a computer readable storage medium, which are used for accelerating the operation of a batch standardization layer in deep learning model training. The multidimensional tensor data is stored in the first memory according to a preset rule, then is taken out in a two-dimensional data form and is operated, a third matrix is constructed through the cooperation of a plurality of register groups and the second memory, and the sum of squares of elements and elements of each row in the first matrix can be obtained simultaneously through matrix multiplication of the first matrix and the third matrix, so that the parallel calculation of sum of squares of elements and elements is realized, the calculation of average value and variance in a batch standardization layer is accelerated, and the problem of long operation time consumption caused by overlarge data in the batch standardization layer operation process is solved. Finally, the operation speed of batch standardization operation is improved, and the time required by training the whole deep learning model is greatly shortened.

Description

Data processing method, device, chip and computer readable storage medium

Technical Field

The present invention relates to the field of deep learning model training, and in particular, to a data processing method, device, chip and computer readable storage medium.

Background

Deep learning is a new field in machine learning research, aimed at creating or simulating neural networks for analysis learning of the human brain, which imitate the mechanisms of the human brain to interpret data such as image, sound, text, etc. The deep learning model can be practically used after being trained by a large amount of data, and common deep learning models include convolutional neural networks (Convolutional Neural Network, CNN) and the like.

In the training process of the deep learning model, the layers of the deep learning model are processed by a batch normalization (Batch Normalization, BN) method, so that the sample variability of the network in the process of transferring each layer is reduced, and the batch normalization is a technology for improving the speed, performance and stability of the deep neural network (Deep Neural Network) proposed in 2015. We call the distribution of the output of the deep neural network middle layer as it goes through the training process changes to an internal covariate offset (Internal Covariate Shift), eliminating this phenomenon will accelerate the training process. The batch normalization layer (Batch Normalization Layer) fixes the mean and variance of the inputs by normalizing the inputs at the layer to reduce internal covariate offset, so the batch normalization can enable the network training to employ a greater learning rate, ultimately speeding up the training of the network. Meanwhile, batch standardization can make the network not rely on weight initialization again.

The batch normalization layer includes forward propagation (Forward Propagation) and backward propagation (Backward Propagation) in the training process of the neural network. Notably, in the forward propagation process, the summation of the data elements and the squaring of the sum are required; in the training of the deep learning model, the data to be processed by the batch standardization layer are often multidimensional tensor data, and huge data volume causes that the batch standardization layer consumes a great deal of time when carrying out summation operation and square sum operation, thereby influencing the training speed of the deep learning model.

Disclosure of Invention

In view of the above, the present invention provides a data processing method, apparatus, chip and computer readable storage medium, so as to solve the problems of long time consumption for batch normalization layer calculation and slow training speed of deep learning model in deep learning model training.

In a first aspect, an embodiment of the present invention provides a data processing method. The method is used for accelerating the operation of a batch normalization layer in the deep learning model. The input of the normalization layer comprises multi-dimensional tensor data, and the dimensions in the multi-dimensional tensor data comprise channels; the method provides a first register set, a second register set, a first memory and a second memory, wherein the first register set and the second register set comprise M rows and N columns of registers. The second memory can store at least N rows and K columns of data, wherein K is not less than 2M, and the method comprises the following steps:

storing the multidimensional tensor data into a first memory according to a preset rule;

taking out the data in the first memory according to the channel and putting the data into a first register group; q data in the same channel can only be put into one row in the first register group, wherein Q is not more than N, and when Q is less than N, the row in which the Q data in the first register group are located is supplemented with at least N-Q0; all data in the first register set form a first matrix;

the method comprises the steps that initial data are placed in any row in a first register set, the size of the initial data is 1 row and Q columns, the content of the initial data is 1, when the data in a second register set comprise the initial data, all the data in the second register set form a second matrix, and the data except the initial data in the second matrix are all 0;

and carrying out matrix multiplication on the first matrix and the third matrix to obtain a multiplication result, wherein the multiplication result comprises the sum of elements and the square sum of the elements of each row in the first matrix.

Further, the multi-dimensional tensor data includes four-dimensional tensor data including a batch size B, a height H, a width W, and a channel C.

Further, the predetermined rule includes storing in a cross-channel order.

Further, the predetermined rule includes storing in a sequential order of channel elements.

According to the method, multidimensional tensor data are stored in the first memory according to the preset rule, then are taken out in the form of two-dimensional data and are operated, a third matrix is constructed through the cooperation of the register groups and the second memory, and the sum of squares of elements of each row in the first matrix can be obtained simultaneously through matrix multiplication of the first matrix and the third matrix, so that parallel calculation of the sum of elements and the sum of squares of elements is realized, calculation in a batch standardization layer is accelerated, and the problem of long operation time consumption caused by overlarge data in the operation process of the batch standardization layer is solved. Finally, the operation speed of batch standardization operation is improved, and the time required by training the whole deep learning model is greatly shortened.

In a second aspect, an embodiment of the present invention provides a data processing apparatus for accelerating operations of a batch normalization layer in training of a deep learning model. The device comprises an arithmetic unit, a first memory, a first register group, a second memory and a second register group, wherein the first register group is connected with the first memory, the second memory and the arithmetic unit, and the second memory is connected with the second register group and the arithmetic unit; the first register set and the second register set each comprise M rows and N columns of registers providing multidimensional tensor data as input to a batch normalization layer, wherein:

the first memory is used for storing multi-dimensional tensor data, and the dimension of the multi-dimensional tensor data comprises a channel;

the first register group is used for storing a first matrix, and the first matrix at least comprises partial data fetched from the first memory;

the second register group is used for storing a second matrix, any row in the second matrix comprises initial data of 1 row and Q columns, and the content of the initial data is 1;

the second memory is used for storing a third matrix, and the third matrix at least comprises a transposed matrix of the first matrix and a transposed matrix of the second matrix;

the arithmetic unit is used for multiplying the first matrix and the third matrix to obtain a multiplication result, and the multiplication result comprises the sum of elements and the square sum of the elements of each row in the first matrix.

According to the device, multidimensional tensor data are stored in the first memory according to the preset rule, then are taken out in the form of two-dimensional data and are operated, the third matrix is constructed through the cooperation of the register groups and the second memory, the sum of squares of elements and the sum of squares of elements of each row in the first matrix can be obtained simultaneously through matrix multiplication of the first matrix and the third matrix, and parallel calculation of the sum of squares of the elements and the sum of squares of the elements is achieved, so that calculation in a batch standardization layer is accelerated, and the problem of long operation time consumption caused by overlarge data amount in the batch standardization layer operation process is solved. Finally, the operation speed of batch standardization operation is improved, and the time required by training the whole deep learning model is greatly shortened.

In a third aspect, an embodiment of the present invention provides a chip. The chip comprises at least the aforementioned data processing means.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having a computer program stored thereon, which when executed by an execution processor, implements the steps of the data processing method described above.

Further combinations of the present invention may be made to provide further implementations based on the implementations provided in the above aspects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of four-dimensional tensor data according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a specific four-dimensional tensor data according to an embodiment of the present invention;

FIG. 4A is a schematic diagram of four-dimensional data stored in an SRAM memory according to a cross channel element order according to an embodiment of the present invention;

FIG. 4B is a schematic diagram of four-dimensional data stored in an SRAM according to a continuous sequence of channel elements according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a data processing apparatus 500 according to an embodiment of the present invention;

FIG. 6A is a schematic diagram of a preferred data processing apparatus 600 according to an embodiment of the present invention;

FIG. 6B is a schematic diagram illustrating an arrangement of registers in the first register file 602 according to an embodiment of the present invention;

FIG. 6C is a schematic diagram of a first matrix placement into a first register set 602 according to an embodiment of the invention;

FIG. 6D is a schematic diagram of a second matrix placement second register set 603 according to an embodiment of the present invention;

FIG. 6E is a schematic diagram of a third matrix placed in the second memory 604 according to an embodiment of the present invention;

FIG. 6F is a schematic diagram of a multiplication result of a first matrix and a third matrix according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a chip according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It will be understood that when an element is referred to as being "connected" to "another element, it can be directly connected to the other element or be indirectly connected to the other element.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The following describes embodiments of the present invention in detail.

The embodiment of the invention provides a data processing method which can be used for accelerating the operation of a batch standardization layer in the training of a deep learning model. The data processed by the method includes multi-dimensional tensor data, wherein dimensions in the multi-dimensional tensor data include at least channels. The method provides a first register set, a second register set, a first memory and a second memory, wherein the first register set and the second register set comprise M rows and N columns of registers, and the second memory can store at least N rows and K columns of data, and K is not less than 2M. Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the invention. As shown in fig. 1, the method comprises the steps of:

step 101: storing the multidimensional data into a first memory according to a preset rule;

step 102: acquiring a first matrix, taking out data in a first memory according to a channel, putting the data into a first register group, and only one row of Q data in the same channel exists in the first matrix, wherein Q is not more than N; when Q is smaller than N, the row where Q data are located in the first register group is supplemented with at least N-Q0; all data in the first register group form a first matrix;

step 103: acquiring a second matrix, putting initial data of 1 row and Q columns into any row in a second register set, wherein the content of the initial data is 1, and when the data in the second register set comprises the initial data, forming the second matrix by all the data in the second register set;

step 104: obtaining a third matrix, and putting the first matrix and the second matrix into a second memory according to columns after both the first matrix and the second matrix are transposed; all data in the second register form a third matrix;

step 105: and carrying out matrix multiplication on the first matrix and the third matrix to obtain a multiplication result, wherein the multiplication result comprises the sum of elements and the square sum of the elements of each row in the first matrix.

In the implementation process, the step of acquiring the second matrix and the step of acquiring the first matrix may be performed sequentially or may be performed in parallel.

In a preferred embodiment of the invention, a SRAM memory is used to store multidimensional tensor data, each piece of data in the SRAM memory having an address corresponding thereto through which the data therein can be accessed. The multi-dimensional tensor data is four-dimensional tensor (4-dimension tensor), and is composed of two-dimensional data in batches. Fig. 2 is a schematic diagram of four-dimensional tensor data according to an embodiment of the present invention. As shown in fig. 2, the first dimension of the four-dimensional tensor data is a batch size B (batch size), the second dimension is a height H (height), the third dimension is a width W (width), and the fourth dimension is a channel number C (channel).

Fig. 3 is a schematic diagram of four-dimensional tensor data according to an embodiment of the present invention. As shown in fig. 3, the dimensions of the four-dimensional tensor data are b=1, h= 5,W =4, and c=64. The four-dimensional tensor data shown in fig. 3 may be stored in the SRAM memory according to a channel element crossing sequence, where the channel element crossing sequence is to store the element at the same position in each channel in the SRAM memory before storing the element at another position in the SRAM memory. Fig. 4A is a schematic diagram of four-dimensional data stored in an SRAM memory according to a cross channel element order according to an embodiment of the present invention. As shown in fig. 4A, the four-dimensional tensor data shown in fig. 3 is stored in the SRAM memory in the order of the channel element intersections, and first, the first element (0, 20, …, 1260) from the channel C0 (i.e., c=0) to the channel C63 (i.e., c=63) in fig. 3 is stored in the SRAM memory, i.e., in the order of (0, 0), (0, 1), …, and (0,0,0,63); the second element (1, 21, …, 1261) in channels C0-C63 of fig. 3 is then stored in SRAM memory, i.e., in the order of (0, 1, 0), (0, 1), …, (0,0,1,63); and so on, until the last element (19, 39, …, 1279) in channels C0 through C63 in fig. 3 is stored in the SRAM memory, i.e., in the order of (0,3,4,0), (0,3,4,1), …, (0,3,4,63).

The four-dimensional tensor data shown in fig. 3 may also be stored in the SRAM memory according to a channel element continuous sequence, i.e., all elements in each channel are stored first, then all elements in the next channel are stored, and so on. Fig. 4B is a schematic diagram of four-dimensional data stored in the SRAM memory according to a continuous sequence of channel elements according to an embodiment of the present invention. As shown in fig. 4B, the four-dimensional tensor data shown in fig. 3 is stored in the SRAM memory in the sequential order of channel elements, and all the data (0, 1, …, 19) in the channel C0 in fig. 3 are first placed in the SRAM memory, that is, in the order of (0, 0), (0, 1, 0), …, (0,3,4,0); all data (20, 21, …, 39) in channel C1 in fig. 3 is then put into SRAM memory, i.e. in the order of (0, 1), (0, 1), …, (0,3,4,1); and so on, until all the data (1260,1261, …, 1279) in the last channel (channel C63) in fig. 3 is put into the SRAM memory, i.e., in the order of (0,0,0,63), (0,0,1,63), …, (0,3,4,63).

According to the method, multidimensional tensor data are stored in the first memory according to the preset rule, then are taken out in the form of two-dimensional data and are operated, a third matrix is constructed through the cooperation of the plurality of register groups and the second memory, and the sum of squares of elements and elements of each row in the first matrix can be obtained simultaneously through matrix multiplication of the first matrix and the third matrix, so that the parallel calculation of sum of squares of elements and elements is realized, and the calculation in a batch standardization layer is accelerated, and the problem of long operation time consumption caused by overlarge data in the batch standardization layer operation process is solved. Finally, the operation speed of batch standardization operation is improved, and the time required by training the whole deep learning model is greatly shortened.

An embodiment of the present invention provides a data processing apparatus for accelerating operations of batch normalization layers in a deep learning model, and please refer to fig. 5, which is a schematic structural diagram of a data processing apparatus 500 according to an embodiment of the present invention. As shown in fig. 5, the data processing apparatus 500 includes a first memory 501, a first register group 502, a second register group 503, a second memory 504, and an operator 505. The first register set 502 is connected to the first memory 501, the second memory 502, and the operator 505, and the second memory 504 is connected to the second register set 503 and the operator 505. The first register set 502 and the second register set 503 have the same structure, and each includes M rows and N columns of registers. The second memory 504 can store at least N rows of K columns of data, and K is not less than 2M. The multidimensional tensor data is provided as input to a batch normalization layer, wherein,

the first memory 501 is configured to store multi-dimensional tensor data, where one dimension of the multi-dimensional tensor data is a channel;

the first register set 502 is used for storing a first matrix, and the first matrix at least comprises part of data fetched from the first memory 501;

the second register set 504 is used for storing a second matrix, and any row in the second matrix includes 1 row and 1 column of initial data, where the content of the initial data is all 1;

the second memory 504 is configured to store a third matrix, where the third matrix includes at least a transpose of the first matrix and a transpose of the second matrix;

the operator 505 is configured to multiply the first matrix and the third matrix to obtain a multiplication result, where the multiplication result includes a sum of elements and a sum of squares of elements of each row in the first matrix.

For a better understanding of the present invention, the solution disclosed in the present invention will be described by way of example with reference to specific application scenarios.

In the calculation process of the batch standardization layer trained by the deep learning model, forward propagation training is needed, and in the forward propagation, the mean and variance (mean & variance) of each channel (channel) of input data are needed to be calculated, so that the mean and variance of each channel are finally obtained. The calculation formula of the mean (μ) and variance (σ) is as follows:

where m=b×h×w (i.e. the product of the batch size and the height and width, i.e. the sum of the number of elements in each channel). The data processing apparatus and method provided by the present invention are now used to calculate the mean and variance for each channel. Calculating the mean value for each channel first requires calculating the sum of all elements in each channel (i.e., calculating) The method comprises the steps of carrying out a first treatment on the surface of the Calculating the variance of each channel first requires calculating the sum of squares of all elements in each channel (i.e. calculating +.>). The input data is four-dimensional tensor data, and it is assumed that the specific data of the input data is four-dimensional tensor data shown in fig. 3.

Fig. 6A is a schematic structural diagram of a preferred data processing apparatus 600 according to an embodiment of the present invention. As shown in fig. 6A, the data processing apparatus 600 includes a first memory 601, a first register group 602, a second register group 603, a second memory 604, and an operator 605. The first register set 602 is connected to the first memory 601, the second memory 604, and the operator 605, and the second memory 604 is connected to the second register set 603 and the operator 605. The first register set 602 and the second register set 603 have the same structure, and each includes 4 rows and 32 columns of registers as shown in fig. 6B. The second memory 604 can store at least 32 rows and 16 columns of data. The specific procedure for calculating the mean and variance using the data processing device 600 is as follows:

firstly, four-dimensional tensor data are put into a first memory 601 according to the continuous sequence of the channel elements; in another preferred embodiment, the memory 601 may also be placed in a cross-channel element order.

Then, a part of the four-dimensional tensor data is fetched and put into the first register group 602 according to the channels, that is, the data from the channel C0 to the channel C3 are respectively put into one to four rows of the first register group 602, and since each channel has at most 20 elements, less than 32, the last 12 0 are complemented in each row. At this time, all data in the first register group 602 constitutes a first matrix, and as shown in fig. 6C, the size of the first matrix is 4X32.

At the same time, 1 row and 20 columns of initial data with all 1's are put into the 2 nd row in the second register set 603, and the remaining 12 registers of the 2 nd row and the remaining three rows of registers are all complemented with 0's. At this time, all data in the second register group 603 constitute a second matrix, and the size of the second matrix is 4X32 as shown in fig. 6D.

After the first matrix and the second matrix are obtained, both the first matrix and the second matrix are transposed, and then the transposed two matrices are placed in the second memory 604 by columns. Wherein the transpose of the first matrix is placed after the transpose of the second matrix, as shown in fig. 6E. All data in the second memory 604 constitutes a third matrix, the size of which is 32X16.

Finally, the first matrix in the first register set 602 and the third matrix in the second memory 604 are sent to the arithmetic unit 605, and matrix multiplication is performed on the first matrix and the third matrix by using the arithmetic unit 605, so as to obtain a multiplication result as shown in fig. 6F. As shown in fig. 6F, the multiplication result is a matrix of 4X16, wherein four data elements from top to bottom in column 2 are the sums of elements of four rows from top to bottom in the first matrix, that is, the sums of all elements in channels C0, C1, C2, and C3, respectively; the first element of the fifth column from top to bottom is the sum of squares of the first row element in the first matrix and the first row element in the third matrix; the second element of the sixth column from top to bottom is the sum of squares of the second row element in the first matrix and the second row element in the third matrix; the third element of the seventh column from top to bottom is the sum of squares of the third row element in the first matrix and the third row element in the third matrix; the fourth element of the eighth column from top to bottom is the sum of squares of the fourth row element in the first matrix and the fourth row element in the third matrix. The sum of the elements is the sum of all the elements in each channel, and the square sum of the elements is the square sum of all the elements in each channel. After the sum of all the elements and the square sum of the elements in each channel are calculated, the mean value and the mode of each channel can be calculated through the related computing equipment according to the formula for calculating the mean value and the variance. The subsequent calculation method of the mean and variance of 60 channels is similar to the foregoing process, and will not be repeated here.

In some other embodiments, the number of data elements in each channel of the input data is greater than the number of rows in each row in the first register set, and then the mean value and variance of each channel can be obtained by a multi-round calculation method, that is, each round calculates partial data element sum and partial element square sum in each channel, and finally sums the partial element sum and partial element square sum in each round, so as to obtain the sum of all elements in the channel and the square sum of all elements, and further obtains the mean value and variance of the channel.

The embodiment of the invention receives and acquires the first matrix through the first register group 602, receives and acquires the second matrix through the second register group 604, and can realize the parallel processing of the second matrix and the first matrix in the operation process.

Fig. 7 is a schematic structural diagram of a chip according to an embodiment of the present invention. As shown in fig. 7, the chip 700 includes one or more processors 701, a communication interface 702, and a computer-readable storage medium 703, where the processors 701, the communication interface 702, and the computer-readable storage medium 703 may be connected by a bus, or may communicate by other means such as wireless transmission. The embodiment of the present invention is exemplified by connection via bus 704. Wherein the computer readable storage medium 703 is configured to store instructions, the processor 701 includes the data processing apparatus disclosed in the above embodiment, for executing the instructions stored in the computer readable storage medium 703. The computer readable storage medium 703 stores program codes, and the processor 701 may invoke the program codes stored in the computer readable storage medium 703 to implement the relevant functions of the foregoing data processing apparatus, and the details of the relevant description in the foregoing embodiments may be referred to, which is not repeated herein.

It should be appreciated that in embodiments of the present invention, the processor 701 may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The communication interface 702 may be a wired interface (e.g., an ethernet interface) or a wireless interface (e.g., a cellular network interface or using a wireless local area network interface) for communicating with other modules or apparatus devices. For example, the communication interface 702 in the embodiments of the present application may be specifically configured to receive input data input by a user; or to receive data from an external device, etc.

The computer-readable storage medium 703 may include Volatile Memory (RAM), such as random access Memory (Random Access Memory); the Memory may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the computer readable storage medium may also include a combination of the above types of memory. The computer readable storage medium may be used to store a set of program code such that the processor invokes the program code stored in the computer readable storage medium to perform the functions associated with the data processing apparatus as described above.

It should be noted that fig. 7 is only one possible implementation of the embodiment of the present invention, and the chip may further include more or fewer components in practical applications, which is not limited herein. For details not shown or described in the embodiments of the present invention, reference may be made to the related descriptions in the foregoing method embodiments, which are not repeated here.

The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores instructions, when the computer readable storage medium runs on a processor, the flow of the data processing method is realized. The storage medium includes ROM/RAM, magnetic disk, optical disk, etc.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be apparent to those skilled in the art that in the several embodiments provided herein, it is to be understood that the disclosed apparatus and methods may be embodied in other forms. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and alternative arrangements included within the spirit and scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A data processing method for accelerating operations of a batch normalization layer in a deep learning model, wherein an input of the batch normalization layer comprises multi-dimensional tensor data, and a dimension in the multi-dimensional tensor data comprises a channel; providing a first register set, a second register set, a first memory and a second memory, wherein the first register set is connected with the first memory, the second memory and an arithmetic unit, and the second memory is connected with the second register set and the arithmetic unit; the first register set and the second register set have the same structure and comprise M rows and N columns of registers; the second memory can store at least N rows and K columns of data, and K is not less than 2M; m is 4, N is 32, and K is 16; the method comprises the following steps:

storing the multidimensional tensor data into the first memory according to a preset rule, wherein the first memory is an SRAM memory;

taking out the data in the first memory according to the channel and putting the data into the first register group; q data in the same channel can only be put into one row in the first register set; the Q is not more than the N, and when the Q is less than the N, at least N-Q0 are added to the row where the Q data are located in the first register group; all data in the first register set form a first matrix;

placing initial data into any row in the second register group, wherein the size of the initial data is 1 row, Q columns and the content of the initial data is 1, and when the data in the second register group comprises the initial data, all the data in the second register group form a second matrix, and the data except the initial data in the second matrix are all 0;

both the first matrix and the second matrix are transposed, the two transposed matrices are placed into the second memory according to columns, and all data in the second memory form a third matrix;

and carrying out matrix multiplication on the first matrix and the third matrix to obtain a multiplication result, wherein the multiplication result comprises element sums and element square sums of each row in the first matrix.

2. The data processing method of claim 1, wherein the multi-dimensional tensor data comprises four-dimensional tensor data, the four dimensions comprising a batch size B, a height H, a width W, and the channel C.

3. The data processing method of claim 2, wherein the predetermined rule comprises storing in a cross-channel element order.

4. The data processing method of claim 2, wherein the predetermined rule comprises storing in a sequential order of channel elements.

5. The data processing device is used for accelerating the operation of a batch normalization layer in a deep learning model and comprises an arithmetic unit, and is characterized by further comprising a first memory, a first register group, a second memory and a second register group; the first register set is connected with the first memory, the second memory and the arithmetic unit, and the second memory is connected with the second register set and the arithmetic unit; the first register set and the second register set comprise M rows and N columns of registers, the second memory can store at least N rows and K columns of data, and K is not less than 2M; m is 4, N is 32, and K is 16; providing multidimensional tensor data as input to the batch normalization layer, wherein:

the first memory is used for storing the multidimensional tensor data, the first memory is an SRAM memory, and the dimensions in the multidimensional tensor data comprise channels;

the first register set is used for storing a first matrix, and the first matrix at least comprises partial data fetched from the first memory;

the second register set is used for storing a second matrix, any row in the second matrix comprises initial data of 1 row and 1 column, and the content of the initial data is 1;

the second memory is configured to store a third matrix, where the third matrix includes at least a transpose of the first matrix and a transpose of the second matrix;

the arithmetic unit is used for multiplying the first matrix and the third matrix to obtain a multiplication result, and the multiplication result comprises element sums and element square sums of each row in the first matrix.

6. A chip comprising at least the data processing device of claim 5.

7. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the data processing method of any of claims 1 to 4.