CN111428879A

CN111428879A - Data processing method, device, chip and computer readable storage medium

Info

Publication number: CN111428879A
Application number: CN202010142403.5A
Authority: CN
Inventors: 闯小明; 杨龚轶凡; 郑瀚寻; 高雷; 侯觉
Original assignee: Shenzhen Xinying Technology Co ltd
Current assignee: Zhonghao Xinying Hangzhou Technology Co ltd
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2020-07-17
Anticipated expiration: 2040-03-04
Also published as: CN111428879B

Abstract

The embodiment of the invention discloses a data processing method, a data processing device, a data processing chip and a computer readable storage medium, which are used for accelerating the operation of a batch standardization layer in deep learning model training. The multidimensional tensor data are stored in a first memory according to a preset rule, then are taken out in a two-dimensional data form and are operated, a third matrix is constructed by matching a plurality of register groups and a second memory, the element sum and the element square sum of each row in the first matrix can be simultaneously solved by carrying out matrix multiplication on the first matrix and the third matrix, the parallel calculation of the element sum and the element square sum is realized, the calculation related to the mean value and the variance in the batch normalization layer is accelerated, and the problem of long operation time consumption caused by overlarge data quantity in the operation process of the batch normalization layer is solved. And finally, the operation speed of batch standardized operation is improved, and the time required by the whole deep learning model training is greatly shortened.

Description

Data processing method, device, chip and computer readable storage medium

Technical Field

The present invention relates to the field of deep learning model training, and in particular, to a data processing method, apparatus, chip, and computer-readable storage medium.

Background

Deep learning is a new field in machine learning research, and aims to establish or simulate a neural network of human brain for analytical learning, and the neural network simulates the mechanism of human brain to interpret data such as images, sounds and texts. The deep learning model can be actually used only after being trained by a large amount of data, and common deep learning models include a Convolutional Neural Network (CNN) and the like.

In the training process of the Deep learning model, each layer of the Deep learning model is mostly selected to be processed by a Batch Normalization (BN) method, so that the difference of samples in the process of transferring the Network in each layer is reduced, the Batch Normalization is a technology for improving the speed, the performance and the stability of a Deep Neural Network (Deep Neural Network) proposed in 2015, the distribution change of the output of a Deep Neural Network middle layer along with the training process is called as Internal Covariate Shift (Internal Covariate Shift), the training process is accelerated by eliminating the phenomenon, the Batch Normalization layer (Batch Normalization L eye) fixes the mean value and the variance of the input by normalizing the input of the layer, so the Batch Normalization can enable the Network to be trained with a larger learning rate, and finally accelerate the training of the Network.

The batch normalization layer includes Forward Propagation (Forward Propagation) and Backward Propagation (Backward Propagation) in the training process of the neural network. It is noted that, in the forward propagation process, a summation operation and a square-sum operation are required for data elements; in deep learning model training, data to be processed by the batch normalization layer are usually multi-dimensional tensor data, and a large amount of data causes the batch normalization layer to consume a large amount of time when performing summation operation and square sum calculation, so that the deep learning model training speed is influenced.

Disclosure of Invention

In view of the above, the present invention provides a data processing method, an apparatus, a chip and a computer readable storage medium, so as to solve the problems that the batch normalization layer calculation in deep learning model training consumes long time and the deep learning model training speed is slow.

In a first aspect, an embodiment of the present invention provides a data processing method. The method is used for accelerating the operation of the batch standardization layer in the deep learning model. The input of the normalization layer comprises multidimensional tensor data, and the dimensions in the multidimensional tensor data comprise channels; the method provides a first register set, a second register set, a first memory and a second memory, wherein the first register set and the second register set respectively comprise M rows and N columns of registers. The second memory can store at least N rows and K columns of data, wherein K is not less than 2M, and the method comprises the following steps:

storing the multidimensional tensor data into a first memory according to a preset rule;

taking out the data in the first memory according to the channel and putting the data in a first register group; q data in the same channel can only be put into one row in the first register group, wherein Q is not more than N, and when Q is less than N, the row where the Q data are in the first register group is supplemented with at least N-Q0 s; all data in the first register group form a first matrix;

putting initial data into any row in a first register group, wherein the size of the initial data is 1 row and Q column, the content of the initial data is all 1, when the data in a second register group comprises the initial data, all the data in the second register group form a second matrix, and the data except the initial data in the second matrix are all 0;

and performing matrix multiplication on the first matrix and the third matrix to obtain a multiplication result, wherein the multiplication result comprises the element sum and the element square sum of each row in the first matrix.

Further, the multi-dimensional tensor data includes four-dimensional tensor data including a batch size B, a height H, a width W, and a channel C.

Further, the predetermined rule includes storing in a channel element crossing order.

Further, the predetermined rule includes storing the channel elements in a sequential order.

The method provided by the application stores the multidimensional tensor data into the first memory according to the preset rule, then takes out the multidimensional tensor data in the form of two-dimensional data and carries out operation, and constructs the third matrix through the cooperation of a plurality of register groups and the second memory, and simultaneously obtains the element sum and the element square sum of each row in the first matrix by carrying out matrix multiplication on the first matrix and the third matrix, thereby realizing the parallel calculation of obtaining the element sum and obtaining the element square sum, accelerating the calculation in the batch normalization layer, and solving the problem of long operation time consumption caused by overlarge data amount in the operation process of the batch normalization layer. And finally, the operation speed of batch standardized operation is improved, and the time required by the whole deep learning model training is greatly shortened.

In a second aspect, an embodiment of the present invention provides a data processing apparatus, where the apparatus is configured to accelerate operations of a batch normalization layer in deep learning model training. The device comprises an arithmetic unit, a first memory, a first register group, a second memory and a second register group, wherein the first register group is connected with the first memory, the second memory and the arithmetic unit, and the second memory is connected with the second register group and the arithmetic unit; the first register set and the second register set each include M rows and N columns of registers, and provide multidimensional tensor data as input to the batch normalization layer, wherein:

the first memory is used for storing multidimensional tensor data, and the dimensionality of the multidimensional tensor data comprises a channel;

the first register group is used for storing a first matrix, and the first matrix at least comprises part of data taken out from the first memory;

the second register group is used for storing a second matrix, any row in the second matrix comprises initial data of 1 row and Q column, and the content of the initial data is all 1;

the second memory is used for storing a third matrix, and the third matrix at least comprises a transposed matrix of the first matrix and a transposed matrix of the second matrix;

the arithmetic unit is used for multiplying the first matrix and the third matrix to obtain a multiplication result, and the multiplication result comprises the element sum and the element square sum of each row in the first matrix.

The device provided by the application stores the multidimensional tensor data into the first memory according to the preset rule, then takes out the multidimensional tensor data in the form of two-dimensional data and carries out operation, the third matrix is constructed by matching the register groups and the second memory, the element sum and the element square sum of each row in the first matrix can be simultaneously solved by multiplying the first matrix and the third matrix, the parallel calculation of the element sum and the element square sum is realized, the calculation in the batch standardization layer is accelerated, and the problem that the operation time is long due to overlarge data quantity in the batch standardization layer operation process is solved. And finally, the operation speed of batch standardized operation is improved, and the time required by the whole deep learning model training is greatly shortened.

In a third aspect, an embodiment of the present invention provides a chip. The chip comprises at least the aforementioned data processing means.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by an execution processor, implements the steps of the data processing method.

The invention can be further combined to provide more implementation modes on the basis of the implementation modes provided by the aspects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of four-dimensional tensor data provided by an embodiment of the present invention;

FIG. 3 is a diagram of exemplary four-dimensional tensor data provided by embodiments of the present invention;

FIG. 4A is a diagram illustrating four-dimensional data stored in an SRAM memory according to a channel element interleaving order according to an embodiment of the present invention;

FIG. 4B is a diagram illustrating four-dimensional data being stored into an SRAM memory according to a sequential order of channel elements according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a data processing apparatus 500 according to an embodiment of the present invention;

FIG. 6A is a block diagram of a preferred data processing apparatus 600 according to an embodiment of the present invention;

FIG. 6B is a diagram illustrating an arrangement of registers in the first register set 602 according to an embodiment of the present invention;

FIG. 6C is a diagram of a first matrix into a first register set 602 according to an embodiment of the present invention;

FIG. 6D is a diagram illustrating a second matrix being placed in a second register set 603 according to an embodiment of the present invention;

FIG. 6E is a diagram of a third matrix being placed in the second memory 604 according to an embodiment of the present invention;

FIG. 6F is a diagram illustrating a multiplication result of the first matrix and the third matrix according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a chip according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that when an element is referred to as being "connected to" another element, or "coupled" to one or more other elements, it can be directly connected to the other element or be indirectly connected to the other element.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The following specifically describes embodiments of the present invention.

The embodiment of the invention provides a data processing method which can be used for accelerating the operation of a batch standardization layer in deep learning model training. The data processed by the method includes multidimensional tensor data in which the dimensions include at least channels. The method provides a first register group, a second register group, a first memory and a second memory, wherein the first register group and the second register group respectively comprise M rows and N columns of registers, the second memory can at least store N rows and K columns of data, and K is not less than 2M. Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:

step 101: storing the multidimensional data into a first memory according to a preset rule;

step 102: acquiring a first matrix, taking out data in a first memory according to a channel and putting the data in a first register group, wherein Q data in the same channel only exist in one row of the first matrix, and Q is not more than N; when Q is smaller than N, the row where Q data are located in the first register group is supplemented with at least N-Q0 s; all data in the first register group form a first matrix;

step 103: acquiring a second matrix, putting initial data of one row 1 and a row Q into any row in a second register group, wherein the content of the initial data is all 1, and when the data in the second register group comprises the initial data, all the data in the second register group form the second matrix;

step 104: acquiring a third matrix, and placing the first matrix and the second matrix into a second memory according to columns after both the first matrix and the second matrix are transferred; all data in the second register form a third matrix;

step 105: and performing matrix multiplication on the first matrix and the third matrix to obtain a multiplication result, wherein the multiplication result comprises the element sum and the element square sum of each row in the first matrix.

In a specific implementation process, the step of acquiring the second matrix and the step of acquiring the first matrix may be executed sequentially or in parallel.

In a preferred embodiment of the present invention, the multidimensional tensor data are stored in an SRAM memory, each piece of data has an address corresponding to the data in the SRAM memory, and the data in the SRAM memory can be accessed through the addresses. The multidimensional tensor data is a four-dimensional tensor (4-dimension tensor), and is composed of batch two-dimensional data. Fig. 2 is a schematic diagram of four-dimensional tensor data according to an embodiment of the present invention. As shown in fig. 2, the first dimension of the four-dimensional tensor data is the batch size b (batch size), the second dimension is the height h (height), the third dimension is the width w (width), and the fourth dimension is the channel number c (channel).

Fig. 3 is a schematic diagram of specific four-dimensional tensor data according to an embodiment of the present invention. As shown in fig. 3, the four-dimensional tensor data has dimensions B equal to 1, H equal to 5, W equal to 4, and C equal to 64. The four-dimensional tensor data shown in fig. 3 can be stored in the SRAM memory according to the channel element crossing sequence, which is to store the element at the same position in each channel into the SRAM memory first and then store the element at another position into the SRAM memory. Fig. 4A is a schematic diagram of four-dimensional data stored in an SRAM memory according to a crossing order of channel elements according to an embodiment of the present invention. As shown in fig. 4A, the four-dimensional tensor data shown in fig. 3 is stored in the SRAM memory in the order of crossing the channel elements, and first the first element (0,20, …,1260) in the channels C0 (i.e., C ═ 0) through C63 (i.e., C ═ 63) in fig. 3 is stored in the SRAM memory, i.e., in the order of (0,0,0,0), (0,0,0,1), …, (0,0, 63); then, the second element (1,21, …,1261) from the channel C0 to the channel C63 in fig. 3 is stored in the SRAM memory, i.e., stored in the order of (0,0,1,0), (0,0,1,1), …, (0,0,1, 63); and so on, until the last element (19,39, …,1279) in the channels C0 to C63 in fig. 3 is stored in the SRAM memory, that is, stored in the order of (0,3,4,0), (0,3,4,1), …, (0,3,4, 63).

The four-dimensional tensor data shown in fig. 3 can also be stored in the SRAM memory according to the continuous sequence of the channel elements, which is to store all the elements in each channel after all the elements in each channel are stored, and so on. Fig. 4B is a schematic diagram of four-dimensional data stored in an SRAM memory according to a sequential order of channel elements according to an embodiment of the present invention. As shown in fig. 4B, the four-dimensional tensor data shown in fig. 3 is stored in the SRAM memory in the order of the channel elements, and all the data (0,1, …,19) in the channel C0 in fig. 3 are first stored in the SRAM memory, that is, in the order of (0,0,0,0), (0,0,1,0), …, (0,3,4, 0); all data (20,21, …,39) in the channel C1 in fig. 3 are put into the SRAM memory immediately, that is, in the order of (0,0,0,1), (0,0,1,1), …, (0,3,4, 1); and so on, until all data (1260,1261, …,1279) in the last channel (channel C63) in fig. 3 is put into the SRAM memory, i.e., in the order of (0,0,0,63), (0,0,1,63), …, (0,3,4, 63).

According to the method, the multidimensional tensor data are stored in the first memory according to the preset rule, then are taken out in the form of two-dimensional data and are operated, the third matrix is constructed by matching the plurality of register groups and the second memory, the element sum and the element square sum of each row in the first matrix can be simultaneously calculated by multiplying the first matrix and the third matrix, the parallel calculation of calculating the element sum and calculating the element square sum is realized, the calculation in the batch standardization layer is accelerated, and the problem that the operation time is long due to overlarge data size in the batch standardization layer operation process is solved. And finally, the operation speed of batch standardized operation is improved, and the time required by the whole deep learning model training is greatly shortened.

An embodiment of the present invention provides a data processing apparatus for accelerating operations of a batch normalization layer in a deep learning model, and please refer to fig. 5, which is a schematic structural diagram of a data processing apparatus 500 according to an embodiment of the present invention. As shown in fig. 5, the data processing apparatus 500 includes a first memory 501, a first register set 502, a second register set 503, a second memory 504 and an operator 505. The first register group 502 is connected to the first memory 501, the second memory 502, and the arithmetic unit 505, and the second memory 504 is connected to the second register group 503 and the arithmetic unit 505. The first register set 502 and the second register set 503 have the same structure, and each register set includes M rows and N columns of registers. The second memory 504 can store at least N rows and K lines of data, and K is not less than 2M. The multidimensional tensor data is provided as an input to a batch normalization layer, wherein,

the first memory 501 is configured to store multidimensional tensor data, one dimension of which is a channel;

the first register set 502 is used for storing a first matrix, and the first matrix at least comprises part of data taken out from the first memory 501;

the second register set 504 is used for storing a second matrix, any row in the second matrix includes initial data of 1 row and Q columns, and the content of the initial data is all 1;

the second memory 504 is used for storing a third matrix, wherein the third matrix at least comprises a transposed matrix of the first matrix and a transposed matrix of the second matrix;

the operator 505 is configured to multiply the first matrix and the third matrix to obtain a multiplication result, where the multiplication result includes a sum of elements and a sum of squares of elements of each row in the first matrix.

For a better understanding of the present invention, the disclosed solution is now exemplified with reference to specific application scenarios.

In the calculation process of the batch normalization layer of the deep learning model training, forward propagation training needs to be performed, and in the forward propagation, a mean and a variance (mean & variance) need to be calculated for each channel (channel) of input data, so that the mean and the variance of each channel are finally obtained. The calculation formula of the mean (μ) and the variance (σ) is as follows:

where m is B H W (i.e., the product of batch size, height and width, i.e., the sum of the number of elements in each channel). The mean and variance of each channel are now calculated using the data processing apparatus and method provided by the present invention. Calculating the mean of each channel first requires calculating the cumulative sum of all elements in each channel (i.e., calculating the mean of each channel)

) (ii) a Calculating the variance of each channel first requires calculating the sum of the squares of all the elements in each channel (i.e., calculating

). The input data is four-dimensional tensor data, and it is now exemplarily assumed that specific data of the input data is the four-dimensional tensor data shown in fig. 3.

Fig. 6A is a schematic structural diagram of a preferred data processing apparatus 600 according to an embodiment of the present invention. As shown in fig. 6A, the data processing apparatus 600 includes a first memory 601, a first register set 602, a second register set 603, a second memory 604 and an operator 605. The first register set 602 is connected to the first memory 601, the second memory 604, and the operator 605, and the second memory 604 is connected to the second register set 603 and the operator 605. The first register set 602 and the second register set 603 are identical in structure, and each register set includes 4 rows and 32 columns of registers as shown in fig. 6B. The second memory 604 can store at least 32 rows and 16 columns of data. The specific process for calculating the mean and variance using the data processing apparatus 600 is as follows:

firstly, four-dimensional tensor data are put into a first memory 601 according to the continuous sequence of the channel elements; in another preferred embodiment, the channel elements may also be placed into memory 601 in interleaved order.

Then, part of the four-dimensional tensor data is taken out and put into the first register set 602 by channels, that is, the data of the channels C0 to C3 are respectively put into one to four rows of the first register set 602, and since each channel has only 20 elements at most and is less than 32, 12 0 s are added at the end of each row. At this time, all data in the first register group 602 constitutes a first matrix, which is 4X32 in size as shown in fig. 6C.

Meanwhile, an initial data of 1 row and 20 columns with all 1 contents is put into the 2 nd row in the second register set 603, and the remaining 12 registers in the 2 nd row and the registers in the remaining three rows are all complemented by 0. At this time, all data in the second register set 603 constitutes a second matrix, which is 4X32 in size as shown in fig. 6D.

After the first matrix and the second matrix are obtained, both the first matrix and the second matrix are transposed, and then the transposed matrices are placed into the second memory 604 by columns. Where the transpose of the first matrix is placed after the transpose of the second matrix, as shown in figure 6E. All data in the second memory 604 constitutes a third matrix, which is 32X16 in size.

Finally, the first matrix in the first register set 602 and the third matrix in the second memory 604 are sent to the operator 605, and the operator 605 performs matrix multiplication on the two matrixes to obtain a multiplication result as shown in fig. 6F. As shown in fig. 6F, the multiplication result is a matrix of 4X16, where the four data elements in the 2 nd column from top to bottom are the sums of the elements in the four rows from top to bottom in the first matrix, that is, the sums of all the elements in the channel C0, the channel C1, the channel C2, and the channel C3, respectively; the first element of the fifth column counted from top to bottom is the element square sum of the first row element in the first matrix and the first row element in the third matrix; the second element of the sixth column counted from top to bottom is the element square sum of the second row element in the first matrix and the second row element in the third matrix; a third element of the seventh column counted from top to bottom is the element square sum of a third row element in the first matrix and a third row element in the third matrix; and the fourth element of the eighth column counted from top to bottom is the element square sum of the fourth row element in the first matrix and the fourth row element in the third matrix. The aforementioned element sum is the sum of all elements in each channel, and the element sum of squares is the sum of squares of all elements in each channel. After the sum of all elements in each channel and the sum of squares of the elements are calculated, the mean and the mode of each channel can be calculated by the relevant calculation device according to the formula for calculating the mean and the variance. The subsequent calculation method of the mean and variance of 60 channels is similar to the foregoing process, and is not described herein again.

In some other embodiments, the number of data elements in each channel of the input data is greater than the number of rows in each row of the first register set, and at this time, the mean and the variance of each channel may be obtained through a multi-round calculation method, that is, the sum of partial data elements and the sum of squares of partial elements in each channel are calculated in each round, and finally the sum of partial elements and the sum of squares of partial elements obtained in each round are summed respectively, so as to obtain the sum of all elements in the channel and the sum of squares of all elements, and further obtain the mean variable of the channel.

In the embodiment of the present invention, the first matrix is received and acquired through the first register set 602, the second matrix is received and acquired through the second register set 604, and the parallel processing of the second matrix and the first matrix can be realized in the operation process.

Fig. 7 is a schematic structural diagram of a chip according to an embodiment of the invention. As shown in fig. 7, the chip 700 includes one or more processors 701, a communication interface 702, and a computer-readable storage medium 703, and the processors 701, the communication interface 702, and the computer-readable storage medium 703 may be connected by a bus, and may also implement communication by other means such as wireless transmission. The embodiment of the present invention is exemplified by connection via a bus 704. The computer-readable storage medium 703 is configured to store instructions, and the processor 701 includes the data processing apparatus disclosed in the above embodiments, and is configured to execute the instructions stored in the computer-readable storage medium 703. The computer-readable storage medium 703 stores program codes, and the processor 701 may call the program codes stored in the computer-readable storage medium 703 to implement the related functions of the foregoing data processing apparatus, which may specifically refer to the related descriptions in the foregoing embodiments, and will not be described herein again.

It should be understood that, in the embodiment of the present invention, the Processor 701 may be a Central Processing Unit (CPU), and the Processor may also be other general processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The communication interface 702 may be a wired interface (e.g., an ethernet interface) or a wireless interface (e.g., a cellular network interface or using a wireless local area network interface) for communicating with other modules or equipment devices. For example, in the embodiment of the present application, the communication interface 702 may be specifically configured to receive input data input by a user; or receive data from an external device, etc.

The computer-readable storage medium 703 may include Volatile Memory (Volatile Memory), such as Random Access Memory (RAM); the Memory may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (Hard Disk Drive, HDD), or a Solid-State Drive (SSD); the computer readable storage medium may also include a combination of memories of the above kinds. The computer readable storage medium may be used to store a set of program code for facilitating a processor to invoke the program code stored in the computer readable storage medium to implement the associated functionality of the aforementioned data processing apparatus.

It should be noted that fig. 7 is only one possible implementation manner of the embodiment of the present invention, and in practical applications, the chip may further include more or less components, which is not limited herein. For the content that is not shown or described in the embodiment of the present invention, reference may be made to the relevant explanation in the foregoing method embodiment, which is not described herein again.

An embodiment of the present invention further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a processor, the foregoing data processing method flow is implemented. The storage medium includes a ROM/RAM, a magnetic disk, an optical disk, and the like.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be apparent to those skilled in the art that in the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A data processing method is used for accelerating the operation of a batch normalization layer in a deep learning model, and is characterized in that the input of the batch normalization layer comprises multidimensional tensor data, and the dimension in the multidimensional tensor data comprises a channel; providing a first register set, a second register set, a first memory and a second memory, wherein the first register set and the second register set respectively comprise M rows and N columns of registers, the second memory can at least store N rows and K columns of data, and K is not less than 2M, and the method comprises the following steps:

storing the multi-dimensional tensor data into the first memory according to a preset rule;

taking out the data in the first memory according to the channel and putting the data in the first register group; q data in the same channel can be only put into one row in the first register group; when Q is smaller than N, at least N-Q0 data are added to the row where the Q data are located in the first register group; all data in the first register group form a first matrix;

putting initial data into any row in the second register group, wherein the size of the initial data is 1 row and Q column, the content of the initial data is all 1, when the data in the second register group comprises the initial data, all the data in the second register group form a second matrix, and the data except the initial data in the second matrix are all 0;

transposing both the first matrix and the second matrix, putting the two transposed matrices into the second memory in columns, and enabling all data in the second memory to form a third matrix;

2. The data processing method of claim 1, wherein the multi-dimensional tensor data comprises four-dimensional tensor data, the four dimensions comprising a batch size B, a height H, a width W, and the channel C.

3. The data processing method of claim 2, wherein the predetermined rule comprises depositing in channel element interleaving order.

4. The data processing method of claim 2, wherein the predetermined rule comprises depositing in a sequential order of channel elements.

5. A data processing device is used for accelerating the operation of a batch normalization layer in a deep learning model, and comprises an operator, and is characterized by further comprising a first memory, a first register group, a second memory and a second register group; the first register set is connected with the first memory, the second memory and the arithmetic unit, and the second memory is connected with the second register set and the arithmetic unit; the first register bank and the second register bank each include M rows and N columns of registers, and provide multidimensional tensor data as input to the batch normalization layer, wherein:

the first memory is used for storing the multidimensional tensor data, and dimensions in the multidimensional tensor data comprise channels;

the first register set is used for storing a first matrix, and the first matrix at least comprises part of data taken out of the first memory;

the second register group is used for storing a second matrix, any row in the second matrix comprises initial data of 1 row and Q columns, and the content of the initial data is all 1;

the arithmetic unit is configured to multiply the first matrix and the third matrix to obtain a multiplication result, where the multiplication result includes a sum of elements and a sum of squares of elements of each row in the first matrix.

6. Chip characterized in that it comprises at least a data processing device according to claim 5.

7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the data processing method of any one of claims 1 to 4.