CN112561047B

CN112561047B - Apparatus, method and computer readable storage medium for processing data

Info

Publication number: CN112561047B
Application number: CN202011523956.1A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Biren Intelligent Technology Co Ltd
Current assignee: Shanghai Bi Ren Technology Co ltd
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2023-04-28
Anticipated expiration: 2040-12-22
Also published as: CN112561047A

Abstract

The present disclosure relates to an apparatus, method, and computer-readable storage medium for processing data. The device comprises: a co-processing unit configured to generate a first set of eigenvalues of a first channel based on input data; a predetermined processing unit coupled to the co-processing unit and configured to determine, for the first channel, at least one first parameter related to normalization of the first set of feature values; a storage unit coupled to the predetermined processing unit and configured to store at least one first parameter; and a first general purpose processing unit coupled to the co-processing unit and the storage unit and configured to normalize the first set of eigenvalues with at least one first parameter for the first channel. According to the embodiment of the disclosure, the power consumption for batch normalization can be reduced, so that the overall performance is improved.

Description

Apparatus, method and computer readable storage medium for processing data

Technical Field

Embodiments of the present disclosure relate generally to the field of computers and, more particularly, relate to an apparatus, method, and computer-readable storage medium for processing data.

Background

Machine learning techniques are increasingly being used in a variety of fields. In training of machine learning models, the processing of large amounts of training data is typically involved. Such training data may be processed in batches. In particular, batch normalization (Batch Normalization) is introduced in deep learning models such as deep neural networks to make training of the deep learning model easier and more stable (e.g., speed up training, prevent overfitting, etc.). Therefore, batch normalization is an important operation in deep learning models.

In batch normalization, the mean and variance of the training data for each batch need to be calculated and then the calculated mean and variance is used to normalize the training data for that batch. In conventional schemes, a general purpose processing unit is employed to perform batch normalization operations. The training data for each batch is typically larger, which results in the general purpose processing unit consuming more power and time to perform batch normalization.

Disclosure of Invention

The present disclosure provides a scheme for processing data. The scheme can realize batch normalization under controllable power and time cost.

According to a first aspect of the present disclosure, there is provided an apparatus for processing data. The device comprises: a co-processing unit configured to generate a first set of eigenvalues of a first channel based on input data; a predetermined processing unit coupled to the co-processing unit and configured to determine, for the first channel, at least one first parameter related to normalization of the first set of feature values; a storage unit coupled to the predetermined processing unit and configured to store at least one first parameter; and a first general purpose processing unit coupled to the co-processing unit and the storage unit and configured to normalize the first set of eigenvalues with at least one first parameter for the first channel.

According to a second aspect of the present invention, there is also provided a method for processing data. The method comprises the following steps: the co-processing unit generates a first set of feature values of the first channel based on the input data; a predetermined processing unit coupled to the co-processing unit determines, for the first channel, at least one first parameter related to normalization of the first set of eigenvalues; a memory unit coupled to the predetermined processing unit stores at least one first parameter; and a first general purpose processing unit coupled to the co-processing unit and the storage unit normalizes the first set of eigenvalues with at least one first parameter for the first channel.

According to a third aspect of the present disclosure, there is also provided a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a machine, performs the method of the second aspect of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the disclosure. In the drawings:

FIG. 1 schematically illustrates a block diagram of an apparatus for processing data according to some embodiments of the present disclosure;

FIG. 2 schematically illustrates a schematic diagram of a memory cell of some embodiments of the present disclosure;

FIG. 3 schematically illustrates a block diagram of an apparatus for processing data, according to some embodiments of the present disclosure;

FIG. 4 illustrates a flowchart of an example method for processing data, according to an embodiment of the present disclosure; and

fig. 5 illustrates a flowchart of an example method for determining normalization parameters according to embodiments of the present disclosure.

Detailed Description

The principles of the present disclosure will be described below with reference to several example embodiments shown in the drawings. While the preferred embodiments of the present disclosure are illustrated in the drawings, it should be understood that these embodiments are merely provided to enable those skilled in the art to better understand and practice the present disclosure and are not intended to limit the scope of the present disclosure in any way.

In describing embodiments of the present disclosure, the term "comprising" and its like should be taken to be open-ended, i.e., including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As used herein, a "neural network" is capable of processing an input and providing a corresponding output, which generally includes an input layer and an output layer, and one or more hidden layers between the input layer and the output layer. Neural networks used in deep learning applications typically include many hidden layers, thereby extending the depth of the network. The layers of the neural network are connected in sequence such that the output of the previous layer is provided as an input to the subsequent layer, wherein the input layer receives the input of the neural network and the output of the output layer is provided as the final output of the neural network. Each layer of the neural network includes one or more nodes (also referred to as processing nodes or neurons), each of which processes input from a previous layer. The terms "neural network", "network" and "neural network model" are used interchangeably herein.

As mentioned previously, batch normalization is an important operation in deep learning models. In training, the effectiveness of the training is typically measured as a variance of the samples as a loss function. For example, in supervised learning, for a sample characterized by x and labeled z, the value derived by the training model is denoted z' =h (x), and then the variance of the sample can be expressed as:

e＝1/2(z-z’) ² (1)

thus, the variance of the multiple samples can be expressed as:

E＝e ⁽¹⁾ +e ⁽²⁾ +e ⁽³⁾ +…+e ⁽ⁿ⁾ (2)

wherein e ⁽ⁱ⁾ The variance of the i-th sample shown in formula (1) is represented, i is 1 to n, and n is a positive integer.

And (3) popularizing the formula (2). Assume that a certain lot includes (n+1) samples with indices of 0 to n, respectively, and that the training values of these samples are z0, z1,... Then, the sum s0 and the average m0 of the training values of these samples can be expressed as:

s0＝z0+z1+...+zn (3)

m0＝s0/(n+1) (4)

in the case of replacing the markers with the average value m0 of the training values, the variance of these samples can be expressed as:

E＝1/2((z0-m0)*(z0-m0)+(z1-m0)*(z1-m0)+...+(zn-m0)*(zn-m0)) (5)

in the conventional scheme, the batch normalization operation is performed by a general-purpose processing unit. That is, the general processing unit needs to calculate the variance of the samples of each lot according to the formulas (3) to (5). The number of samples per batch in model training is enormous, for which a large number of instruction operations and computing resources need to be scheduled. This results in the general purpose processing unit consuming a significant amount of power and time to batch normalize. In addition, the impact of this batch approach on overall performance and power consumption is also not negligible.

Embodiments of the present disclosure propose a solution for processing data to address one or more of the above problems and other potential problems. In this scenario, a predetermined processing unit is provided to determine at least one parameter related to the batch normalization. The determined at least one parameter is obtained by the general purpose processing unit for subsequent processing. Since the predetermined processing unit is specifically designed for determining parameters related to batch normalization, it is capable of performing calculations at a controllable cost and power consumption. Thus, the power consumption for batch normalization can be reduced, thereby improving the overall performance.

Example embodiments of the present disclosure will be described in detail below in conjunction with fig. 1 to 5.

Fig. 1 schematically illustrates a block diagram of an apparatus 100 for processing data according to some embodiments of the disclosure. As shown in fig. 1, the apparatus 100 generally includes a co-processing unit 110, a predetermined processing unit 120, a storage unit 130, and general purpose processing units 140-1 to 140-N, where N is a positive integer. Hereinafter, for convenience of discussion, the general processing units 140-1 to 140-N may also be collectively or individually referred to as "general processing unit 140". It should be understood that the apparatus 100 may also include other units not shown.

The co-processing unit 110 is configured to generate a set of characteristic values for each of the one or more channels based on the input data, which is also referred to below simply as "characteristic value per channel". As used herein, the terms "set of eigenvalues," "first set of eigenvalues," "second set of eigenvalues," and the like refer to eigenvalues generated for a certain channel when processing a batch of data. Such a characteristic value may be an output value of any layer of the neural network. For example, each feature value may correspond to a pixel in a feature map or an element in a feature matrix determined by the co-processing unit 110.

The input data may be initial data or processed intermediate data of the input model in a training phase or an inference phase of the machine learning model. In some embodiments, the input data may be sample data for training a machine learning model. Such sample data may be initial data of the training samples or processed data of the training samples, such as the output of hidden layers of a neural network.

The co-processing unit 110 may be implemented as an execution engine dedicated to tensor computation, such as an execution core of a domain-specific accelerator (DSA). For a batch of data, the co-processing unit 110 may generate a feature map of each channel through matrix operation or convolution operation. The pixels in the feature map for each channel correspond to the feature values for that channel. For example, in the processing of a batch of data, the co-processing unit 110 may generate a feature map for each of 8 channels.

Depending on, for example, the amount of data to be processed and the processing power of the co-processing unit 110, the processing of a batch of data may involve only a single cycle or be divided into multiple cycles. The term "cycle" as used herein refers to a processing cycle of data.

In some embodiments, a set of eigenvalues for each channel may be generated in a single cycle, also referred to as a "single cycle embodiment. In a single cycle embodiment, co-processing unit 110 may generate all of the eigenvalues for each channel in one cycle. For example, the co-processing unit 110 may generate a full feature map of each channel in one cycle for that channel.

In some embodiments, the processing of a batch of data may be divided into a plurality of cycles, and a set of eigenvalues for each channel may be generated over the plurality of cycles, also referred to as a "multicycle embodiment. In a multi-cycle embodiment, co-processing unit 110 may generate a respective subset of a set of eigenvalues for each of the multiple cycles for each channel. The plurality of subsets, which are generated in a plurality of periods, respectively, constitute the set of feature values. For example, the feature map of each channel may have 8×8 pixels, and the processing of a batch of data may be divided into 8 cycles. In this case, the co-processing unit 110 may generate 8 pixels in the feature map at each period for each channel.

The co-processing unit 110 may determine the number of channels to process and the number of divided cycles according to the instruction. Such instructions may come from the general purpose processing unit 140, for example. However, it should be understood that such instructions may also come from another unit in the apparatus 100 or from other units not shown. The scope of the present disclosure is not limited in this respect.

As shown in fig. 1, a predetermined processing unit 120 is coupled to the co-processing unit 110. Thus, the predetermined processing unit 120 may receive the characteristic value of each channel from the co-processing unit 110. The predetermined processing unit 120 is configured to determine for each channel at least one parameter related to the normalization of the characteristic value of that channel, also referred to as "normalization parameter".

It is assumed that in the processing of a batch of data, each channel includes (n+1) eigenvalues with indices of 0 to n, respectively, and these eigenvalues are denoted as y0, y1, respectively. Then, the sum S0, the sum of squares SQ0, and the average value M0 of these feature values can be expressed as:

S0＝y0+y1+...+yn (6)

SQ0＝y0*y0+y1*y1+...+yn*yn (7)

M0＝S0/(n+1) (8)

wherein n is a positive integer.

Accordingly, the variance of these eigenvalues can be expressed as:

E＝1/2((y0-M0)*(y0-M0)+(y1-M0)*(y1-M0)+...+(yn-M0)*(yn-M0)) (9)

by analysis of equation (9), the variance of these eigenvalues can be further expressed as:

E＝1/2(y0*y0+M0*M0–2*y0*M0+y1*y1+M0*M0+2*y1*M0+…)

＝1/2(SQ0+(n+1)*M0*M0–2*S0*M0)

＝1/2(SQ0+S0*S0/(n+1)–2*S0*S0/(n+1))

＝1/2(SQ0–S0*S0/(n+1)) (10)

it can be seen from equation (10) that after the sum of eigenvalues 0 and the sum of squares SQ0 are obtained, only a small amount of computation is required to obtain the variance E of the eigenvalues, thereby normalizing the eigenvalues. Thus, the normalization parameter may comprise at least one of the sum S0 or the sum of squares SQ0 of the eigenvalues.

In some embodiments, the predetermined processing unit 120 may calculate a sum of eigenvalues 0 for each channel. In some embodiments, the predetermined processing unit 120 may calculate a sum of squares SQ0 of the eigenvalues for each channel. In such embodiments, the computational load at the general purpose processing unit 140 may be relaxed, thereby reducing the power consumption of the apparatus 100.

In some embodiments, the predetermined processing unit 120 may calculate both the sum S0 and the square sum SQ0 of the eigenvalues for each channel. In such embodiments, the computational load at the general purpose processing unit 140 may be further relaxed, thereby further reducing the power consumption of the apparatus 100.

The predetermined processing unit 120 may determine which normalization parameter or parameters to calculate according to the instructions. Such instructions may come from the general purpose processing unit 140, for example. For example, the general purpose processing unit 140 may send instructions to the predetermined processing unit 120 to instruct the predetermined processing unit 120 to calculate sum S0, or sum of squares SQ0, or both. However, it should be understood that such instructions may also come from another unit in the apparatus 100 or from other units not shown. The scope of the present disclosure is not limited in this respect.

In an embodiment of the present disclosure, the predetermined processing unit 120 has a fixed function, i.e. for calculating normalization parameters. In other words, the predetermined processing unit 120 may be a dedicated processing unit, which is specifically designed for calculating the normalization parameters. The predetermined processing unit 120 may be implemented in any suitable hardware. For example, the predetermined processing unit 120 may be implemented with an Arithmetic Logic Unit (ALU). As another example, the predetermined processing unit 120 may be implemented with an Application Specific Integrated Circuit (ASIC).

The predetermined processing unit 120 may support parallel operations of multiple channels. That is, the predetermined processing unit 120 may calculate the respective normalization parameters for a plurality of channels at the same time. The throughput of the predetermined processing unit 120 may be adapted to the throughput of the co-processing unit 110 such that the feature data generated by the co-processing unit 110 can be processed at the predetermined processing unit 120 in time.

The memory unit 130 is coupled to the predetermined processing unit 120. The predetermined processing unit 120 may write the calculated normalized parameters for each channel to the storage unit 130. The storage unit 130 is configured to store the normalization parameter for each channel.

Reference is made to fig. 2. Fig. 2 schematically illustrates a schematic diagram of a memory cell 130 of some embodiments of the present disclosure. In the example of fig. 2, the storage unit 130 may store the sum S0 of the eigenvalues of each channel, and the square sum SQ0. Specifically, the sum S0 201 of the channels 0, the sum SQ0 of the squares of the channels 0 202, the sum S0 203 of the channels 1, the sum SQ0 of the squares of the channels 1 204, the sum S0 of the channels 2 205, the sum SQ0 of the squares of the channels 2 206, and the like are stored at the storage unit 130. The storage unit 130 may store the normalization parameters in any suitable way, such as a table.

The storage unit 130 may be implemented using any suitable hardware. As an example, the memory unit 130 may be implemented as a Static Random Access Memory (SRAM). The memory unit 130 may also be implemented as other suitable memory devices including, but not limited to, random Access Memory (RAM), read Only Memory (ROM), erasable programmable read only memory (EPROM or flash memory), flash memory, memory stick. Furthermore, the number of channels and normalization parameters shown in fig. 2 are merely exemplary and are not intended to limit the scope of the present disclosure.

With continued reference to fig. 1. The predetermined processing unit 120 and the storage unit 130 may cooperatively implement the cumulative calculation of the normalization parameter. The predetermined processing unit 120 may receive a subset of the eigenvalues generated by the current period from the co-processing unit 110 and read the stored normalized parameters from the storage unit 130. The stored values of the normalization parameters are derived based on a subset of the eigenvalues generated at a period preceding the current period. The predetermined processing unit 120 may in turn update the normalization parameter based on a subset of the eigenvalues of the current period, i.e. calculate an updated value of the normalization parameter, and write the updated normalization parameter to the storage unit 130.

As an example, assume that the normalization parameters are the sum of eigenvalues 0, and the sum of squares SQ0, and that 8 eigenvalues, e.g., 8 pixels in the eigenvector diagram, are generated for each channel per cycle. Then, the predetermined processing unit 120 may perform the addition operation and the multiplication addition operation as shown in equations (11) and (12):

S0_new＝S0_old+p0+p1+p2+…+p7 (11)

SQ0_new＝SQ0_old+p0*p0+p1*p1+p2*p2+…+p7*p7 (12)

wherein s0_old and sq0_old are old values cumulatively calculated, which are derived based on the characteristic values generated in the previous period, and are read out from the storage unit 130 by the predetermined processing unit 120; p0, p1, p2, …, p7 are eigenvalues generated at the current cycle; s0_new and sq0_new are accumulated calculated update values that will be written to the memory unit 130 to replace s0_old and sq0_old.

In this way, the normalized parameters in the memory unit 130 may be updated stepwise until the last cycle of the batch. It will be appreciated that if the current period is the first period, the normalized parameter read from the memory unit 130 may be 0.

The general purpose processing unit 140 is coupled to the co-processing unit 110 to receive a respective set of characteristic values for one or more channels from the co-processing unit 110. The general processing unit 140 is further coupled to the memory unit 130 to read the normalized parameters from the memory unit 130, for example, one or more of the sum S0 of channel 0, the sum of squares SQ0 of channel 0 202, the sum S0 203 of channel 1, the sum of squares SQ0 of channel 1 204, the sum S0 of channel 2, the sum of squares SQ0 of channel 2 206 shown in fig. 2.

The general processing unit 140 is configured to normalize the received characteristic values with the read normalization parameters. For example, the general purpose processing unit 140 may calculate the variance of the eigenvalues according to equation (10) and perform a small number of subsequent calculations required for batch normalization. The general processing unit 140 may also perform other subsequent processing on the received characteristic values. The scope of the present disclosure is not limited in this respect.

After receiving all the feature values of the processed channels from the co-processing unit 110, the general processing unit 140 may read the normalized parameters of the corresponding channels from the storage unit 130. For example, assume that general processing unit 140-1 is configured to process channel 0 and channel 1. After receiving all feature values of channel 0 and channel 1 (e.g., all pixels in the feature map) from the co-processing unit 110, the general processing unit 140-1 may read the sum S0 201 of channel 0, the sum of squares SQ0 202 of channel 0, the sum S0 203 of channel 1, and the sum of squares SQ0 204 of channel 1 from the storage unit 130.

The general purpose processing unit 140 may be implemented using any suitable hardware. The general purpose processing unit 140 may include, but is not limited to, a Graphics Processing Unit (GPU), a Central Processing Unit (CPU), a microprocessor, a controller, a microcontroller, and the like. Furthermore, although a plurality of general purpose processing units are shown in fig. 1, this is merely exemplary. An apparatus 100 for processing data according to the present disclosure may include a greater or lesser number of general purpose processing units. For example, the apparatus 100 may comprise only one general processing unit for processing all channels.

In the apparatus for processing data according to the present disclosure, a predetermined processing unit having a fixed function is introduced in place of the general processing unit to calculate parameters related to batch normalization. Such dedicated computing resources are introduced to enable computation to be accomplished at a controlled cost and power consumption. Thus, power consumption for performing batch normalization can be reduced. Further, as can be seen from the above description, the calculation of the normalization parameter by the predetermined processing unit can be regarded as being done by the way (on-the-fly) without spending an additional clock cycle. In this way, the delay for calculating the normalization parameter can be hidden. Thus, the apparatus for data processing according to the present disclosure has optimized performance.

Fig. 3 schematically illustrates a block diagram of an apparatus 100 for processing data according to further embodiments of the present disclosure. Fig. 3 shows another example of the apparatus 100. In some cases, the general purpose processing unit 140 may also generate intermediate data that needs to be normalized in the subsequent processing of the channel. Thus, there may be a normalization requirement for data processing at the general processing unit 140.

As shown in fig. 3, in some embodiments, the predetermined processing unit 120 may also be coupled to a general purpose processing unit 140. The predetermined processing unit 120 may also be configured to receive a set of intermediate values for processing one or more channels from the general purpose processing unit 140. The "intermediate value" described herein may refer to any type of data that the general processing unit 140 generates during processing of the corresponding channel. The predetermined processing unit 120 may be further configured to determine at least one parameter related to the normalization of the set of intermediate values and store the determined parameter to the storage unit 130. For example, the predetermined processing unit 120 may calculate the sum S0, and the square sum SQ0 of the set of intermediate values, and write them into the storage unit 130. The manner in which the predetermined processing unit 120 determines the normalized parameters for these intermediate values is similar to that discussed above with reference to fig. 1 and will not be repeated here.

Accordingly, the general purpose processing unit 140 may be further configured to normalize the set of intermediate values for the one or more channels with the normalization parameters from the storage unit.

In such an embodiment, the normalization operation at the general purpose processing unit may also utilize dedicated computing resources. In this way the power consumption of the general purpose processing unit can be further reduced, thereby further optimizing the overall performance of the device for processing data.

A method 400 for processing data is described below in connection with fig. 4. Fig. 4 shows a schematic diagram of a method 400 for processing data according to an embodiment of the disclosure. It should be understood that method 400 may also include additional blocks not shown and/or that the blocks shown may be omitted, the scope of the disclosure being not limited in this respect.

At block 410, the co-processing unit 110 generates a first set of eigenvalues of a first channel based on the input data. In the case of a single channel, the "first channel" described herein may be a single channel being processed. In the case of multiple channels, the "first channel" described herein may be any of a plurality of channels being processed. For example, the co-processing unit 110 may generate a feature map for each channel.

In some embodiments, the processing of the input data may be divided into a plurality of cycles. The co-processing unit 110 may generate a respective subset of the first set of eigenvalues at each of a plurality of cycles. For example, the processing of the input data may be divided into 8 periods, and the feature map of the first channel has 64 pixels. The co-processing unit 110 may generate 8 pixels in the feature map every cycle.

At block 420, the predetermined processing unit 120 determines at least one first parameter related to normalization of the first set of feature values for the first channel. In the case of multiple channels, the predetermined processing unit 120 may determine a respective normalization parameter for each channel, as described above with reference to fig. 1.

In some embodiments, the at least one first parameter may comprise a sum of the first set of characteristic values, i.e. the sum S0 described above. Alternatively or additionally, in some embodiments, the at least one first parameter may comprise a sum of squares of the first set of eigenvalues, i.e. the sum SQ0 described above. For example, the predetermined processing unit 120 may determine the sum S0 of the channels 0, the sum SQ0 of the squares of the channels 0, the sum S0 of the channels 1, the sum SQ0 of the squares of the channels 1, the sum S0 of the channels 2, the sum SQ0 of the squares of the channels 2, and the like.

In some embodiments, the processing of the input data may be divided into a plurality of cycles. The predetermined processing unit 120 may determine at least one first parameter of the first set of characteristic values in cooperation with the storage unit 130.

Fig. 5 shows a schematic diagram of a method 500 for determining normalization parameters according to embodiments of the present disclosure. It should be understood that method 500 may also include additional blocks not shown and/or that the blocks shown may be omitted, the scope of the disclosure being not limited in this respect. Method 500 may be viewed as a specific implementation of block 420.

At block 510, for a first cycle of the plurality of cycles, the predetermined processing unit 120 may receive a first subset of the first set of feature values from the co-processing unit 110. The "first period" described herein may be any one of a plurality of periods of data processing, and the "first subset" may be a characteristic value of the first channel generated by the co-processing unit 110 for the first period of data processing.

At block 520, the predetermined processing unit 120 may read at least one first parameter from the storage unit 130. The read at least one first parameter may be determined based at least on a second subset of the first set of feature values, and the second subset may be generated by the co-processing unit 110 a second period preceding the first period. The "second period" described herein may refer to any or all of the periods preceding the first period. For example, the predetermined processing unit 120 may read the current value of sum S0, and the current value of sum of squares SQ0 for each channel from the storage unit 130.

At block 530, the predetermined processing unit 120 may update the at least one parameter based on the first subset. For example, the first subset may be p0, p1, p2, …, p7 shown in formulas (11) and (12). The predetermined processing unit 120 may calculate updated values of sum S0 and square sum SQ0 according to equations (11) and (12), respectively.

At block 540, the predetermined processing unit 120 may store the updated at least one parameter to the storage unit 130. For example, the predetermined processing unit 120 may write the updated values of sum S0, and square sum SQ0 to the storage unit 130 instead of the corresponding old values.

With continued reference to fig. 4. At block 430, the storage unit 130 stores at least one first parameter determined by the predetermined processing unit 120. In the case of multiple channels, the storage unit 130 may store normalization parameters for each channel. For example, the storage unit 130 may store the sum S0 201 of the channels 0, the sum of squares SQ0 202 of the channels 0, the sum S0 203 of the channels 1, the sum of squares SQ0 204 of the channels 1, the sum S0 205 of the channels 2, the sum of squares SQ0 206 of the channels 2, and the like, as shown in fig. 2.

At block 440, the first general processing unit normalizes the first set of feature values with at least one first parameter for the first channel. The "first general processing unit" described herein refers to a general processing unit for processing the first channel among the general processing units 140-1 to 140-N. For example, after receiving all the characteristic values of channel 0 from the co-processing unit 110, the first channel processing unit may read the sum S0 201 of channel 0, the square sum SQ0 202 of channel 0 from the storage unit 130. Further, the first channel processing unit may normalize the eigenvalues of channel 0 by sum S0 and sum of squares SQ0.

In the case of multiple channels, the co-processing unit 110 may also generate a second set of eigenvalues for the second channel based on the input data. The "second channel" described herein may be any channel other than the first channel among the plurality of channels processed. Accordingly, the predetermined processing unit 120 may further determine at least one second parameter related to the normalization of the second set of characteristic values for the second channel, and the predetermined processing unit 120 may further store the at least one second parameter. The at least one second parameter is for example the sum S0 and the square sum SQ0 of the second channels.

Accordingly, the second general purpose processing unit may normalize the second set of feature values with at least one second parameter for the second channel. The "second general processing unit" described herein refers to a general processing unit for processing the second channel among the general processing units 140-1 to 140-N. For example, after receiving all the characteristic values of channel 1 from the co-processing unit 110, the second channel processing unit may read the sum S0 203 of channel 1, the square sum SQ0 204 of channel 1 from the storage unit 130. Further, the second channel processing unit may normalize the eigenvalues of channel 1 with sum S0 and sum of squares SQ0. The first general purpose processing unit and the second general purpose processing unit may be the same processing unit or may be different general purpose processing units, depending on the specific implementation.

In some embodiments, the method 400 may further include the predetermined processing unit 110 receiving a set of intermediate values from the first general purpose processing unit for processing the first channel and determining at least one third parameter related to normalization of the set of intermediate values. The at least one third parameter may be, for example, the sum of the set of intermediate values, or the sum of squares of the set of intermediate values. The predetermined processing unit 110 may in turn store the at least one third parameter to the storage unit 130. Accordingly, the method 400 may further include the first general purpose processing unit normalizing the set of intermediate values with at least one third parameter for the first channel.

In the method for processing data according to the present disclosure, a predetermined processing unit having a fixed function is introduced in place of the general processing unit to calculate parameters related to batch normalization. Such dedicated computing resources are introduced to enable computation to be accomplished at a controlled cost and power consumption. Thus, power consumption for performing batch normalization can be reduced. Further, as can be seen from the above description, the calculation of the normalization parameter at the predetermined processing unit can be regarded as being done by the way without spending an additional clock cycle. In this way, the delay for calculating the normalization parameter can be hidden. Thus, the method for data processing according to the present disclosure may achieve optimized data processing performance.

It should be appreciated that the

methods

400 and 500 may be implemented as a computer software program that may be tangibly embodied on a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device via the ROM and/or the communication unit. One or more of the acts of the

methods

400 and 500 described above may be performed when a computer program is loaded into RAM and executed by a processor.

The present disclosure may be a method, a computing device, a computer storage medium, and/or a computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing aspects of the present disclosure. The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.

The computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages. In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. An apparatus for processing data, comprising:

a co-processing unit configured to generate a first set of eigenvalues of a first channel based on input data;

a predetermined processing unit coupled to the co-processing unit and configured to determine, for the first channel, at least one first parameter related to normalization of the first set of eigenvalues;

a storage unit coupled to the predetermined processing unit and configured to store the at least one first parameter; and

a first general purpose processing unit coupled to the co-processing unit and the storage unit and configured to normalize the first set of feature values with the at least one first parameter for the first channel;

and wherein the co-processing unit is configured to generate a respective subset of the first set of feature values in each of a plurality of cycles, and the predetermined processing unit is configured to, for a first cycle of the plurality of cycles:

receiving a first subset of the first set of feature values from the co-processing unit;

reading the at least one first parameter from the storage unit, wherein the at least one first parameter read is determined based at least on a second subset of the first set of eigenvalues, and the second subset is generated by the co-processing unit a second period preceding the first period;

updating the at least one parameter based on the first subset; and

the updated at least one parameter is stored to the storage unit.

2. The apparatus of claim 1, wherein the at least one first parameter comprises at least one of:

the sum of the first set of characteristic values, or

The sum of squares of the first set of eigenvalues.

3. The device of claim 1, wherein

The co-processing unit is further configured to generate a second set of eigenvalues for a second channel based on the input data, the second channel being different from the first channel;

the predetermined processing unit is further configured to determine, for the second channel, at least one second parameter related to normalization of the second set of feature values;

the storage unit is further configured to store the at least one second parameter; and is also provided with

The apparatus further comprises:

a second general purpose processing unit is coupled to the co-processing unit and the storage unit and configured to normalize the second set of feature values with the at least one second parameter for the second channel.

4. The apparatus of claim 1, wherein the predetermined processing unit is further coupled to the first general purpose processing unit and is further configured to:

receiving a set of intermediate values from the first general purpose processing unit for processing the first channel;

determining at least one third parameter related to normalization of the set of intermediate values; and

storing the at least one third parameter to the storage unit, and

the first general purpose processing unit is further configured to normalize the set of intermediate values with the at least one third parameter for the first channel.

5. The apparatus of claim 1, wherein the input data comprises training samples for a machine learning model.

6. A method for processing data, comprising:

the co-processing unit generates a first set of feature values of the first channel based on the input data;

a predetermined processing unit coupled to the co-processing unit determines, for the first channel, at least one first parameter related to normalization of the first set of eigenvalues;

a memory unit coupled to the predetermined processing unit stores the at least one first parameter; and

a first general purpose processing unit coupled to the co-processing unit and the storage unit normalizes the first set of eigenvalues with the at least one first parameter for the first channel;

wherein the co-processing unit generating the first set of eigenvalues of the first channel based on the input data comprises: the co-processing unit generates a respective subset of the first set of eigenvalues at each of a plurality of cycles, and the predetermined processing unit determining the at least one first parameter for the first channel comprises the predetermined processing unit performing the following for a first cycle of the plurality of cycles:

updating the at least one parameter based on the first subset; and

the updated at least one parameter is stored to the storage unit.

7. The method of claim 6, wherein the at least one first parameter comprises at least one of:

the sum of the first set of characteristic values, or

The sum of squares of the first set of eigenvalues.

8. The method of claim 6, further comprising:

the co-processing unit generating a second set of eigenvalues for a second channel based on the input data, the second channel being different from the first channel;

the predetermined processing unit determining, for the second channel, at least one second parameter related to normalization of the second set of eigenvalues;

the storage unit stores the at least one second parameter; and

a second general purpose processing unit coupled to the co-processing unit and the storage unit normalizes the second set of eigenvalues with the at least one second parameter for the second channel.

9. The method of claim 6, wherein the predetermined processing unit is further coupled to the first general purpose processing unit, and the method further comprises:

the predetermined processing unit receives a set of intermediate values from the first general purpose processing unit for processing the first channel;

the predetermined processing unit determining at least one third parameter related to normalization of the set of intermediate values;

the predetermined processing unit stores the at least one third parameter to the storage unit; and

the first general purpose processing unit normalizes the set of intermediate values with the at least one third parameter for the first channel.

10. The method of claim 6, wherein the input data comprises training samples for a machine learning model.

11. A computer readable storage medium storing computer instructions for causing the computer to perform the method according to any one of claims 6-10.