CN113570034A

CN113570034A - Processing device, neural network processing method and device

Info

Publication number: CN113570034A
Application number: CN202110679305.XA
Authority: CN
Inventors: 田超; 贾磊; 严小平; 闻军会; 邓广来; 李强
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-10-29
Anticipated expiration: 2041-06-18
Also published as: CN113570034B

Abstract

The application discloses a processing device, a processing method of a neural network and a device thereof, and relates to the fields of deep learning, voice technology and the like. The specific implementation scheme is as follows: the processing device comprises a neural network processing unit NPU, a pseudo static random access memory PSRAM and a digital signal processor DSP which are connected through a bus, wherein the DSP stores input data to be processed and an operation result of the NPU on the input data in an internal memory; the PSRAM stores network parameters of the neural network; the NPU accesses a memory inside the DSP through the bus to read and obtain input data to be processed, accesses the PSRAM through the bus to obtain at least part of network parameters, executes at least one of matrix vector operation and convolution operation on the input data according to the read at least part of network parameters, and synchronously continuously reads the rest network parameters in the PSRAM. Therefore, the parallelism of data reading/loading and calculation can be realized, and the calculation efficiency can be improved.

Description

Processing device, neural network processing method and device

Technical Field

The present invention relates to the field of AI (Artificial Intelligence) such as deep learning and speech technology, and in particular, to a processing apparatus, a processing method for a neural network, and an apparatus therefor.

Background

At present, when voice data is processed by a voice chip in electronic equipment such as an intelligent sound box, all data to be calculated are loaded, and the loaded data to be calculated are used for processing the voice data.

Disclosure of Invention

The application provides a processing device, a processing method of a neural network and a device thereof.

According to an aspect of the present application, there is provided a processing apparatus including: the system comprises a neural network processing unit NPU, a pseudo static random access memory PSRAM and a digital signal processor DSP which are connected through a bus;

the DSP is used for storing input data to be processed in an internal memory; and storing the operation result of the NPU on the input data;

the PSRAM is used for storing network parameters of the neural network;

the NPU is used for accessing a memory inside the DSP through the bus to read and obtain the input data to be processed and accessing the PSRAM through the bus to obtain at least part of network parameters; and performing at least one of matrix vector operation and convolution operation on the input data according to the read at least part of network parameters, and synchronously continuously reading the rest of network parameters in the PSRAM.

According to another aspect of the present application, a processing method of a neural network is provided, which is applied to a processing device, wherein the processing device comprises a neural network processing unit NPU, a pseudo static random access memory PSRAM and a digital signal processor DSP which are connected by a bus; the processing method comprises the following steps:

the NPU accesses a memory inside the DSP through the bus to read and obtain input data to be processed;

the NPU accesses the PSRAM through the bus to obtain at least part of network parameters;

the NPU executes at least one of matrix vector operation and convolution operation on the input data according to the read at least part of network parameters, and synchronously continuously reads the rest of network parameters in the PSRAM;

and the DSP stores the operation result of the NPU on the input data.

According to yet another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the neural network processing methods set forth herein above.

According to yet another aspect of the present application, there is provided a non-transitory computer readable storage medium of computer instructions for causing a computer to perform the processing method of the neural network proposed above in the present application.

According to yet another aspect of the present application, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the processing method of the neural network proposed above in the present application.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic structural diagram of a processing apparatus according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a processing apparatus according to a second embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a processing apparatus according to a third embodiment of the present application;

FIG. 4 is a diagram illustrating a convolution calculation process according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a processing apparatus according to a fourth embodiment of the present disclosure;

fig. 6 is a schematic flowchart of a processing method of a neural network according to a fifth embodiment of the present application;

FIG. 7 shows a schematic block diagram of an example electronic device that may be used to implement embodiments of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In order to save the cost of the voice chip and meet the requirement of a balance algorithm, the In-chip Memory of the voice chip can be reduced, and then the cost of the scheme that the original voice chip is externally connected with the PSRAM through the ESP32 is reduced by using a method of expanding the Memory by using a PSRAM (Pseudo Static Random Access Memory) packaged by an SIP (System In Package). That is, in the existing solution, the PSRAM is placed at the main control chip end of the ESP32 and is externally placed at the board level, which requires extra cost, so that the PSRAM can be packaged into the voice chip, and the cost of externally hanging the PSRAM is saved in cooperation with the reduction of the in-chip memory.

However, as the in-chip memory is reduced, the internal memory with high bandwidth is reduced, and the data loading speed is reduced, so that risks of AI calculation and parallel loading of model data are brought, and therefore, how to improve the bandwidth utilization rate of the PSRAM is of great importance.

In addition, in order to save the area of the voice chip, the functions (voice service logic, Control logic, etc.) of the main Control MCU (micro programmed Control Unit, microcontroller) in the voice chip may be moved from ESP32 to the voice chip, and only one core in the dual-core architecture of the voice chip is reserved for voice processing.

However, after the computation of the dual cores is totally put into one core, the power of the 8x8 and 16x8 multiplication and addition operations is insufficient, and the pressure of processing all the voice by the single core is large.

Moreover, the loading process of data in the PSRAM and the calculation process of voice data are separately executed, and the calculation efficiency of subsequent voice data is seriously affected under the condition that the data loading speed in the PSRAM is low.

In view of the above problems, the present application provides a processing apparatus, a processing method of a neural network, and an apparatus thereof.

A processing apparatus, a processing method of a neural network, and an apparatus thereof according to an embodiment of the present application are described below with reference to the drawings.

Fig. 1 is a schematic structural diagram of a processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 1, the Processing apparatus may include an NPU (neural Network Processing Unit) 110, a PSRAM120, and a DSP (Digital Signal Processor) 130 connected by a bus.

The DSP130 is used for storing input data to be processed in an internal memory; and stores the result of the NPU110 operation on the input data.

The PSRAM120 is used for storing network parameters of the neural network.

The NPU110 is used for accessing a memory inside the DSP130 through a bus so as to read and obtain input data to be processed, and accessing the PSRAM120 through the bus to obtain at least part of network parameters; and performing at least one of a matrix vector operation and a convolution operation on the input data according to at least part of the read network parameters, and synchronously continuing to read the rest of the network parameters in the PSRAM 120.

In the embodiment of the present application, when the neural network is applied in a speech recognition scenario, such as when the NPU is applied in a speech chip, the input data to be processed may be determined according to a feature vector of speech data input by a user. Correspondingly, the operation result of the input data is used for determining the voice recognition result corresponding to the voice data.

It should be understood that the neural network may be applied to other scenarios, and in this case, the input data to be processed may also be determined from other data.

As an application scenario, for example, a neural network is applied to an image recognition scenario or a video recognition scenario, input data to be processed may be determined according to a feature vector of an image or a video frame, and accordingly, an operation result of the input data is used to determine a classification result of the image or the video frame.

An example, which is exemplified by a neural network for identity recognition, the input data to be processed can be determined according to the feature vector of the image or video frame, and accordingly, the operation result of the input data is used for determining the identity information of the target object in the image or video frame.

For another example, as exemplified by the use of a neural network for detecting a living body, input data to be processed may be determined according to a feature vector of an image or a video frame, and accordingly, an operation result of the input data is used for determining whether a living body exists in the image or the video frame. For example, when the probability value output by the neural network is greater than or equal to a preset threshold (for example, the preset threshold may be 0.5), the classification result indicates that a living body exists, and when the probability value output by the neural network is less than the preset threshold, the classification result indicates that a living body does not exist.

As another example, for illustrating that a neural network is used for detecting forbidden pictures (such as violent pictures, pornographic pictures, etc.), the input data to be processed may be determined according to feature vectors of an image or a video frame, and accordingly, the operation result of the input data is used to determine whether the image or the video frame is a forbidden picture. For example, when the probability value output by the neural network is greater than or equal to the preset threshold, the classification result is: the image or video frame is a forbidden picture, and when the probability value output by the neural network is smaller than a preset threshold value, the classification result is as follows: the image or video frame is a normal picture.

As another application scenario, for example, in a case where a neural network is applied to a speech translation scenario, the input data to be processed may be determined according to a feature vector of speech data input by a user. And correspondingly, inputting the operation result of the data for determining the voice translation result.

For example, the neural network is applied to a chinese-english translation scenario for exemplary illustration, the input data to be processed may be determined according to a feature vector of the chinese speech data, and correspondingly, an operation result of the input data is used to determine an english translation result corresponding to the speech data, where the english translation result may be in a speech form or may also be in a text form, which is not limited thereto.

In this embodiment, the NPU110 may access the memory inside the DSP130 through the bus to read and obtain input data to be processed, and access the PSRAM120 through the bus to obtain at least part of network parameters, perform at least one of a matrix vector operation and a convolution operation on the input data according to the read at least part of network parameters, and synchronously continue to read the remaining network parameters in the PSRAM120, so that at least one of the matrix vector operation and the convolution operation may be performed on the input data according to the remaining network parameters that continue to be read, to obtain an operation result of the input data. Therefore, the data in the PSRAM can be read/loaded, and the calculation process can be executed by using the read/loaded data, namely, the data reading/loading and the calculation can be realized in parallel, so that the calculation efficiency can be improved.

It should be noted that, in the related art, the network parameters in the PSRAM need to be loaded by the Cache, the DSP is in a standby state when the Cache is loaded, and the computation process can be executed by using the loaded network parameters after the network parameters are loaded, so that the computation efficiency is low.

In the present application, the loading process of the network parameters in the PSRAM120 and the calculation process of the NPU110 are executed in parallel, which can achieve both the improvement of the utilization rate of data loading and the substantial improvement of the calculation efficiency. The neural network is applied to a voice recognition scene for exemplary illustration, and under the condition that the calculation efficiency is greatly improved, the processing device can be more suitable for voice awakening and recognition tasks of neural network.

In the processing device of the embodiment of the application, the NPU accesses the memory inside the DSP through the bus to read and obtain the input data to be processed, and accesses the PSRAM through the bus to obtain at least part of the network parameters, to perform at least one of a matrix vector operation and a convolution operation on the input data according to the read at least part of the network parameters, and to synchronously continue to read the rest of the network parameters in the PSRAM. Therefore, the data in the PSRAM can be read/loaded, and the calculation process can be executed by using the read/loaded data, namely, the data reading/loading and the calculation can be realized in parallel, so that the calculation efficiency can be improved.

In order to clearly illustrate how the above embodiments of the present application operate on input data, the present application also provides a processing device.

Fig. 2 is a schematic structural diagram of a processing apparatus according to a second embodiment of the present application.

As shown in fig. 2, the processing means may include: NPU210, PSRAM220, and DSP230, wherein NPU210 includes quantization unit 211 and arithmetic unit 212.

The DSP230 is configured to store floating-point input data in an internal memory; and stores the result of the operation of the NPU210 on the input data.

The PSRAM220 is used for storing network parameters of the neural network.

A quantization unit 211, configured to obtain floating-point input data, quantize the floating-point input data to obtain quantized input data, and provide the quantized input data to the operation unit 212; and, the inverse quantization unit is configured to perform inverse quantization on the operation result output by the operation unit 212 to obtain an inverse quantization result.

And an operation unit 212, configured to perform a matrix vector operation and/or a convolution operation on the quantized input data to obtain an operation result of the input data.

In the embodiment of the present application, when the neural network is applied to a speech recognition scenario, the input data of the floating point type may be determined according to the feature vector of the speech data input by the user. Correspondingly, the dequantization result is used for determining the voice recognition result corresponding to the voice data.

It should be understood that the neural network may be applied to other scenarios, and in this case, the input data to be processed may also be determined from other data. For example, when the neural network is applied to an image recognition scene or a video recognition scene, the floating-point type input data can be determined according to the feature vectors of the image or the video frame, and correspondingly, the dequantization result is used for determining the classification result of the image or the video frame; for another example, when the neural network is applied to a speech translation scenario, the floating-point type input data may be determined according to the feature vector of the speech data input by the user, and accordingly, the dequantization result is used to determine the speech translation result.

In this embodiment, the quantization unit 211 in the NPU210 may access the memory inside the DSP230 through the bus to obtain floating-point input data by reading, quantize the floating-point input data to obtain quantized input data, and provide the quantized input data to the operation unit 212, and accordingly, after receiving the quantized input data, the operation unit 212 may perform matrix vector operation and/or convolution operation on the quantized input data to obtain an operation result of the input data, and output the operation result to the quantization unit 211, and accordingly, after receiving the operation result, the quantization unit 211 may perform inverse quantization on the operation result to obtain an inverse quantization result. Therefore, by adopting the special hardware NPU210 to realize matrix calculation and/or convolution calculation, when the NPU is applied to a voice chip, the processing burden of the core in the voice chip can be reduced, and the processing efficiency of the core in the voice chip can be improved.

In a possible implementation manner of the embodiment of the present application, the operation unit 212 may read at least a portion of the network parameters stored in the PSRAM220, perform matrix vector operation on the quantized input data according to the read at least a portion of the network parameters, and synchronously continue to read the remaining network parameters in the PSRAM 220. Therefore, the network parameters can be read, matrix vector operation can be executed by using the read network parameters, and the parallelism of data reading/loading and calculation in the PSRAM can be realized, so that the calculation efficiency is improved.

According to the processing device, the quantization unit is used for acquiring the floating-point input data, the quantization unit is used for quantizing the floating-point input data to obtain the quantized input data, the quantized input data are provided to the operation unit, so that the operation unit is used for performing matrix vector operation and/or convolution operation on the quantized input data to obtain the operation result of the input data, and then the quantization unit is used for performing inverse quantization on the operation result output by the operation unit to obtain the inverse quantization result. Therefore, matrix calculation and/or convolution calculation are/is realized by adopting the special NPU, and when the NPU is applied to a voice chip, the processing load of a core in the voice chip can be reduced, and the processing efficiency of the core in the voice chip is improved.

For clarity, how to quantize the input data and how to dequantize the operation result output by the operation unit 212 in the above-mentioned embodiment of fig. 2 of the present application is illustrated below by performing a matrix vector operation by the operation unit 212.

When the operation unit 212 performs a matrix vector operation, the quantization unit 211 may be configured to obtain a first parameter for quantization and a second parameter for inverse quantization according to floating-point input data stored in a memory inside the DSP 230; multiplying a floating point value to be quantized in the floating point type input data by a first parameter, and converting the floating point value into a numerical type char after solving the integer so as to obtain numerical type input data; sending the numerical input data to the arithmetic unit 212; converting the operation result obtained by the operation unit 212 into a floating point type; the result of the floating-point operation is multiplied by the second parameter and then sent to the memory of the DSP230 for storage.

In the embodiment of the present application, the first parameter for quantization and the second parameter for inverse quantization are determined from input data of a floating point type.

As an example, a maximum vector value corresponding to the floating-point input data may be determined, the maximum tagged vector value is fmax, the first parameter is B, the second parameter is a, B may be 127.0f/fmax, and a may be fmax/127.0 f. Wherein, the value range of one char is-128-127, and during quantization, fmax can be mapped to 127 quantization values to obtain the maximum precision; f is float (floating point).

In this embodiment, the quantization unit 211 in the NPU210 may obtain a first parameter for quantization and a second parameter for inverse quantization according to floating-point input data stored in a memory inside the DSP230, multiply floating-point values to be quantized (for example, all floating-point values in the input data) in the floating-point input data by the first parameter, perform rounding and convert the floating-point values into numerical input data, send the numerical input data to the operation unit 212, perform matrix vector operation on the numerical input data by the operation unit 212 to obtain an operation result of the input data, send the operation result to the quantization unit 211 by the operation unit 212, convert the operation result calculated by the operation unit 212 into a floating-point type by the quantization unit 211, multiply the floating-point operation result by the second parameter to obtain an inverse quantization result, and send the inverse quantization result to the memory of the DSP230 for storage, so that subsequent operations may be performed by software of DSP 230.

Therefore, on one hand, the quantization process can be realized by a special quantization unit, and the NPU210 can be ensured to effectively execute the matrix calculation process. On the other hand, by storing the floating-point input data in the memory of the DSP230 and storing the operation result of the matrix vector operation in the memory of the DSP230, the DSP230 does not need to design Cache (Cache memory) consistency with the NPU210, so that the hardware design can be greatly simplified and the problem of data consistency between the DSP230 and the NPU210 can be solved.

The data consistency refers to that when the DSP accesses a Random Access Memory (Random Access Memory, for short, NPURAM) of the NPU, Access data is mapped into the Cache, and if the NPU modifies data in the NPURAM, the DSP cannot see the modified data in the NPURAM, but only can see data in the Cache, thereby causing a data consistency problem. When the NPU accesses the memory in the DSP, the memory in the DSP is visible to the DSP and the NPU at the same time, and the problem of data consistency can not occur.

As an example, the quantization unit 211 in the NPU210 may determine a maximum vector value fmax corresponding to floating-point input data, determine a first parameter B for quantization and a second parameter a for inverse quantization according to fmax, in performing matrix vector operation, multiply all floating-point values in the input data by B, then perform rounding and convert into a floating-point char, send the char-type input data to the operation unit 212, perform matrix vector operation of 8 × 8 on the char-type input data and the char-type neural network parameter weight by the operation unit 212 (the input vector of the matrix vector operation needs to be quantized into 8 bits, and the matrix vector operation is a matrix operation of 8 bits by 8 bits), output a result of the matrix vector operation to the ACC, the output result is an operation result, the operation result output by the ACC may be converted into a floating-point type, and multiply the floating-point operation result by a, and sent to a Memory (e.g., a DRAM (Dynamic Random Access Memory)) of DSP230 for storage.

In a possible implementation manner of the embodiment of the present application, the NPU210 may access the memory inside the DSP230 through a bus, and specifically, the NPU210 may further include a master interface of the bus, where the master interface is configured to send a memory copy function memcpy to the DSP230 through the bus to access the memory inside the DSP230, so as to obtain the floating-point type input data stored in the memory inside the DSP 230. Therefore, the input data stored in the memory inside the DSP can be effectively read, and the NPU can be ensured to effectively execute the calculation process. Moreover, the memory in the DSP is visible to the DSP and the NPU at the same time, and the memory in the DSP is accessed through the bus, so that the problem of data consistency can be avoided.

In a possible implementation manner of the embodiment of the present application, when the operation unit 212 performs a convolution operation, the quantization unit 211 may be configured to: the conversion operation of the floating point to the short type is performed on the input data of the floating point type to perform the convolution operation on the converted input data of the short type by the arithmetic unit 212. Therefore, the quantization process can be simplified into the process of converting a floating point type into a short fixed point type, the precision of the convolution process can be ensured, and the calculation cost of the quantization process can be reduced.

The floating-point input data may be stored in a memory inside the DSP 230.

In a possible implementation manner of the embodiment of the present application, the NPU210 may be connected to the RAM through a high-speed access interface, and the RAM may obtain the short input data from the NPU and transfer the short input data to the RAM, so that in the subsequent calculation process, the operation unit 212 may effectively obtain the short input data from the RAM and perform convolution operation on the short input data. That is, in the present application, the short input data output from the quantization unit 211 can be stored by the RAM.

Wherein, the RAM is the RAM of NPU, abbreviated as NPURAM.

In order to clearly illustrate how the convolution operation is performed on short types of input data in the above embodiments of the present application, the present application provides another processing apparatus.

Fig. 3 is a schematic structural diagram of a processing apparatus according to a third embodiment of the present application.

As shown in fig. 3, the processing means may include: the NPU310, the PSRAM320, and the DSP330, wherein the NPU310 includes a quantization unit 311 and an operation unit 312, and the operation unit 312 includes a first register 3121, a second register 3122, and an accumulator 3123.

The DSP330 is used for storing floating-point input data in an internal memory; and stores the result of the operation of the NPU310 on the input data.

The PSRAM320 is used for storing network parameters of the neural network.

The quantization unit 311 is configured to obtain floating-point input data, perform a conversion operation from a floating point to a short type on the floating-point input data, and perform a convolution operation on the converted short type input data.

The NPU310 is connected with the RAM through a high-speed access interface; and the RAM is used for unloading the short input data into the RAM.

A first register 3121 for reading the short type of input data from the RAM in the first cycle.

The second register 3122 is configured to, at a plurality of subsequent cycles after the first cycle, read at least a portion of the network parameters stored in the PSRAM320, and perform a dot-product operation on at least a portion of the network parameters read in each cycle and corresponding input data in the first register 3121.

And the accumulator 3123 is configured to obtain a result of the dot product operation, and accumulate the result according to the result of the dot product operation to obtain an operation result of the convolution operation.

For example, when the network parameter is marked as weight ', the network parameter weight' can be divided into 8 weight parts, each weight part is read through a bus, and is subjected to convolution operation, only aiming at short-type input data and weight ', and when a certain weight part is obtained in a certain period, the weight part and the short-type input data are utilized to execute the convolution operation process, the operation unit can read the next weight', so that the parallelism of the network parameter reading/loading process and the convolution calculation process can be realized, and the convolution calculation efficiency is improved.

For example, when the input data is marked as I, the network parameter of the neural network is W, and the input data is 128bytes, the first 4 bytes [0,3] in the input data can be read in the first cycle, and when the second cycle to the thirty-third cycle, the network parameter of 32 cycles, that is, the network parameter of 128bytes, is read, as shown in fig. 4, the first 4 bytes of the input data and the 128bytes of the network parameter can be simultaneously dot-product-operated, and the accumulator ACC accumulates the dot-product-operated results of 32 cycles.

For example, the output of ACC1 in fig. 4 is: w [3] xI [3] + W [2] xI [2] + W [1] xI [1] + W [0] xI [0], and similarly, the output of ACC2 is: w [7] xI [3] + W [6] xI [2] + W [5] xI [1] + W [4] xI [0], and so on, the output of ACC32 is: w127 XI 3 + W126 XI 2 + W125 XI 1 + W124 XI 0.

And then reading 4 bytes [4,7] in the input data and the network parameters of 32 periods, executing dot product operation, sending the result of the dot product operation to an accumulator for accumulation until all the bytes in the input data are consumed, namely until all the bytes in the input data participate in the operation, and ending the matrix operation.

Therefore, the convolution operation can be executed by using the read network parameters in the loading or reading process of the network parameters, the parallelism of the network parameter reading/loading and the convolution calculation can be realized, and the convolution calculation efficiency is improved.

In a possible implementation manner of the embodiment of the present application, when the NPU is applied to a voice chip, in order to further reduce a processing load of a core in the voice chip, the NPU may further include a high-performance activation unit, and the activation unit activates an operation result of the convolution operation. Specifically, the operation result of the convolution operation may be sent to a memory of the DSP for storage, the activation unit may access the memory inside the DSP through the bus, obtain the operation result of the convolution operation stored by the DSP, activate by using an activation function according to the operation result of the convolution operation, and provide the activation result to the DSP for storage, so that the subsequent operation may be executed by software of the DSP.

As an example, the DSP is a High Fidelity (High Fidelity) DSP, the processing device may have a structure as shown in fig. 5, the NPU may include a bus master interface, and the master interface may access a memory inside the High Fidelity DSP through the bus, and the NPU further includes a High speed access interface (128 byte/cycle) through which the NPURAM is connected.

By storing the floating-point input data, the operation result of the matrix vector operation and the operation result of the convolution operation (in a floating-point format) into the internal memory of the HiFi DSP, the HiFi DSP does not need to design Cache consistency with the NPU, namely, the Cache structure does not need to be modified or a consistency bus is not added, so that the design of hardware can be simplified.

On the aspect of computing power, the NPU is internally provided with 128 multiplication and addition operations of 8x8 and supports three matrix operation modes of 4x32, 8x16 and 16x 8. Meanwhile, the method is compatible with 64 multiply-add operations of 16x8 and supports three convolution operation modes of 2x32, 4x16 and 8x 8. Wherein, 4 × 32 means that 128 elements are divided into 32 groups, 4 elements of each group and 4 elements of the input data are dot-multiplied, and the dot-product result is sent to 32 accumulators. If the vector dimension of the input data is N, N/4 cycles are required to complete the matrix operation of 1xN and Nx 32. 8x16, 16x8 are similar.

Matrix operation, namely matrix vector operation, the input data or input vector is quantized to 8bit, 8bit by 8bit vector multiplication matrix operation, the matrix operation result is multiplied by the quantization scale value (second parameter) of the input data. The network parameter weight of the neural network also needs to be quantized, and the quantization process of the network parameter can be completed by software of the HiFi DSP, that is, the operations of the scaling coefficient and the offset coefficient (Scale value and Bias value) of weight can be completed by software of the HiFi DSP because the calculation amount of this part is relatively low. In the above operation, the calculation power for quantization is about 30%, the calculation power for 8x8 matrix is about 67%, and the multiplication scale is 3% in the 8x8 matrix calculation process of 64x64 elements. The quantization process is high in occupation ratio, the main reason is that in the process of converting a floating point into a short fixed point, the sign bit of the floating point needs to be judged, then +/-0.5 is converted into an int8 integer, and the operation HiFi DSP has no specific acceleration instruction and can only be executed by one. By the above hardware acceleration method, a dedicated circuit method can be adopted, that is, the proportion of the part can be reduced from 30% to 5% by performing the matrix operation by the NPU. And by matching with matrix operation, 8 multiplication and addition operations in each period are improved to 128 multiplication and addition operations, so that the calculation efficiency is greatly improved.

For convolution operation, 16 bits are adopted as input, so that the quantization process is simplified into a process of converting floating point type 1024 into short type fixed point. The original quantization process is to find the maximum value absmax of the input data or input vector, divide all values by max and multiply by 127, the calculation needs three steps, and the floating point 1024 to short fixed point is only the third step. Therefore, the precision of the convolution process is ensured, and the calculation expense of the quantization process is reduced (because the original quantization process can not be calculated in parallel).

The NPU has high-performance activation units, realizes sigmoid/tanh/log/exp and other operations, has precision close to a single-precision floating point mathematical library, can finish the calculation of one unit in one period, greatly reduces the time for calculating the functions by using a HiFi DSP, and requires about 400-1000 periods for the calculation of each unit.

The dedicated quantization unit is used for reducing the time overhead of quantization, and the calculation efficiency can be improved by using the limit of the memory.

On the premise of not losing performance, the size of a Static Random Access Memory (SRAM) in the chip can be reduced as much as possible. Compared with the related art voice chip, 1MB + of storage is placed on the PSRAM, the bandwidth of the PSRAM is only 166MB/s, if the PSRAM is called once in 10ms, the storage of the 1MB is read only, the theoretical bandwidth is occupied by 60%, and when the calculation efficiency is 80%, the occupation ratio is increased to 75%. Therefore, it is necessary to put a model with a small number of calls into the PSRAM first, for example, a model in which a model placed in the PSRAM is called once in 30 ms. In addition, when data is loaded, calculation is carried out simultaneously, and model Layer level buffering is carried out in a chip, so that repeated loading is reduced. When NPU hardware is used for acceleration, network parameters can be loaded and stored in an on-chip RAM, the calculation process is completely parallelized, and the limitation that calculation is carried out after loading is removed, so that the bandwidth utilization rate is maximized, which cannot be achieved by a HiFi DSP system. Therefore, in the application, the parallelization of loading and calculation is realized by using hardware, and the NPU not only loads the network parameters in the PSRAM, but also carries out matrix operation at the same time.

The hardware accelerates the reading of 128Bytes in each period of the on-chip RAM, and the bandwidth of the on-chip RAM is 16 times higher than that of 64bits of a HiFi DSP. The input process described above has a quantization process, or a floating point to short process, and the hardware units of the two processes cannot be 128 units because of the NPU hardware acceleration unit area, so that a read rate of 128Bytes is not required. Finally, the read bandwidth of a bus 64bit is determined, and 2 execution units are placed. So for input data or input vectors of the floating point type, their storage locations need to be placed in the core of the HiFi DSP (i.e. internal memory); and the results of the matrix operation and the convolution operation (in floating point format) also need to be stored back in the core of the HiFi DSP. Therefore, the HiFi DSP does not need to be designed with the NPU for Cache consistency, and the design is greatly simplified. After the structure of the processing device is used, the NPU calculation is used for the calculation intensive part, and the HiFi DSP carries out the general calculation and the calculation of voice signal processing, so that the optimal calculation efficiency of various voice tasks and the parallel of calculation and loading are achieved.

The processing device of the embodiment of the application realizes matrix calculation and/or convolution calculation by adopting the special NPU, and when the NPU is applied to a voice chip, the processing burden of a core in the voice chip can be reduced, and the processing efficiency of the core in the voice chip is improved.

In order to implement the above embodiments, the present application also provides a processing method of a neural network.

Fig. 6 is a schematic flowchart of a processing method of a neural network according to a fifth embodiment of the present application.

The embodiment of the application relates to a processing method of a neural network, which is applied to a processing device, wherein the processing device comprises an NPU, a PSRAM and a DSP which are connected through a bus.

As shown in fig. 6, the processing method of the neural network may include the steps of:

in step 601, the NPU accesses the internal memory of the DSP through the bus to read and obtain the input data to be processed.

In step 602, the NPU accesses the PSRAM via the bus to obtain at least a portion of the network parameters.

And 603, the NPU performs at least one of matrix vector operation and convolution operation on the input data according to at least part of the read network parameters, and synchronously continuously reads the rest network parameters in the PSRAM.

In step 604, the DSP stores the result of the NPU operation on the input data.

In a possible implementation manner of the embodiment of the application, the input data stored by the DSP is a floating point type, the NPU includes a quantization unit and an arithmetic unit, the floating point type input data is obtained by the quantization unit, the quantized input data is obtained by quantizing the floating point type input data, and the quantized input data is provided to the arithmetic unit; performing matrix vector operation and/or convolution operation on the quantized input data through an operation unit to obtain an operation result of the input data; and carrying out inverse quantization on the operation result output by the operation unit through the quantization unit to obtain an inverse quantization result.

In a possible implementation manner of the embodiment of the application, when the operation unit performs a matrix vector operation, the quantization unit obtains a first parameter for quantization and a second parameter for inverse quantization according to floating point type input data stored in a memory inside the DSP, multiplies a floating point value to be quantized in the floating point type input data by the first parameter, performs an integer operation, converts the floating point value to be quantized into a numerical type, obtains numerical type input data, and sends the numerical type input data to the operation unit; the arithmetic unit executes matrix vector operation on the numerical input data to obtain an arithmetic result; and the quantization unit converts the operation result into a floating point type, multiplies the floating point type operation result by a second parameter and then sends the result to a DSP memory for storage.

As a possible implementation, the NPU further includes a master interface of the bus; and the main interface is used for sending the memory copy function to the DSP through the bus so as to access the memory in the DSP and obtain the floating-point input data stored in the memory in the DSP.

In another possible implementation manner of the embodiment of the present application, when the operation unit performs a convolution operation, the quantization unit performs a conversion operation from a floating point type to a short type on the input data of the floating point type; and the operation unit executes convolution operation on the converted short input data to obtain an operation result.

As a possible implementation, the NPU is connected to the RAM through a high-speed access interface; and the RAM is used for unloading the short input data into the RAM.

As a possible implementation manner, the arithmetic unit includes a first register, a second register, and an accumulator; the first register reads the input data of the short type from the RAM in a first period; the second register reads at least part of network parameters stored in the PSRAM in a plurality of subsequent cycles after the first cycle, and performs dot product operation on at least part of network parameters read in each cycle and corresponding input data in the first register; the accumulator obtains the result of dot product operation, and accumulates according to the result of dot product operation to obtain the operation result of convolution operation.

As a possible implementation manner, the NPU further includes an activation unit, and the activation unit activates the convolution operation according to the operation result stored in the DSP by using an activation function, and provides the activation result to the DSP for storage.

It should be noted that the explanation of the processing apparatus in any of the foregoing embodiments is also applicable to the embodiment, and the implementation principle thereof is similar, and is not described herein again.

In the processing method of the neural network according to the embodiment of the application, the NPU accesses the memory inside the DSP through the bus to read input data to be processed, and accesses the PSRAM through the bus to obtain at least part of network parameters, to perform at least one of matrix vector operation and convolution operation on the input data according to the read at least part of network parameters, and to synchronously continue to read the rest of network parameters in the PSRAM. Therefore, the data in the PSRAM can be read/loaded, and the calculation process can be executed by using the read/loaded data, namely, the data reading/loading and the calculation can be realized in parallel, so that the calculation efficiency can be improved.

To implement the above embodiments, the present application also provides an electronic device, which may include at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the processing method of the neural network proposed in any one of the embodiments described above.

In order to achieve the above embodiments, the present application also provides a non-transitory computer readable storage medium storing computer instructions for causing a computer to execute the processing method of a neural network proposed in any one of the above embodiments of the present application.

In order to implement the above embodiments, the present application further provides a computer program product, which includes a computer program that, when being executed by a processor, implements the processing method of the neural network proposed in any of the above embodiments of the present application.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

FIG. 7 shows a schematic block diagram of an example electronic device that may be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 7, the device 700 includes a computing unit 701, which can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 702 or a computer program loaded from a storage unit 707 into a RAM (Random Access Memory) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An I/O (Input/Output) interface 705 is also connected to the bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing Unit 701 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 701 executes the respective methods and processes described above, such as the processing method of the neural network described above. For example, in some embodiments, the processing methods of the neural network described above may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the processing method of the neural network described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform the processing method of the neural network described above.

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, Integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, System On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (Electrically Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in a conventional physical host and a VPS (Virtual Private Server). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be noted that artificial intelligence is a subject for studying a computer to simulate some human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and includes both hardware and software technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

According to the technical scheme of the embodiment of the application, the NPU accesses the internal memory of the DSP through the bus to read and obtain the input data to be processed, accesses the PSRAM through the bus to obtain at least part of network parameters, executes at least one of matrix vector operation and convolution operation on the input data according to the read at least part of network parameters, and synchronously and continuously reads the rest network parameters in the PSRAM. Therefore, the data in the PSRAM can be read/loaded, and the calculation process can be executed by using the read/loaded data, namely, the data reading/loading and the calculation can be realized in parallel, so that the calculation efficiency can be improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A processing apparatus, comprising: the system comprises a neural network processing unit NPU, a pseudo static random access memory PSRAM and a digital signal processor DSP which are connected through a bus;

the PSRAM is used for storing network parameters of the neural network;

2. The processing device of claim 1, wherein the DSP stores input data of a floating point type, the NPU comprising:

the quantization unit is used for acquiring the floating-point input data, quantizing the floating-point input data to obtain quantized input data, and providing the quantized input data to the operation unit; the inverse quantization unit is used for performing inverse quantization on the operation result output by the operation unit to obtain an inverse quantization result;

and the operation unit is used for executing matrix vector operation and/or convolution operation on the quantized input data to obtain an operation result of the input data.

3. The processing apparatus according to claim 2, wherein the arithmetic unit is configured to perform a matrix vector operation, and the quantization unit is configured to:

obtaining a first parameter for quantization and a second parameter for inverse quantization according to floating point type input data stored in a memory inside the DSP;

multiplying the floating point value to be quantized in the floating point type input data by the first parameter, and converting the floating point value to a numerical type after solving the integer so as to obtain numerical type input data;

sending the numerical input data to the arithmetic unit;

converting the operation result obtained by the operation unit into a floating point type;

and multiplying the floating-point type operation result by the second parameter and then sending the result to a memory of the DSP for storage.

4. The processing apparatus of claim 3, wherein the NPU further comprises a master interface of the bus;

and the main interface is used for sending a memory copy function to the DSP through the bus so as to access the memory inside the DSP and obtain the floating-point input data stored in the memory inside the DSP.

5. The processing apparatus according to claim 2, wherein the arithmetic unit is configured to perform a convolution operation, and the quantization unit is configured to:

and performing conversion operation of converting the floating point type input data into a short type, so as to perform convolution operation on the converted short type input data.

6. The processing device of claim 5, wherein the processing device further comprises a Random Access Memory (RAM) connected to the NPU through a high speed access interface;

and the RAM is used for transferring the short input data into the RAM.

7. The processing apparatus of claim 6, wherein the arithmetic unit comprises a first register, a second register, and an accumulator;

the first register is used for reading the input data of the short type from the RAM in a first period;

the second register is used for reading at least part of network parameters in the PSRAM in a plurality of subsequent cycles after the first cycle, and performing dot product operation on the at least part of network parameters read in each cycle and corresponding input vectors in the first register;

and the accumulator is used for acquiring the result of the dot product operation and accumulating according to the result of the dot product operation to obtain the operation result of the convolution operation.

8. The processing apparatus of any of claims 1-7, wherein the NPU comprises:

and the activation unit is used for activating by adopting an activation function according to the operation result of the convolution operation stored in the DSP and providing the activation result for the DSP for storage.

9. A processing method of a neural network is applied to a processing device, wherein the processing device comprises a neural network processing unit NPU, a pseudo static random access memory PSRAM and a digital signal processor DSP which are connected through a bus; the processing method comprises the following steps:

and the DSP stores the operation result of the NPU on the input data.

10. The method of claim 9, wherein the DSP stores input data of a floating point type, the NPU including a quantization unit and an arithmetic unit;

the quantization unit acquires the floating-point input data, quantizes the floating-point input data to obtain quantized input data, and provides the quantized input data to an operation unit;

the operation unit executes matrix vector operation and/or convolution operation on the quantized input data to obtain an operation result of the input data;

and the quantization unit performs inverse quantization on the operation result output by the operation unit to obtain an inverse quantization result.

11. The method of claim 10, wherein,

the quantization unit obtains a first parameter for quantization and a second parameter for inverse quantization according to floating point type input data stored in a memory inside the DSP, multiplies a floating point value to be quantized in the floating point type input data by the first parameter, and converts the floating point value to be quantized into a numerical type after being subjected to integer calculation so as to obtain numerical type input data, and sends the numerical type input data to the operation unit;

the arithmetic unit executes matrix vector operation on the numerical input data to obtain an arithmetic result;

and the quantization unit converts the operation result into a floating point type, multiplies the floating point type operation result by the second parameter and then sends the result to a DSP memory for storage.

12. The method of claim 11, wherein the NPU further comprises a master interface of the bus;

13. The method of claim 10, wherein,

the quantization unit performs conversion operation of converting the floating point type input data into a short type;

and the operation unit executes convolution operation on the converted short input data to obtain the operation result.

14. The method of claim 13, wherein the NPU is connected to a random access memory RAM through a high speed access interface;

and the RAM is used for transferring the short input data into the RAM.

15. The method of claim 14, wherein the arithmetic unit comprises a first register, a second register, and an accumulator;

the first register reads the input data of the short type from the RAM in a first period;

the second register reads at least part of network parameters in the PSRAM in a plurality of subsequent cycles after the first cycle, and performs dot product operation on the at least part of network parameters read in each cycle and corresponding input vectors in the first register;

and the accumulator acquires the dot product operation result and accumulates according to the dot product operation result to obtain the operation result of the convolution operation.

16. The method of any of claims 9-15, wherein the NPU includes an activation unit, the method further comprising:

and the activation unit activates by adopting an activation function according to the operation result of the convolution operation stored in the DSP and provides the activation result for the DSP to be stored.

17. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the neural network processing method of any one of claims 9-16.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the processing method of the neural network according to any one of claims 9 to 16.

19. A computer program product comprising a computer program which, when executed by a processor, implements a method of processing a neural network as claimed in any one of claims 9 to 16.