CN116109481A

CN116109481A - Scaling method, chip, storage medium and electronic device

Info

Publication number: CN116109481A
Application number: CN202211541654.6A
Authority: CN
Inventors: 曾飞; 林双杰; 郑先木
Original assignee: Rockchip Electronics Co Ltd
Current assignee: Rockchip Electronics Co Ltd
Priority date: 2022-12-02
Filing date: 2022-12-02
Publication date: 2023-05-12

Abstract

The present disclosure provides a scaling method, a chip, a storage medium, and an electronic device. The method is applied to a neural network processing unit and comprises: reading input data and a weight tensor; configuring at least one convolution kernel according to the weight tensor; the convolution operation is respectively carried out on the input data by utilizing at least one convolution check through a multiplication and addition array of the neural network processing unit; and outputting a result of the convolution operation as scaled data of the input data. According to the method, the neural network processing unit can be utilized to realize the scaling operation on the input data, and an additional hardware scaling module is not required to be added.

Description

Scaling method, chip, storage medium and electronic device

Technical Field

The disclosure belongs to the field of data processing, and relates to a scaling operator implementation technology, in particular to a scaling method, a chip, a storage medium and electronic equipment.

Background

In order to enhance the visual effect, it is often necessary to perform a scaling (resolution) process on the image when processing the image. In addition, many applications in artificial intelligence, such as object segmentation models, also involve scaling of intermediate data during reasoning. Scaling operators are typically implemented using specialized ICs (Integrated Circuit, integrated circuits), such as ASICs (Application Specific Integrated Circuit, application specific integrated circuits), when implemented.

Disclosure of Invention

The present disclosure provides a scaling method, a chip, a storage medium, and an electronic device for implementing a scaling operator through an NPU (Neural-network Processing Unit, neural network processing unit) to perform scaling processing on data.

In a first aspect, an embodiment of the present disclosure provides a scaling method applied to a neural network processing unit, where the scaling method includes: reading input data and a weight tensor; configuring at least one convolution kernel according to the weight tensor; the convolution operation is respectively carried out on the input data by utilizing at least one convolution check through a multiplication and addition array of the neural network processing unit; and outputting a result of the convolution operation as scaled data of the input data.

In an implementation manner of the first aspect, performing a convolution operation on the input data by using at least one convolution kernel includes: performing edge-filling processing on the input data to generate edge-filling data; combining each part of the edge-compensating data corresponding to at least one convolution kernel with the input data to form data to be processed; and respectively performing convolution operation on the data to be processed by utilizing at least one convolution check, and respectively writing the results of the convolution operation on the data to be processed into a buffer in a cross-offset jump writing mode.

In one implementation manner of the first aspect, outputting the result of the convolution operation as scaling data of the input data includes: and outputting the results stored in the buffer and respectively written in a manner of skip writing across offsets as the scaling data.

In one implementation manner of the first aspect, configuring at least one convolution kernel according to the weight tensor includes: configuring convolution weights according to the weight tensors; and configuring TdxTdy convolution kernels, wherein the size of each convolution kernel is configured to be K1 xK 2, K1 and K2 are positive integers, the input step sizes of each convolution kernel in the horizontal direction and the vertical direction are respectively configured to be Tdx and Tdy, the output step sizes of each convolution kernel in the horizontal direction and the vertical direction are respectively configured to be Tdx and Tdx, tdx is mutually equal, and Tsx is equal to Tdx=w_sx, w_sx and w_dx are respectively the width of input data and the width of output data.

In one implementation of the first aspect, reading the input data and the weight tensor includes: reading the input data and the weight tensor and storing the input data and the weight tensor in a buffer, wherein the scaling method further comprises: performing edge-trimming processing on the input data to generate edge-trimming data and storing the edge-trimming data in the buffer, wherein performing convolution operation on the input data by using at least one convolution kernel respectively comprises: reading part of the edge compensating data corresponding to at least one convolution kernel in the input data and the edge compensating data from the buffer in a skip reading manner; respectively carrying out convolution operation on the data to be processed by utilizing at least one convolution check; and writing the result of the convolution operation of the data to be processed into the buffer or an additional buffer different from the buffer respectively in a manner of skip writing across offset.

In one implementation of the first aspect, the cross-offset skip read is performed by a direct memory read access of the neural network processing unit and the cross-offset skip write is performed by a direct memory write access of the neural network processing unit.

In one implementation manner of the first aspect, the offset skip amplitude of the offset skip read in the horizontal direction is Tsx, the offset skip amplitude of the offset skip write in the horizontal direction is Tdx, where Tsx and Tdx are mutually equal and tsx:tdx=w_sx:w_dx, w_sx and w_dx are the width of the input data and the width of the output data, respectively; and the offset skip amplitude in the vertical direction of the offset skip read is Tsy, and the offset skip amplitude in the vertical direction of the offset skip write is Tdy, wherein Tsy and Tdy are mutually equal and Tsy: tdy=h_sy: h_dy, h_sy, and h_dy are the height of the input data and the height of the output data, respectively.

In an implementation manner of the first aspect, performing a convolution operation on the data to be processed by using at least one convolution kernel includes: and carrying out convolution operation on two or more data to be processed in parallel by utilizing the multiply-add array.

In an implementation manner of the first aspect, the multiply-add array of the neural network processing unit performs the acceleration task of the neural network in parallel when performing the convolution operation on the input data.

In a second aspect, embodiments of the present disclosure provide a chip including a neural network processing unit configured to scale the input data to generate the scaled data using the scaling method of any one of the implementations of the first aspect of the present disclosure.

In a third aspect, the disclosed embodiments provide a computer-readable storage medium having stored thereon a computer program that is executed to implement a scaling method according to any implementation of the first aspect of the disclosure.

In a fourth aspect, an embodiment of the present disclosure provides an electronic device, including: a memory configured to store a computer program; and a processor configured to invoke the computer program to perform the scaling method according to any implementation of the first aspect of the present disclosure.

As described above, the scaling method provided by the embodiment of the present disclosure can implement the scaling operator through the neural network processing unit, so that no additional scaling hardware module is required, which is beneficial to further reducing the size of the chip. In addition, the scaling operator implementation algorithm realizes scaling by utilizing the multiplication and addition array in the neural network processing unit, has higher efficiency, can support various scaling parameter operations, can adapt to different scaling algorithms, and has good universality.

Drawings

Fig. 1A is a diagram illustrating an application scenario of a scaling method according to an embodiment of the disclosure.

FIG. 1B shows a flow chart of a scaling method in an embodiment of the present disclosure.

Fig. 1C is a schematic diagram of a bilinear interpolation method in an embodiment of the disclosure.

Fig. 2 is a flow chart illustrating convolution processing of input data in an embodiment of the present disclosure.

Fig. 3 is a flowchart illustrating convolution processing of input data in an embodiment of the present disclosure.

Fig. 4A is a schematic diagram of input data and output data in an embodiment of the disclosure.

Fig. 4B shows a flow chart of a scaling method in an embodiment of the present disclosure.

Fig. 4C to 4K are schematic diagrams illustrating input data and output data in the scaling method according to the embodiment of the disclosure.

Fig. 5 is a schematic structural diagram of a chip according to an embodiment of the disclosure.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Other advantages and effects of the present disclosure will become readily apparent to those skilled in the art from the following disclosure, which describes embodiments of the present disclosure by way of specific examples. The disclosure may be embodied or practiced in other different specific embodiments, and details within the subject specification may be modified or changed from various points of view and applications without departing from the spirit of the disclosure. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concepts of the disclosure by way of illustration, and only the components related to the disclosure are shown in the drawings rather than being drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.

Conventional techniques require additional scaling hardware modules to be added to the chip in order to implement the scaling algorithm, which is detrimental to further chip size reduction. At least in view of this problem, embodiments of the present disclosure provide a scaling method that is capable of implementing a scaling operator through an NPU, thus eliminating the need to add additional scaling hardware modules, which is beneficial to further reducing the chip size. In addition, the scaling operator implementation algorithm realizes scaling by utilizing the MAC multiply-add array in the NPU, has higher efficiency, can support various scaling parameter operations, can adapt to different scaling algorithms, and has good universality.

The technical solutions in the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings in the embodiments of the present disclosure.

Fig. 1A is a diagram illustrating a hardware architecture according to an embodiment of the present disclosure, and a scaling operator implementation algorithm provided by an embodiment of the present disclosure may be implemented by the architecture illustrated in fig. 1A. In fig. 1A, the NPU adopts a data parallel computing architecture, which is used to accelerate the operation of the neural network, and improve the efficiency of the chip in the operation of the neural network. The NPU comprises a MAC multiply-add array, a controller, a buffer, an input circuit and an output circuit. The MAC multiply-add array is an array formed by a plurality of multiply-add devices and is a core component for exerting the calculation force of the NPU. A multiply-add refers to an arithmetic unit where multiple multiplications and multiple additions can be completed in the same instruction cycle. The input circuit is used for reading from the outside to input data to be processed. The output circuit is used for outputting the processing result. The controller is used for controlling the input circuit, the output circuit and the MAC multiply-add array so that the NPU realizes the scaling method according to the embodiment of the disclosure.

Fig. 1B is a flowchart illustrating a scaling method provided according to an embodiment of the present disclosure. As shown in fig. 1B, the scaling method provided by the embodiment of the present disclosure includes the following steps S11 to S14.

In step S11, the input data and the weight tensor are read. The input data may be derived from images acquired by the camera, NPU calculations, CPU calculations, GPU calculations, RGA calculations, and/or the like. The weight tensor is used to represent the interpolation weight and the convolution weight in step S12. In some embodiments, the input data and the weight tensor are read by an input circuit of the NPU. In some embodiments, the input circuit reads the input data and weight tensors from outside of the NPU by RDMA (read direct memory access, direct memory read access) means.

In step S12, at least one convolution kernel is configured according to the weight tensor. In some embodiments, the convolution weights are configured according to a weight tensor, and convolution kernels having a corresponding size and number are configured. In some embodiments, the controller through the NPU configures the corresponding convolution kernel according to the weight tensor.

In step S13, the input data is subjected to convolution operation by the MAC multiply-add array of the NPU, respectively, using at least one convolution check. This convolution operation will be described in detail in later embodiments.

In step S14, the result of the convolution operation is output as scaled data of the input data. In some embodiments, the scaling data may be output into an external memory. In some embodiments, the convolution result is output as scaled data by an output circuit of the NPU. In some embodiments, the output circuit outputs scaled data of the input data to the outside of the NPU through WDMA (write direct memory access, direct memory write access).

The scaling methods provided in embodiments of the present disclosure support multiple scaling modes, e.g., nearest neighbor interpolation, bilinear interpolation, tri-cubic interpolation, etc. Taking bilinear interpolation as an example, fig. 1C is a schematic diagram showing a bilinear interpolation method. Points Q11 (x 1, y 1), Q12 (x 1, y 2), Q21 (x 2, y 1), Q22 (x 2, y 2) are points in the input data. The four points can be linearly interpolated in the x-direction using equations 1 and 2 below in embodiments of the present disclosure:

where f (x, y 1) represents the interpolation result of the point (i.e., R1) at the (x, y 1) position, f (x, y 2) represents the interpolation result of the point (i.e., R2) at the (x, y 2) position, and f (Q) represents the value of the point Q.

Based on the interpolation result in the x direction, the following equation 3 may be used to interpolate in the y direction to obtain a desired estimated value:

where f (x, y) is the final interpolation result.

According to the above process, the interpolation process of the scaling operator is a process of a multiply-add operation, and the multiply-add operation can be implemented by a convolution operation, so that the interpolation process of the scaling operator can be equivalent to a process of a plurality of convolution operations. Based on this principle, step S12 in the embodiment of the present disclosure may implement the scaling process of the input data by performing a convolution operation on the input data.

FIG. 2 is a flow chart illustrating a convolution operation performed on input data using at least one convolution kernel, respectively, in an embodiment of the present disclosure. As shown in fig. 2, the convolution operation performed on the input data in the embodiment of the present disclosure includes the following steps S21 to S23.

In step S21, edge-fill (padding) processing is performed on the input data to generate edge-fill data. The edge trimming process refers to filling some values in the boundary of the input data to increase the size of the input data, where the filled values may be, for example, 0 or other values, and the disclosure is not limited in particular. In some implementations, the edge-fill processing may be performed on the input data in the first cache in advance. In other implementations, the edge-trimming processing can also be performed in convolution operation, in this case, the edge-trimming data does not need to be stored, so that the task of reading and writing the edge-trimming data does not exist, and the efficiency is higher.

In step S22, each part of data corresponding to at least one convolution kernel in the edge repair data is combined with the input data, respectively, to form data to be processed.

In step S23, the data to be processed is subjected to convolution operation by using at least one convolution kernel, and the results of the convolution operation of the data to be processed are written into the buffer respectively in a skip writing manner. The convolution operation may be implemented by a MAC multiply-add array.

In some embodiments, the write across offset hops is performed by the WDMA of the NPU.

In some embodiments, outputting the result of the convolution operation as scaled data of the input data comprises: and outputting the results stored in the buffer and written in a manner of skip writing across the offset as scaling data.

In accordance with one embodiment of the present disclosure, employing a center point alignment mode, configuring at least one convolution kernel from a weight tensor includes: configuring convolution weights according to the weight tensors; tdxTdy convolution kernels are arranged, the size of each convolution kernel is arranged to be K1×K2, K1 and K2 are positive integers, and the values of the two are 2, for example. The input steps of each convolution kernel in the horizontal direction and the vertical direction are respectively configured as Tsx and Tsy, and the output steps of each convolution kernel in the horizontal direction and the vertical direction are respectively configured as Tdx and Tdy. Tsx and Tdx are mutually exclusive, tsy and Tdy are mutually exclusive, and Tsx: tdx=w_sx: w_dx, tsy: tdy=h_sy: h_dy, wherein w_sx and w_dx are the width of the input data and the width of the output data, respectively, and h_sy and h_dy are the height of the input data and the height of the output data, respectively.

In one embodiment, if the size of the input data is 4×4, the size of the output data is 8×8, i.e., w_sx=4, w_sy=4, w_dx=8, and w_dy=8. From this, tsx=1, tsy=1, tdx=2, tdy=2 can be derived. Based on this, 2×2 convolution kernels C1, C2, C3, and C4 are configured in the embodiment of the present disclosure, and each convolution kernel has a size of 2×2, the horizontal step sizes of each convolution kernel in the horizontal direction and the vertical direction are both 1, the output step sizes in the horizontal direction and the vertical direction are both 2, and the periods in the horizontal direction and the vertical direction are independent of each other.

In accordance with an embodiment of the present disclosure, reading input data and a weight tensor includes: the input data and the weight tensor are read and stored in a buffer. The scaling method provided by the embodiment of the disclosure further comprises the following steps: and performing edge-filling processing on the input data to generate edge-filling data and storing the edge-filling data in a buffer.

FIG. 3 is a flow chart illustrating a convolution operation performed on input data using at least one convolution kernel, respectively, in an embodiment of the present disclosure. As shown in fig. 3, performing convolution operations with at least one convolution kernel input data, respectively, in the embodiments of the present disclosure includes the following steps S31 to S33.

In step S31, a part of data corresponding to at least one convolution kernel of the input data and the complementary data is read as data to be processed from the buffer, respectively, in a manner of skip reading across offsets.

In step S32, convolution operations are performed on the data to be processed using at least one convolution kernel, respectively.

In step S33, the result of the convolution operation of the data to be processed is written in the buffer or in an additional buffer different from the buffer, respectively, in a manner of skip writing across offsets.

In some embodiments, each data to be processed may be processed in a serial fashion. Specifically, a first data to be processed is read, convolution operation is performed on the first data to be processed, and a calculation result is written into a buffer. And then reading second data to be processed, carrying out convolution operation on the second data to be processed, and writing the calculation result into the buffer. And the like until all the data to be processed are calculated.

In some embodiments, the processing of each data to be processed may also be performed in a serial pipeline manner. Specifically, the process includes: reading first data to be processed; performing convolution operation on the first data to be processed, and simultaneously reading the second data to be processed; writing the convolution operation result of the first data to be processed into a buffer, simultaneously carrying out convolution operation on the second data to be processed, and reading the third data to be processed; writing the convolution operation result of the second data to be processed into the buffer, simultaneously carrying out convolution operation on the third data to be processed, and reading the fourth data to be processed. And the like until all the data to be processed are calculated.

In some embodiments, each data to be processed may also be processed in a parallel manner. At this time, m pieces of data to be processed may be read in parallel, convolution operation may be performed on the m pieces of data to be processed in parallel, and the calculation result may be written into the buffer. And then, n pieces of data to be processed are read in parallel, convolution operation is carried out on the n pieces of data to be processed in parallel, and the calculation result is written into the buffer. And the like until all the data to be processed are calculated. Wherein m and n are positive integers, and the values of m and n may be the same or different.

In some embodiments, each data to be processed may also be processed in a parallel pipeline manner. Specifically, the process includes: reading m pieces of data to be processed; performing convolution operation on the m pieces of data to be processed in parallel, and simultaneously reading n pieces of data to be processed; writing convolution operation results of the m pieces of data to be processed into a buffer, simultaneously carrying out convolution operation on the n pieces of data to be processed in parallel, and reading k pieces of data to be processed; writing the convolution operation results of the n pieces of data to be processed into a buffer, simultaneously carrying out convolution operation on the k pieces of data to be processed in parallel, and reading j pieces of data to be processed. And the like until all the data to be processed are calculated. Wherein m, n, k, j are all positive integers, and the values of the m, n, k, j can be the same or different.

In some embodiments, the read across offset hops is performed by the RDMA function of the NPU; and/or, the write across offset hops in the presently disclosed embodiments is performed by the WDMA function of the NPU.

In some embodiments, the offset skip amplitude in the horizontal direction across offset skip reads is Tsx, the offset skip amplitude in the horizontal direction across offset skip writes is Tdx, where Tsx and Tdx are mutually exclusive and Tsx: tdx=w_sx: w_dx, w_sx and w_dx are the width of the input data and the width of the output data, respectively. The offset skip amplitude in the vertical direction across offset skip read is Tsy, the offset skip amplitude in the vertical direction across offset skip write is Tdy, where Tsy and Tdy are mutually exclusive and Tsy: tdy=h_sy: h_dy, h_sy and h_dy are the height of the input data and the height of the output data, respectively.

Specifically, for input data stored in the form of an n1×m1 matrix, when data to be processed having a size of n1×m1 is read by a skip-skip reading manner, it is assumed that the coordinates of the current point read are (x_1, y_1). If x_1+Tsx is less than or equal to M1 and line feed reading is not needed, the coordinate of the next point to be read is (x_1+Tsx, y_1); if x_1+Tsx > M1, a line feed reading is required, the coordinate of the next point to be read is (x_1+T)sx-M1, y_1+Tsy). With (x) ₀ ,y ₀ ) As a starting point, one piece of data to be processed can be obtained by reading n1×m1 points in the above manner. Thereafter, at (x) ₁ ,y ₁ ) As a starting point, another data to be processed can be obtained by reading the n1×m1 points in the above manner.

If the output data is stored in a matrix of N2×m2, it is assumed that the coordinates of the current writing point are (x_2, y_2) when data writing is performed by using the skip write method with offset for the convolution operation result of one data to be processed. If x_2+Tdx is less than or equal to M2 and line feed writing is not needed, the coordinate of the next writing point is (x_2+Tdx, y_2); if x_2+Tdx > M2, a line feed write is required, then the coordinates of the next write point are (x_2+Tdx-M2, y_2+Tdy).

In some embodiments, performing convolution operations on data to be processed using at least one convolution kernel includes: and performing convolution operation on two or more data to be processed in parallel by using the MAC multiply-add array.

In some embodiments, the MAC multiply-add array of NPUs performs the acceleration tasks of the neural network in parallel when convolving the input data.

Specifically, the computing power of the MAC multiply-add array is generally sufficient, so that convolution operation can be performed on two or more data to be processed in parallel, and the convolution operation of the data to be processed and the task of accelerating the neural network can be performed in parallel.

The above-described process will be described next by way of a specific example. The scaling operator in this example is used to scale 2×2×c input data to 4×4×c output data, where C is a positive integer. The input data and the output data are as shown in fig. 4A. In this embodiment, w_sx=2, w_sy=2, w_dx=4, and w_dy=4, and thus, tsx=1, tdx=2, tsy=1, and tdy=2 can be known. Fig. 4B is a flowchart showing the scaling process of the input data in this specific example. As shown in fig. 4B, the scaling processing of the input data in this specific example includes the following steps S41 to S48.

In step S41, the input data is subjected to edge trimming processing, as shown in fig. 4C. In some implementations, the edge-fill processing may be performed on the input data in the first cache in advance. In other implementations, the edge-trimming processing can also be performed in convolution operation, in this case, the edge-trimming data does not need to be stored, so that the task of reading and writing the edge-trimming data does not exist, and the efficiency is higher.

In step S42, a convolution kernel having a size of 2×2, an input step size of 1, and an output step size of 2 is configured according to the weight tensor. The convolution weights of the convolution kernels are determined by weight tensors, and four different convolution kernels C1, C2, C3, and C4 may be configured with four interpolation weights in embodiments of the disclosure.

In some embodiments, one implementation method of the four convolution kernels may be: c1 = {0.25× 0.25,0.75 ×0.25,0.25×0.75,0.75×0.75}, c2= {0.25×0.75,0.75× 0.75,0.25 × 0.25,0.75 ×0.25}, c3= {0.75×0.25,0.25× 0.25,0.75 ×0.75,0.75×0.25}, c4= {0.75× 0.75,0.25 ×0.75,0.75×0.25,0.25×0.25}.

In step S43, data of 3×3 at the upper left of the input data is read as data to be processed in a skip-read manner across offsets, and the data to be processed is processed by the convolution kernel C1 to obtain 2×2 interpolation points, as shown in fig. 4D. Wherein the offset jump amplitude of the read across offset jumps is 1. The 3×3 data read in this process and the interpolation point obtained after the convolution process may be stored in the same buffer or may be stored in a different buffer.

In step S44, the interpolation point obtained in step S43 is written into the second buffer in a manner of skip writing across offsets, as shown in fig. 4E. Wherein the offset jump amplitude of the write across offset jumps is 2.

In step S45, the next 3×3 data of the current data to be processed is read as new data to be processed in a skip-skip manner, and the new data to be processed is processed by the convolution kernel C2 to obtain a new interpolation point of 2×2. Since the offset jump amplitude of the read across offset jumps in this example is 1, the new data to be processed and the resulting new interpolation points are shown in fig. 4F.

In step S46, the new interpolation point obtained in step S45 is written into the second buffer in a manner of skip writing across offsets. Since the offset jump amplitude of the offset jump writing is 1 in this example, the new interpolation point in step S45 is written into the second buffer as shown in fig. 4G.

In step S47, the next 3×3 data of the current data to be processed is read as new data to be processed in a skip-skip manner, and the new data to be processed is processed by the convolution kernel C3 to obtain a new interpolation point of 2×2, as shown in fig. 4H, and the new interpolation point is written into the second buffer, as shown in fig. 4I.

In step S48, the next 3×3 data of the current data to be processed is read as new data to be processed in a skip-skip manner, and the new data to be processed is processed by the convolution kernel C4 to obtain a new interpolation point of 2×2, as shown in fig. 4J, and the new interpolation point is written into the second buffer, as shown in fig. 4K.

In addition, in this specific example, the MAC multiply-add array may be used to perform convolution processing on two or more data to be processed simultaneously, that is, parallel processing of two or more data to be processed may be implemented. For example, step S43 and step S47 may be performed in parallel, steps S43, S45, and S48 may be performed in parallel, and the like.

The protection scope of the scaling method according to the embodiments of the present disclosure is not limited to the execution sequence of the steps listed in the embodiments, and all the schemes implemented by adding or removing steps and replacing steps according to the prior art made by the principles of the present disclosure are included in the protection scope of the present disclosure.

The presently disclosed embodiments also provide a computer-readable storage medium having stored thereon a computer program that is executed to implement the scaling method provided in any of the embodiments according to the present disclosure.

Any combination of one or more storage media may be employed in the present disclosure. The storage medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The embodiment of the disclosure further provides a chip, and fig. 5 shows a schematic structural diagram of the chip in the embodiment of the disclosure. The chip contains an NPU configured to scale input data to generate scaled data using a scaling method provided in any of the embodiments of the present disclosure.

The embodiment of the disclosure further provides an electronic device, and fig. 6 is a schematic structural diagram of an electronic device 600 in the embodiment of the disclosure. As shown in fig. 6, the electronic device 600 in this embodiment includes a memory 610 and a processor 620.

The memory 610 is used to store a computer program. In some embodiments, memory 610 includes: various media capable of storing program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.

In particular, memory 610 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) and/or cache memory. The electronic device 600 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. Memory 610 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

The processor 620 is connected to the memory 610 for executing computer programs stored by the memory 610 to cause the electronic device 600 to perform the zooming method.

In some embodiments, the processor 620 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field programmable gate arrays (Field Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In some embodiments, the electronic device 600 may also include a display 630 in this embodiment. A display 630 is communicatively coupled to the memory 610 and the processor 620 for displaying a related GUI interactive interface for the zooming method.

In summary, the scaling method provided by the embodiment of the present disclosure can implement the scaling operator through the NPU, so that no additional scaling hardware module is required, which is beneficial to further reducing the chip size. In addition, the scaling operator implementation algorithm realizes scaling by utilizing the MAC multiply-add array in the NPU, has higher efficiency, can support various scaling parameter operations, can adapt to different scaling algorithms, and has good universality. Accordingly, the present disclosure is highly industrial in value, effectively overcoming various drawbacks in the prior art.

Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The above embodiments are merely illustrative of the principles of the present disclosure and its efficacy, and are not intended to limit the disclosure. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present disclosure. Accordingly, it is intended that all equivalent modifications and variations which a person having ordinary skill in the art would accomplish without departing from the spirit and technical spirit of the present disclosure be covered by the claims of the present disclosure.

Claims

1. A scaling method applied to a neural network processing unit, the scaling method comprising:

reading input data and a weight tensor;

configuring at least one convolution kernel according to the weight tensor;

the convolution operation is respectively carried out on the input data by utilizing at least one convolution check through a multiplication and addition array of the neural network processing unit; and

and outputting the result of the convolution operation as scaling data of the input data.

2. The scaling method of claim 1 wherein performing convolution operations on the input data using at least one of the convolution checks comprises:

performing edge-filling processing on the input data to generate edge-filling data;

combining each part of the edge-compensating data corresponding to at least one convolution kernel with the input data to form data to be processed; and

and respectively performing convolution operation on the data to be processed by utilizing at least one convolution check, and respectively writing the results of the convolution operation on the data to be processed into a buffer in a cross-offset jump writing mode.

3. The scaling method of claim 2 wherein outputting the result of the convolution operation as scaling data for the input data comprises:

and outputting the results stored in the buffer and respectively written in a manner of skip writing across offsets as the scaling data.

4. The scaling method of claim 1 wherein configuring at least one convolution kernel in accordance with the weight tensor comprises:

configuring convolution weights according to the weight tensors; and

TdxTdy convolution kernels are configured, the size of each convolution kernel is configured to be K1 xK 2, K1 and K2 are positive integers, the input step sizes of each convolution kernel in the horizontal direction and the vertical direction are respectively configured to be Tdx and Tdy, the output step sizes of each convolution kernel in the horizontal direction and the vertical direction are respectively configured to be Tdx and Tdy, tdx and Tdx are mutually equal, and Tsx is equal to w_sx, w_sx and w_dx are respectively the width of input data and the width of output data.

5. The scaling method of claim 1 wherein reading the input data and the weight tensor comprises: reading the input data and the weight tensor and storing the input data and the weight tensor in a buffer, wherein the scaling method further comprises: performing edge-trimming processing on the input data to generate edge-trimming data and storing the edge-trimming data in the buffer, wherein performing convolution operation on the input data by using at least one convolution kernel respectively comprises:

reading part of the edge compensating data corresponding to at least one convolution kernel in the input data and the edge compensating data from the buffer in a skip reading manner;

respectively carrying out convolution operation on the data to be processed by utilizing at least one convolution check; and

writing the result of the convolution operation of the data to be processed into the buffer or an additional buffer different from the buffer in a skip-offset writing manner.

6. The scaling method of claim 5 wherein the cross-offset skip read is performed by a direct memory read access of the neural network processing unit and the cross-offset skip write is performed by a direct memory write access of the neural network processing unit.

7. The scaling method of claim 5, wherein,

the offset jump amplitude of the offset jump reading in the horizontal direction is Tsx, the offset jump amplitude of the offset jump writing in the horizontal direction is Tdx, wherein Tsx and Tdx are mutually equal, tsx=w_sx, w_dx and w_dx are the width of the input data and the width of the output data respectively; and

the offset skip amplitude in the vertical direction of the offset skip read is Tsy, and the offset skip amplitude in the vertical direction of the offset skip write is Tdy, wherein Tsy and Tdy are mutually equal and Tsy: tdy=h_sy: h_dy, h_sy and h_dy are the height of the input data and the height of the output data, respectively.

8. The scaling method of claim 2 or 5, wherein the convolutionally computing the data to be processed with at least one of the convolution checks comprises: and carrying out convolution operation on two or more data to be processed in parallel by utilizing the multiply-add array.

9. The scaling method of claim 1, wherein the multiply-add array of neural network processing units performs acceleration tasks of the neural network in parallel when convolving the input data.

10. A chip comprising a neural network processing unit configured to scale the input data using the scaling method of any one of claims 1 to 9 to generate the scaled data.

11. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program is executed to implement the scaling method according to any one of claims 1 to 9.

12. An electronic device, the electronic device comprising:

a memory configured to store a computer program; and

a processor configured to invoke the computer program to perform the scaling method of any of claims 1 to 9.