CN116185942A - Data processing method, device, storage medium and electronic equipment - Google Patents

Data processing method, device, storage medium and electronic equipment Download PDF

Info

Publication number
CN116185942A
CN116185942A CN202111439229.1A CN202111439229A CN116185942A CN 116185942 A CN116185942 A CN 116185942A CN 202111439229 A CN202111439229 A CN 202111439229A CN 116185942 A CN116185942 A CN 116185942A
Authority
CN
China
Prior art keywords
input data
calculation
chip
data
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111439229.1A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambrian Kunshan Information Technology Co ltd
Original Assignee
Cambrian Kunshan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambrian Kunshan Information Technology Co ltd filed Critical Cambrian Kunshan Information Technology Co ltd
Priority to CN202111439229.1A priority Critical patent/CN116185942A/en
Publication of CN116185942A publication Critical patent/CN116185942A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The embodiment of the application provides a data processing method, a device, a storage medium and electronic equipment, wherein resident data can be directly obtained from an on-chip cache, input data does not need to be read from an off-chip storage to the on-chip cache again for second calculation, and the bandwidth required for reading the data from the on-chip cache is far greater than the bandwidth required for reading the data from the off-chip storage, and the data does not need to be loaded from the off-chip storage for multiple times, so that the calculation performance is improved.

Description

Data processing method, device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method, a data processing device, a storage medium, and an electronic device.
Background
In the deep neural network training process, because the calculation process is complex or needs to be calculated for multiple times, multiple times of data reading are often needed, for example, the BatchNorm is an important alignment technology, so that the intermediate result between the network layer and the interlayer is more stable, the convergence speed of the training process is accelerated, but when the BatchNorm technology is adopted for forward calculation or reverse calculation, multiple times of data reading are needed.
However, with the data being read multiple times, the whole calculation process needs to be carried out multiple times on the chip to be successfully completed, so that the whole calculation performance is reduced.
Disclosure of Invention
The embodiment of the application provides a data processing method, a device, a storage medium and electronic equipment, which are beneficial to improving the whole computing performance by residing data in an on-chip cache without loading multiple times from off-chip storage.
In a first aspect, an embodiment of the present application provides a data processing method, applied to a computing device, where the computing device includes an on-chip cache, and the computing device is connected to an off-chip storage, where a plurality of input data are stored off-chip, and the method includes:
reading the plurality of input data into the on-chip cache and performing first calculation to obtain a first calculation result;
residing the plurality of input data in the on-chip cache;
and reading the plurality of input data in the on-chip cache and the first calculation result to perform second calculation so as to obtain calculation results of the plurality of input data.
In a second aspect, embodiments of the present application provide a computing device, the device comprising: the device is connected with an off-chip storage, and a plurality of input data are stored outside the chip;
The processor is used for reading the plurality of input data into the on-chip cache and performing first calculation to obtain a first calculation result;
the on-chip cache is used for residing the plurality of input data; the processor is further configured to perform a second calculation on the plurality of input data in the on-chip cache and the first calculation result, so as to obtain a calculation result of the plurality of input data.
In a third aspect, embodiments of the present application provide a combined processing apparatus comprising a computing apparatus according to the third aspect.
In a fourth aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a memory, a processor, a communication bus, and a communication interface, where the processor and the communication interface memory complete communication with each other through the communication bus; the memory is used for storing a computer program; the processor is configured to implement some or all of the steps described in the first aspect above when executing the program stored on the memory.
In a fifth aspect, embodiments of the present application provide a computer readable storage medium comprising a computer program stored for data exchange, which when executed by a processor, implements some or all of the steps as described in the first aspect of embodiments of the present application.
In a sixth aspect, embodiments of the present application provide a computer program product, wherein the computer program product comprises a non-transitory computer readable storage medium storing a computer program, the computer program being operable to cause a computer to perform some or all of the steps described in the first aspect of the embodiments of the present application. The computer program product may be a software installation package.
The data processing method, the device, the storage medium and the electronic equipment provided by the embodiment of the application can read a plurality of input data to the on-chip cache and perform first calculation so as to obtain a first calculation result; reside a plurality of input data in an on-chip cache; and reading the plurality of input data in the on-chip cache and the first calculation result to perform second calculation so as to obtain calculation results of the plurality of input data. In this way, when the second calculation is performed, resident data can be directly obtained from the on-chip cache, the input data does not need to be read from the off-chip storage to the on-chip cache again for performing the second calculation, and the bandwidth required for reading the data from the on-chip cache is far greater than the bandwidth required for reading the data from the off-chip storage, so that the on-chip cache does not need to be loaded from the off-chip storage for a plurality of times, thereby being beneficial to improving the calculation performance.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic structural diagram of a board provided in an embodiment of the present application;
fig. 2 is a schematic structural diagram of a combined processing apparatus according to an embodiment of the present application;
FIG. 3 is a schematic diagram of the internal architecture of a computing device provided in an embodiment of the present application;
FIG. 4 is a schematic diagram of the internal structure of a processor core according to an embodiment of the present application;
FIG. 5a is a flowchart of a data processing method according to an embodiment of the present disclosure;
FIG. 5b is a schematic diagram of an on-chip cache according to an embodiment of the present disclosure;
FIG. 5c is a schematic diagram of a structure of multi-dimensional tensor data according to an embodiment of the present application;
FIG. 6 is a functional block diagram of a computing device according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The present application will be described in detail with reference to specific examples.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a board 10 provided in an embodiment of the present application, as shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, where the combined processing device is an artificial intelligent computing unit, and is used to support various deep learning and machine learning algorithms, so as to meet intelligent processing requirements in complex scenarios in fields of computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is largely applied to the cloud intelligent field, and one remarkable characteristic of the cloud intelligent application is that the input data volume is large, and the high requirements on the storage capacity and the computing capacity of the platform are provided.
The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.
The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).
Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combined processing device 20 comprises a computing device 201, an interface device 202, a processing device 203, and an off-chip storage 204.
The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform related computations in a deep learning or machine learning training process, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.
The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the processing device 203 may obtain the input data from an off-chip storage, and further, the computing device 201 may obtain the input data from the processing device 203 via the interface device 202, or the computing device 201 may directly obtain the input data from the off-chip storage 204, perform a first calculation on the input data by using a processor in the computing device 201, obtain a first calculation result, and may write the first calculation result into a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, write the control instruction into the control buffer on the chip of the computing device 201, and according to the control instruction, the computing device 201 resides the input data in the control buffer on the chip of the computing device 201 (i.e. the on-chip buffer described in the embodiment of the present application), and finally, perform the second calculation on the input data and the first calculation result by using the processor in the computing device 201, to obtain the second calculation result corresponding to the input data. Alternatively or in addition, the interface device 202 may also read the first calculation result, the second calculation result, the input data, and other data in the storage device of the calculation device 201 and transmit the data to the processing device 203.
The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.
The off-chip storage 204 is used to store input data, which may be DDR memory, typically 16G or more in size, for storing data for the computing device 201 and/or other processing devices; the data may be input data or data such as a first calculation result which cannot be stored in the internal or on-chip memory of the computing device 201 or other processing device.
Fig. 3 shows a schematic diagram of the internal structure of the computing device 201. The computing device 201 is configured to process input data such as computer vision, voice, natural language, and data mining, where the computing device 201 is configured as a multi-core hierarchical structure, and the computing device 201 is a system-on-chip (soc) including a plurality of clusters (clusters), each of which includes a plurality of processor cores, in other words, the computing device 201 is configured in a system-on-chip (soc) -cluster-processor core hierarchy.
At the system-on-chip level, as shown in FIG. 3, computing device 201 includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and a plurality of clusters 305.
There may be a plurality of external memory controllers 301, of which 2 are illustratively shown, for accessing external memory devices, such as the off-chip memory 204 of fig. 2, in response to an access request from a processor core, and the computing device 201 is coupled to the off-chip memory 204 to read or write a plurality of input data from or to the off-chip memory. The peripheral communication module 302 is configured to receive a control signal from the processing device 203 through the interface device 202, and activate the computing device 201 to perform a task. The on-chip interconnect module 303 connects the external memory controller 301, the peripheral communication module 302, and the plurality of clusters 305 for transmitting different input data and control signals between the respective modules. The synchronization module 304 is a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each cluster to ensure synchronization of information. The plurality of clusters 305 are the computing cores of the computing device 201, 4 being illustratively shown, and as hardware progresses, the computing device 201 of the present disclosure may also include 8, 16, 64, or even more clusters 305. The cluster 305 is used to efficiently execute the deep learning algorithm.
At the cluster level, as shown in FIG. 3, each cluster 305 includes a plurality of processor cores (IPU cores) 306 and a memory core (MEM core) 307.
The processor cores 306 are illustratively shown as 4 in the figures, and the present disclosure does not limit the number of processor cores 306. The internal architecture is shown in fig. 4. Each processor core 306 includes three major modules: the device comprises a control module 41, an operation module 42 and a storage module 43, wherein the storage module is an on-chip cache. The processor core 306 may be configured to read a plurality of input data from the off-chip storage 204 to the storage module 43 to perform a first calculation and obtain a first calculation result, may reside the plurality of input data in the storage module and store the first calculation result in the storage module 43, or may be configured to perform a second calculation on the plurality of input data and the first calculation result residing in the storage module 43 and obtain a calculation result of the plurality of input data.
The control module 41 is used for coordinating and controlling the operation of the operation module 42 and the storage module 43 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 411 and an instruction decode unit (instruction decode unit, IDU) 412. The instruction fetching unit 411 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 412 decodes the fetched instruction and sends the decoded result to the operation module 42 and the storage module 43 as control information.
The operation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation, etc.; the matrix operation unit 422 is responsible for the core computation of the deep learning algorithm, i.e. matrix multiplication and convolution. The operation module is used for executing the first calculation and/or the second calculation.
The storage module 43 is configured to store or carry relevant data (e.g., a plurality of input data read from the off-chip cache 204, or a first calculation result obtained by performing a first calculation via the processor core 306, or a second calculation result obtained by performing a second calculation via the processor core 306, etc.), and includes a neuron storage unit (neuron RAM, NRAM) 431, a weight storage unit (weight RAM, WRAM) 432, an input/output direct memory access module (input/output direct memory access, IODMA) 433, and a carry direct memory access module (move direct memory access, MVDMA) 434.NRAM 431 is used to store a plurality of input data for the processor core 306 to use for the first calculation, and/or a first calculation result for the second calculation, and a plurality of input data, etc.; WRAM 432 is configured to store weights for the deep learning network; IODMA 433 controls access to NRAM 431/WRAM 432 and off-chip storage 204 over broadcast bus 309; MVDMA 434 is used to control access to NRAM 431/WRAM 432 and SRAM 308.
It should be noted that, in the embodiment of the present application, the storage module 43 may include one or more on-chip caches (for example, NRAM, WRAM, etc.) described in the embodiment of the present application.
Returning to FIG. 3, the storage cores 307 are primarily used to store and communicate, i.e., to store shared data (e.g., multiple input data) or intermediate results (e.g., first computation results, second computation results, etc.) between the processor cores 306, as well as to perform communications between the clusters 305 and the off-chip storage 204, communications between the clusters 305, communications between the processor cores 306, etc. In other embodiments, the memory core 307 has scalar operation capabilities to perform scalar operations.
The memory core 307 includes a shared memory unit (SRAM) 308, a broadcast bus 309, a clustered direct memory access module (cluster direct memory access, CDMA) 310, and a global direct memory access module (global direct memory access, GDMA) 311. The SRAM 308 plays a role of a high-performance data transfer station, and data multiplexed between different processor cores 306 in the same cluster 305 is not required to be obtained from the processor cores 306 to the off-chip storage 204, but transferred between the processor cores 306 through the SRAM 308, and the memory cores 307 only need to rapidly distribute the multiplexed data from the SRAM 308 to the plurality of processor cores 306, so that the inter-core communication efficiency is improved, and the on-chip off-chip input/output access is also greatly reduced.
In addition, in the embodiment of the present application, the SRAM308 may also be used as an on-chip cache for hosting a plurality of input data, the first calculation result, and the like.
Broadcast bus 309, CDMA 310, and GDMA 311 are used to perform inter-processor core 306 communications, inter-cluster 305 communications, and data transfer between cluster 305 and off-chip storage 204, respectively. As will be described below, respectively.
The broadcast bus 309 is used to perform high-speed communication between the processor cores 306 in the cluster 305. The broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to the transmission of data from point to point (i.e., single processor core to single processor core), multicast is a communication scheme that transfers a piece of data from SRAM308 to a specific number of processor cores 306, and broadcast is a communication scheme that transfers a piece of data from SRAM308 to all processor cores 306, a special case of multicast.
CDMA 310 is used to control access to SRAM308 between different clusters 305 within the same computing device 201. Fig. 3 shows a schematic diagram when one processor core wants to write data to another clustered processor core to illustrate the operation of CDMA 310. In this application scenario, the same computing device includes a plurality of clusters, for convenience of illustration, only cluster 0 and cluster 1 are shown in the figure, and cluster 0 and cluster 1 respectively include a plurality of processor cores, for convenience of illustration, also, cluster 0 in the figure only shows processor core 0, and cluster 1 only shows processor core 1. Processor core 0 is to write data to processor core 1.
Firstly, the processor core 0 sends a unicast write request to write data into the local SRAM 0, CDMA 0 is used as a master (master) end, CDMA 1 is used as a slave (slave) end, the master end pushes the write request to the slave end, that is, the master end sends a write address AW and write data W, the data is transmitted to the SRAM 1 of the cluster 1, then the slave end sends a write response B as a response, and finally the processor core 1 of the cluster 1 sends a unicast read request to read the data from the SRAM 1.
In one embodiment, returning to FIG. 3, GDMA 311 cooperates with external memory controller 301 to control access of on-chip cache SRAM 308 of cluster 305 to off-chip memory 204 or to read input data from off-chip memory 204 into SRAM 308. From the foregoing, it is appreciated that communication between off-chip storage 204 and NRAM 431 or WRAM 432 may be achieved via 2 channels. The first channel is to directly connect the off-chip storage 204 with the on-chip cache NRAM 431 or the on-chip cache WRAM 432 through the IODAM 433; the second channel is that data is transferred between the off-chip storage 204 and the on-chip cache SRAM 308 via the GDMA 311, and then input data is transferred between the on-chip cache SRAM 308 and the on-chip cache NRAM 431 or the on-chip cache WRAM 432 via the MVDMA 434, so that residence of the input data can be realized. While the second channel seemingly requires more elements to participate and the data stream is longer, in practice in some embodiments the bandwidth of the second channel is much greater than the first channel, so communication between the off-chip storage 204 and the on-chip cache NRAM 431 or on-chip cache WRAM 432 may be more efficient through the second channel. Embodiments of the present disclosure may select a data transmission channel based on the hardware conditions itself.
In other embodiments, the functionality of the GDMA 311 and the functionality of the IODMA 433 may be integrated in the same component. The GDMA 311 and the iomma 433 are considered as different components for convenience of description of the present disclosure, and it is within the scope of protection of the present disclosure for a person skilled in the art as long as the functions and technical effects achieved are similar to those of the present disclosure. Further, the functions of the GDMA 311, the IODMA 433, the CDMA 310, and the MVDMA 434 may be implemented by the same components, which are also within the scope of the present disclosure as long as the implemented functions and the achieved technical effects are similar to those of the present disclosure.
Referring to fig. 5a, fig. 5a is a flowchart of a data processing method according to an embodiment of the present application, which is applied to the computing device 201, the computing device includes an on-chip cache (corresponding to the storage module 43 in fig. 4), the computing device 201 is connected to an off-chip storage 204, and the off-chip storage 204 has a plurality of input data, as shown in fig. 5a, and the method includes the following steps:
s510, reading the plurality of input data into the on-chip cache and performing first calculation to obtain a first calculation result.
When data processing is performed, firstly, input data is read from off-chip storage into an on-chip cache, and then the input data in the on-chip cache is calculated.
The plurality of input data may be image data, voice data, video data, etc. according to different application scenarios, which is not limited in this application. Alternatively, the data may be one-dimensional or multidimensional tensor data, and when the data is image data, the input data may be four-dimensional tensor data x [ N, H, W, C ], where N represents the number, C represents the image channel, H represents the height, and W represents the width.
The first calculation may refer to a first calculation process performed by the computing device for calculating the plurality of input data. For some calculations, it is necessary to read the input data multiple times to perform calculations, such as normalize, batchNorm, and two calculations are required to perform the input data, the first calculation referring to the first calculation performed on the input data.
S520, residing the plurality of input data in the on-chip cache.
Since the bandwidth of the computing device for reading data from the off-chip storage is much smaller than the bandwidth of the computing device for reading data from the on-chip cache, the plurality of input data can be resident in the on-chip cache, which is beneficial to reducing the data amount of the data read from the off-chip storage in the subsequent computing process so as to reduce the loading amount of the input data. In this embodiment of the present application, residing refers to reading input data in off-chip storage and storing the input data in an on-chip cache, where the input data remains unchanged and is not modified before the data fails.
S530, reading the plurality of input data in the on-chip cache and the first calculation result to perform second calculation so as to obtain calculation results of the plurality of input data.
For some calculations that require multiple readings of the input data, for example, calculations such as normalize, batchNorm require two readings of the input data for calculation, and the second calculation requires the input data and the first calculation result as inputs for the second calculation.
In order to better understand the solution of the embodiment of the present application, the following uses the BatchNorm calculation as an example to describe the embodiment of the present application.
In the deep neural network training process, batchNorm is an important regulation technology, so that the intermediate result between the network layer and the interlayer is more stable, and the convergence rate of the training process is accelerated. In the deep learning web training process, for the 4-dimensional tensor X of (N, H, W, C), the batch norm includes forward operation and reverse operation, wherein the forward operation on the batch norm is calculated as follows:
Figure BDA0003380910200000061
Figure BDA0003380910200000062
Figure BDA0003380910200000063
the reverse calculations for BatchNorm are as follows:
Figure BDA0003380910200000064
Figure BDA0003380910200000065
Figure BDA0003380910200000066
wherein N represents the number, C represents the image channel, H represents the height, and W represents the width; m=n×h×w, representing the total number of elements per image channel (channel) in a batch of data (referred to as mini-batch); gamma ray c Can refer to scaling variable beta c Can refer to scaling variable, gamma c And beta c Scaling or translation of a certain amplitude can be carried out on the normalized result; mu (mu) c Can refer to the obtained input data x [ n, h, w, c ]]Is used for the average value of (a),
Figure BDA0003380910200000067
can refer to the obtained input data x [ n, h, w, c ]]Is a variance of (2); e may refer to a coefficient close to 0 to prevent +.>
Figure BDA0003380910200000068
The whole denominator is 0./>
Figure BDA0003380910200000069
Can be referred to as x [ n, h, w, c ]]A gradient thereat.
For forward computation, as can be seen from (1) to (3) above, y [ n, h, w, c ] = batch norm_forward (x [ n, h, w, c ], gamma [ c ], beta [ c ]), during computation, x [ n, h, w, c ] needs to be loaded on the chip twice, the first time is used for counting the mean and variance of each channel, and the second time is used for computing the value of the element at the corresponding position of the output tensor.
If the application is used for the BatchNorm computation, the BatchNorm computation includes a forward computation, and the first computation result includes: the mean value corresponding to the plurality of input data and the variance corresponding to the plurality of input data.
Specifically, during the BatchNorm forward computation, the processor first reads the input data from off-chip storage into the on-chip cache, and then performs a first computation, i.e., a mean and variance computation, on the read input data. And, the plurality of input data resides in an on-chip cache. Next, the processor may first read the plurality of input data residing in the on-chip cache and the mean value and variance obtained by the first calculation, calculate the difference between the input data and the mean value, and further divide the difference between the input data and the mean value by the square of the variance and the square root of the coefficient epsilon to achieve normalization processing of the input data, and further, by γ c And beta c And scaling or translating the normalized result to a certain extent so as to realize the whole second calculation process and obtain the calculation result corresponding to all the input data in the off-chip storage.
In the forward computing process of the BatchNorm computation, the input data is resident in the on-chip cache after the mean value and the variance (namely the first computation) are calculated, and when the normalization computation (namely the second computation) is carried out on the input data, the input data can be directly obtained in the on-chip cache for computation, so that the data does not need to be read into the on-chip cache from the off-chip storage again, and the computing performance is improved.
For the inverse calculation, as can be seen from the above inverse formula, diff_x [ n, h, w, c ] = latch_back ward (x [ n, h, w, c ], diff_y [ n, h, w, c ], gamma [ c ], beta [ c ]), the calculation process x [ n, h, w, c ] also needs to be loaded on-chip twice, the first time for calculating the gradient at gamma [ c ], beta [ c ], and the second time for calculating the gradient at x [ n, h, w, c ].
If the application is used for the BatchNorm computation, the BatchNorm computation further includes a reverse computation, and the first computation result includes: a first differential result and a second differential result.
Specifically, in the BatchNorm inverse computation process, the processor first reads input data from the off-chip memory into the on-chip memory, where the input data includes two data, i.e., input tensor data and output error data (mean and variance) in the forward computation process, and then performs a first computation on the read input data, i.e., a first differential computation and a second differential computation corresponding to the above equation (5) and equation (6), respectively, to obtain a first differential result and a second differential result corresponding to the plurality of input data, respectively. And, the input data resides in an on-chip cache. The processor may then first read the plurality of input data residing in the on-chip cache and the parameter gamma obtained by the first calculation c The first differential result and the parameter beta c A second differentiation result, wherein the input data may include the input tensor data and the mean and variance in the forward calculation process, and further, according to the input tensor data, differentiating operation may be performed to obtain gradient at the input tensor data
Figure BDA0003380910200000071
So as to realize the whole second calculation process and obtain the calculation result corresponding to all the input data in the off-chip storage. The second calculation may refer to the input data and its corresponding related parameter gamma described in equation (6) above c Is related to the parameter beta c A series of calculation processes of the second calculation result of (a).
In the reverse calculation process of the BatchNorm calculation, the input data is resident in the on-chip cache after differentiation (namely, first calculation) is carried out on the input data, and when second calculation is carried out on the input data, the input data can be directly obtained in the on-chip cache for calculation, so that the data does not need to be read into the on-chip cache from off-chip storage again, and the calculation performance is improved.
As described above, the input tensor x [ n, h, w, c ] is required to be used twice in the computation, both for BatchNorm forward computation and BatchNorm backward computation, and thus, is required to be loaded from off-chip storage into the on-chip cache twice. Compared with the prior art, in the prior art, when the computing resource reaches the best performance, the BatchNorm forward computation takes time to read in twice the input tensor x [ n, h, w, c ] to the on-chip cache and write out the sum of the time required for once the output tensor Y; when the computing resource reaches the best performance, the BatchNorm back computation is to read in the sum of the two input tensors x [ n, h, w, c ] and Diffff_Y, and the time required to write out the one input tensor Y. Thus, the need to load from off-chip stores to on-chip caches multiple times makes the computation performance of BatchNorm poor. According to the method and the device, after the first calculation is carried out on the input data, the input data is resident in the on-chip cache, when the input data is required to be read again, the input data does not need to be read from the off-chip storage again, and the resident input data can be directly read from the on-chip cache, so that the calculation performance is improved.
It should be noted that, only two calculation steps (the first calculation and the second calculation) are taken as an example at present, and the method described in the embodiment of the present application is equally applicable when three or more calculation steps are involved. For example, if 3 computation steps are involved (which may include a first computation, a second computation, and a third computation, and the three computations have the same input data), multiple input data may be resident in on-chip storage after the first computation is completed, and at the time of the second computation and the third computation, the resident multiple input data is read directly from on-chip cache, and the final desired computation result is obtained.
Therefore, in the application, when the subsequent calculation is performed on the plurality of input data, the plurality of input data can be read from the on-chip cache, and compared with the case that the data is read from the off-chip storage, the bandwidth of the data read from the on-chip cache is far greater than the bandwidth of the data read from the off-chip storage, which is beneficial to reducing the data amount which needs to be read from the off-chip storage when the calculation result is obtained by calculation.
In one possible example, the size of the buffer space corresponding to the on-chip buffer is a first buffer space; the size of the needed cache space corresponding to the plurality of input data is a second cache space; before said reside said plurality of input data in said on-chip cache, the method further comprises the steps of: judging whether the first cache space is smaller than the second cache space or not; and if the first cache space is smaller than the second cache space, part of input data with the same size as the first cache space is resided in the on-chip cache.
Specifically, since the storage space of the on-chip buffer is limited, or the on-chip buffer may store other input data than the plurality of input data calculated at this time, or the data size of the plurality of input data is too large, it is necessary to determine whether the plurality of input data reside in the data size of the on-chip buffer in whole or in part. When the data volume of the input data is large, and the buffer memory space of the on-chip buffer memory is insufficient to store a plurality of input data, part of the input data in the plurality of input data can be stored in the on-chip buffer memory, and the rest of the input data can be stored in the off-chip storage; when the on-chip cache space is sufficient to store the plurality of input data, all of the input data may reside in the on-chip cache.
That is, it may be determined whether to reside all of the plurality of input data in the on-chip cache by comparing the sizes of the first cache space and the second cache space; if the first buffer space is smaller than the second buffer space, indicating that the on-chip buffer has insufficient storage space for storing the plurality of input data, residing part of input data with the same size as the first buffer space in the on-chip buffer, and remaining input data except the part of input data in off-chip storage; otherwise, if the first buffer space is greater than or equal to the second buffer space, it indicates that there is enough storage space in the on-chip storage to temporarily store the plurality of input data, and then the plurality of storage data are all resident in the on-chip buffer.
Therefore, in the application, the data distribution can be performed on a plurality of input data according to the data size of the input data and the storage condition of the on-chip cache, and the plurality of input data can be wholly or partially resident in the on-chip cache, so that the data resident efficiency and success rate can be improved, the occurrence of the condition that all input data are read from the off-chip storage and the on-chip cache is not stored can be avoided, and the loading amount of the input data in the case can be reduced.
Further, when the plurality of input data residing in the on-chip cache is all input data, that is, the storage space of the on-chip cache is sufficient to store all input data in the off-chip cache. And the processor directly reads all resident input data from the on-chip cache, and performs second calculation with the first calculation result to obtain a final calculation result of a plurality of input data.
In one possible example, if not all of the input data is required for performing the second calculation, when selecting that the input data resides in the on-chip cache, a portion of the input data that needs to be subjected to the second calculation may be selected to reside in the on-chip cache, and the specific selection manner is determined according to the calculation manner applicable to the embodiment of the present application, which is not limited herein.
When the plurality of input data residing in the on-chip cache is part of the input data, that is, the remaining input data may reside in the off-chip storage, in this case, the plurality of input data may be read from the on-chip cache and the off-chip storage according to the order in which the plurality of input data reside in the plurality of cache spaces and the cache space in the off-chip storage. And further, carrying out subsequent second calculation by combining the first calculation result so as to obtain calculation results corresponding to the plurality of input data.
The method may further comprise the steps of: performing second calculation on the part of input data and the first calculation result which reside in the on-chip cache to obtain a second calculation result; reading the residual input data except the part of input data residing in the on-chip cache from the off-chip memory, and performing second calculation on the first calculation result to obtain a third calculation result; and obtaining the calculation results of the plurality of input data according to the second calculation result and the third calculation result.
It should be noted that, in the embodiment of the present application, the storage space of the on-chip cache is insufficient to store all of the plurality of input data, and thus, a portion of the input data may be stored in the on-chip cache.
For example, if the method described in the embodiments of the present application is applied to the BatchNorm forward calculation, the second calculation may refer to the calculation process described in the above formula (1) that performs normalization on the input data and the mean and variance thereof; in this case, the processor may first read the partial input data residing in the on-chip cache and calculate the difference between the partial input data and the mean value, and then perform normalization processing on the partial input data by dividing the difference between the partial input data and the mean value by the square of the variance and the square root of the coefficient ε, and then by γ c And beta c Scaling or translating the normalized result to a certain extent, and obtaining a second calculation result corresponding to part of input data; the remaining input data remaining in the off-chip storage may then be read into the on-chip cache and subjected to a series of calculations as described above with the first calculated mean and variance to effect normalization of the remaining input data, and further, by γ c And beta c Scaling or translating the normalized result to obtain a third calculation result corresponding to the residual input data; finally, the second calculation result and the third calculation result can be obtained The calculation results are spliced to obtain calculation results corresponding to all input data (a plurality of input data).
For example, if the method described in the embodiments of the present application is applied to the BatchNorm reverse calculation, the processing method corresponding to the above BatchNorm forward calculation is the same, and will not be described herein again.
In one possible example, when the second calculation result and the third calculation result are spliced, the method may include the following steps: determining a first storage sequence corresponding to partial input data residing in one or more cache spaces in the on-chip cache; determining a second storage order between the partial input data and the remaining input data; and according to the second storage sequence, splicing the second calculation result corresponding to part of the input data and the third calculation result corresponding to the rest of the input data to obtain calculation results of a plurality of input data.
Wherein, it can judge the part of the input data and the rest of the input data by the second storage order, and the part is the front data and the rest is the back data. If part of the input data is preamble data, splicing the second calculation result to the preamble position of the third calculation result; otherwise, the second calculation result is spliced at the subsequent position of the third calculation result.
In one possible example, the on-chip cache may include one or more on-chip cache spaces, which may be one or more of the on-chip caches such as NRAM, SRAM, WRAM shown in fig. 3, and may be other on-chip cache spaces, as the application is not limited in this regard.
For example, as shown in fig. 5b, an on-chip cache structure is shown, and the on-chip cache may include one SRAM, four NRAMs, and four WRAMs.
When the on-chip cache includes a plurality of cache spaces, reading a plurality of input data and a first calculation result in the on-chip cache to perform a second calculation, so as to obtain a calculation result of the plurality of input data, including:
determining an order in which the plurality of input data reside in the plurality of cache spaces; and reading a plurality of input data in the on-chip cache and the first calculation result according to the sequence to perform second calculation so as to obtain calculation results of the plurality of input data.
Specifically, the order in which the input data resides in the on-chip cache space may be any order, or the input data may be sequentially stored in a plurality of cache spaces of the on-chip cache in a predetermined order. The predetermined order may be determined according to the size of the on-chip buffer space, may be set by a user or default by a system, and is not limited herein, for example, storing the input data in a front-to-back order, or performing chained storage on the input data by referring to the input data, or the like.
In this case, since the plurality of input data are associated with each other or have a discharge order, when the plurality of input data are read from the on-chip cache, the plurality of input data can be read according to the order in which the plurality of input data reside in the on-chip cache. In a specific implementation, if all the input data participating in the second calculation reside in the on-chip cache, determining an order in which a plurality of input data reside in a plurality of cache spaces corresponding to the on-chip caches, reading the plurality of input data from the on-chip cache according to the order, and performing the second calculation in combination with the first calculation result to obtain a calculation result of the plurality of input data; therefore, the arrangement sequence of a plurality of input data is not disturbed or destroyed, and the accuracy of the calculation result is improved.
In one possible example, when the on-chip cache space includes an NRAM, the input data is preferentially stored in the NRAM. Specifically, due to different calling modes of calculation instructions (for example, instructions for calling the BatchNorm calculation), when the input data is calculated, when the data needs to be read from the on-chip cache, the input data stored in other on-chip caches (for example, SRAM and/or WRAM) often needs to be carried or allocated into the NRAM, so that the calculation instruction corresponding to the actual calculation can be called through the NRAM. Therefore, when the input data is resident in the on-chip cache, priority is often given to the resident input data in the cache space of the NRAM, and when the cache space of the NRAM is insufficient to store all the input data, the input data is stored in other on-chip cache spaces.
In a specific implementation, if the on-chip cache includes NRAM, SRAM, WRAM, the input data read from the off-chip storage is first resident in the NRAM, and the input data is stored in the SRAM and/or WRAM when the cache space of the NRAM is insufficient to store all the input data.
Therefore, in the application, when the buffer space of the NRAM is insufficient to store all input data, the input data can be stored in other on-chip buffer spaces, so that when the input data needs to be read later, the input data residing in the input data can be read into the NRAM from the SRAM and/or the WRAM, and the bandwidth from one on-chip buffer to the other on-chip buffer is far greater than the bandwidth from the off-chip storage, which is beneficial to improving the data reading efficiency.
In one possible example, if the on-chip cache includes NRAM, SRAM, WRAM, when the plurality of input data reside in the on-chip cache, the method may include the following steps: since the resident input data needs to be taken out from the NRAM when the input data is calculated later and the actual buffer space of the NRAM is unclear, the maximum NRAM buffer space that can be stored when no data is stored in the NRAM can be determined; if the NRAM cache space is larger than or equal to the second cache space, all the plurality of input data reside in the NRAM; if the NRAM buffer space is smaller than the second buffer space, after the input data is resided in the NRAM, other input data is remained and resided in the SRAM and/or WRAM.
Therefore, in the application, the residence mode of a plurality of input data can be planned according to the calling mode of the calculation instruction in the specific calculation process and the subsequent specific calculation mode, so that the failure condition of the residence data can be reduced; in the subsequent calculation, the method is beneficial to reducing the occurrence of the condition of input data reading errors caused by a calculation instruction calling mode.
In a possible embodiment, taking a BatchNorm calculation as an example, if the on-chip cache includes at least two of NRAM, SRAM, WRAM, that is, includes a plurality of cache spaces, and if an instruction for invoking the BatchNorm calculation needs to be invoked by an NRAM, then according to the first storage order, the input data residing in the SRAM and/or the WRAM may be read and spliced with the input data residing in the NRAM to obtain a part of the input data, and the part of the input data resides in the NRAM; further, the subsequent second calculation may be performed as described in the above method, and will not be described herein.
In a possible embodiment, if the on-chip cache includes NRAM and SRAM, and if the instruction for invoking the batch norm calculation needs to be invoked by NRAM, if the current cache space in the NRAM is insufficient to store the input data residing in the SRAM, then the second input data with the same size as the current cache space of the SRAM may be read from the SRAM, a third storage sequence between the second input data and the third input data except the second input data residing in the SRAM is recorded, and the second input data and the first input data in the NRAM are spliced according to the first storage sequence, and reside in the NRAM, and further, the first input data and the second input data after the splicing may be subjected to the second calculation with the first calculation result to obtain a fourth calculation result; further, the remaining third input data residing in the SRAM may be transferred or moved to the NRAM, and the second calculation may be performed with the first calculation result to obtain a fifth calculation result. And finally, splicing the fourth calculation result and the fifth calculation result according to a third storage sequence to obtain second result data corresponding to the partial input data.
It should be noted that the number of the on-chip caches is merely an example, and if there are a plurality of on-chip caches, the method may also be adopted, which is not described herein again.
Therefore, in the application, the second calculation can be preferentially performed no matter how many on-chip memories are included, and after the calculation is completed, the splicing of the calculation results corresponding to the second calculation can be completed according to the storage sequence among the data, so that the final calculation result is obtained.
In one possible example, when the remaining input data is read from off-chip storage, an offset corresponding to the remaining input data in all data stored off-chip may be determined; determining a starting position corresponding to the residual input data in the off-chip memory according to the offset, namely, the first input data in the residual input data; starting from the initial position, the remaining input data is read from the off-chip memory until the position corresponding to the last input data of the remaining input data.
In one possible example, the input data is multi-dimensional tensor data, including N, H, W, C four dimensions; when the data amount of the multidimensional tensor data is greater than or equal to a preset threshold value, the method can further comprise the following steps: splitting the multidimensional tensor data into a plurality of sections in the dimension C, and calculating each section of split multidimensional tensor data by the calculating device to obtain a calculation result corresponding to the multidimensional tensor data.
The C dimension is taken as a low dimension, and the other N, H, W dimensions are taken as high dimensions without distinction, because the C dimension exists independently during calculation, and thus the C dimension can be split.
The preset threshold value may be set by the user or default by the system, which is not limited herein; for example, the dimension C of the input data can be split according to the size of the input data, and when the input data is large, the bandwidth of loading is increased, so that the calculation efficiency is affected. The preset threshold may be used to define data of a small order of magnitude and data of a large order of magnitude, and data exceeding the preset threshold may be defined as data of a large order of magnitude. For example, the preset threshold may be defined as 10M, a plurality of input data exceeding the data amount of 10M may be defined as data of a large order of magnitude, and data of a data amount smaller than 10M may be defined as data of a small order of magnitude. Optionally, the splitting may be performed according to the computing unit, the splitting may be performed according to the size of the data that can be processed by the computing unit, the waste of the computing unit may be prevented, and so on.
It should be noted that, in the embodiment of the present application, mainly for input data of a large order of magnitude, the above multi-dimensional data may include multi-dimensional tensor data including four dimensions, that is, four dimensions of NHWC, as the tensor data X in the above formulas (1) - (6).
In a specific implementation, since the tensor data is multidimensional data, most of the data cannot reside on-chip by adopting the method described in the above embodiment, and therefore, the multidimensional tensor data can be split into multiple segments.
Further, as shown in fig. 5c, a schematic diagram of a structure of the multidimensional tensor data is shown, and the input data is the multidimensional tensor data. Still taking the BatchNorm calculation as an example, the splitting process of the input data is described. Since the entire BatchNorm computation is independent in the C dimension; therefore, cyclic BatchNorm computation can be performed on each piece of multi-dimensional tensor data after splitting in the C dimension, and as shown in the figure, each TASK can correspond to one BatchNorm computation; the overall data is split into multiple segments in the C-dimension (corresponding to Job1, job2, jobn in the figure), and before each second calculation, the segments of multi-dimensional tensor data may reside in an on-chip cache, and further, may be calculated in the C-dimension (i.e., task1, task2, task in the figure) in a round-robin fashion (the specific calculation process for each segment is consistent with that described in methods S510-S530 above), and finally, the calculations may be performed for each segment of multi-dimensional tensor data, and finally, the stitching may be performed in the C-dimension to obtain the calculation result corresponding to calculating the overall batch norm.
Therefore, in the application, the multi-dimensional tensor with large magnitude can be split in the C dimension, so that the data processing process, namely the process of reading multi-dimensional tensor data from off-chip storage, is reduced, and the overall data residence rate is improved.
In one possible example, the method may further include the steps of: determining the split jumping step length and loading section length according to the data bandwidth; determining an adjusting coefficient according to the jumping step length and the loading section length; and determining the number of segments of the multidimensional data split in the C dimension according to the adjustment coefficient.
Considering that splitting the multidimensional tensor data in the dimension C reduces the bandwidth of the subsequent loading data, a balance point needs to be found, which is the best overall computing performance.
The jump step length S may refer to a step length S needed to jump between the first segment and the next segment, and the loading segment length may refer to a segment length corresponding to loading a segment of multidimensional tensor data for calculation.
In a specific implementation, the adjustment coefficient m may refer to a quotient between the jitter step S and the loading section length L, i.e. the adjustment coefficient m=jitter step S/loading section length L; the larger the adjustment coefficient m is, the smaller the bandwidth of the loaded data is, so that the bandwidth of the loaded data can be tested according to the adjustment coefficient to determine a balance point, namely, the number of segments of the multidimensional tensor data split in the dimension C is determined, so that the overall performance is optimal.
It can be seen that, in the data processing method described in the embodiments of the present application, a plurality of input data may be read to the on-chip cache and first calculation is performed, so as to obtain a first calculation result; reside a plurality of input data in an on-chip cache; and reading the plurality of input data in the on-chip cache and the first calculation result to perform second calculation so as to obtain calculation results of the plurality of input data. In this way, when the second calculation is performed, resident data can be directly obtained from the on-chip cache, the input data does not need to be read from the off-chip storage to the on-chip cache again for performing the second calculation, and the bandwidth required for reading the data from the on-chip cache is far greater than the bandwidth required for reading the data from the off-chip storage, so that the on-chip cache does not need to be loaded from the off-chip storage for a plurality of times, thereby being beneficial to improving the calculation performance.
The foregoing description of the embodiments of the present application has been presented primarily in terms of a method-side implementation. It will be appreciated that the electronic device, in order to achieve the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the application may divide the functional units of the electronic device according to the above method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated in one processing unit. The integrated units may be implemented in hardware or in software functional units. It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice.
Referring to fig. 6, fig. 6 is a block diagram illustrating functional units of a computing device 600 according to an embodiment of the present application, and as shown in fig. 6, the computing device 600 includes: an on-chip cache 610 and a processor 620, the device being coupled to off-chip storage, the off-chip storage having a plurality of input data;
the processor 620 is configured to read the plurality of input data into the on-chip cache and perform a first calculation to obtain a first calculation result;
the on-chip cache 610 is configured to reside the plurality of input data; the processor 620 is further configured to perform a second calculation on the plurality of input data in the on-chip cache and the first calculation result to obtain a calculation result of the plurality of input data.
It can be seen that the embodiment of the present application provides a computing device, which can read a plurality of input data to the on-chip cache and perform a first computation to obtain a first computation result; reside a plurality of input data in an on-chip cache; and reading the plurality of input data in the on-chip cache and the first calculation result to perform second calculation so as to obtain calculation results of the plurality of input data. In this way, when the second calculation is performed, resident data can be directly obtained from the on-chip cache, the input data does not need to be read from the off-chip storage to the on-chip cache again for performing the second calculation, and the bandwidth required for reading the data from the on-chip cache is far greater than the bandwidth required for reading the data from the off-chip storage, so that the on-chip cache does not need to be loaded from the off-chip storage for a plurality of times, thereby being beneficial to improving the calculation performance.
Optionally, the on-chip cache includes a plurality of cache spaces, and the reading the plurality of input data in the on-chip cache and the first calculation result perform a second calculation to obtain a calculation result of the plurality of input data; the processor 610 is specifically configured to:
determining an order in which the plurality of input data reside in the plurality of cache spaces;
and reading a plurality of input data in the on-chip cache and the first calculation result according to the sequence to perform second calculation so as to obtain calculation results of the plurality of input data.
Optionally, the size of the buffer space corresponding to the on-chip buffer is the first buffer space; the size of the needed cache space corresponding to the plurality of input data is a second cache space; the processor 610 is further specifically configured to, prior to said residing the plurality of input data in the on-chip cache:
judging whether the first cache space is smaller than the second cache space or not;
and if the first cache space is smaller than the second cache space, part of input data with the same size as the first cache space is resided in the on-chip cache.
Optionally, the processor 610 is specifically further configured to:
performing second calculation on the part of input data and the first calculation result which reside in the on-chip cache to obtain a second calculation result;
reading the residual input data except the part of input data residing in the on-chip cache from the off-chip memory, and performing second calculation on the first calculation result to obtain a third calculation result;
and obtaining the calculation results of the plurality of input data according to the second calculation result and the third calculation result.
Optionally, the input data is multidimensional tensor data, including N, H, W, C four dimensions;
When the data size of the multidimensional tensor data is greater than or equal to a preset threshold, the processor 610 is specifically further configured to:
splitting the multidimensional tensor data into a plurality of sections in the dimension C, and calculating each section of split multidimensional tensor data by the calculating device to obtain a calculation result corresponding to the multidimensional tensor data.
Optionally, the processor 610 is specifically further configured to:
determining the split jumping step length and loading section length according to the data bandwidth;
determining an adjusting coefficient according to the jumping step length and the loading section length;
and determining the number of segments of the multi-dimensional tensor data split in the C dimension according to the adjustment coefficient.
It may be appreciated that the functions of each program module of the computing apparatus in the embodiments of the present application may be specifically implemented according to the method in the embodiments of the method, and the specific implementation process may refer to the relevant description of the embodiments of the method, which is not repeated herein.
The present application also provides a computer storage medium storing a computer program for data processing, the computer program causing a computer to execute part or all of the steps of any one of the methods described in the above method embodiments.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the methods described in the method embodiments above. The computer program product may be a software installation package.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device provided in an embodiment of the present application, where, as shown in fig. 7, the electronic device includes a memory, an input device, an output device, and a processor, where the electronic device may further include a communication bus, and the processor, the input device, the output device, and the memory may be connected to each other through the bus. Optionally, the electronic device may further include an instruction storage unit, where the instruction storage unit is disposed adjacent to the processor. Further alternatively, the instruction storage unit is integrated with the processor, i.e. the instruction storage unit is an on-chip storage unit of the processor. Thus, when the processor needs to execute the program in the memory, the electronic device first loads the program in the memory to the instruction storage unit, and then the processor can access the instruction storage unit to execute the program in the instruction storage unit.
The processor is configured to implement the following steps when executing the program stored in the memory:
reading the plurality of input data into the on-chip cache and performing first calculation to obtain a first calculation result;
residing the plurality of input data in the on-chip cache;
and reading the plurality of input data in the on-chip cache and the first calculation result to perform second calculation so as to obtain calculation results of the plurality of input data.
Further, the processor may be a central processing unit (Central Processing Unit, CPU), an intelligent processor (Intelligence Processing Unit, NPU), a graphics processor (Graphics Processing Unit, GPU) or an image processor (Image Processing Unit), which is not limited in this application. According to different processors, the method for determining hardware performance provided by the embodiment of the application can be applied to the artificial intelligence application fields such as image recognition processing, deep learning processing, computer vision processing, intelligent robot processing, natural language processing and the like, and complex functional programs in the artificial intelligence field can be executed.
Further, depending on the application scenario, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a computing device, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vision terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.
It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.
In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are divided herein by taking into account the logic function, and there may be other manners of dividing the units when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.
In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.
In some implementation scenarios, the above-described integrated units may be implemented in the form of software program modules. The integrated unit may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand alone product. In this regard, when the aspects of the present disclosure are embodied in a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described by the embodiments of the present disclosure. The aforementioned Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media capable of storing program codes.
In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP and ASICs, etc. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.
The foregoing may be better understood in light of the following clauses:
clause a1. A data processing method applied to a computing device, the computing device comprising an on-chip cache, the computing device being coupled to an off-chip storage, the off-chip storage having a plurality of input data, comprising:
reading the plurality of input data into the on-chip cache and performing first calculation to obtain a first calculation result;
residing the plurality of input data in the on-chip cache;
and reading the plurality of input data in the on-chip cache and the first calculation result to perform second calculation so as to obtain calculation results of the plurality of input data.
A2. The method of A1, the on-chip cache includes a plurality of cache spaces, and the reading the plurality of input data in the on-chip cache and the first calculation result to perform a second calculation to obtain a calculation result of the plurality of input data includes:
determining an order in which the plurality of input data reside in the plurality of cache spaces;
and reading a plurality of input data in the on-chip cache and the first calculation result according to the sequence to perform second calculation so as to obtain calculation results of the plurality of input data.
A3. The method according to A1 or A2, wherein the size of the buffer space corresponding to the on-chip buffer is a first buffer space; the size of the needed cache space corresponding to the plurality of input data is a second cache space; before said reside of said plurality of input data in said on-chip cache, further comprising:
judging whether the first cache space is smaller than the second cache space or not;
and if the first cache space is smaller than the second cache space, part of input data with the same size as the first cache space is resided in the on-chip cache.
A4. The method according to A3, further comprising:
performing second calculation on the part of input data and the first calculation result which reside in the on-chip cache to obtain a second calculation result;
reading the residual input data except the part of input data residing in the on-chip cache from the off-chip memory, and performing second calculation on the first calculation result to obtain a third calculation result;
and obtaining the calculation results of the plurality of input data according to the second calculation result and the third calculation result.
A5. The method of A1 or A2, the input data being multi-dimensional tensor data comprising N, H, W, C four dimensions;
When the data amount of the multidimensional tensor data is greater than or equal to a preset threshold value, the method further comprises the following steps:
splitting the multidimensional tensor data into a plurality of sections in the dimension C, and calculating each section of split multidimensional tensor data by the calculating device to obtain a calculation result corresponding to the multidimensional tensor data.
A6. The method according to A5, further comprising:
determining the split jumping step length and loading section length according to the data bandwidth;
determining an adjusting coefficient according to the jumping step length and the loading section length;
and determining the number of segments of the multi-dimensional tensor data split in the C dimension according to the adjustment coefficient.
A7. The method of any of A1-A6, the on-chip cache comprising one or more of NRAM, SRAM, WRAM.
A8. The method according to A1, further comprising: the method is used for BatchNorm computation, including forward computation; the first calculation result includes: the mean value corresponding to the plurality of input data and the variance corresponding to the plurality of input data.
A9. The method of A1-A8 for a BatchNorm computation, including a reverse computation; the first calculation result includes: a first differential result and a second differential result.
A10. A computing device, the device comprising: the device is connected with an off-chip storage, and a plurality of input data are stored outside the chip;
the processor is used for reading the plurality of input data into the on-chip cache and performing first calculation to obtain a first calculation result;
the on-chip cache is used for residing the plurality of input data; the processor is further configured to perform a second calculation on the plurality of input data in the on-chip cache and the first calculation result, so as to obtain a calculation result of the plurality of input data.
A11. The device according to a10, wherein the reading the plurality of input data in the on-chip cache and the first calculation result perform a second calculation to obtain a calculation result of the plurality of input data; the processor is specifically configured to:
determining an order in which the plurality of input data reside in the on-chip cache;
and reading a plurality of input data in the on-chip cache and the first calculation result according to the sequence to perform second calculation so as to obtain calculation results of the plurality of input data.
A12. The device according to a10 or a11, wherein the size of the buffer space corresponding to the on-chip buffer is a first buffer space; the size of the needed cache space corresponding to the plurality of input data is a second cache space; the processor is further specifically configured to, prior to said residing said plurality of input data in said on-chip cache:
Judging whether the first cache space is smaller than the second cache space or not;
and if the first cache space is smaller than the second cache space, part of input data with the same size as the first cache space is resided in the on-chip cache.
A13. The apparatus of a12, the processor is specifically further configured to:
performing second calculation on the part of input data and the first calculation result which reside in the on-chip cache to obtain a second calculation result;
reading the residual input data except the part of input data residing in the on-chip cache from the off-chip memory, and performing second calculation on the first calculation result to obtain a third calculation result;
and obtaining the calculation results of the plurality of input data according to the second calculation result and the third calculation result.
A14. The apparatus of a10 or a11, the input data being multi-dimensional tensor data comprising N, H, W, C four dimensions;
when the data amount of the multidimensional tensor data is greater than or equal to a preset threshold, the processor is specifically further configured to:
splitting the multidimensional tensor data into a plurality of sections in the dimension C, and calculating each section of split multidimensional tensor data by the calculating device to obtain a calculation result corresponding to the multidimensional tensor data.
A15. The apparatus of a14, the processor is specifically further configured to:
determining the split jumping step length and loading section length according to the data bandwidth;
determining an adjusting coefficient according to the jumping step length and the loading section length;
and determining the number of segments of the multi-dimensional tensor data split in the C dimension according to the adjustment coefficient.
B1. A neural network chip comprising means for performing any one of clauses A1-A9.
C1. A computer readable storage medium comprising storing a computer program for data exchange, which when executed by a processor implements the method of any of clauses A1-A9.
D1. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of clauses A1-A9.
While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that the same are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. The appended claims are intended to define the scope of the disclosure and are therefore to cover all equivalents or alternatives falling within the scope of these claims.

Claims (13)

1. A data processing method, applied to a computing device, the computing device including an on-chip cache, the computing device being coupled to an off-chip storage, the off-chip storage having a plurality of input data, the method comprising:
reading the plurality of input data into the on-chip cache and performing first calculation to obtain a first calculation result;
residing the plurality of input data in the on-chip cache;
and reading the plurality of input data in the on-chip cache and the first calculation result to perform second calculation so as to obtain calculation results of the plurality of input data.
2. The method of claim 1, wherein the on-chip cache includes a plurality of cache spaces, and wherein the reading the plurality of input data in the on-chip cache and the first calculation result to perform a second calculation to obtain a calculation result of the plurality of input data includes:
determining an order in which the plurality of input data reside in the plurality of cache spaces;
and reading a plurality of input data in the on-chip cache and the first calculation result according to the sequence to perform second calculation so as to obtain calculation results of the plurality of input data.
3. The method according to claim 1 or 2, wherein the size of the buffer space corresponding to the on-chip buffer is the first buffer space; the size of the needed cache space corresponding to the plurality of input data is a second cache space; before said parking said plurality of input data in said on-chip cache, said method further comprises:
judging whether the first cache space is smaller than the second cache space or not;
and if the first cache space is smaller than the second cache space, part of input data with the same size as the first cache space is resided in the on-chip cache.
4. A method according to claim 3, characterized in that the method further comprises:
performing second calculation on the part of input data and the first calculation result which reside in the on-chip cache to obtain a second calculation result;
reading the residual input data except the part of input data residing in the on-chip cache from the off-chip memory, and performing second calculation on the first calculation result to obtain a third calculation result;
and obtaining the calculation results of the plurality of input data according to the second calculation result and the third calculation result.
5. The method of claim 1 or 2, wherein the input data is multi-dimensional tensor data comprising N, H, W, C dimensions;
when the data amount of the multidimensional tensor data is greater than or equal to a preset threshold, the method further comprises:
splitting the multidimensional tensor data into a plurality of sections in the dimension C, and calculating each section of split multidimensional tensor data by the calculating device to obtain a calculation result corresponding to the multidimensional tensor data.
6. The method of claim 5, wherein the method further comprises:
determining the split jumping step length and loading section length according to the data bandwidth;
determining an adjusting coefficient according to the jumping step length and the loading section length;
and determining the number of segments of the multi-dimensional tensor data split in the C dimension according to the adjustment coefficient.
7. The method of any of claims 1-6, wherein the on-chip cache comprises one or more of NRAM, SRAM, WRAM.
8. The method of any one of claims 1-7, wherein the method is for a batch norm calculation, the batch norm calculation comprising a forward calculation; the first calculation result includes: the mean value corresponding to the plurality of input data and the variance corresponding to the plurality of input data.
9. The method of any one of claims 1-8, wherein the method is used for a batch norm calculation, the batch norm calculation comprising a reverse calculation; the first calculation result includes: a first differential result and a second differential result.
10. A computing device, the device comprising: the device is connected with an off-chip storage, and a plurality of input data are stored outside the chip;
the processor is used for reading the plurality of input data into the on-chip cache and performing first calculation to obtain a first calculation result;
the on-chip cache is used for residing the plurality of input data; the processor is further configured to perform a second calculation on the plurality of input data in the on-chip cache and the first calculation result, so as to obtain a calculation result of the plurality of input data.
11. A combination processing device, characterized in that it comprises the computing device of claim 10.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a computer program for storing a computer program for data exchange, which computer program, when being executed by a processor, implements the method according to any of claims 1-9.
13. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1-9.
CN202111439229.1A 2021-11-29 2021-11-29 Data processing method, device, storage medium and electronic equipment Pending CN116185942A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111439229.1A CN116185942A (en) 2021-11-29 2021-11-29 Data processing method, device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111439229.1A CN116185942A (en) 2021-11-29 2021-11-29 Data processing method, device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN116185942A true CN116185942A (en) 2023-05-30

Family

ID=86442898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111439229.1A Pending CN116185942A (en) 2021-11-29 2021-11-29 Data processing method, device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN116185942A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118070865A (en) * 2024-04-25 2024-05-24 北京壁仞科技开发有限公司 Optimization method and device of artificial intelligent model, electronic equipment and storage medium
CN118170714A (en) * 2024-05-13 2024-06-11 北京壁仞科技开发有限公司 Method, computing device, medium and program product for accelerating computation
CN118277331A (en) * 2024-06-03 2024-07-02 北京壁仞科技开发有限公司 Computing device, method of performing normalized class operations in a computing device, computer-readable storage medium, and computer program product

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118070865A (en) * 2024-04-25 2024-05-24 北京壁仞科技开发有限公司 Optimization method and device of artificial intelligent model, electronic equipment and storage medium
CN118070865B (en) * 2024-04-25 2024-07-23 北京壁仞科技开发有限公司 Optimization method and device of artificial intelligent model, electronic equipment and storage medium
CN118170714A (en) * 2024-05-13 2024-06-11 北京壁仞科技开发有限公司 Method, computing device, medium and program product for accelerating computation
CN118277331A (en) * 2024-06-03 2024-07-02 北京壁仞科技开发有限公司 Computing device, method of performing normalized class operations in a computing device, computer-readable storage medium, and computer program product

Similar Documents

Publication Publication Date Title
CN116185942A (en) Data processing method, device, storage medium and electronic equipment
CN112799726B (en) Data processing device, method and related product
WO2023045446A1 (en) Computing apparatus, data processing method, and related product
CN112686379A (en) Integrated circuit device, electronic equipment, board card and calculation method
CN112799599A (en) Data storage method, computing core, chip and electronic equipment
CN113469337B (en) Compiling method for optimizing neural network model and related products thereof
CN115952848A (en) Convolution operation circuit, compiling method and related product
CN113238976B (en) Cache controller, integrated circuit device and board card
CN113238975A (en) Memory, integrated circuit and board card for optimizing parameters of deep neural network
CN112801276A (en) Data processing method, processor and electronic equipment
CN112948001A (en) Method for setting tensor hardware configuration, readable storage medium and device
CN113742266B (en) Integrated circuit device, electronic apparatus, board and computing method
CN113791996B (en) Integrated circuit device, electronic apparatus, board and computing method
CN113792867B (en) Arithmetic circuit, chip and board card
CN118210552A (en) Instruction generation method, device and storage medium
CN113469365B (en) Reasoning and compiling method based on neural network model and related products thereof
CN117742566A (en) Memory access processing device, processor, chip, board card and instruction execution method
CN114647442A (en) Apparatus operating according to instruction set
CN114625370A (en) Method, device and heterogeneous system for data layout between host and device
CN114648438A (en) Apparatus, method, and readable storage medium for processing image data
CN114429194A (en) Device, board card, method and readable storage medium for processing neural network calculation
CN112486775A (en) Method for counting module throughput and readable storage medium
CN114444677A (en) Device, board card and method for sparse training and readable storage medium
CN113469328A (en) Device, board card, method and readable storage medium for executing revolution crossing
CN113469327A (en) Integrated circuit device for executing advance of revolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination