CN114648438A

CN114648438A - Apparatus, method, and readable storage medium for processing image data

Info

Publication number: CN114648438A
Application number: CN202011496888.4A
Authority: CN
Inventors: 不公告发明人
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2022-06-21

Abstract

The invention relates to a device, a board card, a method and a readable storage medium for processing image data, wherein the computing device of the invention is included in an integrated circuit device which comprises a universal interconnection interface and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by the user. The integrated circuit device may further comprise a storage device, which is connected to the computing device and the other processing device, respectively, for data storage of the computing device and the other processing device.

Description

Apparatus, method, and readable storage medium for processing image data

Technical Field

The present invention relates generally to the field of neural networks. More particularly, the present invention relates to an apparatus, method, and readable storage medium for processing image data.

Background

In the conventional CPU, GPU, and DSP instruction sets, based on the purpose of maximizing programmability, the instruction set design uses a design in which one instruction completes one action, such as a reduced instruction set (risc) instruction set or a vertical instruction set (vliw) instruction set, and such instructions are collectively referred to as single operation instructions, or single instruction stream single data stream (SISD), and the instruction unit decodes only one instruction at a time and provides only one piece of data for the operation unit during execution.

Because a single operation instruction cannot process batch data at one time, a SIMD (single instruction multiple data) instruction set is developed for a CPU, and a plurality of data are operated by using one instruction, which is mainly used for supporting parallel operation of small and broken data. Taking image processing as an example, common data types of images are RGB565, RGBA8888, YUV422 and other formats, and data of these formats is characterized in that one component of one pixel uses less than or equal to 8 bits, and the unit storage of the register of the CPU is generally 32 bits or 64 bits, if a single operation instruction is used for control, processing of one 8-bit pixel occupies 32-bit or 64-bit register space, which causes resource waste, and a SIMD instruction can process 4 or 8 pixels at a time, synchronously complete 4 or 8 operations, fully utilize the register space, and improve the calculation efficiency by several times.

With the rapid development of artificial intelligence, more and more artificial intelligence dedicated processors come out, and various application scenarios (such as image processing) of the neural network require a large number of repeated tasks to be performed, such as data handling, matrix multiplication, matrix addition and the like. If only a single operation instruction is used for control, hardware resources cannot be used at the same time, so an artificial intelligence processing scheme capable of executing multiple operation instructions is needed urgently.

Disclosure of Invention

To at least partially solve the technical problems noted in the background, aspects of the present invention provide an apparatus, method, and readable storage medium for processing image data.

In one aspect, a computing device for processing image data using a neural network is connected to an off-chip memory, and includes a shared memory unit and a processor core. The shared storage unit is used for loading image data from an off-chip memory, the processor core comprises an operation module and a control module, the operation module comprises a front revolution unit, a matrix operation unit and a rear revolution unit, the control module is used for inputting the image data to the front revolution unit, the matrix operation unit and the rear revolution unit for operation according to multiple operation instructions so as to generate an operation result, and finally the shared storage unit stores the operation result.

In another aspect, the present invention discloses an integrated circuit device including the aforementioned computing device, and a board including the integrated circuit device according to the aforementioned description.

In another aspect, the present invention discloses a method of processing image data using a computing device including a front rotation unit, a matrix operation unit, and a rear rotation unit. The method comprises the following steps: loading image data from an off-chip memory; inputting image data to a front revolution unit, a matrix operation unit and a rear revolution unit for operation according to the multi-operation instruction so as to generate an operation result; and storing the operation result to an off-chip memory.

In another aspect, the present invention discloses a computer readable storage medium having stored thereon computer program code for processing image data, which when executed by a processing apparatus, performs the method as described above.

The invention processes the operation sequence with data dependence by using multiple operation instructions, processes image data in batch, does not need additional processor structural design or compiling optimization technology to solve the problem of data dependence, and greatly improves the execution speed of the operation sequence.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are illustrated by way of example and not by way of limitation, and like reference numerals designate like or corresponding parts throughout the several views, in which:

fig. 1 is a structural diagram showing a board card of the embodiment of the present invention;

FIG. 2 is a block diagram illustrating an integrated circuit device of an embodiment of the invention;

FIG. 3 is a schematic diagram showing the internal structure of a computing device of an embodiment of the invention;

FIG. 4 is a schematic diagram showing the internal structure of a processor core of an embodiment of the invention;

FIG. 5 is a schematic diagram showing when one processor core wants to write data to another clustered processor core;

FIG. 6 is a schematic diagram showing image data being loaded from DRAM to SRAM and then to NRAM;

FIG. 7 is a schematic diagram showing the internal structure of a processor core of another embodiment of the invention;

FIG. 8 is a schematic diagram showing the internal structure of a processor core of another embodiment of the present invention;

FIG. 9 is a flow chart illustrating processing image data according to another embodiment of the present invention;

FIG. 10 is a flow chart illustrating processing image data according to another embodiment of the present invention; and

FIG. 11 is a flow chart illustrating processing image data according to another embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the terms "first", "second", "third" and "fourth", etc. in the claims, the description and the drawings of the present invention are used for distinguishing different objects and are not used for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims of this application, the singular form of "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this specification refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

The following detailed description of embodiments of the invention refers to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the present invention. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of a platform are high.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combined processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.

The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.

The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing apparatus 201 may obtain the control instruction from the processing apparatus 203 via the interface apparatus 202, and write the control instruction into the control cache on the computing apparatus 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present invention may be considered to have a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.

The DRAM204 is used for storing data to be processed, and is an off-chip memory, generally 16G or larger in size, for storing data of the computing device 201 and/or the processing device 203.

Fig. 3 shows an internal structural diagram of the computing apparatus 201. The computing device 201 is used for processing input data such as computer vision, voice, natural language, data mining and the like, the computing device 201 in the figure is designed by adopting a multi-core hierarchical structure, the computing device 201 is used as a system on chip and comprises a plurality of clusters (clusters), each cluster comprises a plurality of processor cores, in other words, the computing device 201 is formed by a system on chip-cluster-processor core hierarchy.

Looking at the system-on-chip hierarchy, as shown in FIG. 3, the computing device 201 includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and a plurality of clusters 305.

There may be multiple external memory controllers 301, 2 shown by way of example in the figure, for accessing an external memory device, such as DRAM204 in figure 2, to read data from or write data to off-chip in response to an access request issued by a processor core. The peripheral communication module 302 is used for receiving the control signal from the processing device 203 through the interface device 202 and starting the computing device 201 to execute the task. The on-chip interconnect module 303 connects the external memory controller 301, the peripheral communication module 302 and the plurality of clusters 305 for transmitting data and control signals between the respective modules. The synchronization module 304 is a global synchronization barrier controller (GBC) for coordinating the operation progress of the clusters and ensuring the synchronization of the information. The plurality of clusters 305 are the computing cores of the computing device 201, 4 are exemplarily shown in the figure, and as the hardware is developed, the computing device 201 of the present invention may further include 8, 16, 64, or even more clusters 305. The clusters 305 are used to efficiently execute deep learning algorithms.

Looking at the cluster level, as shown in FIG. 3, each cluster 305 includes a plurality of processor cores (IPU core)306 and a memory core (MEM core) 307.

The number of the processor cores 306 is exemplarily shown as 4 in the figure, and the present invention does not limit the number of the processor cores 306. The internal architecture is shown in fig. 4. Each processor core 306 includes three major modules: a control module 41, an arithmetic module 42 and a storage module 43.

The control module 41 is used for coordinating and controlling the operations of the operation module 42 and the storage module 43 to complete the deep learning task, and includes an Instruction Fetch Unit (IFU) 411 and an Instruction Decode Unit (IDU) 412. The instruction fetch unit 411 is used to obtain an instruction from the processing device 203, and the instruction decode unit 412 decodes the obtained instruction and sends the decoded result to the operation module 42 and the storage module 43 as control information.

The operation module 42 includes a vector operation unit 421, a front rotation unit 422, a matrix operation unit 423, and a rear rotation unit 424. The vector operation unit 421 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the forward-conversion unit 422 is used for converting floating point numbers into fixed point numbers, and the matrix operation unit 423 is responsible for core calculation of a deep learning algorithm, namely matrix multiplication, matrix addition and convolution based on the fixed point numbers; the post-rotation unit 424 is used to convert the matrix operation result from the fixed-point number back to the floating-point number.

The storage module 43 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM)431, a weight storage unit (weight RAM, WRAM)432, an input/output direct memory access (IODMA) 433, and a transport direct memory access (MVDMA) 434. NRAM431 is used to store the feature map for processor core 306 to compute and the intermediate result after computation; the WRAM432 is used for storing the weight of the deep learning network; IODMA 433 controls access of NRAM 431/WRAM 432 and DRAM204 through broadcast bus 309; the MVDMA 434 is used to control access of the NRAM 431/WRAM 432 and the SRAM 308.

Returning to FIG. 3, the storage core 307 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 306, as well as perform communications between the clusters 305 and the DRAMs 204, communications among the clusters 305, communications among the processor cores 306, and the like. In other embodiments, storage core 307 has the capability of scalar operations to perform scalar operations.

The memory core 307 includes a shared memory unit (SRAM)308, a broadcast bus 309, a Cluster Direct Memory Access (CDMA) 310, and a Global Direct Memory Access (GDMA) 311. The SRAM308 plays a role of a high-performance data transfer station, data multiplexed among different processor cores 306 in the same cluster 305 do not need to be acquired from the DRAM204 through the processor cores 306 respectively, but are transferred among the processor cores 306 through the SRAM308, and the storage core 307 only needs to rapidly distribute the multiplexed data from the SRAM308 to the plurality of processor cores 306, so that the inter-core communication efficiency is improved, and the on-chip and off-chip input/output access is greatly reduced.

The broadcast bus 309, CDMA 310, and GDMA311 are used to perform communication among the processor cores 306, communication among the cluster 305, and data transfer between the cluster 305 and DRAM204, respectively. As will be described separately below.

The broadcast bus 309 is used to accomplish high-speed communication among the processor cores 306 in the cluster 305, and the broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (i.e., from a single processor core to a single processor core) data transfer, multicast is a communication for transferring a copy of data from SRAM308 to a specific number of processor cores 306, and broadcast is a communication for transferring a copy of data from SRAM308 to all processor cores 306, and is a special case of multicast.

CDMA 310 is used to control access to SRAM308 between different clusters 305 within the same computing device 201. Fig. 5 shows a schematic diagram when one processor core wants to write data to another clustered processor core to illustrate the operating principle of CDMA 310. In this application scenario, the same computing device includes multiple clusters, for convenience of description, only cluster 0 and cluster 1 are shown in the figure, and cluster 0 and cluster 1 respectively include multiple processor cores, and also for convenience of description, cluster 0 in the figure only shows processor core 0, and cluster 1 only shows processor core 1. Processor core 0 wants to write data to processor core 1.

Firstly, the processor core 0 sends a unicast write request to write data into a local SRAM 0, the CDMA 0 serves as a master (master) end, the CDMA 1 serves as a slave (slave) end, the master end pushes the write request to the slave end, namely the master end sends a write address AW and write data W, the data are transmitted into the SRAM 1 of the cluster 1, then the slave end sends a write response B as a response, and finally the processor core 1 of the cluster 1 sends a unicast read request to read the data from the SRAM 1.

Returning to fig. 3, GDMA311 cooperates with external memory controller 301 to control access of SRAM308 of cluster 305 to DRAM204 or to read data from DRAM204 into SRAM 308. As can be seen from the foregoing, communication between DRAM204 and NRAM431 or WRAM432 may be achieved via 2 channels. The first channel is to directly contact DRAM204 with NRAM431 or WRAM432 through IODAM 433; the second channel is that data is transferred between DRAM204 and SRAM308 via GDMA311, and then between SRAM308 and NRAM431 or WRAM432 via MVDMA 434. Although seemingly the second channel requires more components and the data flow is longer, in some embodiments, the bandwidth of the second channel is substantially greater than the first channel, and thus communication between DRAM204 and NRAM431 or WRAM432 may be more efficient over the second channel. The embodiment of the invention can select the data transmission channel according to the hardware condition.

In other embodiments, the functionality of GDMA311 and the functionality of IODMA 433 may be integrated in the same component. For convenience of description, the GDMA311 and the IODMA 433 are considered as different components, and it is within the scope of the present invention for those skilled in the art to achieve the same functions and achieve the same technical effects as the present invention. Further, the functions of GDMA311, IODMA 433, CDMA 310 and MVDMA 434 may be implemented by the same component.

Because the data flow and the operation flow of the image processing algorithm of the neural network are relatively fixed and relate to a large number of repeated operations, the instruction set supports one instruction to complete multiple data transportation or/and multiple data calculation operations, namely, multiple operation instructions are utilized to process parallel data in batch.

One embodiment of the present invention is a system for processing image data using a neural network, the system having various devices as shown in fig. 1 to 4. When processing image data, a large amount of data needs to be transferred between the DRAM204 and the SRAM308 and between the SRAM308 and the NRAM 431/WRAM 432, and high-homogeneity operations such as matrix multiplication, matrix addition and the like are carried out, if a single operation instruction is used, one data is processed at a time, a lot of time for input/output, transfer and operation is consumed, and the embodiment sets multiple operation instructions in an instruction set and processes the same kind of tasks in batches, so that the image processing is more efficient.

Taking the example of the computing device 201 processing matrix addition or multiplication, the computing device 201 first needs to load image data from the DRAM204 into the SRAM 308. While image data typically occupies a whole block of memory space when stored in DRAM204, FIG. 6 is a schematic diagram of image data loaded from DRAM204 into SRAM308 and then into NRAM431, where image data occupies P in DRAM204_X×P_YThe storage space of (2), the space is continuous and complete. If the user loads the image data with a single operation instruction, the user needs to write P_X×P_YAnd each single operation instruction controls the movement of data of one address.

This embodiment uses a multiple operation instruction to load the entire image data into SRAM308 at once, i.e., one instruction controls P_X×P_YThe image data of the size is loaded into the SRAM 308. Since the multi-operation command does not involve data movement of individual addresses, the image data is loaded into P of SRAM308 in one block when the multi-operation command is used_X×P_YAnd (4) storage space. Assume that SRAM308 does not have a complete P_X×P_YBut 2 non-contiguous blocks of space, e.g. one block of space is medium P_X1×P_YThe storage space of another block is middle P_X2×P_YOf the storage space P_X1+P_X2＝P_XThen the multi-operation instruction cannot complete the task, i.e. the multi-operation instruction cannot get P_X×P_YInto P_X1×P_YImage data and P of_X2×P_YThe image data of (2) are stored separately.

When the image data is calculated, the image data stored in the SRAM308 needs to be split into a plurality of sub-images, and the sub-images are respectively transmitted to the processor core 306 for operation. Taking the example where each cluster 305 includes 4 processor cores 306, the image data is split into 4 sub-graphs, as shown in fig. 6, which are sent to NRAMs 431 of each processor core 306. To accomplish this, the processing device 203 may use a single operation command to carry address by address, or use a multiple operation command to enable the broadcast bus 309 to carry one sub-image at a time to the NRAM431 of the memory module 43, and carry the corresponding weight to the WRAM432 in the same way.

No matter the operation of matrix multiplication or matrix addition, the subgraph needs to be converted into fixed point number through the front rotation number unit 422, then the matrix operation unit 423 performs the operation of matrix multiplication or matrix addition on the subgraph data and the weight, and finally the operation result of the fixed point number is input into the back rotation number unit 424 to be converted into the operation result of the floating point number. In case of homogeneous operation, when the sub-graphs are stored in the storage space of the continuous and complete NRAM431 and the weight values are also stored in the storage space of the continuous and complete WRAM432, the control module 41 may further enable the forward-conversion unit 422 to read the sub-graph data from the NRAM431 according to another multi-operation instruction, convert the sub-graph data into fixed-point numbers, perform matrix multiplication or matrix addition operation on the sub-graph data and the weight values through the matrix operation unit 423, and input the operation result of the fixed-point numbers into the post-conversion unit 424 to convert the operation result of the floating-point numbers back.

Then, the processing device 203 may use a single operation instruction to carry the operation results from the NRAM431 to the SRAM308 one by one, or use a multi-operation instruction to make the MVDMA 434 carry the operation results from the NRAM431 to the SRAM308 at one time, and these operation results are continuously and completely stored in the SRAM 308. Finally, the processing device 203 stores the operation results from the SRAM308 back to the DRAM204 at one time by using the multi-operation instruction, so as to complete the task of performing matrix multiplication or matrix addition.

Another embodiment of the present invention is a system for processing image data using a neural network, the system having various devices as shown in fig. 1 to 3. The structure of the processor core 306 in this embodiment is shown in fig. 7, and is different from the structure in fig. 4 in that the operation module 42 in this embodiment further includes an activation unit 701 for implementing an activation function to activate the intermediate result output by the matrix operation unit 423, and then convert the activated intermediate result back to a floating point number by the post-rotation unit 424.

Also for example, in the case of matrix addition or multiplication, the computing device 201 loads the entire image data from the DRAM204 into the SRAM308 at once by using a multiple operation instruction, i.e., one instruction controls P_X×P_YThe image data of the size is loaded into the SRAM 308. When calculating the image data, the image data stored in the SRAM308 is divided into a plurality of subgraphs, which are transmitted to the plurality of processor cores 306 for calculation, and the processing device 203 may transport the subgraphs one by one address by using a single operation instruction, or transport one subgraph to the NRAM431 of the storage module 43 at a time by using a multi-operation instruction, and transport the corresponding weight to the WRAM432 in the same way.

When the subgraph is stored in the storage space of the continuous and complete NRAM431 and the weight is also stored in the storage space of the continuous and complete WRAM432, the control module 41 may further cause the forward-rotation unit 422 to read the subgraph data from the NRAM431 according to another multi-operation instruction, convert the subgraph data into a fixed-point number, perform matrix multiplication or matrix addition on the subgraph data and the weight through the matrix operation unit 423, and the activation unit 701 activates the intermediate result output by the matrix operation unit 423 and inputs the activated intermediate result into the post-rotation unit 424 to convert the floating-point number operation result back.

Then, the processing device 203 may use a single operation instruction to transfer the operation result on the NRAM431 to the SRAM308 address by address, or use a multi-operation instruction to make the MVDMA 434 transfer the operation result on the NRAM431 to the SRAM308 at one time, and these operation results are continuously and completely stored in the SRAM 308. Finally, the processing device 203 stores the operation results from the SRAM308 back to the DRAM204 at a time by using the multi-operation instruction to complete the task of performing matrix multiplication or matrix addition.

Another embodiment of the present invention is a system for processing image data using a neural network, the system having various devices as shown in fig. 1 to 3. The structure of the processor core 306 in this embodiment is shown in fig. 8, and is different from the structure shown in fig. 7 in that the operation module 42 in this embodiment further includes a batch normalization unit 801 for normalizing the activation result generated by the activation unit 701 and converting the batch normalized intermediate result into a floating point number by the post-rotation unit 424.

Also exemplified as matrix addition or multiplication, the computing device 201 loads the entire image data from the DRAM204 into the SRAM308 at once. When the image data is calculated, the image data stored in the SRAM308 is divided into a plurality of sub-images, and the sub-images are transmitted to the plurality of processor cores 306, respectively, to be calculated. The processing device 203 may use a single operation command to carry address by address, or use a multi-operation command to make the broadcast bus 309 carry one sub-image at a time to the NRAM431 of the storage module 43, and carry the corresponding weight to the WRAM432 in the same way.

When the sub-graph is stored in the continuous and complete storage space of the NRAM431 and the weight is also stored in the continuous and complete storage space of the WRAM432, the control module 41 may further make the forward-rotation number unit 422 read the sub-graph data from the NRAM431 according to another multi-operation instruction, convert the sub-graph data into fixed-point numbers, perform matrix multiplication or matrix addition on the sub-graph data and the weight through the matrix operation unit 423, activate the intermediate result output by the matrix operation unit 423 through the activation unit 701, normalize the activation result generated by the activation unit 701 through the batch normalization unit 801, and input the batch normalized intermediate result into the post-rotation number unit 424 to convert the batch normalized intermediate result back into the operation result of floating-point numbers.

Another embodiment of the present invention is a method for processing image data using a neural network, which is implemented based on various apparatuses as shown in fig. 1 to 4. Fig. 9 shows a flowchart when this embodiment handles matrix addition or multiplication.

In step 901, image data is loaded from DRAM 204. This embodiment uses a multiple operation instruction to load the entire image data into SRAM308 at once, i.e., one instruction controls P_X×P_YThe image data of the size is loaded into the SRAM308, and the image data is loaded into the P of the SRAM308 in one block_X×P_YAnd (4) storage space.

In step 902, image data is split into a plurality of subgraphs. When calculating the image data, the image data stored in the SRAM308 needs to be split into a plurality of sub-graphs, and the sub-graphs are respectively transmitted to the plurality of processor cores 306 for operation. To accomplish this, the processing device 203 may use a single operation command to carry address by address, or use a multiple operation command to enable the broadcast bus 309 to carry one sub-image at a time to the NRAM431 of the memory module 43, and carry the corresponding weight to the WRAM432 in the same way.

In step 903, the image data is input to the pre-rotation unit, the matrix operation unit, and the post-rotation unit for operation according to the multi-operation command, so as to generate an operation result. When the sub-graph is stored in the continuous and complete storage space of the NRAM431 and the weight is also stored in the continuous and complete storage space of the WRAM432, the control module 41 makes the forward-rotation unit 422 read the sub-graph data from the NRAM431 according to another multi-operation instruction, converts the sub-graph data into fixed-point numbers, performs matrix multiplication or matrix addition operation on the image data and the weight through the matrix operation unit 423, and inputs the operation result of the fixed-point numbers into the operation result of the back-rotation unit 424, which converts the operation result of the fixed-point numbers into floating-point numbers.

In step 904, the floating point number result is carried to SRAM 308. The processing device 203 may use a single operation instruction to carry the operation result on the NRAM431 to the SRAM308 address by address, or use a multiple operation instruction to make the MVDMA 434 carry the operation result on the NRAM431 to the SRAM308 at one time, and these operation results are continuously and completely stored in the SRAM 308.

In step 905, the operation result is stored to the DRAM 204. The processing device 203 uses the multi-operation instruction to store the operation results from the SRAM308 back to the DRAM204 at one time to complete the task of performing matrix multiplication or matrix addition.

Another embodiment of the present invention is a method for processing image data using a neural network, which is implemented based on various devices as shown in fig. 1 to 3 and 7. Fig. 10 shows a flowchart when this embodiment handles matrix addition or multiplication.

In step 1001, image data is loaded from the DRAM 204. This embodiment uses a multiple operation instruction to load the entire image data into SRAM308 at once, i.e., one instruction controls P_X×P_YThe image data of the size is loaded into the SRAM308, and the image data is loaded into the P of the SRAM308 in one block_X×P_YAnd (4) storage space.

In step 1002, image data is split into a plurality of subgraphs. When the image data is calculated, the image data stored in the SRAM308 needs to be split into a plurality of sub-images, and the sub-images are respectively transmitted to the processor core 306 for operation. To accomplish this, the processing device 203 may use a single operation command to carry address by address, or use a multiple operation command to enable the broadcast bus 309 to carry one sub-image at a time to the NRAM431 of the memory module 43, and carry the corresponding weight to the WRAM432 in the same way.

In step 1003, according to the multi-operation instruction, the image data is input to the pre-rotation unit, the matrix operation unit, the activation unit and the post-rotation unit for operation, so as to generate an operation result. When the subgraph is stored in the storage space of the continuous and complete NRAM431 and the weight is also stored in the storage space of the continuous and complete WRAM432, the control module 41 may further cause the forward-rotation unit 422 to read the subgraph data from the NRAM431 according to another multi-operation instruction, convert the subgraph data into a fixed-point number, perform matrix multiplication or matrix addition on the subgraph data and the weight through the matrix operation unit 423, and the activation unit 701 activates the intermediate result output by the matrix operation unit 423 and inputs the activated intermediate result into the post-rotation unit 424 to convert the floating-point number operation result back.

In step 1004, the floating-point number operation result is transferred to the SRAM 308. The processing device 203 may use a single operation instruction to carry the operation results from the NRAM431 to the SRAM308 one by one, or use a multi-operation instruction to make the MVDMA 434 carry the operation results from the NRAM431 to the SRAM308 at one time, and the operation results are continuously and completely stored in the SRAM 308.

In step 1005, the operation result is stored in the DRAM 204. The processing device 203 uses the multi-operation instruction to store the operation results from the SRAM308 back to the DRAM204 at one time to complete the task of performing matrix multiplication or matrix addition.

Another embodiment of the present invention is a method for processing image data using a neural network, which is implemented based on various devices as shown in fig. 1 to 3 and 8. Fig. 11 shows a flowchart when this embodiment handles matrix addition or multiplication.

In step 1101, image data is loaded from the DRAM 204. This embodiment uses a multiple operation instruction to load the entire image data into SRAM308 at once, i.e., one instruction controls P_X×P_YThe image data of the size is loaded into the SRAM308, and the image data is loaded into the P of the SRAM308 in one block_X×P_YAnd (4) storage space.

In step 1102, image data is split into a plurality of subgraphs. When the image data is calculated, the image data stored in the SRAM308 needs to be split into a plurality of sub-images, and the sub-images are respectively transmitted to the processor core 306 for operation. To accomplish this, the processing device 203 may use a single operation command to carry address by address, or use a multiple operation command to enable the broadcast bus 309 to carry one sub-image at a time to the NRAM431 of the memory module 43, and carry the corresponding weight to the WRAM432 in the same way.

In step 1103, the image data is inputted to the pre-rotation unit, the matrix operation unit, the activation unit, the batch normalization unit, and the post-rotation unit for operation according to the multi-operation command, so as to generate an operation result. When the sub-graph is stored in the storage space of the continuous and complete NRAM431 and the weight is also stored in the storage space of the continuous and complete WRAM432, the control module 41 may further enable the forward-rotation unit 422 to read the sub-graph data from the NRAM431 according to another multi-operation instruction, convert the sub-graph data into a fixed-point number, perform matrix multiplication or matrix addition on the sub-graph data and the weight through the matrix operation unit 423, enable the activation unit 701 to activate the intermediate result output by the matrix operation unit 423, standardize the activation result generated by the activation unit 701 through the batch normalization unit 801, and input the batch-normalized intermediate result into the operation result converted back to the floating-point number through the backward-rotation unit 424.

In step 1104, the floating-point number operation result is transferred to the SRAM 308. The processing device 203 may use a single operation instruction to carry the operation results from the NRAM431 to the SRAM308 one by one, or use a multi-operation instruction to make the MVDMA 434 carry the operation results to the SRAM308 at one time, and the operation results are continuously and completely stored in the SRAM 308.

In step 1105, the operation result is stored to the DRAM 204. The processing device 203 uses the multi-operation instruction to store the operation results from the SRAM308 back to the DRAM204 at one time to complete the task of performing matrix multiplication or matrix addition.

Another embodiment of the invention is a computer readable storage medium having stored thereon computer program code for processing image data, which when executed by a processor performs the method of the embodiments as described above. In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when the aspects of the present invention are embodied in a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in the embodiments of the present invention. The Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

The invention utilizes multi-operation instructions to process operation sequences with data dependence, such as fixed operations of revolution number + matrix multiplication/addition + revolution number, revolution number + matrix multiplication/addition + activation + revolution number, or revolution number + matrix multiplication/addition + activation + batch standardization + revolution number, and the like in graphic processing, does not need additional processor structural design or compiling optimization technology to solve the problem of data dependence, has the advantage of processor performance power consumption area gain, and greatly improves the execution speed of the operation sequences (algorithms).

According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a car recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present invention can also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical care, and the like. Furthermore, the electronic equipment or the device can be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the electronic device or apparatus with high computational power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and the electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration.

It is noted that for the sake of simplicity, the present invention sets forth some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the inventive arrangements are not limited by the order of acts described. Accordingly, persons skilled in the art may appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the invention. Further, those skilled in the art will appreciate that the described embodiments of the invention are capable of being practiced in other alternative embodiments that may involve fewer acts or modules than are necessary to practice one or more aspects of the invention. In addition, the description of some embodiments of the present invention is also focused on different schemes. In view of this, those skilled in the art will understand that portions of the present invention that are not described in detail in one embodiment may also refer to related descriptions of other embodiments.

In particular implementations, based on the disclosure and teachings of the present invention, one of ordinary skill in the art will appreciate that the several embodiments disclosed herein may be practiced in other ways than as specifically disclosed herein. For example, as for each unit in the foregoing embodiments of the electronic device or apparatus, the unit is split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present invention, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the invention. In addition, in some scenarios, multiple units in an embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors and like devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause a1, a computing device for processing image data using a neural network, coupled to off-chip memory, the computing device comprising: a shared memory unit for loading the image data from the off-chip memory; and a processor core comprising: the operation module comprises a front revolution unit, a matrix operation unit and a rear revolution unit; the control module is used for inputting the image data to the front revolution unit, the matrix operation unit and the rear revolution unit for operation according to a multi-operation instruction so as to generate an operation result; wherein the shared storage unit stores the operation result.

Clause a2, the computing device of clause a1, wherein the previous revolution unit converts the image data from a floating point number to a fixed point number.

Clause A3, the computing device of clause a1, wherein the processor core further comprises a storage module to load the image data from the shared memory unit, the pre-forwarding unit reading the image data from the storage module.

Clause a4, the computing device of clause A3, wherein the operation module further comprises an activation unit for activating the intermediate result output by the matrix operation unit, the post-rotation unit converting the activation result from a fixed-point number to a floating-point number.

Clause a5, the computing device of clause A3, wherein the calculation module further comprises: the activation unit is used for activating the intermediate result output by the matrix operation unit; and a batch normalization unit to normalize the activation result; wherein the post-rotation unit converts the normalization result from a fixed-point number to a floating-point number.

Clause a6, the computing device of clause a1, wherein the matrix operation unit performs one of a matrix multiply task and a matrix add task.

Clause a7, an integrated circuit device, comprising the computing device of any of clauses a 1-6.

Clause A8, a board comprising the integrated circuit device of clause a 7.

Clause a9, a method of processing image data with a computing device including a front rotation number unit, a matrix operation unit, and a rear rotation number unit, the method comprising: loading the image data from an off-chip memory; inputting the image data to the front rotation unit, the matrix operation unit and the rear rotation unit for operation according to a multi-operation instruction so as to generate an operation result; and storing the operation result to the off-chip memory.

Clause a10, a computer-readable storage medium having stored thereon computer program code for processing image data, the computer program code, when executed by a processing apparatus, performing the method of clause a 9.

The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A computing device for processing image data using a neural network, coupled to an off-chip memory, the computing device comprising:

a shared memory unit for loading the image data from the off-chip memory; and

a processor core, comprising:

the operation module comprises a front revolution unit, a matrix operation unit and a rear revolution unit;

the control module is used for inputting the image data to the front revolution unit, the matrix operation unit and the rear revolution unit for operation according to a multi-operation instruction so as to generate an operation result;

wherein the shared storage unit stores the operation result.

2. The computing device of claim 1, wherein the previous turn unit converts the image data from a floating point number to a fixed point number.

3. The computing device of claim 1, wherein the processor core further comprises a memory module to load the image data from the shared memory unit, the forward forwarding unit to read the image data from the memory module.

4. The computing device of claim 3, wherein the operation module further comprises an activation unit to activate the intermediate result output by the matrix operation unit, the post-rotation unit to convert the activation result from a fixed-point number to a floating-point number.

5. The computing device of claim 3, wherein the operation module further comprises:

the activation unit is used for activating the intermediate result output by the matrix operation unit; and

a batch normalization unit to normalize the activation result;

wherein the post-rotation unit converts the normalization result from a fixed-point number to a floating-point number.

6. The computing device of claim 1, wherein the matrix operation unit performs one of a matrix multiplication task and a matrix addition task.

7. An integrated circuit device comprising the computing device of any of claims 1 to 6.

8. A board card comprising the integrated circuit device of claim 7.

9. A method of processing image data using a computing device including a front rotation unit, a matrix operation unit, and a rear rotation unit, the method comprising:

loading the image data from an off-chip memory;

inputting the image data to the front revolution unit, the matrix operation unit and the rear revolution unit for operation according to a multi-operation instruction so as to generate an operation result; and

and storing the operation result to the off-chip memory.

10. A computer readable storage medium having stored thereon computer program code for processing image data, which when executed by a processing apparatus, performs the method of claim 9.