CN112035056B

CN112035056B - Parallel RAM access equipment and access method based on multiple computing units

Info

Publication number: CN112035056B
Application number: CN202010654566.1A
Authority: CN
Inventors: 贾兆荣
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2022-11-29
Anticipated expiration: 2040-07-09
Also published as: CN112035056A

Abstract

The invention relates to the technical field of server data processing, and provides a parallel RAM access device and an access method based on multiple computing units, wherein one end of the parallel RAM access device is connected with a plurality of AIPU units, the other end of the parallel RAM access device is connected with a plurality of RAMs, the parallel RAM access device comprises a register, an address arbitration module, an address mapping module, a memory read-write module and a plurality of AIPU interface modules, each AIPU unit accesses all RAM spaces through the mapping calculation of the address mapping module, all the AIPU units work simultaneously, the memory bandwidth can be greatly improved, the mutual data interaction between the AIPUs can be met, the complexity of the data interaction between the AIPUs is simplified, the difficulty of the AIPUs in arranging on a chip is simplified, and the computing efficiency of AI application is greatly improved.

Description

Parallel RAM access equipment and access method based on multiple computing units

Technical Field

The invention belongs to the technical field of server data processing, and particularly relates to parallel RAM access equipment and an access method based on multiple computing units.

Background

Artificial intelligence gives the intelligence to the robot, and the robot replaces human beings to complete some work. The basic method to implement artificial intelligence is machine learning, which uses algorithms to parse data, learn from it, and then make decisions and predictions about events in the real world. Unlike traditional hard-coded software programs that solve specific tasks, machine learning is "trained" with large amounts of data, learning from the data through various algorithms to complete a task method, and thereby solve or process a certain class of tasks. Machine learning was derived directly from the early artificial intelligence field. Conventional algorithms include decision tree learning, inferential logic planning, clustering, reinforcement learning, and bayesian networks, among others. Deep learning is a technology for realizing machine learning, and by establishing a deep artificial neural network and training and learning a large amount of data, the neural network can accurately analyze the characteristics of input data, so that a machine can make accurate judgment. Therefore, deep learning places high demands on the performance and bandwidth of computer systems.

Throughout the early 1980 s and 1990 s, computer systems have been plagued by bottlenecks that have relatively slow CPU performance, thereby limiting the operations that applications can perform. Driven by moore's law, the number of transistors has increased dramatically over the years, improving system performance and bringing exciting new computational possibilities. A feature during 1990-2000 was centralized computing around desktops and workstations. During the period from 2000 to 2010, the improvement of connectivity and the improvement of a flow enable people to turn to mobile computing, smart phones and cloud computing, and in the period from 2010 to later, the number of interconnected internet of things devices and sensors is increased sharply, and meanwhile, the improvement of the connectivity and the improvement of the flow are advanced to cloud computing and edge computing. The latter allows processing closer to the data, effectively improving latency, bandwidth and energy usage. The improvement of the computing performance and the development of the FPGA, the GPU, the TPU and the DPU promote the rapid development of AI, the neural network used for deep learning is more and more complex, the data volume is exponentially increased, and the reasoning accuracy is higher and higher.

As hardware evolves, the performance of AI applications can be analyzed using the well-known analysis tool, the rofoline model, which shows how applications fully exploit the full potential of the memory bandwidth and processing power of the underlying hardware. Roofoline varies from system architecture to system architecture. In the Roofline model, the Y-axis represents performance per second, while the X-axis represents the operating strength or number of operations performed per byte. The first is a diagonal line, which shows the limitation imposed by memory bandwidth. The second is a horizontal line that shows the limits imposed by the computational performance of the hardware. These lines together form the roof line shape and thus the name of the model. Of the two applications represented by the two pink lines in the roopline model, the first application has a bottleneck in computing performance in memory bandwidth, and the second application has a bottleneck in computing performance in the processor.

AI applications, which typically appear as accelerators at the edge, run primarily on FPGAs, soCs, ASICs and other dedicated chips, with neural networks to assist resource-limited X86-based devices in processing large numbers of pictures and speech layer-by-layer. The trained AI model parameters are stored in an HBM/DDR off-chip storage space, an natural language processing UNIT (AIPU) is an AI computing UNIT, the AIPU obtains the model parameters and picture or voice data through DMA (direct memory access), convolution operation or matrix operation is carried out, and intermediate results are cached on an RAM (random access memory). At present, the neural network used by AI is deep and large in data volume, 1 AIPU can not fully utilize hardware resources to achieve the highest efficiency, a plurality of AIPUs are usually called as a computing unit array, and a CPU or a GPU schedules the computing units through an OPENCL or other methods for configuring registers to complete huge convolution operation or matrix operation. Because the FPGA, soC, or ASIC has the advantages of low power consumption, high parallelism, low latency, and high speed, the AIPU computing unit is usually placed on these chips to run, and the CPU or GPU is assisted by the accelerator to complete large-scale computation.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a parallel RAM access device based on multiple computing units, and aims to solve the problems in the prior art.

The technical scheme provided by the invention is as follows: a parallel RAM access device based on multiple computing units is disclosed, one end of the parallel RAM access device is connected with a plurality of AIPU units, the other end of the parallel RAM access device is connected with a plurality of RAMs, the parallel RAM access device comprises a register, an address arbitration module, an address mapping module, a memory read-write module and a plurality of AIPU interface modules, and each AIPU unit corresponds to one RAM;

the AIPU interface modules are respectively connected with corresponding AIPU units and used for receiving data reading and writing information of the AIPU units and caching the received data reading and writing information into corresponding first-in first-out queues (FIFO), wherein the data reading and writing information comprises reading and writing commands, reading and writing data, reading and writing addresses and reading and writing lengths;

the register is used for storing data information including a read-write mode, a storage initial address, a storage space size and a write data size;

the address arbitration module is respectively connected with the register and the AIPU interface modules and is used for judging the data state of the FIFO, reading corresponding data read-write information according to the read-write mode of the register and sending read-write commands and read-write addresses in the data read-write information to the address mapping module;

the address mapping module is respectively connected with the address arbitration module and the n extended RAMs, and is used for mapping discontinuous spaces of the n RAMs by using continuous virtual storage spaces and calculating memory read-write addresses;

the memory read-write module is connected with the address mapping module and used for reading and writing the RAM according to the memory read-write address obtained by calculation and replying the corresponding AIPU unit;

and each AIPU unit accesses all the RAM spaces through the mapping calculation of the address mapping module, and all the AIPU units work simultaneously.

As an improved scheme, each virtual storage space mapped by the expansion RAM and the address mapping module comprises an input data space, an output data space and a convolution kernel space.

As a modified scheme, the expansion RAM comprises a RAM0 and a RAM1, the AIPU unit comprises a first AIPU unit and a second AIPU unit;

the first AIPU unit command address is addr, and the initial address addr _ st and the space size data _ size of the section where the first AIPU unit command address is located are judged according to the initial addresses A0, B0 and C0 of the three parts of space of the virtual RAM, and the offset address addr _ delta of the corresponding RAM;

judging a RAM number RAM _ sel corresponding to the command address and an address offset delta _ addr _ RAM in the RAM according to the address offset delta _ addr of the command address relative to the initial address add _ st;

and synthesizing the memory read-write address addr _ RAM according to the calculated base address addr _ base, the address offset delta _ addr _ RAM and the RAM number RAM _ sel.

As an improved scheme, the read-write mode of the RAM comprises an independent read-write mode and a concurrent read-write mode.

As an improved scheme, when the read-write mode of the RAM is the independent read-write mode, the RAM is read and written according to the address mapped by the address mapping module.

As an improved scheme, when the read-write mode of the RAM is a concurrent read-write mode, whether the memory read-write addresses of concurrent reading and writing are consistent is judged;

when the memory read-write addresses of concurrent reading and writing are judged to be consistent, the RAM is read and written according to the memory read-write addresses mapped by the address mapping module, and the read-write data are broadcasted to all the AIPU units;

and when the memory read-write addresses of concurrent reading and writing are judged to be inconsistent, enabling all the RAMs, and simultaneously reading and writing the corresponding RAMs according to the memory read-write addresses mapped by the address mapping module.

Another object of the present invention is to provide a parallel RAM access method based on multiple compute units, the method comprising the steps of:

a plurality of AIPU interface modules receive data read-write information of the AIPU units and cache the received data read-write information into corresponding FIFO (first-in first-out) queues;

judging the data state of the FIFO, reading corresponding data read-write information according to the read-write mode of a register, and sending a read-write command and a read-write address in the data read-write information to an address mapping module;

the address mapping module maps discontinuous spaces of the n RAMs by using the continuous virtual storage space and calculates the memory read-write address;

and the memory read-write module reads and writes the RAM according to the memory read-write address obtained by calculation and replies the corresponding AIPU unit.

As an improved scheme, the step of the address mapping module mapping the discontinuous spaces of the n RAMs by using the continuous virtual storage space and calculating the memory read-write address specifically includes the following steps:

synthesizing a memory read-write address addr _ RAM according to the calculated base address addr _ base, the address offset delta _ addr _ RAM and the RAM number RAM _ sel;

wherein the expansion RAM includes RAM0 and RAM1, the AIPU unit includes first AIPU unit and second AIPU unit.

As an improved scheme, the step of the memory read-write module reading and writing the RAM according to the memory read-write address obtained by calculation and replying the corresponding AIPU unit specifically includes the following steps:

judging the read-write mode of the RAM, wherein the read-write mode of the RAM comprises an independent read-write mode and a concurrent read-write mode;

when the read-write mode of the RAM is the independent read-write mode, reading and writing the RAM according to the address mapped by the address mapping module;

when the read-write mode of the RAM is a concurrent read-write mode, judging whether the memory read-write addresses of concurrent reading and writing are consistent;

In the embodiment of the invention, the parallel RAM access equipment based on multiple computing units is characterized in that one end of the parallel RAM access equipment is connected with a plurality of AIPU units, the other end of the parallel RAM access equipment is connected with a plurality of RAMs, the parallel RAM access equipment comprises a register, an address arbitration module, an address mapping module, a memory read-write module and a plurality of AIPU interface modules, each AIPU unit accesses the space of all the RAMs through the mapping calculation of the address mapping module, all the AIPU units work simultaneously, the memory bandwidth can be greatly improved, the mutual data interaction between the AIPUs can be met, the complexity of the data interaction between the AIPUs is simplified, the difficulty of the arrangement of the AIPUs on a chip is simplified, and the computing efficiency of AI application is greatly improved.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

FIG. 1 is a schematic structural diagram of a parallel RAM access device based on multiple computing units according to the present invention;

FIG. 2 is a schematic diagram of address mapping provided by the present invention;

FIG. 3 is a flow chart of an implementation of a parallel RAM access method based on multiple computing units according to the present invention;

FIG. 4 is a flowchart illustrating an implementation of an address mapping module according to the present invention mapping discontinuous spaces of n RAMs using continuous virtual storage spaces and calculating memory read/write addresses;

fig. 5 is a flow chart of the implementation of the memory read/write module provided by the present invention reading/writing the RAM according to the memory read/write address obtained by calculation and replying to the corresponding AIPU unit.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are merely for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.

Fig. 1 is a schematic structural diagram of a parallel RAM access device based on multiple computing units according to the present invention, and for convenience of explanation, only the parts related to the embodiment of the present invention are shown in the diagram.

The parallel RAM access equipment based on multiple computing units is characterized in that one end of the parallel RAM access equipment is connected with a plurality of AIPU units, the other end of the parallel RAM access equipment is connected with a plurality of RAMs, the parallel RAM access equipment comprises a register, an address arbitration module, an address mapping module, a memory read-write module and a plurality of AIPU interface modules, and each AIPU unit corresponds to one RAM;

the address mapping module is respectively connected with the address arbitration module and the n expanded RAMs, and is used for mapping discontinuous spaces of the n RAMs by using continuous virtual storage spaces and calculating memory read-write addresses;

and each AIPU unit accesses all RAM spaces through the mapping calculation of the address mapping module, and all the AIPU units work simultaneously.

In the embodiment, the parallel RAM access device based on the multiple computing units comprises a data reading logic and a data writing logic, the reading and the writing are not mutually influenced, and the requirement that the AIPU unit reads data and writes data simultaneously is met. Each AIPU unit is provided with 1 RAM, but each AIPU unit can access all RAM spaces through address mapping; all the AIPU units can simultaneously and parallelly work and simultaneously access all the RAM spaces; if n AIPU units are expanded, the maximum calculation efficiency of the whole AI application is about n times of the efficiency of a single AIPU unit, and the bandwidth of the RAM is linearly related to the number of the AIPU units, so that the problem that the bandwidth of the RAM restricts the AI application efficiency when the number of the AIPU units is increased is solved; i.e. it is avoided that the efficiency of AI application falls in the diagonal region of the roopline model when adding AIPU units.

In the embodiment of the present invention, each of the extended RAMs and the virtual storage space mapped by the address mapping module includes an input data space, an output data space, and a convolution kernel space, and since the AIPU unit generally reads and writes data in a burst mode, and generally requires that the data storage space is continuous, and n extended RAM addresses are discontinuous, if data required by the AIPU unit is stored in a plurality of RAMs, if each AIPU unit calculates convolution of 1 picture, the used convolution kernels are the same and are stored in the RAM corresponding to each AIPU unit on average, at this time, the AIPU unit needs to read out convolution kernel parameters sequentially from all RAMs, but the AIPU unit requires that the read addresses be continuous, and at this time, the n discontinuous RAM spaces need to be continuous by using the virtual storage space.

To illustrate the operation of the address mapping module, the following description is made with reference to specific examples:

as shown in fig. 2, the extended RAM includes RAM0 and RAM1, and the AIPU unit includes a first AIPU unit and a second AIPU unit, which are specifically implemented as:

(1) Mapping three parts of spaces of 2 RAMs to an input data space, an output data space and a convolution kernel space on a virtual RAM;

(2) Assuming that the command address of the first AIPU unit is addr, judging the initial address addr _ st of the segment where the address is located, the space size data _ size and the offset address addr _ delta of the corresponding RAM according to the initial addresses A0, B0 and C0 of the three-part space of the virtual RAM;

{addr_st,addr_base,data_size}＝

addr<B0？{A0,0,feature0_size}:

addr<C0？{B0,feature0_size,feature1_size}:

{C0,feature0_size+feature0_size,weight_size}；

(3) And then judging the RAM number RAM _ sel corresponding to the command address and the address offset delta _ addr _ RAM in the RAM according to the address offset delta _ addr of the command address relative to the starting address add _ st.

delta_addr＝addr–addr_st；

{delta_addr_ram,ram_sel}＝delta_addr<data_size？{delta_addr,h01}:{delta_addr-data_size,h02}；

(4) And synthesizing the memory read-write address addr _ RAM according to the calculated base address addr _ base, the address offset delta _ addr _ RAM and the RAM number RAM _ sel.

addr_ram＝{ram_sel,addr_base+delta_addr_ram}。

In this embodiment, the situation of multiple ASIPU units is similar, and will not be described herein, but the invention is not limited thereto.

In the embodiment of the invention, the read-write mode of the RAM comprises an independent read-write mode and a concurrent read-write mode;

In this embodiment, when data interaction is required between the AIPUs, the read-write mode is set to be the single read-write mode, and only 1 AIPU is allowed to be enabled to work at this time, but the AIPU can access all RAM spaces; when all the AIPUs are read and written concurrently, all the AIPUs read or write the same address of the corresponding RAM, and the period of reading and writing data once by the n AIPUs is 1 instead of the n periods used for reading and writing in turn, so that the bandwidth is increased by n times; when all the AIPUs read the data at the same address, the memory read-write module only reads once, and then broadcasts the read data to all the AIPUs, and the bandwidth of the RAM is also increased by n times.

Fig. 3 shows a flowchart of an implementation of the parallel RAM access method based on multiple computing units provided in the present invention, which specifically includes the following steps:

in step S101, a plurality of AIPU interface modules receive data read-write information of an AIPU unit, and buffer the received data read-write information into corresponding FIFO;

in step S102, the data state of the FIFO is determined, corresponding data read-write information is read according to the read-write mode of the register, and a read-write command and a read-write address in the data read-write information are sent to the address mapping module;

in step S103, the address mapping module maps the discontinuous spaces of the n RAMs using the continuous virtual storage space, and calculates the memory read-write address;

in step S104, the memory read/write module reads and writes the RAM according to the memory read/write address obtained by calculation, and replies to the corresponding AIPU unit.

In the embodiment of the present invention, as shown in fig. 4, the step of the address mapping module mapping the discontinuous spaces of the n RAMs by using the continuous virtual storage space and calculating the memory read/write address specifically includes the following steps:

in step S201, the command address of the first AIPU unit is addr, and the starting address addr _ st, the space size data _ size, and the offset address addr _ delta of the corresponding RAM of the segment where the command address of the first AIPU unit is located are determined according to the starting addresses A0, B0, and C0 of the three parts of the virtual RAM;

in step S202, according to the address offset delta _ addr of the command address relative to the start address add _ st, the RAM number RAM _ sel corresponding to the command address and the address offset delta _ addr _ RAM in the RAM are determined;

in step S203, a memory read-write address addr _ RAM is synthesized according to the calculated base address addr _ base, the address offset delta _ addr _ RAM, and the RAM number RAM _ sel;

In the embodiment of the present invention, each of the extended RAMs and the virtual storage space mapped by the address mapping module includes an input data space, an output data space, and a convolution kernel space.

As shown in fig. 5, the step of the memory read-write module reading and writing the RAM according to the memory read-write address obtained by calculation and replying to the corresponding AIPU unit specifically includes the following steps:

in step S301, determining a read-write mode of the RAM, where the read-write mode of the RAM includes an individual read-write mode and a concurrent read-write mode;

in step S302, when the read-write mode of the RAM is the single read-write mode, the RAM is read and written according to the address mapped by the address mapping module;

in step S303, when the read-write mode of the RAM is the concurrent read-write mode, determining whether the memory read-write addresses of concurrent read-write are consistent, if so, executing step S304, otherwise, executing step S305;

in step S304, when it is determined that the memory read-write addresses of concurrent reading and writing are consistent, the RAM is read and written according to the memory read-write addresses mapped by the address mapping module, and the read-write data is broadcast to all AIPU units;

in step S305, when it is determined that the memory read/write addresses of concurrent reading and writing are not consistent, all RAMs are enabled, and the corresponding RAMs are simultaneously read and written according to the memory read/write addresses mapped by the address mapping module.

In the embodiment of the invention, the parallel RAM access equipment based on multiple computing units is suitable for AI application of edge computing, an AI application program at the moment generally exists in the form of an accelerator, a CPU or a GPU is assisted to accelerate convolution computing or matrix operation, the time delay is reduced, and two factors restricting the computing efficiency of the AI application are RAM bandwidth and resources of hardware where the AI application program is located. On the premise of certain hardware resources, the AIPU units need to be arranged as many as possible, so that the hardware resources are fully utilized, and the calculation efficiency is improved; in addition, how to expand the RAM when a plurality of AIPU units are arranged needs to be considered, so that all the AIPUs access the RAM in parallel, and bandwidth reduction caused by serial access is avoided; but also how data interaction is carried out among the AIPUs, so that the AIPUs can cooperatively complete huge calculation amount;

a plurality of AIPU units are connected to parallel RAM access equipment one end based on many computational element, and a plurality of RAM is connected to the other end, parallel RAM access equipment includes register, address arbitration module, address mapping module, memory read-write module and a plurality of AIPU interface module, and each AIPU unit passes through the space of all RAM of the mapping calculation access of address mapping module, and all AIPU units work simultaneously, can not only improve the memory bandwidth by a wide margin, can also satisfy mutual data interaction between the AIPU, has simplified the complexity of data interaction between the AIPU, has simplified the degree of difficulty that AIPU arranged on the chip, has improved AI applied computational efficiency greatly.

The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. A parallel RAM access device based on multiple computing units is characterized in that one end of the parallel RAM access device is connected with a plurality of AIPU units, the other end of the parallel RAM access device is connected with a plurality of RAMs, the parallel RAM access device comprises a register, an address arbitration module, an address mapping module, a memory read-write module and a plurality of AIPU interface modules, and each AIPU unit corresponds to one RAM;

2. The multiple-compute-unit-based parallel RAM access device of claim 1, wherein the virtual memory space mapped by each extended RAM and the address mapping module comprises an input data space, an output data space, and a convolution kernel space.

3. The multiple-compute-unit-based parallel RAM access device of claim 2, wherein the extended RAM comprises RAM0 and RAM1, the AIPU units comprise a first AIPU unit and a second AIPU unit;

4. The multiple compute unit based parallel RAM access device of claim 2 in which the read and write modes of the RAM include a single read and write mode and a concurrent read and write mode.

5. The multiple-compute-unit-based parallel RAM access device of claim 4, wherein when the read-write mode of the RAM is a single read-write mode, the RAM is read-written according to an address to which the address mapping module maps.

6. The parallel RAM access device based on multiple computing units of claim 4, wherein when the read-write mode of the RAM is a concurrent read-write mode, it is determined whether the memory read-write addresses of concurrent read-write are consistent;

7. A multiple compute unit based parallel RAM access method for a multiple compute unit based parallel RAM access apparatus as claimed in claim 1, the method comprising the steps of:

the AIPU interface modules receive data read-write information of the AIPU units and buffer the received data read-write information into corresponding first-in first-out queues (FIFO);

8. The multi-compute unit-based parallel RAM access method of claim 7, wherein the step of the address mapping module mapping the non-contiguous space of the n RAMs using the contiguous virtual memory space and computing the memory read and write addresses comprises the steps of:

9. The multiple-compute-unit-based parallel RAM access method of claim 8, wherein the virtual memory space mapped by each extended RAM and the address mapping module comprises an input data space, an output data space, and a convolution kernel space.

10. The multi-compute unit-based parallel RAM access method of claim 7, wherein the step of the memory read-write module reading and writing the RAM according to the computed memory read-write address and replying to the corresponding AIPU unit comprises the steps of: