CN114936171A - Memory access controller architecture - Google Patents

Memory access controller architecture Download PDF

Info

Publication number
CN114936171A
CN114936171A CN202210665476.1A CN202210665476A CN114936171A CN 114936171 A CN114936171 A CN 114936171A CN 202210665476 A CN202210665476 A CN 202210665476A CN 114936171 A CN114936171 A CN 114936171A
Authority
CN
China
Prior art keywords
access
queue
controller
memory access
application program
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210665476.1A
Other languages
Chinese (zh)
Other versions
CN114936171B (en
Inventor
胡景铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shencun Technology Wuxi Co ltd
Original Assignee
Shencun Technology Wuxi Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shencun Technology Wuxi Co ltd filed Critical Shencun Technology Wuxi Co ltd
Priority to CN202210665476.1A priority Critical patent/CN114936171B/en
Publication of CN114936171A publication Critical patent/CN114936171A/en
Application granted granted Critical
Publication of CN114936171B publication Critical patent/CN114936171B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • G06F13/1642Handling requests for interconnection or transfer for access to memory bus based on arbitration with request queuing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)

Abstract

The application discloses a distributed access control architecture, wherein a storage access controller architecture is used for an FPGA chip and comprises M distributed access controllers, N access queue controllers and N access interface controllers; the access queue controller corresponds to the access interface controller one by one; the distributed access controller is provided with an application program access port and N storage communication ports, and the target application program and the application program access port are in one-to-one correspondence; the N storage communication ports correspond to the N access queue controllers one by one; the access queue controller is cascaded with an access interface controller and a storage medium. The design framework introduces the memory access queue controller, adopts a plurality of distributed memory access controllers, the memory access queue controller and the memory access interface controller, realizes the parallel processing of multiple application programs, the configurable number of each port and the memory access function of a multitask queue, and has stronger adaptability, smaller time delay among storage access commands and higher utilization rate of storage access bandwidth.

Description

Memory access controller architecture
Technical Field
The embodiment of the application relates to the technical field of chips, in particular to a full-connection distributed storage access controller framework based on an FPGA chip.
Background
In modern data center construction and edge computing, FPGAs are increasingly being used for algorithmic computation acceleration. The calculation speed of some algorithms depends greatly on the memory access speed, especially the parameter reading speed, and is even completely determined by the memory access speed, such as an EtherFang algorithm. An FPGA (field programmable gate array) designer increases the memory access speed index of hardware by adding a memory controller, the traditional DDR interface is developed to 4-6 from two initially supporting ping-pong operations, and even in the HBM design on chips in recent two years, 30-100 interfaces (the interfaces refer to physical end interfaces and can access independent memory space, and shared interfaces on non-logic or virtual concepts) are provided.
In order to exert the hardware performance of these interfaces in practical application, the current mainstream distributed access controller hardware architecture design mainly adopts a single-port distributed access architecture, a point-to-point distributed access architecture and a bus-arbitrated access architecture. For a single-port distributed memory access architecture, the speed of superposition of a plurality of memory access ports cannot be matched; for a point-to-point distributed access architecture, as physical ports of an HBM are dozens in the design of a new generation FPGA, corresponding application programs cannot divide so many access ports to be in one-to-one correspondence with storage access ports; for the bus arbitration type memory access structure, the dynamic load difference between the memory access ports is large, port blockage and bandwidth reduction may occur, and the running speed of the application program is reduced.
Disclosure of Invention
The present application provides a storage access controller architecture. The method is used for the FPGA chip, and solves the problem that the FPGA chip in the related technology cannot give full play to the performance of a hardware interface. The storage access controller framework is used for a storage access controller in an FPGA chip and comprises M distributed access controllers, N access queue controllers and N access interface controllers; the access queue controller corresponds to the access interface controller one by one; m and N are positive integers greater than 1;
the distributed access controller is provided with an independent application program access port and N storage communication ports, the application program access port is used for data interaction with a target application program, and the target application program and the application program access port are in one-to-one correspondence; the N storage communication ports are respectively in one-to-one correspondence with the N access queue controllers and are used for data storage and communication;
the memory access queue controller is cascaded with a memory access interface controller and a storage medium and is used for data storage and communication between the memory access queue controller and the storage medium; the memory access interface controller is used for managing a task queue in the memory access queue controller.
Specifically, the distributed access control architecture comprises a first distributed access controller, a second distributed access controller to an Mth distributed access controller, and a first application program access port, a second application program access port to the Mth application program access port which are correspondingly arranged are respectively accessed to a first application program, a second application program to the Mth application program;
the distributed access controller also comprises a strategy register, a task segmentation module and an instruction remap module; the strategy register stores a configuration strategy for performing task segmentation on a read/write request of a target application program; the configuration policy at least comprises at least one of a target application program access port, a storage communication port, an access data volume, an access capacity and a transmission speed;
the task segmentation module is used for performing task segmentation on the read/write requests of the target application program according to the configuration strategy, acquiring a target number of subtasks and new read/write requests, and designating a corresponding target number of the storage communication ports; the target number is a positive integer not exceeding N;
and the instruction remap module is used for updating the operation addresses of the split target number of subtasks and sending the target number of new read/write requests to the corresponding storage communication ports.
Specifically, when the target application program is a read data request, the distributed access controller further includes a read data reassembly module and a read data recovery module;
the read data recovery module is used for recovering the return data of the storage communication ports in real time, and the return data is transmitted back to the corresponding storage communication ports by the target number of access queue controllers;
the read data recombination module is used for recombining the data collected by the read data recovery module and sending the data to the target application program through the application program access port.
Specifically, the memory access queue controller comprises a first memory access queue controller, a second memory access queue controller and an Nth memory access queue controller, wherein each memory access queue controller supports at least M task queues and return queues of application programs; the task queue comprises a first application program task queue, a second application program task queue to an Mth application program task queue, and the return queue comprises a first application program return queue, a second application program return queue to the Mth application program return queue;
the jth storage communication port of the ith distributed access controller corresponds to an ith application program task queue and an ith application program return queue in the jth access queue controller, wherein i is a positive integer not greater than M, and j is a positive integer not greater than N.
Specifically, the memory access queue controller further comprises a management register and a read data forwarding module;
the management register is stored with configuration information of a task queue and a return queue, and the configuration information at least comprises at least one of the size of a single task space, the number of cache tasks, the number of supported applications, the number of single access bytes and the capacity of each application task queue and the return queue;
the read data forwarding module is used for receiving returned data and forwarding the returned data to each application program return queue.
Specifically, each memory access queue controller is cascaded with the memory access interface controller, and the memory access interface controller performs queue management according to the configuration information in the management register, reads and executes memory access tasks cached in each application program task queue, and returns data to the read data forwarding module.
Specifically, when the management register stores priority information for an application program, the memory access interface controller manages and executes a task queue and a return queue based on the priority information.
Specifically, the memory access interface controller is correspondingly provided with memory access interfaces, and the memory access interfaces are respectively connected to corresponding storage media; and the memory access interface controller reads and writes data into the storage medium through the memory access interface.
Specifically, the storage access controller is further provided with a policy configuration management module; the policy configuration management module is used for receiving configuration policies, configuration information and priority information input from the outside, writing the configuration policies into a policy register in the distributed access controller, and writing the configuration information and the priority information into a management register in the access queue controller.
The beneficial effect that above-mentioned technical scheme brought includes at least: the distributed access controller adopts a single application access port, can determine a storage communication port according to a target application program, and then interacts with the access queue controllers corresponding to the rear stages one by one through the storage communication port. In the scheme, a memory access queue controller is introduced between the distributed memory access controller and the memory access interfaces, and the memory access queue controller is matched with the memory access interface controller to control the transmission of each memory access interface and each storage medium. The method is characterized in that N storage communication ports of each distributed access controller correspond to N access queue controllers one by one, distributed task processes are executed on the access queue controllers by a single data request, the access efficiency and the bandwidth utilization rate are improved, and the waiting time delay between tasks at a task initiating end is reduced.
Drawings
FIG. 1 is a schematic structural diagram of a single-port distributed memory access architecture applied to an FPGA chip in the related art;
FIG. 2 is a schematic structural diagram of a point-to-point distributed access architecture applied to an FPGA chip in the related art;
FIG. 3 is a schematic structural diagram of a bus-arbitrated access architecture applied to an FPGA chip in the related art;
FIG. 4 is a schematic structural diagram of a distributed access controller according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a memory access queue controller according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a memory access controller applied to an FPGA chip according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, the following detailed description of the embodiments of the present application will be made with reference to the accompanying drawings.
Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated object, indicating that there may be three relationships, for example, a and/or B, which may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
Fig. 1 is a schematic structural diagram of a single-port distributed memory access architecture applied to an FPGA chip in the related art. The distributed access controller adopting the single-port distributed access architecture comprises a single application program access port, a distributed access management module and a plurality of memory access ports. An application refers to the implementation of a particular algorithm in an FPGA. The method can evenly distribute the data in the memory chip, and can simultaneously call the access port of each memory medium during reading and writing, thereby achieving the maximum access efficiency. However, a single application access port is limited by data bit width, operating frequency, and maximum bandwidth limitations, and cannot match the speed of stacking multiple memory access ports. In addition, the application program may not be able to exert the parallel characteristic of the FPGA by accessing the memory through the single port, which affects the acceleration effect of the algorithm.
Fig. 2 is a schematic structural diagram of a point-to-point distributed access architecture applied to an FPGA chip in the related art. The application program access sub-ports and the memory access ports in the point-to-point distributed access architecture are in one-to-one correspondence, so that the bandwidth required to be matched for each application program access sub-port is reduced theoretically, and the performance of the multi-port memory is exerted to the maximum. However, in the design of the FPGA chip, the physical ports of the High Bandwidth Memory (HBM) are dozens, and most application programs cannot divide so many access sub-ports to correspond to the Memory access ports one-to-one. In addition, the design architecture needs to adapt the number of the memory access ports to the application programs according to different hardware partition calculation structures, and meanwhile, different application programs and bandwidth balance (almost unreachable) need to be considered, so that the design workload in actual operation is large, and the portability among the hardware is poor.
Fig. 3 is a schematic structural diagram of a bus-arbitrated memory access architecture applied to an FPGA chip in the related art. The method is a widely adopted framework in FPGA design, Xilinx and altera both provide AXI bus exchange modules of respective platforms, and Xilinx integrates NoC (AXI bus network) of a hard core even on a new generation of 7nm chip. The design is simple to use, each application program is allowed to access any cache, and the parallel of multiple application programs is supported (the multiple application programs can run in parallel with the same function or in parallel with different application programs). However, this design has the following problems:
1. the problem of access balance cannot be solved, the dynamic load difference between the memory access ports is large, and the real-time bandwidth is reduced due to the blocking of some ports under the condition that the total bandwidth is enough, so that the running speed of an application program is reduced;
2. the bus exchange capacity is limited, and the transmission capacity of a port cannot be matched frequently;
3. the number of access ports on the application side affects the real-time use number of the access ports of the memory (the connection real-time communication of the number of the ports on one side can be reduced under the ideal load balancing state of the access ports and the memory), if the occupied resources of the application program are large, the placeable number is small, and the bandwidth utilization rate is seriously affected;
4. the design is limited by a hardware platform, and the transportability is poor;
5. when a large number of small read-write tasks are executed, the delay occupation ratio among the tasks is large, and the access efficiency is influenced.
By integrating the problems of access efficiency, transportability, transmission rate, running rate and the like of the storage access controller under various design architectures, the scheme adopts a brand-new full-connection distributed storage access controller architecture based on an FPGA chip, adopts a plurality of distributed access controllers, access queue controllers and access interface controllers, and realizes the functions of multi-application program parallel processing, configurable port quantity, data address rearrangement and multi-task queue access, and has stronger adaptability, more excellent multi-application program parallel access performance and higher access efficiency.
Fig. 4 is a schematic structural diagram of a distributed access controller according to an embodiment of the present disclosure.
The number of the distributed access controllers is consistent with the number of the application programs, namely the storage access controllers are selected to be configured according to actual requirements before the FPGA chip is designed, and the configuration comprises the number of the application programs, the access amount and other parameters. Each distributed access controller is provided with an application access port, and the application access port is in one-to-one correspondence with a single application (only in quantity, one-to-one correspondence is not a fixed binding application). The design number of the distributed access controller at least needs to be larger than the number of the preset application programs of the chip so as to meet the design requirement of the chip. In this scheme, a case where the storage access controller is in a working state is described, where M application programs correspond to M distributed access controllers, i.e., a first distributed access controller to an mth distributed access controller. The APP access port in fig. 4 is an application access port, and is responsible for receiving a read/write request of an application. After a read/write request of an application program enters through an application program access port, a task segmentation module segments the request into a target number of new read/write requests, because the access rate and bandwidth of a traditional memory access port are limited, and the access capacity of the application program is too large, the request cannot be directly accessed, so that the distributed access controller is segmented into N storage communication ports by the design, the specific number is determined according to the application scenario and the required computing power of the FPGA chip, and the embodiment does not limit the number. It should be noted that the maximum value of N is not necessarily greater than the target number obtained by segmentation, for example, the value of N in the framework is 8, and the number of actually segmented tasks is 6, and then 6 of the storage communication ports can be used; when the number of the tasks to be actually split is 10, a queue form needs to be adopted, and redundant tasks are sent in a queue form through ports, for example, tasks 1 to 8 use storage communication ports 1 to 8 for communication respectively, and tasks 9 and 10 use storage communication ports 1 and 2 similarly, but communication is performed after tasks 1 and 2 are performed. The target number is determined based on a configuration policy stored in a policy register, the configuration policy at least includes at least one of a target application access port, a storage communication port, an access data amount, an access capacity, and a transmission speed, and the task segmentation module may segment the new read/write requests into the target number according to the above parameters.
Illustratively, when the access capacity of the APP needs 4GB, the corresponding capacity of each memory access interface is 512MB, the minimum number of interfaces that need to be connected is 8 (8 memory communication ports are needed correspondingly, because the memory communication ports and the memory access interfaces at the back level are in one-to-one correspondence); 128 bytes are accessed at a time. If the app interface speed is twice the access interface speed, then each access cuts the contiguous data into 64 bytes to the space corresponding to both access interfaces. If it is 4 times, it is 32 bytes. But if 8 times, 16 bytes may already be less than the interface bit width or burst length, and less than this length has an impact on speed of more than 50%, keeping slicing 32 bytes to a minimum unit.
The new read/write requests with the target number after being split form sub-tasks with the target number, and need to be executed separately, so the operation addresses of the sub-tasks with the target number after being split need to be updated by the instruction remap module, and each new read/write request is sent to the corresponding storage communication port.
If the operation instruction is a read operation instruction, the relevant data needs to be read from the storage medium and returned, and for a plurality of subtasks, data reorganization needs to be carried out. Therefore, the scheme is also provided with a read data recovery module and a read data recombination module, wherein the read data recovery module is used for recovering the return data of each storage communication port; and the read data recombination module is used for recombining the data collected by the read data recovery module and sending the data to the corresponding target application program through the application program access port.
Fig. 5 is a schematic structural diagram of a memory access queue controller according to an embodiment of the present application.
The memory access queue controller is designed to support a plurality of parallel task processing queues, reduce the waiting time delay between tasks at a task initiating end and eliminate the influence of the multi-port exchange processing handshake waiting time delay in the design scheme of figure 3. The storage access controller in the scheme is provided with N access queue controllers, each access queue controller comprises a task queue and a task return queue, and the task queues specifically comprise a first application task queue, a second application task queue to an Mth application task queue. For the requests of different target application programs, the number of the application program task queues is specifically determined according to the target number calculated by the previous stage. Similarly, the return queue includes a first application return queue, a second application return queue to an mth application return queue, and the application queues are in a one-to-one correspondence relationship. As in FIG. 5, the APP1 task queue corresponds to the APP1 return queue, and the APP M task queue corresponds to the APP M return queue. Taking the example that M groups of application program queues and application program return queues exist in each memory access queue controller, the APP1 (first application program) task queue and the APP1 return queue correspondingly execute the related tasks of the APP 1; and the APP M (Mth application) task queue and the APP M return queue correspondingly execute the related tasks of the APP M.
It should be noted that, for different target application programs, the number of storage communication ports used by each distributed memory access register may be different, and therefore, the number of memory access queue controllers used by corresponding APP tasks may also be different. For example, APP1 needs to use N memory communication ports, and correspondingly needs a first memory access queue controller to an nth memory access queue controller, whereas APP M needs to use N-1 memory communication ports, and correspondingly needs a first memory access queue controller to an N-1 th memory access queue controller.
The access queue controller is also provided with a management register and a read data forwarding module. The management register stores configuration information of the task queue and the return queue, and the configuration information at least comprises at least one of the size of a single task space, the number of cache tasks, the number of supported applications, the number of single access bytes and the capacity of each application task queue and each return queue. The memory access interface controller distributes and generates each task queue and return queue based on the configuration information. And the read data forwarding module is used for receiving returned data and forwarding the returned data to each application program return queue. And storing the segmented tasks sent by the preceding stage into corresponding task queues of corresponding memory access ports, and waiting for the memory access port controller to read. If the data is read, the read data forwarding module forwards the data to the corresponding application program return queue to wait for the data to be taken away.
And the access interface controller executes the task queue and the return queue depending on the configuration information in the management storage register, reads and executes the access tasks cached in the application program task queues, generates the return tasks to be stored in the application program return queues, and performs data interaction with corresponding storage media through respective access interfaces. In addition, when tasks of different application programs exist in the access queue controller, in order to guarantee the access bandwidth of the system and prevent the blocking caused by the overlarge access requirement of a certain thread, priority information about each application program can be set in the register manager, so that the access interface controller can execute the task queue and the return queue according to the priority sequence, and the high-priority access bandwidth of the system is guaranteed.
Fig. 6 is a schematic structural diagram of a memory access controller applied to an FPGA chip according to an embodiment of the present application.
Taking the example that the memory access controller executes read/write requests from APP1 to APP M, M application programs require M distributed access controllers, the task segmentation module in each distributed access controller performs task segmentation according to the configuration information of the policy register, and simultaneously the remap module is required to be instructed to update the operation address of each subtask, so as to determine the number of memory communication ports enabled by each distributed access controller. Because the access queue controller adopts parallel queue to execute, the jth storage communication port of the ith distributed access controller corresponds to the ith application program task queue and the ith application program return queue in the jth access queue controller. This maximizes the use of the memory queue controller. And i is a positive integer no greater than M, and j is a positive integer no greater than N. Generally, the number of access queue controllers is determined by the application program that allocates the most storage communication ports. Taking N storage communication ports of APP1 as an example, the storage communication port 1 corresponds to an APP1 task queue and an APP1 task return queue of the memory access queue controller 1; the storage communication port N corresponds to the APP1 task queue and the APP1 task return queue of the access queue controller N, and so on.
Each access memory queue controller configures the task queues according to the management register and by parameters such as the size of the task space of the return queue, the number of bytes accessed at a time, the capacity and the like, and the corresponding access memory interface controller reads data according to priority information or a default task sequence and performs data interaction with respective storage media through the access memory interface at the later stage. The memory access interface controllers, the memory access interfaces and the storage media are in one-to-one correspondence. It should be noted that a storage medium controller (not shown in fig. 6) is further disposed between the access interface and the storage medium controller, and is used for controlling access to the storage medium.
For the data reading request, the access interface access controller controls the access interface to acquire relevant data and then sends the data to the data reading forwarding module, and the module sends the data to respective return queues for caching according to the sequence and waits for being taken away by the preceding-stage distributed access controller. Since the request task is divided into a target number of subtasks, for a certain distributed access controller, it is necessary to collect scattered return data through a port, and this process is executed by the read data recovery module, and then reassembled by the read data reassembly module to form a complete data block. And finally, transmitting the data to the target application program through the APP access port as a read data response.
The policy register and the management register in the storage access controller store configuration policy, configuration information, priority information and the like, and the configuration policy, the configuration information, the priority information and the like are configured and managed through an internal policy configuration management module during design and production. And an external analysis tool generates a distributed management strategy corresponding to each app according to the hardware platform parameters and the app access and storage parameters. The policy configuration management module writes the corresponding configuration information and policies into each sub-controller through an external analysis tool. The configuration strategy is written into a strategy register in the distributed access controller, and the configuration information and the priority information are written into a management register in the access queue controller, so that the controller architecture provided by the scheme is realized and the corresponding control process is executed. In addition, the configuration and calling can be performed during FPGA application design, and parameters can be configured according to a chip architecture when the configuration and calling is used for designing a digital front end of a chip; when the method is used for FPGA application design, the method can be called in engineering and configured according to an application scene and hardware design.
To sum up, in the storage access controller architecture provided by the present application, the distributed access controller employs a single application access port, controls the task dividing module to divide the task request into a target number of subtasks through the configuration policy in the policy register, updates the operation address and the storage communication port of each subtask through the instruction remap module, and further performs interaction through the storage communication port and the access queue controller corresponding to the subsequent level one to one. In addition, the read data recovery module is responsible for collecting data of each subtask, and the read data recombination module is used for recombining the data to generate complete data.
The scheme abandons the traditional point-to-point and bus arbitration scheme, introduces an access queue controller between a distributed access controller and an access interface, divides the access queue controller into a task queue and a return queue, adopts a strategy to control a multi-task queue, is respectively used for receiving the request/data of the corresponding storage communication port of each distributed access controller, and then returns the related data through the corresponding application program return queue; the management register distributes the parameters of the task queue and the management queue according to the relevant configuration information. In addition, the read data forwarding module is responsible for receiving the data returned by the access interface controller and forwarding the data to the return queue for caching. The method can reduce the waiting time delay between tasks at the task initiating end and eliminate the influence of the multi-port exchange processing handshake waiting time delay.
The access interface control in the scheme is responsible for managing the transmission between the access interface and the corresponding storage medium, and for the request with priority information, the access interface control can execute the sequence of each task queue according to the priority, so that the access efficiency is improved, and the thread is blocked. The policy configuration management module can configure and manage each policy register and management register according to actual requirements and application programs.
Taking the implementation of the etherhouse algorithm by the Intel FPGA S10MX as an example, some basic information is as follows:
1) storage (HBM): the capacity is 8GB, 16 independent ports (each corresponding to 512MB space), the bit width of each port is 128bit, the rate is 2Gbps at the highest, and 32 pseudo-random ports (actually used interfaces, since the speed of the independent ports is too fast, manufacturers provide the pseudo-random ports in actual application, the pseudo-random ports can be approximately regarded as physical ports, and the access rate of each pseudo-random port is 1Gbps 128 Gbps.
2) The number of apps: ideally up to 8.
3) App access bandwidth: considering the ideal state, 400MHz frequency, 512bit width, can match 1.6 HBM pseudorandom port rates. The design is more average under the ideal state of the APP access storage bandwidth, the APP internal access and the calculation are parallel, the bandwidth requirement is far greater than the calculation, and the access delay caused by the calculation does not exist.
4) App capacity requirement: each APP has the access capacity of 4GB, and the access type is read and can be shared.
The working efficiency of several design architectures is evaluated by the number of real-time full-load working random channels as follows:
1) the design structure
The number of full-load working random channels is 1.6 × 8 and 12.8.
2) Single-port distributed memory access architecture
Limited by memory capacity, two APPs may be used, and the full-load working random channel number is 1.6 × 2 and 3.2. The design architecture is 3 times its performance.
3) Point-to-point distributed memory access architecture
Due to the limitation of capacity, APP needs to be modified to support multiple interfaces, and the modification seriously affects the operation speed and is not feasible under the application condition.
4) Bus arbitration type access structure
Limited by the number of APP channels, only 8 access channels can be used, and simultaneously, because the data volume of each access is small, the data return delay accounts for more than 20% of the estimated task time, the number of the full-load working random channels is 1 × 8 × 0.8 — 6.4. The design architecture is 2 times of its performance. (this evaluation does not take into account bus switching capabilities, and by default may match the aforementioned calculations).
Therefore, compared with various design framework modes in the related technology, the storage access controller adopting the design framework has the advantages that the full-load working random channel number is improved to different degrees. And the design framework has smaller limit on APP development, effectively reduces the optimized development of the memory access direction, and reduces the development difficulty and workload. The hardware performance can be more effectively exerted, the memory bandwidth utilization rate is improved from 10% to more than 80% at most, and the calculation performance is further improved. The adaptability is better, and different versions of memory access controllers do not need to be developed for different APPs; meanwhile, the dependence on an integrated bus in the FPGA is small, and the FPGA can be conveniently transplanted to different manufacturers and different types of FPGAs. The method has better expansibility, can support different APPs (or the same APP) to work simultaneously, can configure the number of threads according to needs, and has higher degree of freedom.
The above description is of the preferred embodiment of the invention; it is to be understood that the invention is not limited to the particular embodiments described above, in that devices and structures not described in detail are understood to be implemented in a manner common in the art; any person skilled in the art can make many possible variations and modifications, or modify equivalent embodiments, without departing from the technical solution of the invention, without affecting the essence of the invention; therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims (9)

1. A memory access controller architecture is characterized in that the memory access controller architecture is used for a memory access controller in an FPGA chip and comprises M distributed access controllers, N access queue controllers and N access interface controllers; the access queue controller corresponds to the access interface controller one by one; m and N are positive integers greater than 1;
the distributed access controller is provided with an independent application program access port and N storage communication ports, the application program access port is used for data interaction with a target application program, and the target application program and the application program access port are in one-to-one correspondence; the N storage communication ports are respectively in one-to-one correspondence with the N access queue controllers and are used for data storage and communication;
the memory access queue controller is cascaded with a memory access interface controller and a storage medium and is used for data storage and communication between the memory access queue controller and the storage medium; the memory access interface controller is used for managing a task queue in the memory access queue controller.
2. The memory access controller architecture as claimed in claim 1, wherein the distributed access control architecture comprises a first distributed access controller, a second distributed access controller to an Mth distributed access controller, and a first application access port, a second application access port to the Mth application access port, which are correspondingly configured, respectively access a first application, a second application to the Mth application;
the distributed access controller also comprises a strategy register, a task segmentation module and an instruction remap module; the strategy register stores a configuration strategy for performing task segmentation on a read/write request of a target application program; the configuration policy at least comprises at least one of a target application program access port, a storage communication port, an access data volume, an access capacity and a transmission speed;
the task segmentation module is used for performing task segmentation on the read/write requests of the target application program according to the configuration strategy, acquiring a target number of subtasks and new read/write requests, and designating a corresponding target number of the storage communication ports;
and the instruction remap module is used for updating the operation addresses of the split target number of subtasks and sending a target number of new read/write requests to the corresponding storage communication ports.
3. The storage access controller architecture of claim 2, wherein when the target application is a read data request, the distributed access controller further comprises a read data reorganization module and a read data reclaim module;
the read data recovery module is used for recovering the return data of the storage communication ports in real time, and the return data is transmitted back to the corresponding storage communication ports by the target number of access queue controllers;
the read data reorganization module is used for reorganizing the data collected by the read data recovery module and sending the data to the target application program through the application program access port to serve as read data response.
4. The memory access controller architecture of claim 3, wherein the memory access queue controllers comprise a first memory access queue controller, a second memory access queue controller, and an Nth memory access queue controller, each memory access queue controller supporting a task queue and a return queue for at least M applications; the task queue comprises a first application program task queue, a second application program task queue to an Mth application program task queue, and the return queue comprises a first application program return queue, a second application program return queue to the Mth application program return queue;
the jth storage communication port of the ith distributed access controller corresponds to an ith application program task queue and an ith application program return queue in the jth access queue controller, wherein i is a positive integer not greater than M, and j is a positive integer not greater than N.
5. The memory access controller architecture of claim 3, wherein the memory access queue controller further comprises a management register and a read data forwarding module;
the management register is stored with configuration information of the task queue and the return queue, and the configuration information at least comprises at least one of the size of a single task space of each application program task queue and the return queue, the number of cache tasks, the number of supported application programs, the number of bytes accessed at a time and the capacity;
the read data forwarding module is used for receiving returned data and forwarding the returned data to each application program return queue.
6. The memory access controller architecture as claimed in claim 5, wherein each of the memory access queue controllers is cascaded with the memory access interface controller, the memory access interface controller performs queue management according to the configuration information in the management register, reads and executes memory access tasks cached in each application task queue, and returns data to the read data forwarding module.
7. The memory access controller architecture of claim 5, wherein when priority information for an application is stored in the management register, the memory access interface controller manages and executes a task queue and a return queue based on the priority information.
8. The memory access controller architecture as claimed in claim 1, wherein the memory access interface controller is correspondingly provided with memory access interfaces, and the memory access interfaces are respectively connected to corresponding storage media; and the memory access interface controller reads and writes data into the storage medium through the memory access interface.
9. The storage access controller architecture of any of claims 1 to 8, wherein the storage access controller is further provided with a policy configuration management module; the policy configuration management module is used for receiving configuration policies, configuration information and priority information input from the outside, writing the configuration policies into a policy register in the distributed access controller, and writing the configuration information and the priority information into a management register in the access queue controller.
CN202210665476.1A 2022-06-14 2022-06-14 Storage access controller architecture Active CN114936171B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210665476.1A CN114936171B (en) 2022-06-14 2022-06-14 Storage access controller architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210665476.1A CN114936171B (en) 2022-06-14 2022-06-14 Storage access controller architecture

Publications (2)

Publication Number Publication Date
CN114936171A true CN114936171A (en) 2022-08-23
CN114936171B CN114936171B (en) 2023-11-14

Family

ID=82867636

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210665476.1A Active CN114936171B (en) 2022-06-14 2022-06-14 Storage access controller architecture

Country Status (1)

Country Link
CN (1) CN114936171B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116107923A (en) * 2022-12-27 2023-05-12 深存科技(无锡)有限公司 BRAM-based many-to-many high-speed memory access architecture and memory access system
CN118484143A (en) * 2024-05-22 2024-08-13 深存科技(无锡)有限公司 Data acceleration pipeline synchronous ring

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030214326A1 (en) * 2002-02-11 2003-11-20 Craimer Stephen G. Distributed dynamically optimizable processing communications and storage system
CN104503928A (en) * 2014-12-05 2015-04-08 中国航空工业集团公司第六三一研究所 Random memory circuit based on queue management
CN105183662A (en) * 2015-07-30 2015-12-23 复旦大学 Cache consistency protocol-free distributed sharing on-chip storage framework
US20160232112A1 (en) * 2015-02-06 2016-08-11 Futurewei Technologies, Inc. Unified Memory Bus and Method to Operate the Unified Memory Bus
CN109983449A (en) * 2018-06-30 2019-07-05 华为技术有限公司 The method and storage system of data processing
CN111275179A (en) * 2020-02-03 2020-06-12 苏州浪潮智能科技有限公司 Architecture and method for accelerating neural network calculation based on distributed weight storage
US20210279002A1 (en) * 2020-03-09 2021-09-09 EMC IP Holding Company LLC Method, device, and computer program product for managing memories
CN114443364A (en) * 2021-12-27 2022-05-06 天翼云科技有限公司 Distributed block storage data processing method, device, equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030214326A1 (en) * 2002-02-11 2003-11-20 Craimer Stephen G. Distributed dynamically optimizable processing communications and storage system
CN104503928A (en) * 2014-12-05 2015-04-08 中国航空工业集团公司第六三一研究所 Random memory circuit based on queue management
US20160232112A1 (en) * 2015-02-06 2016-08-11 Futurewei Technologies, Inc. Unified Memory Bus and Method to Operate the Unified Memory Bus
CN105183662A (en) * 2015-07-30 2015-12-23 复旦大学 Cache consistency protocol-free distributed sharing on-chip storage framework
CN109983449A (en) * 2018-06-30 2019-07-05 华为技术有限公司 The method and storage system of data processing
CN111275179A (en) * 2020-02-03 2020-06-12 苏州浪潮智能科技有限公司 Architecture and method for accelerating neural network calculation based on distributed weight storage
US20210279002A1 (en) * 2020-03-09 2021-09-09 EMC IP Holding Company LLC Method, device, and computer program product for managing memories
CN114443364A (en) * 2021-12-27 2022-05-06 天翼云科技有限公司 Distributed block storage data processing method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙德荣;: "基于PowerPC的嵌入式大容量高速率存储技术", 电子技术与软件工程, no. 14 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116107923A (en) * 2022-12-27 2023-05-12 深存科技(无锡)有限公司 BRAM-based many-to-many high-speed memory access architecture and memory access system
CN116107923B (en) * 2022-12-27 2024-01-23 深存科技(无锡)有限公司 BRAM-based many-to-many high-speed memory access architecture and memory access system
CN118484143A (en) * 2024-05-22 2024-08-13 深存科技(无锡)有限公司 Data acceleration pipeline synchronous ring

Also Published As

Publication number Publication date
CN114936171B (en) 2023-11-14

Similar Documents

Publication Publication Date Title
US10387202B2 (en) Quality of service implementation in a networked storage system with hierarchical schedulers
CN114936171A (en) Memory access controller architecture
US8325603B2 (en) Method and apparatus for dequeuing data
US6418478B1 (en) Pipelined high speed data transfer mechanism
US8307053B1 (en) Partitioned packet processing in a multiprocessor environment
CN101222428B (en) Method, system and hardware used for transmitting data packet in network structure
US20150127691A1 (en) Efficient implementations for mapreduce systems
CN109388338B (en) Hybrid framework for NVMe-based storage systems in cloud computing environments
CN106648896B (en) Method for dual-core sharing of output peripheral by Zynq chip under heterogeneous-name multiprocessing mode
CN101499956B (en) Hierarchical buffer zone management system and method
US20090175275A1 (en) Flexible network processor scheduler and data flow
CN110018781B (en) Disk flow control method and device and electronic equipment
EP2084611B1 (en) Virtualization support in a multiprocessor storage area network
US20020071321A1 (en) System and method of maintaining high bandwidth requirement of a data pipe from low bandwidth memories
CN108768898A (en) A kind of method and its device of network-on-chip transmitting message
CN112685335B (en) Data storage system
US10705985B1 (en) Integrated circuit with rate limiting
US20230367713A1 (en) In-kernel cache request queuing for distributed cache
CN116382861A (en) Self-adaptive scheduling method, system and medium for server network process of NUMA architecture
CN114546287B (en) Method and device for single-channel multi-logic-unit number cross transmission
WO2022170769A1 (en) Communication method, apparatus, and system
CN112236755A (en) Memory access method and device
CN111694635A (en) Service quality control method and device
US20230396561A1 (en) CONTEXT-AWARE NVMe PROCESSING IN VIRTUALIZED ENVIRONMENTS
US9965211B2 (en) Dynamic packet buffers with consolidation of low utilized memory banks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant