CN112256604A - Direct memory access system and method - Google Patents

Direct memory access system and method Download PDF

Info

Publication number
CN112256604A
CN112256604A CN202011118140.0A CN202011118140A CN112256604A CN 112256604 A CN112256604 A CN 112256604A CN 202011118140 A CN202011118140 A CN 202011118140A CN 112256604 A CN112256604 A CN 112256604A
Authority
CN
China
Prior art keywords
cache
memory
dma
address
cpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011118140.0A
Other languages
Chinese (zh)
Other versions
CN112256604B (en
Inventor
姜莹
王海洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haiguang Information Technology Co Ltd
Original Assignee
Haiguang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haiguang Information Technology Co Ltd filed Critical Haiguang Information Technology Co Ltd
Priority to CN202011118140.0A priority Critical patent/CN112256604B/en
Publication of CN112256604A publication Critical patent/CN112256604A/en
Application granted granted Critical
Publication of CN112256604B publication Critical patent/CN112256604B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A Direct Memory Access (DMA) system and method are provided, the system including: a DMA memory configured to cache data; a memory configured to store data; a DMA memory indexer configured to: in response to a write command of the DMA device writing data to a write address, determining whether the write address is an address frequently accessed by the central processing unit CPU; if the write address is determined to be an address frequently accessed by the CPU, caching data to be written in the DMA memory; and if the write address is determined not to be an address frequently accessed by the CPU, sending the data to be written to the memory for storage.

Description

Direct memory access system and method
Technical Field
The present application relates to the field of integrated circuits, and more particularly, to Direct Memory Access (DMA) systems and DMA methods.
Background
Early CPU memory hierarchies were three layers, namely CPU registers, DRAM main memory and disk storage. Because the access time overhead between the register and the main memory is very different, the designer adds an L1 cache (2 to 4 clock cycles) between the register (one clock cycle) and the main memory, for example, when the CPU needs to fetch data with address a from the main memory, the address a is sent to the cache first, and when the CPU requests to access the data with address a, if the data with address a is stored in the cache (i.e., a cache hit occurs), the data is fetched from the cache directly and transferred to the CPU. Therefore, the data frequently accessed by the CPU is stored in the cache, and the speed of reading and writing the data from the cache by the CPU is higher than the speed of reading and writing the data from the memory, so that the speed of reading and writing the data by the CPU is greatly improved by the cache. As the speed of the CPU is faster and faster, the cache speed is also increased accordingly, and a multi-level cache is designed as a buffer between the CPU and the memory, such as the current second-level cache L2 and third-level cache L3.
Multi-core CPUs have been introduced for higher processing efficiency. In a multi-core CPU, each core has its own cache. It may happen that the caches of the two cores both have copies of the same data, and in the case where the two CPU cores modify the copies of the same data stored in the respective caches by themselves, it may result in inconsistency of the copies of the same data between different caches. Fig. 1 illustrates the problem of cache inconsistency in the prior art. As shown in fig. 1, for example, a CPU has a plurality of cores: CPU core 0, CPU core 1, CPU core 2, CPU core 3, and multi-level caches L1, L2, L3, etc., where the L1 cache is generally dedicated to each core. Assuming that a certain piece of content x in the memory is read by the CPU core 0 and the CPU core 1 at the same time to be 0 and stored in the cache, at this time, the caches of the CPU core 0 and the CPU core 1 both have a copy 0 of the content, and then the CPU core 0 changes the content of x to 1. At this time, if the CPU core 1 needs to access the value of x again, but there is a value 0 of x in the cache, the value obtained by the CPU core 1 is the old value 0, and a cache inconsistency occurs. There is a need to address the issue of cache inconsistencies.
DMA is a high-speed data transfer operation that allows direct reading and writing of data between a DMA external device and a memory without being transferred through an accumulator of a Central Processing Unit (CPU), and modification of memory addresses and completion of transfer of end reports are realized by hardware circuits, thereby greatly increasing the data transfer speed.
There is a need to address cache coherency issues in the context of DMA device operation.
Disclosure of Invention
To solve the above and other problems, according to one aspect of the present disclosure, there is provided a direct memory access DMA system including: a DMA memory configured to cache data; a memory configured to store data; a DMA memory indexer configured to: in response to a write command of the DMA device writing data to a write address, determining whether the write address is an address frequently accessed by the central processing unit CPU; if the write address is determined to be an address frequently accessed by the CPU, caching data to be written in the DMA memory; and if the write address is determined not to be an address frequently accessed by the CPU, sending the data to be written to the memory for storage.
According to another aspect of the present disclosure, there is provided a direct memory access DMA method including: determining, by the DMA memory indexer, whether the write address is an address that is frequently accessed by the central processing unit CPU in response to a write command of the DMA device that writes data to the write address; if the write address is determined to be an address frequently accessed by the CPU, caching, by the DMA memory indexer, data to be written in the DMA memory; if it is determined that the write address is not an address that is frequently accessed by the CPU, the data to be written is sent by the DMA memory indexer to memory for storage.
According to another aspect of the present disclosure, there is provided a computer-readable medium for storing computer-executable instructions, wherein the computer-executable instructions, when executed by a processor, perform the method of an embodiment of the present invention.
The embodiment of the invention adds the hardware DMA memory indexer to determine whether to write data into the DMA memory or the memory so as to replace the direct write operation to the CPU cache during the write updating, thereby not needing to modify the cache design and the consistency protocol of the CPU and being completely compatible with the existing design architecture.
Drawings
Fig. 1 illustrates the problem of cache inconsistency in the prior art.
FIG. 2 shows a flow diagram of one existing cache coherency scheme for a DMA system.
FIG. 3 shows a block diagram of a DMA system according to an embodiment of the invention.
FIG. 4 shows a flowchart of the operation of a DMA system according to an embodiment of the present invention in the case where the write address is not an address that is frequently accessed by the CPU.
FIG. 5 shows a flow diagram of the operation of a DMA system according to an embodiment of the invention in the case where the write address is an address that is frequently accessed by the CPU but that is not cached in the CPU cache.
FIG. 6 shows a flow diagram of the operation of a DMA system according to an embodiment of the invention in the case where the write address is an address that is frequently accessed by the CPU but that is cached in the CPU cache.
FIG. 7 shows a flow diagram of a direct memory access DMA method according to an embodiment of the invention.
FIG. 8 illustrates a block diagram of an exemplary computer system/server suitable for use in implementing embodiments of the present invention.
Detailed Description
In the existing cache coherence protocol MESI for multiple CPU cores and multiple caches, an initial scenario: at the beginning, there is no data in all CPUs, one of them has read operation, at this time, RR (data is read from the memory to the cache of the current CPU) occurs, and the state is E (exclusive, only the current CPU has data, and is consistent with the memory). At this time, if other CPUs also read the memory data, the state is modified to S (shared, multiple CPUs have the same data and keep consistent with the memory), if one of the CPUs has data modification, the data state in the CPU is modified to M (having the latest data and inconsistent with the memory, but based on the data in the current CPU), and the other CPUs having the data are notified that the data is Invalid (or Invalid), and the state of the cache lines in the other CPUs is modified to I (Invalid, and the data in the memory is considered to be inconsistent, the data is not available, and the cache lines should be reacquired). That is, when the cache controller monitors the local operation and the remote operation, it needs to make certain modification on the state of the cache line with consistent address, so as to ensure the consistency of data flowing among multiple caches.
In a DMA scenario, in many high-performance processors, to accelerate the DMA write process from an external device to a memory, when finding that a write address of a DMA write command hits in a CPU cache according to a cache directory, DMA data to be written is not written into a memory and an address corresponding to the CPU cache is invalidated, but the DMA write data is directly updated to the CPU cache, such as Freescale I/O staging and Intel's DirectI/O (ddio) technology. When the DMA device writes a DMA memory, if a write operation can be directly performed on the cache, the memory write hits a cache line with a V (Valid) state, the data of the cache line may not be written back to the memory, but the data is directly written into the cache, and then the state of the cache line is still V (Valid). The method can effectively improve the efficiency of the DMA equipment in writing the memory. Based on the cache consistency of the cache directory, the following operations are performed when the device DMA write operation hits a cache line of the CPU cache by using the method of directly writing to the cache line, as shown in fig. 2.
FIG. 2 shows a flow diagram of one existing cache coherency scheme for a DMA system.
In step 301, a write request to the DMA memory by the DMA device is sent to the cache directory of the bus. The state of each cache line and the CPU ID to which the cache line belongs (e.g., which of CPU0, CPU1, and CPU 2 writes or modifies) are recorded in the cache directory on the bus.
In step 302, when the cache directory is searched to find that the address of the write request hits in a cache line, it is assumed that the CPU ID with the exclusive right recorded in the cache directory of the hit cache line is CPU 0. In fig. 2, the ID of the cache line points to CPU0, and the status is v (valid), which indicates that the cache line is exclusive in CPU0 and the cache of CPU0 can be modified directly. The cache directory initiates a bus operation to write the data directly into the cache of CPU 0. Note that the DMA device need not write data to memory at this point.
In step 203, the cache directory sends an invalidate command (invalid) to other CPUs (e.g., CPU1, CPU 2) to invalidate the DMA write address for caches in the other CPUs, thereby ensuring that the data is up-to-date in the CPU0 cache.
In step 204, assuming that CPU1 needs to write data of an address using the above-mentioned DMA, CPU1 initiates a read operation, and the address is not cached in the cache of CPU 1.
In step 205, the cache of CPU1 initiates a read command to the cache directory.
In step 206, the cache directory sends the command to the cache of CPU0 by looking up to find that the address is cached in the cache of CPU 0.
In step 207, the cache of CPU0 returns the data to CPU 1.
From the practical implementation point of view, it is still quite difficult for the DMA device to directly write data into the CPU cache. Especially considering that there may be multiple levels of caches in a processor, when the CPU performs bus snooping, the write address may hit in the cache of the multiple level caches L1, L2, or L3, which is more complicated, and the protocol state machine between the multiple levels of caches is much more complicated than the bus protocol.
The above solution mainly has the following disadvantages:
1) the processor has more cache levels, needs to modify a state machine protocol among the multiple levels of caches, and needs to perform completeness verification with the previous protocol, thereby increasing the difficulty of design and verification.
2) The cache design of the CPU is not compatible with existing designs and requires major modifications to support DMA write update operations.
Therefore, there is still a need for better solutions to the cache coherency problem in the case of operation of a DMA device.
Reference will now be made in detail to the present embodiments of the invention, examples of which are illustrated in the accompanying drawings. While the invention will be described in conjunction with the specific embodiments, it will be understood that they are not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. It should be noted that the method steps described herein may be implemented by any functional block or functional arrangement, and that any functional block or functional arrangement may be implemented as a physical entity or a logical entity, or a combination of both.
In order that those skilled in the art will better understand the present invention, the following detailed description of the invention is provided in conjunction with the accompanying drawings and the detailed description of the invention. Note that the example to be described next is only a specific example, and is not intended as a limitation on the embodiments of the present invention, and specific shapes, hardware, connections, steps, numerical values, conditions, data, orders, and the like, are necessarily shown and described. Those skilled in the art can, upon reading this specification, utilize the concepts of the present invention to construct more embodiments than those specifically described herein.
FIG. 3 shows a block diagram of a DMA system according to an embodiment of the invention.
As shown in fig. 3, a DMA system 300 according to an embodiment of the present invention includes: a DMA memory 301 configured to buffer data; a memory 302 configured to store data; DMA memory indexer 303 configured to: in response to a write command of the DMA device 306 writing data to a write address, determining whether the write address is an address frequently accessed by the central processing unit CPU 307; if it is determined that the write address is an address frequently accessed by the CPU307, the data to be written is cached in the DMA memory 301; if it is determined that the write address is not an address frequently accessed by the CPU307, the data to be written is sent to the memory 302 for storage.
In a multi-CPU core scenario, CPU307 may include CPU0, CPU1, CPU 2, and the like. Although 3 CPUs are shown in fig. 2, the present invention is not limited thereto, and the number of CPUs may be any number.
The embodiment of the invention adds the hardware DMA memory indexer 303 to determine whether to write data into the DMA memory 301 or the memory so as to replace the direct write operation to the CPU cache during the write update, thereby not needing to modify the cache design and the consistency protocol of the CPU and being completely compatible with the existing design architecture.
The DMA memory indexer mainly realizes the following functions:
1) judging whether a write address initiated by the DMA device is an address frequently accessed by the CPU, if not, caching in the DMA memory, and if so, allocating a space for the write operation in the DMA memory to cache data;
2) converting the DMA write command from a non-cacheable command to a cacheable command if the data to be written is cached in the DMA memory;
3) according to the cache consistency protocol, performing the cache consistency protocol with the cache directory;
4) sending the data to a DMA memory or a memory according to the requirement of whether the data needs to be cached;
5) when the DMA memory is full and in response to a new command to write to the DMA memory, it is responsible for writing the data cached in the DMA memory to the memory to free the DMA memory.
However, this embodiment of the present invention designs a hardware DMA memory, which mainly implements the following functions:
1) storing data to be written into a memory by DMA equipment;
2) and when the memory address of the CPU or the DMA equipment for initiating the access is matched with the memory address stored in the DMA memorizer, returning the data stored in the DMA memorizer to the initiating party for accessing.
To achieve the above functionality, a specific embodiment of the DMA system 300 and its operation is described below.
In one embodiment, DMA memory indexer 303 is configured to: whether the write address is an address frequently accessed by the CPU307 is determined by an attribute bit of the address carried in the write command. The attribute bit of the address may be existing in the prior art, or may be added in the present embodiment, as long as the attribute bit of the address can indicate whether the write address is an address frequently accessed by the CPU. Of course, whether the write address is an address frequently accessed by the CPU307 may be determined in other manners, such as by using a lookup table to store the number of times each address is accessed by the CPU307, and determining whether the write address is an address frequently accessed by the CPU307 according to a comparison between the number of times recorded in the lookup table about the write address and a threshold, although the embodiment of the invention is not limited thereto. Here, the CPU307 may be a CPU, such as the CPU0, or a plurality of or all CPUs. That is, determining whether the write address is an address frequently accessed by the CPU307 may be determining whether it is an address frequently accessed by a certain CPU or determining whether it is an address frequently accessed by a plurality of or all CPUs.
In one embodiment, the DMA device 306 and the one or more CPUs 307 are assigned unique IDs that are different from each other so that the cache directory uniquely identifies the hardware to which the cache line belongs, e.g., whether the cache line was written by the DMA device 306 or by a CPU of the one or more CPUs 307. For example, which device (CPU or DMA) has the right to modify this cache, records this device ID.
In one embodiment, the DMA system 300 may further comprise: a CPU cache 304 configured to cache data; the cache directory 305 is configured to record a status bit of a cache line indicating whether data is cached and an identifier ID indicating hardware to which the cache line belongs.
In one embodiment, cache directory 305 is specifically configured to: if DMA memory indexer 303 determines that the write address is an address that is frequently accessed by CPU307, it determines whether the write address has a cacheline in CPU cache 304; if it is determined that the write address does not have a cache line in the CPU cache 304, the status bit of the cache line of the write address is updated by the cache directory 305 to be valid and the ID of the identifier thereof is updated to be the ID of the DMA device 306; if it is determined that the write address exists in the CPU cache 304, an invalidate command is broadcast by the cache directory 305 to the one or more CPUs 307 to cause the one or more CPUs 307 to write back the data of the cache of the write address in the CPU cache 304 and to cause the cache line of the write address to be invalidated by the cache directory 305, and after the data to be written is cached in the DMA memory 301, the status bit of the cache line of the write address is updated to be valid and its ID identifier is updated to be the ID of the DMA device 306.
As such, in the case where the write address is an address that is frequently accessed by the CPU307 (i.e., data to be cached needs to be written to the DMA memory 301), a different operation concerning cache coherency is taken by distinguishing whether or not the write address has a cache line in the CPU cache 304.
Specifically, if it is determined that the write address does not have a cache line in the CPU cache 304, no buffer coherency operation needs to be performed, and the data to be cached is simply written to the DMA memory 301, and the status bit of the cache line of the write address is updated by the cache directory 305 to be valid, and the ID of the identifier thereof is updated to be the ID of the DMA device 306, so as to indicate that the cache line is written to the DMA memory 301 by the DMA device 306 and should be read from the DMA memory 301 of the DMA device 306 later.
If the write address is determined to have a cache line in the CPU cache 304, a cache coherency operation is required. Specifically, an invalidate command is broadcast by cache directory 305 to one or more CPUs 307, causing one or more CPUs 307 to write back the write-addressed cached data in CPU cache 304, and the write-addressed cache line is invalidated by cache directory 305. In this way, the data in other CPU caches may be removed, i.e., the newly written data only needs to be cached in the DMA memory 301 without copies in multiple CPU caches, thereby solving the cache coherency problem. And after caching the data to be written in the DMA memory 301, the status bit of the cache line of the write address is updated to be valid and its ID identifier is updated to be the ID of the DMA device 306.
In this manner, cache coherency may be guaranteed in the case of DMA devices.
While the CPU is to read data, in one embodiment, the DMA memory 301 is configured to: if the cache directory responds to a CPU read command for reading a memory address, determining that the memory address is in the DMA memory 301 or the CPU cache 304, the data of the memory address in the DMA memory 301 or the CPU cache 304 is sent to the CPU307, and if the cache directory responds to a CPU read command for reading a memory address, determining that the memory address is not in the DMA memory 301 or the CPU cache 304, the CPU read command is sent to the memory 302. The determination of the memory address in the DMA memory 301 or the CPU cache 304 may be performed by determining whether the identifier ID of the cache line in the cache directory is the ID of the DMA device 306 or the ID of a certain CPU.
In one embodiment, when the DMA memory 301 is full, and in response to a new command to write to the DMA memory, the DMA memory indexer 303 is responsible for writing the data cached in the DMA memory 301 to the memory 302 to free the DMA memory 301.
The respective operations in the above-described embodiment are described in detail below with reference to fig. 4 to 6.
FIG. 4 shows a flowchart of the operation of a DMA system according to an embodiment of the present invention in the case where the write address is not an address that is frequently accessed by the CPU.
When the write address is not an address frequently accessed by the CPU, it indicates that the write address does not need to be written into any cache because even if the write address is written into the cache, the access efficiency is not improved because the CPU does not frequently access the cache, and instead, the write address occupies the space of the cache. The operation in this case can then be as shown in fig. 4:
in step 401, the DMA device sends a write command to the DMA memory indexer, and the DMA memory indexer may query an attribute bit of a DMA address carried in the write command to find that the address is not an address frequently accessed by the CPU;
in step 402, the DMA memory indexer directly sends the DMA data to be written to memory without allocating a memory area in DMA memory for the write command. Since the data is not written to any CPU cache or DMA memory, the cache directory also does not modify the state of any cache line;
in step 403, when the CPU initiates an access to the DMA address, the cache directory queries the status bit of the address in the cache line, and finds that the status of the DMA address is i (invalid), which indicates that the address is not cached in any CPU cache or DMA memory;
in step 404, the cache directory sends an access request of the CPU to the DMA address to the memory;
in step 405, the memory queries and returns the data of the DMA address to the CPU.
Thus, in the case where the address is not an address that is frequently accessed by the CPU, the DMA memory indexer may directly send the data to memory for storage, thereby not taking up space in the cache.
FIG. 5 shows a flow diagram of the operation of a DMA system according to an embodiment of the invention in the case where the write address is an address that is frequently accessed by the CPU but that is not cached in the CPU cache.
The operation in the case where the write address of the DMA device is an address that is frequently accessed by the CPU and that address has no cache in the CPU cache is shown in fig. 5.
In step 501, the DMA device sends a write command to the DMA memory indexer, and the DMA memory indexer queries attribute bits of an address carried in the write command and finds that the address is an address frequently accessed by the CPU;
in step 502, the DMA memory indexer sends a command to the cache directory informing the cache directory that the data to be written is to be cached in the DMA memory; the cache directory looks up the cache line state of the address, finds that the state of the address is I, and the ID is Null, indicating that the address is not cached in either the CPU cache or the DMA memory. Since the data to be written is to be cached in the DMA memory, the cache directory updates the cache line status bit of the address to V, the ID is the ID of the DMA device, and the data indicating the address is cached in the DMA memory.
In step 503, after the cache directory has updated the status of the cache line, a command is sent to the DMA memory indexer to write the DMA data to the DMA memory;
in step 504, the DMA indexer allocates a storage address for the data to be written of the DMA device and writes the data to be written of the DMA device to the DMA memory;
in step 505, when the CPU initiates an access to an address of the memory, the cache directory finds that the status bit of the cache line of the address is V through searching, and the ID is the ID of the DMA device, indicating that the data of the address is cached in the DMA memory;
in step 506, the cache directory sends a command to the DMA memory requesting the DMA memory to send the data at the address to the CPU;
in step 507, the DMA memory sends the data to the CPU.
Therefore, under the condition that the write address is an address frequently accessed by the CPU but the address does not have cache in the CPU cache, the data is directly written into the DMA memory without being written into the CPU cache or the memory, so that cache consistency operation on the data in the CPU cache is not needed, and the data can be uniquely searched in the DMA memory because the ID can be found by the cache directory during reading as the ID of the DMA device, and the problem of cache inconsistency can be avoided.
FIG. 6 shows a flow diagram of the operation of a DMA system according to an embodiment of the invention in the case where the write address is an address that is frequently accessed by the CPU but that is cached in the CPU cache.
The operation in the case where the write address of the DMA device is an address that is frequently accessed by the CPU and the address is cached in the CPU cache is shown in fig. 6:
in step 601, the DMA device sends a write command to the DMA memory indexer, and the DMA memory indexer queries attribute bits of an address carried in the write command and finds that the address is an address frequently accessed by the CPU;
in step 602, the DMA memory indexer sends a command to the cache directory to inform the cache directory that the data to be written is to be cached in the DMA memory; the cache directory queries the state of the cache line of the address, finds that the state of the address is V, the ID is a certain CPU (for example, CPU 0), and indicates that the data of the address has cache in the CPU cache;
in step 603, regardless of which CPU the ID is directed to, the cache directory may send a broadcast invalidate (invalid) command, asking the CPU or CPUs to write back the cache data at that address in the CPU cache, and invalidate (invalid) the corresponding cache line in the cache directory. Therefore, any cache of the CPU for the data of the address is removed, and the problem of cache inconsistency among different CPUs is solved. The data of the address is cached in the DMA memory, so that the state bit of the cache line corresponding to the cache directory is updated to be V, the ID is the ID of the DMA device, and the data of the address is indicated to be cached in the DMA memory;
in step 604, the cache directory sends a command to the DMA memory indexer after updating the state of the cache line;
in step 605, the DMA memory indexer allocates a memory address for the data to be written by the DMA device and writes the DMA data to the DMA memory;
in step 606, when the CPU initiates a read operation for accessing an address of the memory, the cache directory finds that the status bit of the cache line of the address is V through searching, and the ID is the ID of the DMA device, indicating that the data of the address is cached in the DMA memory;
in step 607, the cache directory sends a command to the DMA memory requesting the DMA memory to send the data at the address to the CPU;
in step 608, the DMA sends the data to the CPU for that address.
In this embodiment, in the case that the write address is an address frequently accessed by the CPU but the address has a cache in the CPU cache, a cache coherency operation needs to be performed, and by writing back data in the CPU caches of all CPUs and invalidating the cache line in the cache directory, the problem of cache inconsistency between different CPUs can be solved.
Therefore, the hardware DMA memory indexer is added to determine whether to write data into the DMA memory or the memory so as to replace direct write operation on the CPU cache during write updating, and the cache consistency protocol is executed under the condition that the data is written into the DMA memory and the data has the cache in the CPU cache, so that the cache design and the consistency protocol of the CPU do not need to be modified, and the method is completely compatible with the existing design framework.
Of course, the examples in the above embodiments and drawings are only examples and are not limiting. For example, the schematic of the ID and status bits of a cache line in the cache directory of the figure is an example only, and not a limitation. The cache directory may be regarded as hardware in this disclosure, such as a Static Random Access Memory (SRAM) for recording information of the cache directory, or a lookup table or other hardware for managing information of the cache directory, which is not limited herein as long as the functions in the embodiments of the present invention can be implemented.
FIG. 7 shows a flow diagram of a direct memory access DMA method 700 according to an embodiment of the invention.
The direct memory access DMA method 700 as shown in FIG. 7 includes: step 701, in response to a write command of a DMA device writing data to a write address, determining whether the write address is an address frequently accessed by a Central Processing Unit (CPU) by a DMA memory indexer; step 702, if the write address is determined to be an address frequently accessed by the CPU, caching the data to be written in a DMA memory by a DMA memory indexer; step 703 sends the data to be written to memory for storage by the DMA memory indexer if it is determined that the write address is not an address that is frequently accessed by the CPU.
In one embodiment, the method 700 further comprises: if the DMA memory indexer determines that the write address is an address frequently accessed by a CPU, determining whether the write address has a cache line in a CPU cache by a cache directory, wherein the cache directory is configured to record a status bit of the cache line indicating whether data is cached and an identifier ID indicating hardware to which the cache line belongs; if the cache directory determines that the write address does not have a cache line in the CPU cache, the cache directory updates the state bit of the cache line of the write address to be valid, and updates the ID of the identifier of the cache line to be the ID of the DMA device; if the write address is determined to have a cache line in the CPU cache, an invalidation command is broadcast by the cache directory to one or more CPUs so that the one or more CPUs write back data of the cache of the write address in the CPU cache and the cache line of the write address is invalidated by the cache directory, and after the data to be written is cached in the DMA memory, the status bit of the cache line of the write address is updated to be valid and the ID of the identifier ID thereof is updated to be the ID of the DMA device.
In one embodiment, the method 700 further comprises: if the cache directory responds to a CPU read command for reading a memory address and determines that the memory address is in the DMA memory or the CPU cache, the DMA memory sends data of the memory address in the DMA memory or the CPU cache to the CPU, and if the cache directory responds to a CPU read command for reading a memory address and determines that the memory address is not in the DMA memory or the CPU cache, the DMA memory sends the CPU read command to the memory.
In one embodiment, whether the write address is an address frequently accessed by the CPU is determined by an attribute bit of the address carried in the write command.
In one embodiment, the DMA device and the one or more CPUs are assigned unique IDs that are different from each other so that the cache directory uniquely identifies the hardware to which the cache line belongs.
In one embodiment, the method 700 further comprises: configured by the DMA storage indexing appliance to at least one of: converting the DMA write command from a non-cacheable command to a cacheable command if data to be written is cached in the DMA memory; according to a cache consistency protocol, performing the cache consistency protocol with a cache directory; when the DMA memory is full and in response to a new command to write to the DMA memory, it is responsible for writing the data cached in the DMA memory to the memory to free the DMA memory.
FIG. 8 illustrates a block diagram of an exemplary computer system/server suitable for use in implementing embodiments of the present invention.
The computer system may include a processor (H1); a computer-readable medium (H2) coupled to the processor (H1) and having stored therein computer-executable instructions for performing the functions and/or steps of the methods in the described embodiments of the present technology.
The processor (H1) may include, but is not limited to, for example, one or more processors or microprocessors or the like.
The computer-readable medium (H2) may include, but is not limited to, for example, Random Access Memory (RAM), Read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, a hard disk, a floppy disk, a solid state disk, a removable disk, a CD-ROM, a DVD-ROM, a Blu-ray disk, and the like.
In addition, the computer system may include a data bus (H3), an input/output (I/O) bus (H4), a display (H5), and an input/output device (H6) (e.g., a keyboard, a mouse, a speaker, etc.), among others.
The processor (H1) may communicate with external devices (H5, H6, etc.) via a wired or wireless network (not shown) over an I/O bus (H4).
Of course, the above-mentioned embodiments are merely examples and not limitations, and those skilled in the art can combine and combine some steps and apparatuses from the above-mentioned separately described embodiments to achieve the effects of the present invention according to the concepts of the present invention, and such combined and combined embodiments are also included in the present invention, and such combined and combined embodiments are not necessarily described herein.
It is noted that advantages, effects, and the like, which are mentioned in the present disclosure, are only examples and not limitations, and they are not to be considered essential to various embodiments of the present invention. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the invention is not limited to the specific details described above.
The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
The flowchart of steps in the present disclosure and the above description of methods are merely illustrative examples and are not intended to require or imply that the steps of the various embodiments must be performed in the order presented. As will be appreciated by those skilled in the art, the order of the steps in the above embodiments may be performed in any order. Words such as "thereafter," "then," "next," etc. are not intended to limit the order of the steps; these words are only used to guide the reader through the description of these methods. Furthermore, any reference to an element in the singular, for example, using the articles "a," "an," or "the" is not to be construed as limiting the element to the singular.
In addition, the steps and devices in the embodiments are not limited to be implemented in a certain embodiment, and in fact, some steps and devices in the embodiments may be combined according to the concept of the present invention to conceive new embodiments, and these new embodiments are also included in the scope of the present invention.
The individual operations of the methods described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software components and/or modules including, but not limited to, a hardware circuit, an Application Specific Integrated Circuit (ASIC), or a processor.
The various illustrative logical blocks, modules, and circuits described may be implemented or described with a general purpose processor, a Digital Signal Processor (DSP), an ASIC, a field programmable gate array signal (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any commercially available processor, controller, microcontroller or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may reside in any form of tangible storage medium. Some examples of storage media that may be used include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM, and the like. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. A software module may be a single instruction, or many instructions, and may be distributed over several different code segments, among different programs, and across multiple storage media.
The methods disclosed herein comprise one or more acts for implementing the described methods. The methods and/or acts may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of actions is specified, the order and/or use of specific actions may be modified without departing from the scope of the claims.
The above-described functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as one or more instructions on a tangible computer-readable medium. A storage media may be any available tangible media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other tangible medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. As used herein, disk (disk) and disc (disc) includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.
Accordingly, a computer program product may perform the operations presented herein. For example, such a computer program product may be a computer-readable tangible medium having instructions stored (and/or encoded) thereon that are executable by one or more processors to perform the operations described herein. The computer program product may include packaged material.
Software or instructions may also be transmitted over a transmission medium. For example, the software may be transmitted from a website, server, or other remote source using a transmission medium such as coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, or microwave.
Further, modules and/or other suitable means for carrying out the methods and techniques described herein may be downloaded and/or otherwise obtained by a user terminal and/or base station as appropriate. For example, such a device may be coupled to a server to facilitate the transfer of means for performing the methods described herein. Alternatively, the various methods described herein can be provided via storage means (e.g., RAM, ROM, a physical storage medium such as a CD or floppy disk) so that the user terminal and/or base station can obtain the various methods when coupled to or providing storage means to the device. Moreover, any other suitable technique for providing the methods and techniques described herein to a device may be utilized.
Other examples and implementations are within the scope and spirit of the disclosure and the following claims. For example, due to the nature of software, the functions described above may be implemented using software executed by a processor, hardware, firmware, hard-wired, or any combination of these. Features implementing functions may also be physically located at various locations, including being distributed such that portions of functions are implemented at different physical locations. Also, as used herein, including in the claims, "or" as used in a list of items beginning with "at least one" indicates a separate list, such that a list of "A, B or at least one of C" means a or B or C, or AB or AC or BC, or ABC (i.e., a and B and C). Furthermore, the word "exemplary" does not mean that the described example is preferred or better than other examples.
Various changes, substitutions and alterations to the techniques described herein may be made without departing from the techniques of the teachings as defined by the appended claims. Moreover, the scope of the claims of the present disclosure is not limited to the particular aspects of the process, machine, manufacture, composition of matter, means, methods and acts described above. Processes, machines, manufacture, compositions of matter, means, methods, or acts, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding aspects described herein may be utilized. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or acts.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the invention. Thus, the present invention is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the invention to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (13)

1. A direct memory access, DMA, system comprising:
a DMA memory configured to cache data;
a memory configured to store data;
a DMA memory indexer configured to:
in response to a write command of the DMA device writing data to a write address, determining whether the write address is an address frequently accessed by the central processing unit CPU;
if the write address is determined to be an address frequently accessed by the CPU, caching data to be written in the DMA memory;
and if the write address is determined not to be an address frequently accessed by the CPU, sending the data to be written to the memory for storage.
2. The system of claim 1, further comprising:
a CPU cache configured to cache data;
the cache directory is configured to record a state bit of a cache line indicating whether data is cached and an identifier ID indicating hardware to which the cache line belongs;
wherein the cache directory is specifically configured to:
if the DMA memory indexer determines that the write address is an address frequently accessed by a CPU, determining whether the write address has a cache line in the CPU cache;
if the write address is determined not to exist in the CPU cache, the cache directory updates the status bit of the cache line of the write address to be valid, and updates the ID of the identifier of the cache line of the write address to be the ID of the DMA device;
if the write address is determined to have a cache line in the CPU cache, an invalidation command is broadcast by the cache directory to one or more CPUs so that the one or more CPUs write back data of the cache of the write address in the CPU cache and the cache line of the write address is invalidated by the cache directory, and after the data to be written is cached in the DMA memory, the status bit of the cache line of the write address is updated to be valid and the ID of the identifier ID thereof is updated to be the ID of the DMA device.
3. The system of claim 2, wherein the first and second sensors are arranged in a single package,
wherein the DMA memory is configured to:
if the cache directory responds to a CPU read command for reading a memory address, determines that the memory address is in the DMA memory or the CPU cache, sends the data of the memory address in the DMA memory or the CPU cache to the CPU,
and if the cache directory responds to a CPU read command for reading a memory address and determines that the memory address is not in the DMA memory or the CPU cache, sending the CPU read command to the memory.
4. The system according to claim 1 or 2,
wherein the DMA memory indexer is configured to:
and determining whether the write address is an address frequently accessed by the CPU or not through the attribute bit of the address carried in the write command.
5. The system of claim 2, wherein the DMA device and the one or more CPUs are assigned unique IDs that are different from each other so that the cache directory uniquely identifies the hardware to which the cache line belongs.
6. The system of claim 2, wherein the DMA storage index appliance is configured to at least one of:
converting the DMA write command from a non-cacheable command to a cacheable command if data to be written is cached in the DMA memory;
according to a cache consistency protocol, performing the cache consistency protocol with a cache directory;
when the DMA memory is full and in response to a new command to write to the DMA memory, it is responsible for writing the data cached in the DMA memory to the memory to free the DMA memory.
7. A direct memory access, DMA, method comprising:
determining, by the DMA memory indexer, whether the write address is an address that is frequently accessed by the central processing unit CPU in response to a write command of the DMA device that writes data to the write address;
if the write address is determined to be an address frequently accessed by the CPU, caching, by the DMA memory indexer, data to be written in the DMA memory;
if it is determined that the write address is not an address that is frequently accessed by the CPU, the data to be written is sent by the DMA memory indexer to memory for storage.
8. The method of claim 7, further comprising:
if the DMA memory indexer determines that the write address is an address frequently accessed by a CPU, determining whether the write address has a cache line in a CPU cache by a cache directory, wherein the cache directory is configured to record a status bit of the cache line indicating whether data is cached and an identifier ID indicating hardware to which the cache line belongs;
if the cache directory determines that the write address does not have a cache line in the CPU cache, the cache directory updates the state bit of the cache line of the write address to be valid, and updates the ID of the identifier of the cache line to be the ID of the DMA device;
if the write address is determined to have a cache line in the CPU cache, an invalidation command is broadcast by the cache directory to one or more CPUs so that the one or more CPUs write back data of the cache of the write address in the CPU cache and the cache line of the write address is invalidated by the cache directory, and after the data to be written is cached in the DMA memory, the status bit of the cache line of the write address is updated to be valid and the ID of the identifier ID thereof is updated to be the ID of the DMA device.
9. The method of claim 8, further comprising:
if the cache directory responds to a CPU read command for reading a memory address, determining that the memory address is in the DMA memory or the CPU cache, sending data of the memory address in the DMA memory or the CPU cache to the CPU by the DMA memory,
and if the cache directory responds to a CPU read command for reading a memory address and determines that the memory address is not in the DMA memory or the CPU cache, the DMA memory sends the CPU read command to the memory.
10. The method according to claim 7 or 8, wherein whether the write address is an address frequently accessed by a CPU is determined by an attribute bit of the address carried in the write command.
11. The method of claim 8, wherein a DMA device and one or more CPUs are assigned unique IDs that are different from each other so that the cache directory uniquely identifies hardware to which the cache line belongs.
12. The method of claim 8, further comprising:
configured by the DMA storage indexing appliance to at least one of:
converting the DMA write command from a non-cacheable command to a cacheable command if data to be written is cached in the DMA memory;
according to a cache consistency protocol, performing the cache consistency protocol with a cache directory;
when the DMA memory is full and in response to a new command to write to the DMA memory, it is responsible for writing the data cached in the DMA memory to the memory to free the DMA memory.
13. A computer-readable medium storing computer-executable instructions, wherein the computer-executable instructions, when executed by a processor, perform the method of any of claims 7-12.
CN202011118140.0A 2020-10-19 2020-10-19 Direct memory access system and method Active CN112256604B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011118140.0A CN112256604B (en) 2020-10-19 2020-10-19 Direct memory access system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011118140.0A CN112256604B (en) 2020-10-19 2020-10-19 Direct memory access system and method

Publications (2)

Publication Number Publication Date
CN112256604A true CN112256604A (en) 2021-01-22
CN112256604B CN112256604B (en) 2022-07-08

Family

ID=74244829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011118140.0A Active CN112256604B (en) 2020-10-19 2020-10-19 Direct memory access system and method

Country Status (1)

Country Link
CN (1) CN112256604B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704150A (en) * 2021-08-13 2021-11-26 苏州浪潮智能科技有限公司 DMA data cache consistency method, device and system in user mode
CN114116531A (en) * 2022-01-28 2022-03-01 苏州浪潮智能科技有限公司 Cache consistency write-back method, device, equipment and medium
CN115061972A (en) * 2022-07-05 2022-09-16 摩尔线程智能科技(北京)有限责任公司 Processor, data read-write method, device and storage medium
CN115480708A (en) * 2022-10-11 2022-12-16 成都市芯璨科技有限公司 Method for time division multiplexing local memory access

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4835686A (en) * 1985-05-29 1989-05-30 Kabushiki Kaisha Toshiba Cache system adopting an LRU system, and magnetic disk controller incorporating it
US5555398A (en) * 1994-04-15 1996-09-10 Intel Corporation Write back cache coherency module for systems with a write through cache supporting bus
JP2004258935A (en) * 2003-02-26 2004-09-16 Matsushita Electric Ind Co Ltd Semiconductor device
US20120159080A1 (en) * 2010-12-15 2012-06-21 Advanced Micro Devices, Inc. Neighbor cache directory
CN110059024A (en) * 2019-04-19 2019-07-26 中国科学院微电子研究所 A kind of memory headroom data cache method and device
WO2020199061A1 (en) * 2019-03-30 2020-10-08 华为技术有限公司 Processing method and apparatus, and related device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4835686A (en) * 1985-05-29 1989-05-30 Kabushiki Kaisha Toshiba Cache system adopting an LRU system, and magnetic disk controller incorporating it
US5555398A (en) * 1994-04-15 1996-09-10 Intel Corporation Write back cache coherency module for systems with a write through cache supporting bus
JP2004258935A (en) * 2003-02-26 2004-09-16 Matsushita Electric Ind Co Ltd Semiconductor device
US20120159080A1 (en) * 2010-12-15 2012-06-21 Advanced Micro Devices, Inc. Neighbor cache directory
WO2020199061A1 (en) * 2019-03-30 2020-10-08 华为技术有限公司 Processing method and apparatus, and related device
CN110059024A (en) * 2019-04-19 2019-07-26 中国科学院微电子研究所 A kind of memory headroom data cache method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DAVID C WYLAND: ""Cache tag RAM chips simplify cache memory design"", 《MICROPROCESSORS AND MICROSYSTEMS》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704150A (en) * 2021-08-13 2021-11-26 苏州浪潮智能科技有限公司 DMA data cache consistency method, device and system in user mode
CN113704150B (en) * 2021-08-13 2023-08-04 苏州浪潮智能科技有限公司 DMA data cache consistency method, device and system in user mode
CN114116531A (en) * 2022-01-28 2022-03-01 苏州浪潮智能科技有限公司 Cache consistency write-back method, device, equipment and medium
CN114116531B (en) * 2022-01-28 2022-04-22 苏州浪潮智能科技有限公司 Cache consistency write-back method, device, equipment and medium
CN115061972A (en) * 2022-07-05 2022-09-16 摩尔线程智能科技(北京)有限责任公司 Processor, data read-write method, device and storage medium
CN115061972B (en) * 2022-07-05 2023-10-13 摩尔线程智能科技(北京)有限责任公司 Processor, data read-write method, device and storage medium
CN115480708A (en) * 2022-10-11 2022-12-16 成都市芯璨科技有限公司 Method for time division multiplexing local memory access
CN115480708B (en) * 2022-10-11 2023-02-28 成都市芯璨科技有限公司 Method for time division multiplexing local memory access

Also Published As

Publication number Publication date
CN112256604B (en) 2022-07-08

Similar Documents

Publication Publication Date Title
CN112256604B (en) Direct memory access system and method
KR101814577B1 (en) Method and apparatus for processing instructions using processing-in-memory
US9792210B2 (en) Region probe filter for distributed memory system
JP4928812B2 (en) Data processing system, cache system, and method for sending requests on an interconnect fabric without reference to a lower level cache based on tagged cache state
US9170946B2 (en) Directory cache supporting non-atomic input/output operations
US8812786B2 (en) Dual-granularity state tracking for directory-based cache coherence
US9213649B2 (en) Distributed page-table lookups in a shared-memory system
US10402327B2 (en) Network-aware cache coherence protocol enhancement
US20150058570A1 (en) Method of constructing share-f state in local domain of multi-level cache coherency domain system
US20160224467A1 (en) Hierarchical cache structure and handling thereof
US10019377B2 (en) Managing cache coherence using information in a page table
US8185692B2 (en) Unified cache structure that facilitates accessing translation table entries
JP2003186744A (en) Method of improving computer performance by adjusting time used for preemptive expelling of cache entry
US9645931B2 (en) Filtering snoop traffic in a multiprocessor computing system
JP2019519028A (en) Shadow tag memory to monitor the status of cache lines at different cache levels
US20120159080A1 (en) Neighbor cache directory
CN109684237B (en) Data access method and device based on multi-core processor
US20110314228A1 (en) Maintaining Cache Coherence In A Multi-Node, Symmetric Multiprocessing Computer
TWI768039B (en) Multiprocessor system, data management method and non-transitory computer readable medium
TWI428754B (en) System and method for implementing an enhanced hover state with active prefetches
WO2023108938A1 (en) Method and apparatus for solving address ambiguity problem of cache
JP5536377B2 (en) Coherence tracking method and data processing system (coherence tracking enhanced with implementation of region victim hash for region coherence arrays)
EP1611513B1 (en) Multi-node system in which global address generated by processing subsystem includes global to local translation information
US11526449B2 (en) Limited propagation of unnecessary memory updates
US9842050B2 (en) Add-on memory coherence directory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant