DIRECT MEMORY ACCESS ENGINE FOR DATA CACHE CONTROL
BACKGROUND OF THE INVENTION
1. Field of the Invention.
The application relates generally to memory management systems, and specifically to
an DMA engine for data cache control.
2. Description of the Background Art.
Multimedia encoders, such as those used for MPEG and MPEG2 encoding, provide
the necessary compression to allow video and audio data to be transferred, stored, and played
in a computer environment. Integrated MPEG encoders use an embedded processor to
perform the encoding operations required to compress the video and audio data. Figure 1
illustrates a block diagram of a conventional embedded processor 100 for use in an integrated
MPEG encoder. The instruction cache 128 feeds an instruction stream to the instruction
decode unit 124 which decodes instructions within the stream for the execution unit 104. The
decoded instructions are executed by the execution unit 104. The data cache controller 112
supervises the operation of the data cache 116 using conventional cache management
techniques.
The data cache is typically divided into a number of sets, where each set contains a
number of cache lines for storing data. Each cache line has a tag that holds a number of
address bits and several control bits (e.g., valid bit, lock indicator, dirty bit). A cache line is
filled at the request of the execution unit 104 when a data location is needed which is not
currently represented in the cache. This is commonly known as a cache miss. When a cache
miss occurs, the data cache controller 112 initiates one or more external memory accesses and
brings the requested cache line into the data cache 116 and updates the tags accordingly. The
issue of which existing cache line to replace is treated using conventional algorithms such as
the "Least-Recently Used" methodology in which the least recently used data line is replaced
with the new line.
In an integrated MPEG encoder, the execution unit 104 must operate on blocks and
macroblocks which are typical for processing audio and video data. Conventional cache
management techniques are not optimal for systems in which blocks of data are required to be
transferred to and from the cache 116 and external memory 108. For example, if there is a
minor variation in a block of data, conventional schemes replace the entire block which is
time consuming and resource inefficient. In these systems, it is desirable to be able to load
an entire data set with a standard block of data in advance of when the execution unit 104
requires the data, such that the data will be available to the execution unit 104 when the
processing of the data block begins. Additionally, it is desirable in such systems to pre-load
the data cache 116 while minimizing the involvement of the execution unit 104, allowing the
execution unit 104 to devote its resources to other computationally intensive tasks.
Thus, a system is needed in which blocks of data can be pre-loaded prior to the
processing of the blocks of data, and which minimizes the involvement of an execution unit
to improve the overall processing power of the system.
SUMMARY OF THE INVENTION
In accordance with the present invention, an DMA engine is coupled to a data cache
and an execution unit. The DMA engine operates independently of the execution unit and
can load blocks of data from external memory to the data cache and from the data cache to
external memory without assistance from the execution unit. The execution unit programs
the DMA engine with block transfer information and the DMA engine performs the rest of
the operations independently. In contrast to conventional implementations of DMA engines
which require a dedicated memory buffer to operate, in accordance with the present invention
the memory buffer has been merged with the data cache memory. The block transfer
information allows the system to transfer blocks of data to and from the cache which is
advantageous in multimedia encoding systems in which the data is grouped into blocks and
macroblocks for processing by the execution unit. In a further embodiment, the data cache is
organized into sets which can be used for storage of multimedia blocks of data under the
control of the DMA engine or for traditional data storage under the cache controller using
conventional line replacement policies. The cache controller and DMA engine are
implemented separately; however, both share the same buffers for storage. The execution
unit preferably dynamically determines whether the cache controller or the DMA engine will
control each transfer of data responsive to instructions in the code or program directing the
execution unit to use the DMA engine or the cache controller to perform the data transfer.
Thus, the data transfers can be optimized for the computational requirements of the system,
providing greater flexibility and improving the overall processing power of the system.
In a preferred embodiment, the audio and video blocks of data are loaded into the data
cache by the DMA engine immediately prior to when they are needed by the execution unit.
This optimizes the processing of the system because there is no delay in waiting for blocks to
be transferred while the execution unit is idle. When a block has been transferred into the
cache, the DMA engine sets a status flag to indicate to the execution unit that the block
transfer is complete and the requested block is available for processing. The blocks can be of
any size, and are preferably chosen to contain only data which has changed. By limiting the
transfer of data which has not changed, the amount of overall data transfers is reduced,
improving the system's processing capabilities.
Finally, the present invention is equally applicable to instruction cache management.
In this embodiment, data need only be retrieved from main memory, but using the DMA
engine of the present invention, only the portion of the code required is retrieved, which
maximizes the use of the system resources.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a block diagram of a prior art embedded processor.
Figure 2 is a block diagram of a preferred embodiment of an embedded processor in
accordance with the present invention.
Figure 3 is a block diagram of a data cache.
Figure 4 is a block diagram of a data cache line tag in accordance with the present
invention.
Figure 5 is a block diagram of an embodiment of an DMA engine in accordance with
the present invention.
Figure 6 is a flowchart illustrating a preferred method of transferring a block of data
from external memory to a data cache.
Figure 7 is a flowchart illustrating a preferred method of transferring a block of data to
external memory from a data cache.
Figure 8 is a flowchart illustrating a preferred method of allocating sets in the data
cache.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Figure 2 is a block diagram illustrating a preferred embodiment of an embedded
processor 200 of an integrated multimedia encoder. The embedded processor 200 comprises
an instruction cache controller 232, instruction cache 228, instruction decoder 224, external
memory 208, and data cache controller 212. In accordance with the present invention, a
direct memory access (DMA) engine 250 is coupled to the execution unit 204, external
memory 208, data cache 216, and instruction cache 228. The DMA engine 250 is designed to
perform the transfer of blocks of data from the data cache 216 to the external memory 208,
and from external memory 208 to the data cache 216 across lines 213, 211. Additionally, the
DMA engine 250, or a separate DMA engine, transfers instruction data to and from the
instruction cache 228 and external memory 208 across lines 201 , 211. The execution unit
204 programs block transfer information into the DMA engine 250 across line 205, and the
DMA engine 250 then performs the block transfer without further assistance from the
execution unit 204. The execution unit transmits standard data requests to the cache
controller 212 across line 209.
In conventional data cache memory, information is stored and transferred in cache
lines. However, in multimedia encoder applications, blocks and macroblocks which are
larger than individual cache lines are the groupings of data which are processed by the
execution unit 204. If a multimedia encoder is using a conventional cache management
system, multiple cache line transfer operations must be performed in order to complete a
usable data transfer, which lengthens the processing time of the system. Also when a block
of data which needs to be replaced is smaller than a cache line, the entire cache line is
replaced by the conventional cache-management system, which over utilizes system
resources. However, by using a DMA only the data which needs to be replaced is retrieved
from main memory 208 and placed in a cache line. Another major benefit of using a DMA
engine 250 to provide the transfer of data and instructions is that it allows the execution unit
204 to be free to concentrate its resources on computation. This division of labor also greatly
improves the overall processing power of the system. Further, in contrast to existing
implementations of DMA engines, the DMA engine 250 of the present invention does not
require a separate dedicated memory buffer. Rather, the data cache 216 itself is used as the
buffer for the DMA engine 250 which is shared with the cache controller 212. No additional
hardware is required to store data required by the DMA engine 250 to perform its data
transfer operations.
Figure 3 illustrates a preferred embodiment of the data cache 216 in accordance with
the present invention. The data cache 216 is optimized to support both traditional data
storage functions as well as the block data storage required by the processing of multimedia
data. The data cache 216 is organized into sets 304. Each set 304 contains a number of cache
data lines 300. The sets 304 are organized in the data cache 216 responsive to the type of
applications required. For example, for digital video applications, a typical set is 256x32 bits,
while for digital audio applications a typical set is 64x32 bits. The illustrative data cache 216
shown in Figure 3 is shown for digital audio applications, each block representing a byte.
Two sets 304 are shown, one is illustrated having eight cache lines 300. Thus, the cache is 32
bits by 128 bits.
In a preferred embodiment, the data cache 216 has a busy tag 308 and a direction tag
312. The busy tag 308 indicates to the execution unit 204 whether the data cache 216 is
being used for a DMA data transfer. Only one DMA data transfer is permitted to occur at a
time. The direction tag 312 indicates to the DMA engine 250 whether an operation is a read
or a write. An address tag 320 is also provided for the data cache which indicates the starting
address for a data transfer by the DMA engine 250. A lock indicator 324 is also provided for
each data set 304. The lock indicator 324 indicates to the execution unit 204 whether access
is permitted to an individual data set 304. If the program code instructs that a particular set
304 contains data which should not be modified, the lock indicator 324 for that set 304 is
written, and no data transfers will then occur involving the locked set 304. If there is a
particular set 304 from which data should be read, all of the lock indicators 324 for the other
sets 304 are written, and the unlocked set 304 will therefore be the set from which data is read
by the DMA engine 250. A separate portion of the data cache 216 is used as a buffer 316 for
the use of the DMA engine 250. This buffer serves as a memory for the DMA engine 250,
and stores information the DMA engine 250 requires to perform data transfers. The data
cache controller 212 also uses the data cache 216 as a buffer for its operations.
As shown in Figure 4, each cache line 300 has a data part 412 and a control part 414
which is comprised of an address section 408 and control bits 400. Control bits 400 typically
include a valid bit 400(1), a lock indicator 400(2) for the cache line, and a dirty bit 400(3).
The address section 408 is used to determine cache hits or cache misses, as in normal cache
management systems. However, in contrast with the DMA engine 250 in accordance with the
present invention, the cache controller 212 does not perform data transfers in response to
program instructions.
In operation, upon requiring a DMA transfer as indicated by instructions contained
within executing code, the execution unit 204 checks the busy tag 308 of the data cache 216
to determine whether a DMA transfer is currently occurring. If the busy tag 308 is set,
indicating that a DMA transfer is occurring, no action is taken. In an embodiment using a
queue for the DMA requests, a request for a DMA transfer is stored in the queue if the busy
tag 308 is set. Once the busy tag 308 is cleared, the data transfer will begin. If the busy tag
308 indicates that the data cache 216 is available for a DMA transfer, the execution unit 204
writes the starting address into the address tag 320 of the data cache 216, and writes to the
direction tag 312 to indicate whether the transfer is to be a read or a write. If the data transfer
is a write, a set 304 is chosen from the unlocked sets 304 using least-recently-used principles
as the set to which data is to be written.
Figure 5 is a block diagram illustrating a preferred embodiment of DMA engine 250.
The execution unit 204 determines whether the cache controller 212 or the DMA engine 250
will transfer the data. In a preferred embodiment, the determination of which to use is made
by the programmer in creating the code. Typically the programmer will specify the use of the
DMA engine 250 when the data does not change very often. When data is constantly being
replaced, the cache controller 212 is more appropriately specified to perform the data transfer
operations. Responsive to the instructions, the execution unit 204 selects either the cache
controller 212 or the DMA engine 250 to perform the data transfer. If a DMA operation is
specified, the execution unit 204 then checks the busy tag 308 of the data cache 216 across
line 503 to determine whether the cache 216 is available for a DMA data transfer operation.
If the busy tag 308 is clear, the execution unit 204 transmits block transfer information to the
DMA engine 250 to allow the DMA engine 250 to transfer data responsive to a cache miss.
Block information preferably includes address information, byte count information, and a
control indication as to whether the operation is a read or a write. The address information is
transmitted over address line 507 to the address tag 320, the byte count information is
transmitted over data line 501 to the data cache buffer 316, and the control signal is
transmitted over control line 503 to the direction tag 308. The use of the byte count
information and starting address to identify a block of data allows blocks of data of precise
size to be specified. This allows the blocks of data to be transferred to include only the
information which is changing to be replaced, rather than requiring the transfer of an entire
set of data, most of which has not changed since the last transfer. Other block transfer
information transmitted over different or a single line can also be used in accordance with the
present invention.
Once the execution unit 204 verifies that the busy tag 308 is clear, the execution unit
204 initiates the DMA engine 250 across line 205 to begin the transfer. The data
manipulation module 500 retrieves the block transfer information over lines 509, 511, and
515 and performs the required tasks. For example, if the direction tag 308 indicates a read
operation is to be performed, the data manipulation module 500 accesses the external memory
208 through data line 211 at the address specified by the execution unit 204. The data
manipulation module 500 begins reading bits of data from external memory 208 which are
counted by the counter 524. When the counter value equals the byte count value received
from the execution unit 204, the data manipulation module 500 stops reading. The current
counter value and base count value are also stored in the data cache buffer 316. The data
manipulation module 500 then determines which sets 304 are unlocked, and selects an
unlocked set 304 to which to write the data. Selection of the set 304 is based upon least-
recently used principles; however, the selection is only limited to the unlocked sets 304.
Once a set 304 is selected, the data manipulation module 500 writes the busy tag 308
to indicate to the execution unit 204 that the DMA engine 250 is working on transferring a
data block and that the execution unit 204 cannot initiate a DMA transfer. The data
manipulation module 500 then writes the data across line 519 through counter 520, across
line 213 into the set 304. After the transfer is complete, the DMA engine 250 disables the
busy tag 308 to indicate to the execution unit 204 that it may initiate a new DMA transfer.
Alternatively, the DMA engine 250 examines a DMA queue in the data cache 208 to see if
there are any pending DMA data transfers to be executed. After storing new data into a data
set 304, the execution unit 204 may lock the set 304 to preserve the newly transferred data
from subsequent DMA operations.
For a read operation, the data manipulation module 500 retrieves the data across line
213 from the designated set 304 through counter 520. The data is counted through counter
520 until the size of the retrieved data matches the specified byte count. The data is then
transferred to external memory 208. The data manipulation module 500 may be implemented
as specialized hardware, as a microprocessor, or other conventional means.
Figure 6 is a flowchart illustrating a preferred method of writing to data cache. First,
the DMA engine 250 determines 600 whether the request is to read a block of data from
external memory 208. If it is not, the process proceeds to step 700, discussed below. If it is,
the DMA engine 250 receives 604 the block transfer information from the execution unit 204.
Then, the DMA engine 250 enables 608 the busy tag 308 of the data cache 216 to indicate to
the execution unit 204 that the block is being transferred. Next, the DMA engine 250
accesses 612 external memory 208, and locates the specified address. A block of data of the
requested size is retrieved 616 and stored 620 into the data cache 216. As the transfer is
completed for the lines 300, the busy tag 308 is disabled 624.
Figure 7 is a flow chart illustrating a preferred method of transferring data from the
data cache 216 to external memory 208. First, the DMA engine 250 receives 700 block
transfer information such as the starting address and the byte count. Then, the busy tag 308 is
enabled 704 to indicate that the data cache is in use by the DMA engine 250. The data is
read from the data cache line 300 and transferred to available memory lines in external
memory 208. After the data is read from each line 300 the busy tag 308 for that line is
disabled, allowing the execution unit 204 to access that line 300. Thus, the DMA engine 250
in accordance with the present invention provides for the independent transfer of data from
memory 208 to the data cache 216 and back, allowing the execution unit 204 to devote its
resources to more computationally intensive tasks. A busy tag 308 is provided to streamline
communication between the DMA engine 250 and the execution unit 204 and ensure that
only the correct data is read from a memory location.
Figure 8 illustrates a preferred method of allocating sets 212 in data cache 216. First,
the execution unit 204 determines 800 if a data transfer is to be made by the DMA engine
250. Again, this information is preferably provided in the code to be executed by the
execution unit 204, allowing the dynamic designation of each data transfer to be performed
by either the DMA engine 209 or the cache controller 112. If the data transfer is to be made
by the DMA engine 250, the execution unit 204 selects 804 a cache set 304 to which to
transfer data. Next, the execution unit determines 806 whether the selected set 304 is
unlocked. If it is, the set 304 is added 808 to a list of unlocked sets. If it is unlocked the set
304 is not added 812 to the list. The execution unit 204 determines 816 whether there are
more sets. If there are, the execution unit 204 selects 820 a next set. The process is repeated
until a list of all unlocked sets 304 are obtained. Once the list of unlocked sets 304 is
obtained, the execution unit 204 orders 824 the sets responsive to their latest time of access
by the DMA engine 250. The set 304 which has not been accessed for the longest period is
selected 828 as the set 304 to which to transfer the data.
If the data transfer is to be made by the cache controller 212, the execution unit 204
transfers the data to a cache line 300 using the cache controller 212, as described above. Both
the cache controller and the DMA engine 250 use the same instruction set. In normal
operation, the cache controller compares address tags of cache lines 300 to requests for data
to determine cache hit and cache misses. Upon finding a cache miss, the missing data is
retrieved from memory and stored in a cache line as described above. If the program code
instructs the execution unit 204 to perform a DMA data transfer, the DMA engine 250
performs the operation as described above.
In a further embodiment, referring again to Figure 2, the instruction cache is also
managed by the DMA engine 250. In this embodiment, the DMA engine 250 fetches
instructions from the external memory 208 across line 211 responsive to receiving an address
and a byte count from the execution unit 204. The DMA engine 250 transmits the instruction
data across line 201 in the instruction cache 128 for access by the execution unit 204. The
use of the DMA engine 250 to perform instruction retrieval allows the execution unit 204 to
devote its resources to performing computation tasks, thus maximizing the performance of
the system.
While the present invention has been described with reference to certain preferred
embodiments, those skilled in the art will recognize that various modifications may be
provided. These and other variations upon and modifications to the preferred embodiments
are provided for by the present invention.