CN114780466B

CN114780466B - DMA-based optimization method for data copy delay

Info

Publication number: CN114780466B
Application number: CN202210720713.XA
Authority: CN
Inventors: 不公告发明人
Original assignee: Muxi Technology Beijing Co ltd
Current assignee: Muxi Technology Beijing Co ltd
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-09-02
Anticipated expiration: 2042-06-24
Also published as: CN114780466A

Abstract

The invention relates to the technical field of data processing, in particular to a DMA-based optimization method for data copy delay, which comprises the following steps: analyzing and recording a source initial address and a destination initial address contained in the copying command, and acquiring the length of the copied data; calculating a virtual address of a read request according to the source initial address and the length of the copied data, and calculating a virtual address of a write request according to the destination initial address and the length of the copied data; dividing the virtual address space into a plurality of virtual pages, and respectively updating the virtual address sequences which are arranged in a natural order and correspond to the read-write request into virtual address sequences which are arranged in a disordered order; and sending the virtual address sequence which is arranged in the disorder sequence to an address translation unit for address translation to obtain a corresponding physical address, and reading and writing data according to the physical address to finish copying. The first addresses of all the virtual pages are translated in advance according to the out-of-order sequence, the address translation period is shortened, and the data copying efficiency is improved.

Description

DMA-based optimization method for data copy delay

Technical Field

The invention relates to the technical field of data processing, in particular to a DMA-based optimization method for data copy delay.

Background

In the process of DMA data COPY transmission, a very important processing flow is read-write virtual address translation belonging to COPY operation. Virtual addresses are used in the system instead of physical addresses, and the physical addresses are real physical memory addresses; a virtual address is an address in a logical sense that the virtual address needs to be translated into a physical address in order to perform an access operation. The process of DMA executing COPY involves two processes of converting a virtual address to a physical address, namely, a process of reading a virtual address to a read physical address and a process of writing a virtual address to a write physical address. The translation of virtual address to physical address requires the mutual participation of mmu (memory Management unit) and page table (page table).

The page table includes a page table and a page directory table: a Page Table (Page Table Entry), abbreviated as PTE, records the real physical address of the virtual address; the Page Directory table (Page Directory Entry), abbreviated as PDE, records a Directory of pages. The page directory table and the page table can be divided into a plurality of levels according to operating systems with different bit widths, and the process of translating the virtual address into the physical address is to use the virtual address as an index and search the corresponding physical address in the multi-level page table one level by one level. The GPU virtual address page has a minimum physical unit of 1 physical page, the size of the physical page can be set according to the bit width and physical memory of the operating system, for example, a 32-bit operating system, the size of one physical page is 4KB, and lsb12 of the virtual address identifies the offset (offset address) of the address in the page, and the page index number or called physical address page Entry (Entry) is msb20 of the virtual address. The physical page is a continuous memory block with fixed size in the physical memory, and the system is divided according to the set fixed size, so that the translation process of all virtual addresses mapped to the physical addresses in the physical page can be completed as long as the physical address page entrance is positioned.

The functional unit for controlling the page Table search is an MMU (memory management unit), and the MMU consists of a TLB (translation lookaside buffer) and a Table walk unit. Because it is relatively time consuming to access the page table in the memory, especially in the case of using multi-level page tables, it needs multiple memory accesses, in order to speed up the access, a hardware cache is set for the page table: TLB, the CPU will look first in the TLB, since it is fast to find in the TLB. The TLB is fast, because it contains a small number of physical address page entries, and because the TLB is integrated into the CPU, it can run almost as fast as the CPU. The virtual address translation is firstly inquired in the TLB, and if the virtual address translation is hit, the virtual address translation is completed; otherwise, the memory needs to be accessed for many times, the multi-level page table is queried to obtain the physical address page entry, and obviously, the translation delay is far greater than that of the TLB. The delay caused by the virtual address translation of DMACOPY is a very key influence factor for improving the performance of COPY; physical page Entries which can be stored in the TLB are limited, and reading and writing virtual address translation of COPY requires frequent access to a page table, so that the influence of virtual address translation delay on COPY efficiency and performance is very critical.

The general execution flow of the DMA-based copy is as follows: after the DMA analyzes a COPY command, firstly analyzing a source initial address, a destination initial address and the data length of the COPY, and respectively calculating a read virtual address and a write virtual address sequence according to AXI burst setting, wherein the virtual read address and the write address are natural sequences; the method comprises the steps that a reading request with a natural sequence is sent out firstly, address translation is executed, a physical address after the address translation is finished reads data from a source memory (memory) through an AXI and an SOC bus (such as PCIe), a virtual address (a writing initial address + offset) of a writing request is calculated according to the natural sequence (offset) of the reading request corresponding to an association ID after the data are returned, then the writing request and the read-back data are sent to an address translation module together, and the writing request and the read-back data are sent to a target memory through the AXI and the SOC bus after the address translation is finished, and writing operation is executed; the DMA waits for a write return, completing this COPY. The disadvantages of the above general implementation procedure are as follows:

the read-write virtual addresses are all sent to the address translation module in a natural sequence, so all addresses belonging to 1 physical page can start address translation in a centralized manner according to the natural sequence; because all virtual addresses in 1 physical address page belong to the same page, only page index numbers (Entry) + offset addresses (offset addresses) are used, 1 address in the same physical address page is translated completely, the page index numbers are stored in the TLB, when other virtual address translation operations in the physical address page are executed, the translation operations can be completed only by inquiring and matching the page index numbers in the TLB, and the translation operations can be completed in a plurality of clock cycles; when the first address of each physical address page is translated, multiple stages of page tables in the memory need to be accessed for multiple times, the delay of one memory access can reach hundreds of clock cycles, and the address translation delay depending on the access of the multiple stages of page tables is far greater than the delay of accessing the TLB. Therefore, when virtual address translation requests are sent in a natural sequence, blank beats can occur between pages due to the delay of accessing the multi-level page table, and the pipelining can not be formed, so that the copying performance is influenced.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a method for optimizing data copy delay based on DMA, and the adopted technical solution is as follows:

a DMA-based optimization method for data copy delay comprises the following steps:

analyzing and recording a source initial address and a destination initial address contained in the copying command, and acquiring the length of the copied data; calculating a virtual address of a read request according to the source initial address and the length of the copied data, and calculating a virtual address of a write request according to the destination initial address and the length of the copied data, wherein the virtual addresses corresponding to the read-write requests are arranged in a natural sequence; dividing the virtual address space into a plurality of virtual pages, and respectively updating the virtual address sequences which are arranged in a natural order and correspond to the read-write request into virtual address sequences which are arranged in a disordered order; the virtual address sequence arranged out of order comprises a subsequence formed by a first address read in advance and a subsequence formed by virtual addresses in updated virtual pages, wherein the first address is any one virtual address in each virtual page, the updated virtual pages comprise an absent virtual page and a combined virtual page, the absent virtual page comprises an in-page virtual address, the combined virtual page comprises an in-page virtual address and a residual first address, the in-page virtual address is a virtual address except the first address in the virtual page, the residual first address is a first address in the residual virtual page arranged behind, and the residual virtual page is a virtual page except the virtual page from which the first address is read in advance; and sending the virtual address sequence to an address translation unit according to the disorder sequence for address translation to obtain a corresponding physical address, and reading and writing data according to the physical address to finish copying.

The invention has the following beneficial effects:

the embodiment of the invention provides a DMA-based optimization method for data copy delay, which arranges virtual addresses in a natural sequence out of order before translating the virtual addresses into physical addresses, so that an address translation unit translates first addresses in a plurality of virtual pages in advance, and simultaneously adds the first addresses in the remaining virtual pages into virtual pages reading the first addresses in advance, so that the address translation unit translates the first addresses of all the virtual pages in advance.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a method for optimizing data copy latency based on DMA according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a virtual address in a virtual page according to an embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the method for optimizing DMA-based data replication delay according to the present invention, its specific implementation, structure, features and effects will be given in conjunction with the accompanying drawings and the preferred embodiments. In the following description, different "one embodiment" or "another embodiment" refers to not necessarily the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The method disclosed in the embodiment of the invention is suitable for all processors applying DMA technology, such as CPU and GPU chips. DMA (Direct Memory Access) transfer copies data from one address space to another, providing high-speed data transfer between a peripheral and a Memory or between a Memory and a Memory. The DMA transmission mode has the advantages that large data volume data can be directly moved without occupying processor resources, direct control of the processor is not needed, processes of reserving a site and recovering the site for interrupt processing are not needed, a special channel is opened up for data transmission by using hardware, and the special channel is processed in parallel with the processor. Because the storage structures of the CPU and the GPU are basically the same, the embodiment of the present invention takes a GPU chip as an example to illustrate a specific scheme of the present invention, and a GPU (graphics Processing unit), that is, a graphics processor, is a behavior based on large data volume transmission regardless of image Processing or parallel computing.

The following describes a specific scheme of the DMA-based data copy delay optimization method and system in detail with reference to the accompanying drawings.

Referring to fig. 1, a flow chart of a method for optimizing DMA-based data copy latency according to an embodiment of the present invention is shown, where the method includes the following steps:

step S001, the source start address and the destination start address included in the copy command are analyzed and recorded.

And receiving a COPY (COPY) command from the GPU processor, reading the COPY command into the DMA module in a DMA active acquisition mode, and starting to execute COPY operation through command analysis and scheduling.

It should be noted that the data storage unit of the GPU system is Byte (Byte), the smallest data unit of the data COPY is also Byte, each physical address corresponds to one Byte of data, and similarly, each virtual address also corresponds to one Byte of data.

For the GPU chip, COPY operation may implement data movement between the following memory locations: from system memory to local memory, from system memory to system memory, from local memory to system memory, or from local memory to local memory.

Specifically, the source start address and the destination start address identify the source address and the destination write address of the first byte data to be COPY, and all address sequences of COPY are calculated on the basis of the destination start address of the source start address, and the length of the copied data is recorded. Wherein the source address corresponds to the start address of the read request and the destination start address corresponds to the start address of the write request.

Step S002, calculating the virtual address of the read request according to the source initial address, and calculating the virtual address of the write request according to the destination initial address, wherein the virtual addresses corresponding to the read-write requests are arranged in a natural sequence; dividing a virtual address space into a plurality of virtual pages, and respectively updating the natural sequence of the virtual addresses corresponding to the read-write request into virtual addresses arranged in an out-of-order sequence; the virtual address sequence arranged out of order comprises a subsequence formed by a first address read in advance and a subsequence formed by virtual addresses in updated virtual pages, wherein the first address is any one virtual address in each virtual page, the updated virtual pages comprise missing virtual pages and combined virtual pages, the missing virtual pages comprise in-page virtual addresses, the combined virtual pages comprise in-page virtual addresses and a residual first address, the in-page virtual addresses are virtual addresses except the first address in the virtual pages, the residual first addresses are first addresses in the residual virtual pages arranged behind, and the residual virtual pages are virtual pages except the virtual pages in which the first address is read in advance.

And calculating the virtual address and the number of the read-write requests included in the copy command according to the AXI bus data bit width. The number of virtual pages can be calculated according to the number of virtual addresses.

The method for calculating the virtual address of the read request according to the source initial address comprises the following steps: taking a source initial address as a virtual address of a first read request, and calculating a read request address sequence according to an increment address length corresponding to a data width set by an AXI data bus: { source start address, source start address + increment, source start address +2 × increment.

The method for calculating the virtual address of the write request according to the destination initial address comprises the following steps: calculating a read request address sequence by taking a target initial address as a first write request address and according to an increment address length corresponding to a data width set by an AXI data bus: { destination start address, destination start address + increment, destination start address +2 × increment.

The virtual addresses of the read-write requests obtained by the method are the virtual addresses arranged according to a natural sequence.

The method for dividing the virtual address space into a plurality of virtual pages comprises the following steps: and calculating the number of virtual pages occupied by the data storage according to the data length needing to be copied.

Specifically, referring to fig. 2, there is a one-to-one correspondence between virtual addresses and physical addresses, the virtual addresses are consecutive, and the virtual addresses in a virtual page are also consecutive; while for physical addresses, the physical addresses in the same physical page are continuous, but the physical pages may be discontinuous, for example, in the figure, for continuous virtual page 0 and virtual page 1, the last virtual address of virtual page 0 is seqn-12, the start virtual address of adjacent virtual page 1 is seqn-11, and the virtual addresses between two virtual pages are continuous. However, for the physical page i and the physical page j corresponding to the virtual page 0 and the virtual page 1, respectively, the physical page i and the physical page j may be continuous physical pages or discontinuous physical pages; the corresponding physical addresses may also be discontinuous, such as the last physical address of physical page i is (i +1) × n-1, and the start physical address of physical page j is j × n. For data needing to be copied, m read-write groups are counted, and n read/write groups are arranged in each physical address page. Calculating the virtual address of the read request according to the source initial address, and obtaining the virtual address corresponding to the read request as follows: { seq0, seq1, …, seq-1, seq +1, …, seq2n-1, seq2n, seq2n +1, …, seq-3, seq-2, seq-1 }, dividing a virtual space into a plurality of virtual pages, each virtual page having n virtual addresses in total, wherein the virtual addresses { seq0, seq1, …, seq-1 } are divided into a first virtual page, seq0 is a first virtual address in the first virtual page, and seq-1 is an nth virtual address in the first virtual page; the virtual addresses { seqn, seqn +1, …, seq2n-1} are divided into a second virtual page, seqn is the first virtual address in the second virtual page, and seq2n-1 is the nth virtual address in the second virtual page; by analogy, all the virtual addresses are divided into corresponding virtual pages to obtain a plurality of virtual pages, and the addresses in each virtual page are continuously arranged according to a natural sequence. And then, acquiring the virtual addresses which are arranged in an out-of-order sequence by taking any one virtual address in each virtual page as a first address, wherein the first address is an address for performing address translation on the first virtual page and is used as an entry address of the virtual page. The first address may be the first virtual address in each virtual page, may be a virtual address in a virtual page having the same offset with respect to the first virtual address in the virtual page, or may be any one of the virtual addresses in the virtual page. In order to reduce the calculation amount, improve the calculation efficiency and optimize the resource allocation, the first virtual address can be selected as the head address, or the virtual address with the same offset relative to the first virtual address in the virtual page can be selected as the head address; for example, the virtual address at the middle position in each virtual page is selected as the first address.

The first address is sent to the address translation unit in advance for address translation, after the translation is completed, a corresponding page table Entry (Entry) exists in a mapping table of the TLB, and the translation can be completed only by taking a few cycles when the virtual address in the corresponding virtual page is subsequently translated (Entry + intra-page offset). Specifically, the first addresses equal to the number of the virtual pages read in advance are selected according to a natural sequence, and the selected first addresses are sequentially sent to the MMU for address translation according to the sequence; because the first address in the first virtual page has been sent to the MMU in advance for address translation, and a preset number of first addresses have been read in advance before the first virtual page, the first virtual page is set as a missing virtual page, the first virtual page is not filled with the remaining first addresses any more, and the missing virtual page is used as an updated virtual page. Setting the second virtual page as a combined virtual page, namely filling the first initial address in the remaining initial addresses into the position of the original initial address in the second virtual page to obtain an updated second virtual page, and sequentially sending the virtual addresses in the updated second virtual page to the MMU for address translation, so that the initial addresses which are not translated in advance are sent to the MMU for address translation along with the virtual pages in advance; similarly, the third virtual page is set as a combined virtual page, the second initial address in the remaining addresses is filled in the position of the original initial address in the third virtual page to obtain an updated third virtual page, and the updated virtual addresses in the third virtual page are sequentially sent to the MMU for address translation; and by analogy, because the head addresses of the t virtual pages are read in advance and the rest of the head addresses are not moved into the first virtual page, the last t-1 virtual pages are missing virtual pages to finish the address translation of all the virtual addresses. Because the first address corresponding to the second virtual page is translated in advance, and the corresponding page table entry is already in the mapping table of the TLB, the multi-level page table does not need to be accessed again when the address translation is performed on the original virtual address in the second virtual page, and the original virtual address in the virtual page can be translated in several cycles. After all the virtual address translation in the first virtual page is completed, the TLB removes the page table entries corresponding to the first initial address read in advance, establishes the mapping relation between the remaining first initial address and the physical address, and forms the page table entries corresponding to the remaining first initial address; and by analogy, removing the page table entry of the corresponding initial address after the complete translation of the virtual address in the virtual page is formed, and establishing a dynamic process of the page table entry of the next initial address to be translated.

Because the virtual addresses are translated according to a natural sequence in the prior art, when the first virtual address in a virtual page is translated, because a corresponding page table entry does not exist in the TLB, a multi-level page table needs to be accessed for address translation, and the translation time is long. The method for translating the address provided by the patent adopts the virtual addresses arranged in the out-of-order sequence to translate the virtual addresses in advance, the corresponding initial addresses are translated when the virtual addresses in the corresponding virtual pages are sent to the MMU, and the virtual addresses in the virtual pages can be translated in a plurality of cycles.

As a preferred embodiment, the first address is the virtual address in each virtual page with the same address increment relative to the starting address. The number of physical pages read ahead is set to t, for the first virtual page { seq0, seq1, …, seqn-1}, the second virtual page { seq, seqn +1, …, seq2n-1}, the third virtual page { seq2n, seq2n +1, …, seq3n-1}, …, the t-th virtual page { seq (t-1) n, seq (t-1) n +1, …, seqn x n-1 }. Taking the first virtual address of each virtual page as the head address, the subsequence formed by the head address read in advance is { seq0, seq2n, …, seq (t-1) n }, the subsequence formed by the virtual addresses in the updated virtual page is { seq1, …, seq-1, seq × n, seq +1, …, seq2n-1, seq (t +1) × n, seq2n +1, …, seq-3, seq-2/seq-1 }, and the subsequence formed by the head address read in advance and the subsequence formed by the virtual addresses in the updated virtual page are spliced to obtain the disordered virtual address: { seq0, seq2n, …, seq (t-1) n, seq1, …, seq-1, seq Xn, seq +1, …, seq2n-1, seq (t +1) Xn, seq2n +1, …, seq-3, seq-2/seq-1 }, if the last virtual address is an integer multiple of n, the sequence of the last request is seq-2, otherwise the sequence of the last request is seq-1. The specific value of the number t of the virtual pages read in advance can be determined according to the relationship between the total delay time of different systems for accessing the multi-level page table and the total delay time of querying the TLB of all the addresses in the virtual pages. Sending the virtual addresses to an address translation unit for address translation according to the sequence after the disordered arrangement, and when translating the virtual addresses in a page (for example, the sequence number is the x-th page), translating the first address of the x + t virtual page in the combined virtual page; in the translation delay process of the first address of the x + t-th virtual page, all the addresses of the missing virtual pages x, x +1, x +2, x +3, … and x + t-1 are successfully matched through entries in the TLB so as to quickly finish translation; when the translation of the first address of the x + t virtual page is completed to obtain an entry, all addresses in the x + t missing virtual page send translation requests, and the translation requests are matched with the entry which is obtained just successfully, so that the purpose of rapid translation is achieved.

Specifically, the number t of the first addresses read in advance satisfies the following condition:

T=（n-1）*T _TLB *t

wherein T is the total delay of accessing the multi-level page table by translating a first address, and T _TLB The time delay of translating a virtual address of an existing page table entry for the TLB, t is the number of the first addresses read in advance, and n is the total number of the virtual addresses in the virtual page.

The out-of-order sequence of the virtual addresses corresponding to the read-write request is the same, and the process of updating the natural sequence of the virtual addresses corresponding to the read-write request into the out-of-order sequence is the same, so the process of updating the write request into the out-of-order sequence is not repeated.

And S003, sending the virtual addresses to an address translation unit according to the out-of-order sequence for address translation to obtain corresponding physical addresses, and reading and writing data according to the physical addresses to finish copying.

After obtaining the physical address of the read request, reading the corresponding data, sending the read data and the write request to an address translation module together for address translation, sending the data to a destination memory through an AXI and an SOC bus after completing the address translation, executing the write operation, and waiting for the write return by the DMA to complete the copy operation. The address translation process of the write request is the same as that of the read request, and is not described in detail.

In summary, the embodiments of the present invention provide an optimization method for data copy latency based on DMA, where before translating a virtual address into a physical address, the virtual addresses in a natural order are arranged out of order, so that a TLB pre-translates a plurality of first addresses in virtual pages, and simultaneously adds the first addresses in the remaining virtual pages to a virtual page in which the first address is read in advance, thereby enabling the TLB to translate the first addresses of all virtual pages in advance, shortening an address translation cycle, and improving data copy efficiency.

It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims

1. A DMA-based optimization method for data copy delay is characterized by comprising the following steps:

analyzing and recording a source initial address and a destination initial address contained in the replication command, and acquiring the length of the replicated data;

calculating a virtual address of a read request according to the source initial address and the length of the copied data, and calculating a virtual address of a write request according to the destination initial address and the length of the copied data, wherein the virtual addresses corresponding to the read-write requests are arranged in a natural sequence; dividing the virtual address space into a plurality of virtual pages, and respectively updating the virtual address sequences which are arranged in a natural order and correspond to the read-write request into virtual address sequences which are arranged in a disordered order; the disordered sequentially-arranged virtual address sequence is obtained by splicing a subsequence formed by a first address read in advance and a subsequence formed by virtual addresses in updated virtual pages, wherein the first address is any one virtual address in each virtual page, the updated virtual pages comprise missing virtual pages and combined virtual pages, the missing virtual pages comprise in-page virtual addresses, the combined virtual pages comprise in-page virtual addresses and one remaining first address, the in-page virtual addresses are virtual addresses in the virtual pages except the first address, the remaining first addresses are first addresses in the remaining virtual pages arranged in the back, and the remaining virtual pages are virtual pages except the virtual pages read in advance;

and sending the virtual address sequence to an address translation unit according to the disorder sequence for address translation to obtain a corresponding physical address, and reading and writing data according to the physical address to finish copying.

2. The method of claim 1, wherein the first address further comprises:

the virtual address having the same offset is the first address with respect to the start address of each virtual page.

3. The method of claim 1, wherein the first address is an address of a first address in the virtual page for address translation, and is an entry address of the virtual page.

4. The method for optimizing data copy delay based on DMA according to claim 1, wherein the method for determining the number of the first addresses read in advance comprises: and determining according to the relation between the total delay time for accessing the multi-level page table and the total delay time for querying the TLB by the virtual addresses in all pages in the virtual page.

5. The method of claim 1, wherein the following condition is satisfied:

T=（n-1）×T _TLB ×t

wherein T is the total delay of accessing the multi-level page table by translating a first address, and T _TLB The time delay of translating a virtual address of an existing page table entry for the TLB is determined, t is the number of the first addresses read in advance, and n is the total number of virtual addresses in the virtual page.

6. The method according to claim 1, wherein the updated virtual pages include missing virtual pages equal to the number t of the read-ahead first addresses, and the other virtual pages are combined virtual pages; the first virtual page is a missing item virtual page, and the last t-1 virtual pages are the remaining number of missing item virtual pages.

7. The method of claim 1, wherein the address translation unit removes the corresponding page table entry after completing the translation of a virtual page; and establishing a page table entry of a next head address to be translated.

8. The method according to claim 1, wherein the out-of-order of the virtual addresses corresponding to the read/write requests is the same.

9. The method of claim 1, wherein the source start address and the destination start address are a source address and a destination write address of a first byte of data to be copied.