CN116860665A

CN116860665A - Address translation method executed by processor and related product

Info

Publication number: CN116860665A
Application number: CN202310891442.9A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-10-10

Abstract

The present disclosure discloses an address translation method performed by a processor, a computing device, a chip, and a board. The computing means may be comprised in a combined processing means, which combined processing means may further comprise interface means and processing means. The computing device interacts with the processing device to collectively perform a user-specified computing operation. The combined processing means may further comprise storage means connected to the computing means and the processing means, respectively, for storing data of the computing means and the processing means. Aspects of the present disclosure provide methods of address translation that may accelerate address translation by reducing page table walk time.

Description

Address translation method executed by processor and related product

Technical Field

The present disclosure relates generally to the field of computers, and more particularly to the field of storage management. More particularly, the present disclosure relates to address translation methods, processors, computing devices, chips, and boards performed by processors.

Background

In the field of computer technology, one of the important functions of a computer operating system is memory management. In a multiprocessing operating system, each process has its own Virtual Address space, and any Virtual Address (Virtual Address) within the system specification range can be used. The address used when the cpu CPU (Central Processing Unit) executes an application program is a virtual address. When the operating system allocates memory to a process, the virtual Address used needs to be mapped to a Physical Address (Physical Address), which is a real Physical memory access Address. Therefore, address translation (i.e., translating a virtual address into a physical address that substantially stores data) is required prior to data access. By dividing the addresses into virtual addresses and physical addresses, program compiling can be simplified, a compiler compiles programs based on continuous and sufficient virtual address space, and virtual addresses of different processes are distributed to different physical addresses, so that a system can simultaneously run a plurality of processes, and the running efficiency of the whole computer system is improved. In addition, since the application can use but cannot alter the address translation, one process cannot access the memory contents of another process, thereby increasing the security of the system.

FIG. 1 is a schematic diagram of an example address translation process, illustrating the address translation process of a four-level page table. As shown in fig. 1, one virtual address is divided into several address segments, which are denoted as EXT, offset_lvl4, offset_lvl3, offset_lvl2, offset_lvl1, offset_pgfrom the upper to the lower bits, respectively. In this example, the highest virtual address segment EXT is not used. In the address translation process, the virtual address segments offset_lvl4, offset_lvl3, offset_lvl2, offset_lvl1 are used as indexes in the four-stage page table, respectively, i.e., the virtual address segment offset_lvl4 is used as an index in the fourth-stage page table (highest-stage page table), the virtual address segment offset_lvl3 is used as an index in the third-stage page table, the virtual address segment offset_lvl2 is used as an index in the second-stage page table, and the virtual address segment offset_lvl1 is used as an index in the first-stage page table. The initial address of the highest level page table (i.e., fourth level page table) is stored in, for example, control register CR3, the contents of which are set by the operating system and cannot be changed by the application program. For address translation, the processor uses OFFSET_lvl4 as an index to find the page table entry at the index in the fourth stage page table (highest stage page table), the content of the page table entry is the address of the third stage page table (next stage page table) (also known as OFFSET_lvl4 points to the third stage page table address in the fourth stage page table), then uses OFFSET_lvl3 as an index to find the page table entry at the index in the third stage page table, the content of the page table entry is the address of the second stage page table (next stage page table) (also known as OFFSET_lvl3 points to the second stage page table address in the third stage page table), then uses OFFSET_lvl2 as an index to find the page table entry at the index in the second stage page table, the content of the page table entry is the address of the first stage page table (next stage page table) (also known as OFFSET_lvl2 points to the first stage page table address in the second stage page table address), and then uses OFFSET_lvl1 as an index to find the virtual page table address (i.e. virtual page table address in the virtual page table (next stage_vop) corresponding to the address in the first stage page table (i.e. virtual page table address in the address of the virtual page table) is found in the first stage address table (next stage page table) is referred to as the address in the virtual page table (next stage address table) is indicated by the virtual page table address table (virtual page table). Since the address translation process involves a walk from the highest level page table to the lowest level page table, the process may also be referred to as a page table walk.

Those skilled in the art will appreciate that the virtual addresses shown in FIG. 1 are merely examples, and that the highest address segment of other virtual addresses may be used as an index in the highest level page table in a page table walk. For example, in one variation of the above example, there are five total levels of page tables, and during address translation, the highest order virtual address segment EXT is also used as an index in the highest level page table.

Address translation is a very time consuming process, and for multi-level page tables, multiple accesses to memory are typically required to obtain the corresponding physical address. Taking the four-level page table shown in fig. 1 as an example, the memory needs to be accessed 4 times to obtain the corresponding physical address. Thus, to save address translation time and improve computer system performance, a hardware buffer, i.e., a translation look-aside buffer TLB (Translation Lookaside Buffer), may be provided in the processor core to store previously used first-level Page Table Entries (PTEs). When address translation is needed, firstly, inquiring whether the needed PTE exists in the TLB, and if so, immediately obtaining the corresponding physical address. The TLB architecture may also be a multi-level structure, with the lowest level TLB having the smallest size and the fastest speed, and searching the next level TLB when the lowest level TLB misses.

Although a TLB can reduce the latency of many address translations, it is not possible to make page table traversals for address translations when an address translation request misses in the TLB. To reduce the time required for page table walk operations, a hardware page table walker PTW (Page Table Walker) is typically provided for the processor core alone, which may be internal or external to the processor core. By using a hardware page table walker, multiple levels of page tables can be quickly traversed to obtain the final memory page physical address.

Disclosure of Invention

The inventors have appreciated that although mechanisms exist such as PTW to mitigate the performance penalty of TLB misses, the latency of page table walks is still high, thereby also increasing the latency of memory accesses. While this problem exists for general processors, it is particularly pronounced for the recently emerging new type of computer processor known as a neural processor NPU (Neural Processing Unit).

The NPU is arranged for complex deep learning models and extensive dataset training. As a result of research, the inventors have found that address translation at NPUs presents two challenges.

(1) There are a large number of bursty sparse address translation requests on the NPU. In one aspect, current recommendation systems are typically trained on terabyte-scale data sets, which results in a large size of the embedded layer and makes embedded lookups exhibit highly sparse and irregular memory access patterns. On the other hand, NPUs typically intercept loops of weights and input data into small blocks and overlap access delays with computation time by double buffering the Scratch Pad Memory (SPM). Normally, these tiles can exploit parallelism, but during the embedded lookup phase they may trigger a large burst of address translation requests in a short time.

(2) NPUs are sensitive to long address translation delays, since page tables are typically implemented as multi-level radix trees (e.g., x 86-64), multiple memory references will be required in the critical path of address translation. Long address translation delays can reduce multiply-accumulate operation utilization and may render some time-sensitive tasks unexecutable.

It is an object of the present disclosure to eliminate or alleviate at least the above-mentioned technical problem of high page table walk delay. To achieve this object, the inventors of the present disclosure propose and utilize the concept of translation paths. Specifically, the inventors regard the concatenation of the addresses of all the next-level page tables traversed from the start point (i.e., the highest-level page table address) to the end point (i.e., the physical address of the memory page) of the virtual address translation as the translation path of the virtual address (for example, in the example of fig. 1, the translation path is the address of the third-level page table+the address of the second-level page table+the address of the first-level page table), in other words, the translation path includes N page table addresses (N > 1) each pointed to in the multi-level page table traversal from the highest-level page table in order from the consecutive N address segments from the upper bits to the lower bits in the virtual address. Further, the inventors contemplate a scheme of saving the N address segments and translation paths of the virtual address that was most recently requested for translation in a processor, thereby enabling a page table walk to be directly started from the last page table address in the same translation path (or portion thereof) when processing a current virtual address translation request having the same translation path (or portion thereof), thereby shortening the time of the page table walk.

In an implementation of the disclosed scheme, the saving of the translation path is implemented as a saving of the N page table addresses, and the comparison of the translation path is also translated into a comparison of the N address segments of the current virtual address and the saved N address segments (because the same address segment corresponds to the same page table address, meaning the same translation path or part thereof).

According to a first aspect of the present disclosure, the above object is achieved by an address translation method performed by a processor, wherein the processor comprises a page table walker and a translation path cache, entries in the translation path cache being for holding a partial virtual address of a most recently requested translation and a translation path thereof, the partial virtual address comprising consecutive N address segments from high order to low order, N > =1, the translation path comprising N page table addresses respectively pointed to in a multi-level page table walk from a highest level page table in turn corresponding to the N address segments. The method comprises the following steps: responding to an address translation request, and comparing a virtual address in the address translation request with an entry in the translation path cache; in response to a match of the virtual address with one or more entries in the translation path cache, selecting an entry with a longest matching prefix as a matching entry; and using the page table traverser to traverse the page table from the page table address pointed by the last matching address segment in the matching item.

According to a second aspect of the present disclosure, the above object is achieved by a processor. The processor includes a page table walk unit including a page table walk and a translation path cache. The entries in the translation path cache are used for storing a partial virtual address of a most recently requested translation and a translation path thereof, wherein the partial virtual address comprises N continuous address segments from high order to low order, N > =1, and the translation path comprises N page table addresses which are respectively pointed to in a multi-stage page table traversal from the highest-stage page table in sequence corresponding to the N address segments. The page table walk unit is configured to: responding to an address translation request, and comparing a virtual address in the address translation request with an entry in the translation path cache; in response to a match of the virtual address with one or more entries in the translation path cache, selecting an entry with a longest matching prefix as a matching entry; and using the page table walker to start page table traversal from the page table address pointed by the last matching address segment in the matching item.

According to a third aspect of the present disclosure, the above object is achieved by a computing device. The computing device includes a processor according to the second aspect.

According to a fourth aspect of the present disclosure, the above object is achieved by a chip. The chip comprises a computing device according to the third aspect.

According to a fifth aspect of the present disclosure, the above object is achieved by a board card. The board card comprises the chip according to the fourth aspect.

By the address translation method, processor, computing device, chip and board provided above, embodiments of the present disclosure can reduce the time of page table traversal by directly starting page table traversal from the last page table address in the same translation path (or portion thereof) saved. Further, in some embodiments, by providing two or more access ports in a TLB of a processor, it may be supported to receive two or more address translation requests simultaneously, which may in turn be processed concurrently. Still further, in some embodiments, multiple address translation requests with adjacent virtual addresses may be consolidated to reduce overall address translation time by providing a merge buffer in the processor to store page table walk PTW requests that have been delivered to the page table walker for traversal but have not returned page physical addresses.

Drawings

The above and other aspects, features and benefits of the present disclosure will become more apparent from the following detailed description with reference to the accompanying drawings. In the drawings, the same reference numerals or letters are used to designate the same or equivalent elements. The accompanying drawings, which are included to provide a better understanding of embodiments of the disclosure, and are not necessarily drawn to scale, wherein:

FIG. 1 is a schematic diagram of an example address translation process;

FIG. 2 is a flow chart of a method performed by a processor according to some embodiments of the present disclosure;

FIG. 3 illustrates one specific example of a page table walk;

FIG. 4 schematically illustrates an example translation path cache that has saved an entry;

FIG. 5 schematically illustrates an example translation path cache that has saved two entries;

FIG. 6 shows an example schematic block diagram of a processor of an embodiment of the present disclosure;

FIG. 7 shows a block diagram of a board card of an embodiment of the present disclosure;

fig. 8 shows a block diagram of a combination processing apparatus of an embodiment of the present disclosure.

Detailed Description

Embodiments herein will be described more fully hereinafter with reference to the accompanying drawings. The embodiments herein may, however, be embodied in many different forms and should not be construed as limiting the scope of the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, a noun that indicates an object may refer to a singular object as well as to a plurality of objects unless the context clearly indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used herein, specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof.

Moreover, the use of ordinal terms such as "first," "second," and "third," etc., herein to modify an object does not by itself connote any priority, precedence, or order of one object relative to another, nor does it connote the temporal order in which acts of a method are performed, but rather merely serves as a tag for distinguishing one object having a particular name from another object having the same name to distinguish the objects.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood. It will be further understood that the terms used herein should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and should not be interpreted in a limiting sense unless explicitly so defined herein.

Fig. 2 is a flow diagram of an address translation method 200 performed by a processor including a page table walker and a translation path cache, entries in the translation path cache for holding a partial virtual address of a most recently requested translation and its translation path, the partial virtual address including consecutive N address segments from high order to low order, N > =1, the translation path including N page table addresses that each point to in a multi-level page table walk from a highest level page table in turn corresponding to the N address segments, according to some embodiments of the present disclosure.

The method 200 includes the steps of: a step 201 of comparing a virtual address in an address translation request with an entry in the translation path cache in response to the address translation request; a step 202 of selecting, as a matching item, an item having a longest matching prefix in response to a match between the virtual address and one or more items in the translation path cache; and using the page table walker, starting from the page table address pointed to by the last matching address segment in the matching entry, step 203.

Next, the method 200 and its various embodiments will be further described in connection with fig. 3-5. It is to be appreciated that while fig. 3-5 illustrate examples with respect to four-level page tables, the number of page table levels in the present disclosure is not limited to four, but may be any number greater than or equal to two. In addition, while certain embodiments address particular considerations of NPUs, aspects of the present application are generally applicable to a variety of processors, such as central processor CPU (Central Processing Unit), graphics processor GPU (Graphics Processing Unit), tensor processor TPU (Tensor Processing Unit), and intelligent processor IPU (Intelligent Processing Unit), as long as there are memory access latency issues with page table walk mechanisms during their address translation.

FIG. 3 illustrates one specific example of a page table walk, involving four-level page tables. Assume that consecutive address segments in the virtual address involved in the page table walk are 0b9, 00c, 0ae, 0c2 (represented in hexadecimal numbers), respectively, of this example.

In a normal page table walk, as shown in fig. 3, after the address 613 of the fourth-stage page table (highest-stage page table) is acquired from the control register CR3, the processor finds the page table entry at the index in the fourth-stage page table using the address segment 0b9 as the index, and the hexadecimal number 042 stored in the page table entry is the address of the third-stage page table (next-stage page table), that is, the address segment 0b9 points to the third-stage page table address 042 in the fourth-stage page table. Notably, in the example of FIG. 3, the page table address is represented by physical page number PPN (Physical Page Number), since a system specific number of zeros may be appended after the PPN to obtain the complete page table address. Those skilled in the art will appreciate that the representation of a page table address with PPN is merely an example, and that there may be a variety of other representations of page table addresses, and aspects of the present disclosure are not limited in this respect.

After obtaining the address of the third stage page table, the processor uses address segment 00c as an index to find the page table entry at the index in the third stage page table, the hexadecimal number 125 stored in the page table entry is the address of the second stage page table (next stage page table) (also referred to as address segment 00c points to the second stage page table address 125 in the third stage page table), then uses address segment 0ae as the index to find the page table entry at the index in the second stage page table, the hexadecimal number 508 stored in the page table entry is the address of the first stage page table (next stage page table) (also referred to as address segment 0ae points to the first stage page table address 508 in the second stage page table), then uses address segment 0c2 as the index to find the page table entry PTE at the index in the first stage page table, and the hexadecimal number 123 stored in the page table entry is the physical address of the corresponding memory page (also referred to as address segment 0c2 points to the memory address 123 in the first stage page table), and combines with the address segment (not shown) in the virtual address to obtain the physical address corresponding to the virtual address.

In the example of fig. 3, the three consecutive address segments 0b9, 00c, 0ae from high to low in the virtual address (excluding the virtual address segment 0c2 pointing to the memory page) are the N address segments of the virtual address, respectively corresponding to the N page table addresses that the N address segments each point to in a multi-level page table walk starting from the highest level page table in turn, i.e. 042, 125, 508, the concatenation of the page table addresses forming the translation path of the virtual address, according to the definition of the present disclosure.

Let us now assume that the example virtual address of fig. 3 is a virtual address in an address translation request to be processed by a processor and that the address translation request is to be processed by the processor in accordance with a method of an embodiment of the present disclosure, and that the processor comprises a page table walker and a translation path cache, entries in the cache being for holding a partial virtual address of a most recently requested translation and its translation path, said partial virtual address comprising consecutive N address segments (N > =1) from high order to low order, said translation path comprising N page table addresses, each pointed to in a multi-level page table walk, in turn, from a highest level page table, corresponding to said N address segments. It is also assumed that no entry is held in the translation path cache at this time.

First, the processor compares the virtual address in the address translation request with entries in the translation path cache. Since no entry is stored in the translation path cache at this time, the processor directly determines that there is no match between the virtual address and any entry in the translation path cache (the match refers to that the N address segments in the virtual address and the N address segments in the aligned entry are identical in the previous one or more address segments or the address prefixes are identical), so that the processor uses the page table walker to perform the above-mentioned page table walk from the highest-level page table address 613 to find the physical address corresponding to the virtual address.

In one embodiment, after the physical address is found, in response to the virtual address not having a perfect match with any entry in the translation path cache, the processor saves the N address segments (i.e., 0b9, 00c, 0 ae) of the virtual address and their corresponding N page table addresses (i.e., 042, 125, 508) in the entry of the translation path cache, wherein the perfect match means that the N address segments in the virtual address are identical to the N address segments of one entry in the translation path cache.

FIG. 4 schematically illustrates an example translation path cache that has maintained the entry whose contents include the N address segments 0b9/00c/0ae (in the form of an "index" field) and their corresponding N page table addresses 042/125/508 (in the form of a "translation path" field).

After completion of the processing of this address translation request, it is assumed that the processor in turn receives another address translation request (not shown) that needs to be processed, and that the N address segment values of the virtual address therein are 0b9, 00c and 0ad.

According to a method of an embodiment of the present disclosure, a processor first compares a virtual address in the another address translation request with an entry in a translation path cache. Since there is only one entry in the translation path cache at this time, the comparison involves only the virtual address and the entry. The processor then finds that the first two address segments 0b9 and 00c of the N address segments in the virtual address and the first two address segments 0b9 and 00c of the N address segments in the entry are identical (see fig. 4), i.e. that the virtual address matches the entry in the translation path cache.

In response to the match, the processor selects the entry as a matching entry and uses a page table walker to begin a page table walk (see FIG. 3) from a page table address (125) corresponding to a last matching address segment (00 c) in the matching entry to find a physical address corresponding to the virtual address. In other words, if the translation path of the virtual address is partially identical to the saved translation path, the page table traversal is not required to be performed on the identical translation path portion, but the page table traversal is directly started from the last page table address in the identical translation path portion, so that the time of the page table traversal is shortened.

In one embodiment, after the physical address is found, in response to the virtual address not having a perfect match with any entry in the translation path cache (as described above, the virtual address only having a partial match with an existing entry in the cache), the processor saves the N address segments (i.e., 0b9, 00c, and 0 ad) of the virtual address and their corresponding N page table addresses (i.e., 042, 125, 378) in the entries of the translation path cache. FIG. 5 schematically illustrates an example translation path cache that has saved the entry and previous entries, the entry contents including the N address segments 0b9/00c/0ad (in the form of an "index" field) and their corresponding N page table addresses 042/125/378 (in the form of a "translation path" field).

After completion of the processing of the further address translation request, it is assumed that the processor in turn receives a further address translation request (not shown) to be processed, and that the N address segment values of the virtual address therein are 0b9, 00c and 0ae.

According to a method of an embodiment of the present disclosure, a processor first compares a virtual address in the further address translation request with an entry in a translation path cache. Since there are two entries in the translation path cache at this time, the processor finds that the N address segments 0b9, 00c and 0ae in the virtual address are identical to the N address segments in the first entry, and the first two address segments 0b9 and 00c in the N address segments are identical to the first two address segments of the N addresses in the second entry (see fig. 5). In response to the virtual address matching the two entries of the translation path cache, the processor selects the entry with the longest matching prefix (i.e., the first entry) as a matching entry and uses the page table walker to begin a page table walk (see FIG. 3) from the page table address (i.e., 508) corresponding to the last matching address segment (i.e., address segment 0 ae) in the matching entry to find the physical address corresponding to the virtual address. In other words, the translation path of the virtual address is identical to a certain saved translation path, so that the page table traversal is not required to be performed on the identical translation path, but the page table traversal is directly started from the last page table address in the identical translation path, thereby shortening the time of the page table traversal.

After the physical address is found, the processor does not save the N address segments of the virtual address in the further address translation request and their corresponding N page table addresses in the entries of the translation path cache, since there is a perfect match of the virtual address with the first entry in the translation path cache.

Those skilled in the art will appreciate that, before the number of entries stored in the translation path buffer does not reach the upper limit of the buffer capacity, storing the N address segments and the N page table addresses corresponding thereto in the entries of the translation path buffer means that the entries are newly built in the translation path buffer to perform the storing; after the number of entries held in the translation path cache reaches the upper limit of the cache capacity, if the processor also needs to hold a new entry in the cache, the new entry may be overridden by the first held entry in the cache or by an entry most similar to the new entry. The aspects of the present disclosure are not limited in this respect. Those skilled in the art will also appreciate that while the example translation path caches of fig. 4 and 5 are shown as accommodating only 4 entries, the upper limit of translation path cache capacity may be set to any suitable amount depending on the application and requirements of the present disclosure. The aspects of the present disclosure are not limited in this respect.

In one embodiment, the translation path cache is located in a page table walker. However, it is fully understood by those skilled in the art that in other embodiments, the translation path cache may be located outside of the page table walker. The aspects of the present disclosure are not limited in this respect.

In one embodiment, the processor further includes a translation look-aside buffer, TLB, and the method is performed in response to the address translation request missing in the TLB. However, those skilled in the art will fully appreciate that the disclosed aspects may be applicable even if there is no TLB in the processor.

In some processors (e.g., NPUs), the bottleneck in address translation is focused on the concurrency not being large enough, rather than on underutilization of locality of references. Thus, in further embodiments, the TLB may have two or more access ports to support receiving two or more address translation requests simultaneously through the two or more access ports, and processing the two or more address translation requests concurrently.

If the processor is able to receive and concurrently process two or more address translation requests, the processor may want to track the status of outstanding address translation requests that miss in the TLB in order to be able to implement the functionality of overlapping hit and miss address translation requests and overlapping two miss address translation requests, as well as out-of-order return address translation request results. Thus, in a further embodiment, the processor further comprises a miss status handling register MSHR for caching any address translation requests that miss the TLB until the address translation of the request is completed.

The inventors of the present disclosure have also discovered that instructions, regular data, and discrete data generally exhibit different locality of reference. Thus, in one embodiment, the TLB includes an instruction translation look-aside buffer ITLB dedicated to instruction addresses, a data translation look-aside buffer DTLB1 dedicated to regular data addresses, and a data translation look-aside buffer DTLB2 dedicated to discrete data addresses, and wherein the DTLB2 has two or more access ports, each access port covering a portion of the address space.

The inventors of the present disclosure have also found that 18% -44% of the virtual addresses of address translation requests and the virtual address of the last certain address translation request correspond to the same memory page, or that the virtual address of the former and the virtual address of the latter are adjacent. These address translation requests with adjacent virtual addresses may cause unnecessary repeated page table traversals because these page table traversals all result in the same memory page address, except that the memory page address plus the respective address offset results in the respective corresponding physical address. Although the above-described methods of the present disclosure may shorten the time of page table traversal for these address translation requests (because the translation path is the same as the most recently saved translation path), it is still necessary to fetch the address of the same memory page in the primary page table, thus still causing unnecessary duplication. To avoid repeatedly processing these address translation requests with adjacent virtual addresses, the inventors contemplate combining processing these address translation requests. Specifically, in one embodiment, the processor further includes a merge buffer having stored therein a page table walk PTW request that has been delivered to the page table walker for traversal but has not returned a page physical address, and the method further includes: in response to receiving a new PTW request, determining whether the new PTW request requests the same page as any stored PTW request in the merge buffer; in response to determining that the new PTW request requests the same page as a stored PTW request, storing the new PTW request in a merge entry in the merge buffer associated with the stored PTW request requesting the same page; and waiting for a page table walk return result of the stored PTW request requesting the same page.

In a further embodiment, determining whether the new PTW request requests the same page as any stored PTW request in the merge buffer comprises: comparing the virtual address of the new PTW request with the rest address segments except the lowest address segment in the virtual address of any stored PTW request; if the remaining address segments are the same, then the new PTW request is determined to be requesting the same page as a stored PTW request.

In a further embodiment, the method further comprises: storing a new merge term associated with the new PTW request in the merge buffer in response to determining that the new PTW request is not the same page as any stored PTW request requests; and delivering the new PTW request to the page table walker.

In a further embodiment, the method further comprises: in response to the page table walker returning a page physical address for a stored PTW request, page table walk results are generated based on the page physical address for all PTW requests in a merge key associated with the stored PTW request, respectively.

In a further embodiment, the generating page table walk results for all of the PTW requests in the merge key associated with the stored PTW request based on the page physical address, respectively, includes: for each PTW request in a merge term, generating a physical address corresponding to the virtual address based on the page physical address and the lowest address segment in the virtual address of the PTW request as a page table walk result of the PTW request.

Those skilled in the art will appreciate that the second address translation request with the adjacent virtual address need not be deleted from the merge buffer after it is processed and may be directly overridden the next time another address translation request with the adjacent virtual address needs to be stored.

Obviously, the larger the merge buffer, the more address translation requests with adjacent virtual addresses that can be pending, and the more duplicate page table walk operations that are avoided. However, enlarging the merge buffer for only a few extremes is wasteful of resources. The inventors of the present disclosure have found through experimentation that a merge buffer capable of storing 8 to 12 entries is sufficient to cope with address translation requests with adjacent virtual addresses in most cases. Thus, in further embodiments, the upper limit of the capacity of the merge buffer is 8, 9, 10, 11 or 12.

In a further embodiment, a merge buffer is located in the page table walker. However, it is fully understood by those skilled in the art that in other embodiments, the merge buffer may be located outside of the page table walker. The aspects of the present disclosure are not limited in this respect.

Accordingly, embodiments of the present disclosure also provide a processor comprising a page table walk unit comprising a page table walk and a translation path cache, wherein: the entries in the translation path cache are used for storing virtual addresses of recently requested translations and translation paths thereof, the virtual addresses comprise N continuous address segments from high order to low order, N is greater than 1, and the translation paths comprise N page table addresses which correspond to the N address segments and are respectively pointed in multi-stage page table traversal from the highest-stage page table in sequence; the page table walk unit is configured to: responding to an address translation request, and comparing a virtual address in the address translation request with an entry in the translation path cache; in response to a match of the virtual address with one or more entries in the translation path cache, selecting an entry with a longest matching prefix as a matching entry; and using the page table walker to start page table traversal from the page table address pointed by the last matching address segment in the matching item. The various embodiments of the processor correspond to the various embodiments of the method described above and thus are not repeated here.

Fig. 6 shows an example schematic block diagram of a processor 600 of an embodiment of the disclosure. In this example, the page table walk unit 604 includes a merge buffer 605, a translation path cache 606, and a page table walk 607, and the translation lookaside unit 601 includes a TLB 602 and an MSHR 603. However, as described above, it is fully understood by those skilled in the art that merge buffer 605 may be located external to page table walk unit 604, and MSHR 603 may also be located external to translation lookaside unit 601. The aspects of the present disclosure are not limited in this respect. In one embodiment, the processor may also include a merge buffer 605, a translation path buffer 606, and a page table walker 607 without page table walk unit 604. In one embodiment, the processor may also include the TLB 602 and MSHR 603 without the translation lookaside unit 601. The aspects of the present disclosure are not limited in this respect. Those skilled in the art will also appreciate that one or more of the TLB 602, MSHR 603, and merge buffer 605 may not be present in some embodiments of the processor (as described above). In one embodiment, translation path cache 606 is adjacent or near the physical location of page table walker 607. In one embodiment, merge buffer 605 is adjacent or near the physical location of page table walker 607.

In addition, while only one translation lookaside unit, one TLB, one MSHR, one page table walk unit, one merge buffer, and one translation path cache are shown in the processor example of fig. 6, a processor of an embodiment of the present disclosure may include one or more translation lookaside units, one or more TLBs, one or more MSHRs, one or more page table walk units, one or more merge buffers, and one or more translation path caches. The aspects of the present disclosure are not limited in this respect.

It will be understood that blocks of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, and/or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.

It will also be understood that the functions/acts noted in the blocks of the flowcharts may occur out of the order noted in the figures. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Although some of the figures may include arrows on communication paths to illustrate a primary direction of communication, it should be understood that communication may occur in a direction opposite to the depicted arrows.

Furthermore, aspects of the present disclosure may take the form of a computer program on a memory having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. In the context of this document, a memory may be any medium that can contain, store, or be adapted to communicate a program for use by or in connection with an instruction execution system, apparatus, or device.

The present disclosure thus also provides a machine-readable storage medium (not shown) having stored thereon instructions which, when executed on a processor, cause the processor to perform the method for expediting processing address translation requests described in connection with the above-described embodiments.

Fig. 7 shows a schematic structural view of a board 70 according to an embodiment of the present disclosure. As shown in fig. 7, the board card 70 includes a Chip 701, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is largely applied to the cloud intelligent field, and one remarkable characteristic of the cloud intelligent application is that the input data volume is large, and the high requirements on the storage capacity and the computing capacity of the platform are provided, so that the board card 70 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 701 is connected to an external device 703 through an external interface device 702. The external device 703 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 703 to the chip 701 through the external interface means 702. The calculation result of the chip 701 may be transmitted back to the external device 703 via the external interface means 702. The external interface device 702 may have different interface forms, such as a PCIe interface, etc., according to different application scenarios.

The board card 70 also includes a memory device 704 for storing data, which includes one or more memory cells 705. The memory device 704 is connected to the control device 706 and the chip 701 through a bus and transmits data. The control device 706 in the board card 70 is configured to regulate the state of the chip 701. To this end, in one application scenario, the control device 706 may comprise a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 8 is a block diagram showing a combination processing apparatus in a chip 701 of this embodiment. As shown in fig. 8, the combination processing device 80 includes a computing device 801, an interface device 802, a processing device 803, and a storage device 804.

The computing device 801 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 803 through the interface device 802 to collectively accomplish the user-specified operations.

The interface means 802 is used for transmitting data and control instructions between the computing means 801 and the processing means 803. For example, the computing device 801 may obtain input data from the processing device 803 via the interface device 802, write to a storage device on the computing device 801 chip. Further, the computing device 801 may obtain control instructions from the processing device 803 via the interface device 802, and write the control instructions into a control cache on the computing device 801 chip. Alternatively or in addition, the interface device 802 may also read data in a memory device of the computing device 801 and transmit it to the processing device 803.

The processing device 803, as a general purpose processing device, performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 801, and the like. Depending on the implementation, the processing device 803 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 801 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect thereto. However, when computing device 801 and processing device 803 are considered in combination, they are considered to form a heterogeneous multi-core structure.

The storage 804 is configured to store data to be processed, which may be DRAM, is DDR memory, and is typically 16G or greater in size, for storing data for the computing device 801 and/or the processing device 803.

When the computing device 801 runs the neural network, the processing device 803 is generally required to compile the neural network to obtain an executable file, where the executable file includes device information, that is, which device in the heterogeneous computer system needs to execute the executable file. The executable files are assembled and linked to obtain an executable program for the neural network, and the executable program is stored in the storage device 804.

The processing device 803 may read an executable program from a storage location of the executable program and obtain a plurality of tasks of the program according to the executable program. These tasks are distributed via the interface device 802 to the computing device 801 for execution, ultimately obtaining the result of the operation.

From the above description in connection with fig. 7 and 8, those skilled in the art will appreciate that the present disclosure also discloses an electronic device or apparatus that may include one or more of the above-described boards, one or more of the above-described chips, and/or one or more of the above-described combination processing apparatuses.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are divided herein by taking into account the logic function, and there may be other manners of dividing the units when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.

In some implementation scenarios, the above-described integrated units may be implemented in the form of software program modules. The integrated unit may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand alone product. In this regard, when the aspects of the present disclosure are embodied in a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described by the embodiments of the present disclosure. The aforementioned Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media capable of storing program codes.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP and ASICs, etc. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. The appended claims are intended to define the scope of the disclosure and are therefore to cover all equivalents or alternatives falling within the scope of these claims.

Claims

1. An address translation method performed by a processor, wherein the processor includes a page table walker and a translation path cache, entries in the translation path cache for holding a partial virtual address of a most recently requested translation and a translation path thereof, the partial virtual address including consecutive N address segments from high order to low order, N > = 1, the translation path including N page table addresses corresponding to the N address segments each pointing in a multi-level page table walk from a highest level page table in turn, the method comprising:

responding to an address translation request, and comparing a virtual address in the address translation request with an entry in the translation path cache;

In response to a match of the virtual address with one or more entries in the translation path cache, selecting an entry with a longest matching prefix as a matching entry; and

and using the page table traverser to carry out page table traversal from the page table address pointed by the last matching address segment in the matching item.

2. The method of claim 1, wherein the method further comprises:

and in response to the virtual address and any one of the entries in the translation path cache not having a complete match, storing the N address segments of the virtual address and the translation paths thereof in the translation path cache, wherein the complete match is that the N address segments in the virtual address and the N address segments of one of the entries in the translation path cache are identical in complete correspondence.

3. The method of any of claims 1-2, wherein the processor further comprises a translation look-aside buffer, TLB, and the method is performed in response to the address translation request missing in the TLB.

4. The method of claim 3, wherein the TLB has two or more access ports to support receiving two or more address translation requests simultaneously through the two or more access ports, and processing the two or more address translation requests concurrently.

5. The method of claim 4, wherein the processor further comprises a miss status handling register MSHR for caching any address translation requests that miss the TLB until the address translation of the request is completed.

6. The method of any of claims 3-5, wherein the TLB comprises an instruction translation look-aside buffer, ITLB, dedicated to instruction addresses, a data translation look-aside buffer, DTLB1, dedicated to regular data addresses, and a data translation look-aside buffer, DTLB2, dedicated to discrete data addresses, and wherein the DTLB2 has two or more access ports, each access port covering a portion of an address space.

7. The method of any of claims 1-6, wherein the processor further comprises a merge buffer having stored therein a page table walk PTW request that has been delivered to the page table walker for traversal but has not returned a page physical address, and the method further comprises:

in response to receiving a new PTW request, determining whether the new PTW request requests the same page as any stored PTW request in the merge buffer; and

in response to determining that the new PTW request requests the same page as a stored PTW request, storing the new PTW request in a merge entry in the merge buffer associated with the stored PTW request requesting the same page; and

And waiting for a page table traversal of the stored PTW request requesting the same page to return a result.

8. The method of claim 7, wherein determining whether the new PTW request requests the same page as any stored PTW requests within the merge buffer comprises:

comparing the virtual address of the new PTW request with the rest address segments except the lowest address segment in the virtual address of any stored PTW request;

if the remaining address segments are the same, then the new PTW request is determined to be requesting the same page as a stored PTW request.

9. The method of any one of claims 7-8, wherein the method further comprises:

storing a new merge term associated with the new PTW request in the merge buffer in response to determining that the new PTW request is not the same page as any stored PTW request requests; and

delivering the new PTW request to the page table walker.

10. The method of any one of claims 7-9, wherein the method further comprises:

in response to the page table walker returning a page physical address for a stored PTW request, page table walk results are generated based on the page physical address for all PTW requests in a merge key associated with the stored PTW request, respectively.

11. The method of claim 10, wherein the separately generating page table walk results for all PTW requests in a merge term associated with the stored PTW request based on the page physical address comprises:

for each PTW request in a merge term, generating a physical address corresponding to the virtual address based on the page physical address and the lowest address segment in the virtual address of the PTW request as a page table walk result of the PTW request.

12. A processor comprising a page table walk unit comprising a page table walk and a translation path cache, wherein:

the entry in the translation path cache is used for storing a part of virtual addresses of the latest requested translation and a translation path thereof, the part of virtual addresses comprise N continuous address segments from high order to low order, N > =1, and the translation path comprises N page table addresses which correspond to the N address segments and are respectively pointed in multi-stage page table traversal from the highest stage page table in sequence;

the page table walk unit is configured to:

13. The processor of claim 12, wherein the page table walk unit is further configured to:

14. The processor of any of claims 12-13, wherein the translation path cache is adjacent or proximate to a physical location of the page table walker.

15. The processor of any one of claims 12-14, further comprising a translation lookaside unit including a translation lookaside buffer, TLB, and the address translation request is passed to the page table walk unit in response to the address translation request missing in the TLB.

16. The processor of claim 15, wherein the TLB has two or more ports to support receiving two or more address translation requests simultaneously through the two or more access ports, and processing the two or more address translation requests concurrently.

17. The processor of claim 16, wherein the translation look-aside unit further comprises a miss status handling register MSHR to cache any address translation requests that miss the TLB until address translations of the requests are completed.

18. The processor of any one of claims 16-17, wherein the TLB comprises an instruction translation look-aside buffer, ITLB, dedicated to instruction addresses, a data translation look-aside buffer, DTLB1, dedicated to regular data addresses, and a data translation look-aside buffer, DTLB2, dedicated to discrete data addresses, and wherein the DTLB2 has two or more access ports, each access port covering a portion of an address space.

19. The processor of any of claims 12-18, wherein the page table walk unit further comprises a merge buffer having stored therein a page table walk, PTW, request that has been delivered to the page table walk for walk but has not returned a page physical address, and the page table walk unit is further configured to:

20. The processor of claim 19, wherein the page table walk unit is further configured to determine whether the new PTW request requests the same page as any stored PTW requests within the merge buffer as follows:

21. The processor of any one of claims 19-20, wherein the page table walk unit is further configured to:

Delivering the new PTW request to the page table walker.

22. The processor of any one of claims 19-21, wherein the page table walk unit is further configured to:

23. The processor of claim 22, wherein the page table walk unit is further configured to generate page table walk results for all PTW requests in a merge term associated with the stored PTW request, respectively, as follows:

24. The processor of any one of claims 19-23, wherein the merge buffer is adjacent or near a physical location of the page table walker.

25. A computing device comprising a processor according to any of claims 12-24.

26. A chip comprising the computing device of claim 25.

27. A board card comprising the chip of claim 26.