CN113590508B

CN113590508B - Dynamic reconfigurable memory address mapping method and device

Info

Publication number: CN113590508B
Application number: CN202111155689.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Muxi Technology Beijing Co ltd
Current assignee: Muxi Technology Beijing Co ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2022-02-11
Anticipated expiration: 2041-09-30
Also published as: CN113590508A

Abstract

The invention provides a dynamic reconfigurable memory address mapping method and device, and relates to a chip technology.A memory address mapping relation of a target application is generated based on a configuration parameter and a memory concurrent access mode of the target application by acquiring the configuration parameter of a chip and the memory concurrent access mode of the target application; receiving an execution request of a user for the target application, and calling the memory address mapping relation corresponding to the target application according to the execution request; the technical scheme of dynamically configuring the memory subsystem of the chip according to the memory address mapping relation can be combined with specific target application to generate the corresponding memory address mapping relation so as to adapt to more complex memory access modes in application scenes such as artificial intelligence, high-performance calculation and the like.

Description

Dynamic reconfigurable memory address mapping method and device

Technical Field

The present invention relates to chip technologies, and in particular, to a dynamically reconfigurable memory address mapping method and apparatus.

Background

High throughput computing chips such as GPU and AI chips need to be supported by a high bandwidth memory subsystem. The existing memory system has extremely high theoretical bandwidth. However, the applications actually running on the chip are subject to resource contention of the memory subsystem and contention among concurrent computing units, so that it is difficult to achieve the nominal theoretical bandwidth of the memory system, and extra power consumption overhead may be caused by improper resource usage.

In the prior art, in order to solve resource contention, a concurrent memory access request is dispersed to a memory resource capable of processing a plurality of memory requests in parallel by remapping a memory address.

However, in the prior art, the access and storage modes of applications running on a high-throughput computing chip such as a GPU and an AI are complex and changeable, and parameters of a memory subsystem of the application also change with different configurations (such as virtualization) of the chip, so that it is difficult to meet the requirements of all scenes with a single address mapping function.

Disclosure of Invention

The embodiment of the invention provides a dynamic reconfigurable memory address mapping method and device, which can be combined with chip configuration parameters and target application to generate a corresponding memory address mapping relation so as to adapt to more complex memory access modes in application scenes such as artificial intelligence and high-performance calculation.

In a first aspect of the embodiments of the present invention, a dynamically reconfigurable memory address mapping method is provided, including:

acquiring configuration parameters of a chip and a memory concurrent access mode of a target application, and generating a memory address mapping relation of the target application based on the configuration parameters and the memory concurrent access mode;

receiving an execution request of a user for the target application, and calling the memory address mapping relation corresponding to the target application according to the execution request;

and dynamically configuring the memory subsystem of the chip according to the memory address mapping relation.

Optionally, in a possible implementation manner of the first aspect, the obtaining a memory concurrent access mode of a target application includes:

and acquiring code information of the target application, and acquiring a memory concurrent access mode of the target application according to the code information.

and acquiring the bit turning rate of the memory access stream of the target application, and acquiring the memory concurrent access mode of the target application based on the bit turning rate.

Optionally, in a possible implementation manner of the first aspect, the obtaining a bit flipping rate of a memory access stream of the target application includes:

means for recording the bit flip rate according to simulator or said chip operation;

and acquiring the bit flipping rate of the memory access stream of the target application based on the device. And acquiring the bit flipping rate of the memory access stream of the target application in a preset time period.

Optionally, in a possible implementation manner of the first aspect, the obtaining configuration parameters of a chip includes:

and acquiring the architecture parameters, the memory system parameters and the scheduling strategy of the chip.

Optionally, in a possible implementation manner of the first aspect, after generating the memory address mapping relationship of the target application based on the configuration parameter and the memory concurrent access mode, the method further includes:

and binding the target application and the memory address mapping relation based on a preset position.

Optionally, in a possible implementation manner of the first aspect, the preset position includes:

and the position of the metadata of the executable file corresponding to the target application or the position of the page table entry corresponding to each data block in the target application.

In a second aspect of the embodiments of the present invention, a dynamically reconfigurable memory address mapping apparatus is provided, including:

the mapping module is used for acquiring configuration parameters of a chip and a memory concurrent access mode of a target application, and generating a memory address mapping relation of the target application based on the configuration parameters and the memory concurrent access mode;

the calling module is used for receiving an execution request of a user for the target application and calling the memory address mapping relation corresponding to the target application according to the execution request;

and the execution module is used for dynamically configuring the memory subsystem of the chip according to the memory address mapping relation.

In a third aspect of the embodiments of the present invention, a dynamically reconfigurable memory address mapping device is provided, including: memory, a processor and a computer program, the computer program being stored in the memory, the processor running the computer program to perform the method of the first aspect of the invention as well as various possible aspects of the first aspect.

A fourth aspect of the embodiments of the present invention provides a readable storage medium, in which a computer program is stored, the computer program being, when executed by a processor, configured to implement the method according to the first aspect of the present invention and various possible aspects of the first aspect.

According to the dynamic reconfigurable memory address mapping method and device provided by the invention, the memory address mapping relation of the target application is generated through the configuration parameters of the chip and the memory concurrent access mode of the target application, and when the target application is executed subsequently, the memory subsystem of the chip is dynamically configured according to the memory address mapping relation, so that the corresponding memory address mapping relation can be generated by combining with the specific target application, and the dynamic reconfigurable memory address mapping method and device are suitable for more complex memory access modes in application scenes such as artificial intelligence, high-performance calculation and the like.

Drawings

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention.

Fig. 2 is a schematic flowchart of a dynamically reconfigurable memory address mapping method according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of the same application column-first concurrent thread scheduling policy and row-first concurrent thread scheduling policy according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a high-throughput computing on-chip memory concurrent access mode affected by a computing unit scheduling policy according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of an apparatus for recording a bit flip rate according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a dynamically reconfigurable memory address mapping apparatus according to an embodiment of the present invention.

Fig. 7 is a dynamically reconfigurable memory address mapping apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

It should be understood that in the present application, "comprising" and "having" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that, in the present invention, "a plurality" means two or more. "and/or" is merely an association describing an associated object, meaning that three relationships may exist, for example, and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "comprises A, B and C" and "comprises A, B, C" means that all three of A, B, C comprise, "comprises A, B or C" means that one of A, B, C comprises, "comprises A, B and/or C" means that any 1 or any 2 or 3 of A, B, C comprises.

It should be understood that in the present invention, "B corresponding to a", "a corresponds to B", or "B corresponds to a" means that B is associated with a, and B can be determined from a. Determining B from a does not mean determining B from a alone, but may be determined from a and/or other information. And the matching of A and B means that the similarity of A and B is greater than or equal to a preset threshold value.

As used herein, "if" may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Referring to fig. 1, a schematic diagram of an application scenario provided by the embodiment of the present invention is a DRAM memory subsystem in the prior art, which shows a 4-level organization structure of a common DRAM: channel (Channel), cluster (Bank), Row (Row), Column (Column). The DRAM system illustrated in the figure has 32 channels, 16 clusters per channel. The specific bit field in the address of the memory access indicates the serial number of the channel, cluster, row, and column to be accessed. In the existing memory system, the number of resources in each layer may be different from the above example, and more design layers may be introduced, which brings more complicated resource competition situations, for example, HBM2 adds a bank group layer between clusters and channels. To fully utilize the theoretical bandwidth provided by a memory system, the compute unit needs to generate enough parallel memory requests that can be processed concurrently. However, in actual implementation, taking the DRAM system of fig. 1 as an example, two parallel memory accesses from the compute units may not be concurrently processed by the DRAM memory system due to contention for I/O buses of the same channel, line caches of the same cluster, and so on. Therefore, concurrent memory requests need to access different channels or different clusters of the same channel as much as possible, and therefore bit segments corresponding to the channels and clusters in the memory addresses accessed by the computing unit need to be changed as much as possible.

Concurrent memory access modes of most applications are simple in applications running on a traditional computing chip such as a CPU. For applications with large memory level parallelism, the memory access addresses of the compute units are typically accumulated sequentially, typically while traversing an array. Therefore, in order to efficiently utilize the DRAM memory subsystem of fig. 1, a typical CPU may employ a memory mapping scheme (i.e., the memory address mapping scheme shown in fig. 1), specifically, the frequency of the change of the highest bit segment in the sequentially accumulated memory access mode may be the lowest, and the highest bit segment is defined as a row number, so as to reduce the performance of row cache update and power consumption overhead; the frequency of the change of the middle bit section is higher, the change is defined as the serial number of a channel and a cluster, and the parallelism of the memory subsystem is utilized as much as possible; the lowest bit segment has the highest frequency of change and is defined as a column number, so that the locality of the line cache is fully utilized.

To solve the above technical problem, referring to fig. 2, a flowchart of a dynamically reconfigurable memory address mapping method according to an embodiment of the present invention is shown, and an execution main body of the method shown in fig. 2 may be a software and/or hardware device. The execution subject of the present application may include, but is not limited to, at least one of: user equipment, network equipment, etc. The user equipment may include, but is not limited to, a computer, a smart phone, a Personal Digital Assistant (PDA), the above mentioned electronic equipment, and the like. The network device may include, but is not limited to, a single network server, a server group of multiple network servers, or a cloud of numerous computers or network servers based on cloud computing, wherein cloud computing is one type of distributed computing, a super virtual computer consisting of a cluster of loosely coupled computers. The present embodiment does not limit this. The dynamically reconfigurable memory address mapping method includes steps S101 to S103, and specifically includes the following steps:

s101, obtaining configuration parameters of a chip and a memory concurrent access mode of a target application, and generating a memory address mapping relation of the target application based on the configuration parameters and the memory concurrent access mode.

Specifically, according to the scheme, the configuration parameters of the chip are combined with the specific memory concurrent access mode of the target application to generate the memory address mapping relation corresponding to the target application. For example, the target application a corresponds to a memory address mapping relation 1, and the target application B corresponds to a memory address mapping relation 2.

It can be appreciated that, because the dynamically configurable address mapping scheme in the prior art is only directed to graphics rendering type scenes, rather than general high-throughput computing, a set of methods for selecting the address mapping scheme according to application characteristics cannot be provided. The scheme can be combined with specific target application to generate a corresponding memory address mapping relation so as to adapt to more complex memory access modes in application scenes such as artificial intelligence and high-performance calculation.

In practical applications, the configuration parameters of the chip may be architecture parameters, memory system parameters, and scheduling policies of the chip.

It will be appreciated that many parameters of the chip affect memory level parallelism. For example, in the tFAW parameter in the DDR and GDDR standards, a memory bank is defined, and at most 4 different line caches are updated in each time window with the length of tFAW, and more than 4 line cache update requests can only be queued continuously, so that the tFAW limits the development of inter-cluster memory access parallelism in a single channel. These secondary parameters are different depending on the memory subsystem architecture and memory system parameters of the chip, and therefore the architecture parameters of the chip are acquired.

It will also be appreciated that different scheduling policies of the high-throughput computing chip may also change the memory access patterns of the application. Referring to fig. 3, fig. 3 shows a case, assuming that 16 threads in an application can be executed concurrently by 4 threads each time, each thread accesses a line of data in a memory in turn, the memory system has 4 channels, the high throughput computing chip in fig. 3 is scheduled in a manner of concurrent threads with line priority and column priority, x in txy is a line sequence number of a memory to be accessed by the thread, and y is a channel sequence number to be accessed. Fig. 4 shows channel sequence numbers to be accessed by each thread when two different scheduling policies are executed for the first time, if scheduling is performed according to the row-first order, then the concurrent memory accesses of the 4 concurrent threads may be uniformly distributed in the 4 channels, and if scheduling is performed according to the column-first order, then the concurrent memory accesses of the 4 concurrent threads are all concentrated in the same channel, because of resource contention for the channel, the performance of the column-first scheduling policy is much lower than that of the row-first scheduling policy. Therefore, if the scheduling policies are different, the memory concurrent access mode of the target application is also affected.

Therefore, according to the scheme, the memory address mapping relation is simultaneously acquired by combining the configuration parameters of the chip and the memory concurrent access mode of the target application.

It should be noted that after the memory address mapping relationship of the target application is generated based on the configuration parameters and the concurrent memory access mode, the target application and the memory address mapping relationship also need to be bound based on a preset position. It can be understood that, in order to call the memory address mapping relationship when the target application is executed subsequently, the two need to be bound together.

In an actual application, the preset position may be a position corresponding to metadata of an executable file of the target application, or a position corresponding to a page table entry of each data block in the target application. It will be appreciated that each application may specify the memory address mapping it needs in a particular location, such as the metadata of its executable file. Meanwhile, each data block in an application may specify a memory address mapping relationship required by the data block in a specific location, for example, a page table entry corresponding to the data block.

S102, receiving an execution request of a user to the target application, and calling the memory address mapping relation corresponding to the target application according to the execution request.

Specifically, in this step, when the target application is executed, the main controller detects an execution request for the target application, and since the memory address mapping relationship corresponding to the target application is already generated in step S101, the target application may call the memory address mapping relationship to execute the operation in step S103.

S103, dynamically configuring the memory subsystem of the chip according to the memory address mapping relation.

It can be understood that, when the high throughput computing chip executes the target application, the memory subsystem of the chip is dynamically configured according to the selected memory address mapping relationship, and the data stored in the memory subsystem is directly migrated and configured.

It should be noted that, in the prior art, although a memory mapping scheme also exists in the memory subsystem, the switching of the memory mapping scheme is static, because data needs to be retained in the memory subsystem for a long time in a conventional computing chip such as a CPU, and if the memory mapping scheme needs to be dynamically switched, the address of the same data storage before and after switching is changed, thereby causing the data to be invalid. Therefore, the existing scheme can switch the memory mapping scheme when the chip is restarted, that is, only static switching is supported.

It should be further noted that, compared with the prior art, in the high throughput chip of this scheme, data in the memory subsystem may be frequently migrated between the memory subsystem and the external storage under the control of one host controller, and the migration process is completely controlled by the host controller, so that the memory mapping scheme of the data block may be switched each time a data block is migrated into the memory system, thereby improving the performance of the high throughput chip and reducing the power consumption thereof.

Based on the foregoing embodiment, a specific implementation manner of the step S102 (obtaining the memory concurrent access mode of the target application) may be:

in some embodiments, the memory concurrent access mode of the target application is obtained, and the memory concurrent access mode includes static analysis and dynamic analysis, wherein the static analysis may be applied to some rules, and the dynamic analysis may be applied to some non-rules, so as to adapt to more complex memory access modes existing in application scenarios such as artificial intelligence and high-performance computing.

Static analysis:

It is understood that static analysis can obtain a fixed step size and a known specific memory access pattern such as space linking (e.g., Z-Morton, Hilbert) by analyzing the source code of the target application.

Dynamic analysis:

It should be noted that, for an irregular application and an application whose memory access mode is sensitive to a specific scheduling policy, the bit flipping rate of the memory access stream of the target application may be obtained, and then the memory concurrent access mode of the target application is obtained according to the bit flipping rate.

In practical applications, the apparatus for recording the bit flipping rate as shown in fig. 5 may be used, and the bit flipping rate of the target application in a certain time period (a preset time period) of the memory access stream may be counted on a simulator or a practical chip, so as to analyze a concurrent memory access mode that is difficult to be found by a conventional static analysis in the target application.

It should be noted that, in order to reduce the data amount of dynamic analysis, the present solution may perform sampling analysis on a segment of a target application, that is, analyze data within the preset time period, and need not analyze all data, thereby reducing power consumption.

Referring to fig. 6, which is a schematic structural diagram of a dynamically reconfigurable memory address mapping apparatus according to an embodiment of the present invention, the dynamically reconfigurable memory address mapping apparatus 60 includes:

the mapping module 61 is configured to obtain configuration parameters of a chip and a memory concurrent access mode of a target application, and generate a memory address mapping relationship of the target application based on the configuration parameters and the memory concurrent access mode;

a calling module 62, configured to receive an execution request of the target application from a user, and call the memory address mapping relationship corresponding to the target application according to the execution request;

and the execution module 63 is configured to dynamically configure the memory subsystem of the chip according to the memory address mapping relationship.

The apparatus in the embodiment shown in fig. 6 can be correspondingly used to perform the steps in the method embodiment shown in fig. 2, and the implementation principle and technical effect are similar, which are not described herein again.

Referring to fig. 7, which is a schematic diagram of a hardware structure of a dynamically reconfigurable memory address mapping device according to an embodiment of the present invention, the dynamically reconfigurable memory address mapping device 70 includes: a processor 71, a memory 72 and computer programs; wherein

A memory 72 for storing the computer program, which may also be a flash memory (flash). The computer program is, for example, an application program, a functional module, or the like that implements the above method.

A processor 71 for executing the computer program stored in the memory to implement the steps performed by the apparatus in the above method. Reference may be made in particular to the description relating to the preceding method embodiment.

Alternatively, the memory 72 may be separate or integrated with the processor 71.

When the memory 72 is a device separate from the processor 71, the apparatus may further include:

a bus 73 for connecting the memory 72 and the processor 71.

The present invention also provides a readable storage medium, in which a computer program is stored, which, when being executed by a processor, is adapted to implement the methods provided by the various embodiments described above.

The readable storage medium may be a computer storage medium or a communication medium. Communication media includes any medium that facilitates transfer of a computer program from one place to another. Computer storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, a readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Additionally, the ASIC may reside in user equipment. Of course, the processor and the readable storage medium may also reside as discrete components in a communication device. The readable storage medium may be a read-only memory (ROM), a random-access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The present invention also provides a program product comprising execution instructions stored in a readable storage medium. The at least one processor of the device may read the execution instructions from the readable storage medium, and the execution of the execution instructions by the at least one processor causes the device to implement the methods provided by the various embodiments described above.

In the above embodiments of the apparatus, it should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A dynamically reconfigurable memory address mapping method, comprising:

the method comprises the steps of obtaining configuration parameters of a chip and a memory concurrent access mode of a target application, and generating a memory address mapping relation of the target application based on the configuration parameters and the memory concurrent access mode, wherein the configuration parameters are parameters influencing memory level parallelism, the memory concurrent access mode comprises static analysis and dynamic analysis, the static analysis aims at regular application, and the dynamic analysis aims at irregular application;

2. The method of claim 1, wherein obtaining the memory concurrent access pattern of the target application comprises:

3. The method of claim 1 or 2, wherein obtaining the memory concurrent access pattern of the target application comprises:

4. The method of claim 3, wherein obtaining the bit flipping rate of the memory access stream of the target application comprises:

acquiring the bit flipping rate of the memory access stream of the target application based on the device; and acquiring the bit flipping rate of the memory access stream of the target application in a preset time period.

5. The method of claim 1, wherein the obtaining configuration parameters of the chip comprises:

6. The method of claim 1, after generating the memory address mapping relationship for the target application based on the configuration parameters and the concurrent memory access mode, further comprising:

7. The method of claim 6, wherein the preset position comprises:

8. A dynamically reconfigurable memory address mapping device, comprising:

the mapping module is used for acquiring configuration parameters of a chip and a memory concurrent access mode of a target application, and generating a memory address mapping relation of the target application based on the configuration parameters and the memory concurrent access mode, wherein the configuration parameters are parameters influencing memory level parallelism, the memory concurrent access mode comprises static analysis and dynamic analysis, the static analysis aims at regular application, and the dynamic analysis aims at irregular application;

9. A dynamically reconfigurable memory address mapping device, comprising: memory, a processor and a computer program, the computer program being stored in the memory, the processor running the computer program to perform the method of any of claims 1 to 7.

10. A readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 7.