CN115033184A

CN115033184A - Memory access processing device and method, processor, chip, board card and electronic equipment

Info

Publication number: CN115033184A
Application number: CN202210772819.4A
Authority: CN
Inventors: 李越; 许巍瀚; 王文强; 徐宁仪
Original assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Current assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-09-09

Abstract

The embodiment of the disclosure provides a memory access processing device, a memory access processing method, a processor, a chip, a board card and an electronic device, wherein the memory access processing device is used for processing a plurality of memory access requests in parallel for a memory, the memory comprises a plurality of storage areas, and each storage area comprises a plurality of storage units; the memory access processing device comprises: the arbitration unit is used for determining a plurality of first memory access requests for accessing the same target storage area, and different first memory access requests are used for accessing different target storage units in the target storage area; a control unit for requesting the memory for data of each storage unit in the target storage area; and the data returning unit is used for returning the data in the corresponding target storage unit aiming at each first memory access request after the memory returns the data of each storage unit in the target storage area. The embodiment of the disclosure improves the efficiency of accessing the memory by a plurality of access requests.

Description

Memory access processing device and method, processor, chip, board card and electronic equipment

Technical Field

The present disclosure relates to the field of chip technologies, and in particular, to a memory access processing apparatus and method, a processor, a chip, a board, and an electronic device.

Background

With the rapid development of technologies such as artificial intelligence and the like, data processing tasks required to be undertaken by a computing system become heavy, and higher requirements are put forward for high-performance computing. To improve processing efficiency, many processors introduce hardware multithreading. For example, a Graphics Processing Unit (GPU) and the like may schedule a plurality of threads to form a thread group, and the plurality of threads in the thread group complete an entire computing task in parallel. Many computing tasks need to access the memory to read and write data, namely, the data needs to be completed by the access request of the thread to the memory, so the design of the access mechanism of the memory is particularly important. For example, for an external memory with a large capacity and a low bandwidth, when access requests of multiple parallel threads access the memory, if the address span accessed by the multiple access requests is large, the memory needs to respond to the access requests of the threads across a long address, the delay of data transmission is greatly increased, and the response efficiency is low.

Disclosure of Invention

In a first aspect, an embodiment of the present disclosure provides a memory access processing apparatus, configured to process multiple memory access requests for a memory in parallel, where the memory includes multiple storage areas, and each storage area includes multiple storage units; the memory access processing device comprises:

the arbitration unit is used for determining a plurality of first memory access requests for accessing the same target storage area, and different first memory access requests are used for accessing different target storage units in the target storage area;

a control unit for requesting the memory for data of each storage unit in the target storage area;

and the data returning unit is used for returning the data in the corresponding target storage unit aiming at each first memory access request after the memory returns the data of each storage unit in the target storage area.

In the embodiment of the disclosure, for a plurality of memory access requests in parallel, an arbitration unit requests data from a memory by acquiring a plurality of first memory access requests accessing the same target memory area and using the memory area as granularity; for the memory, only the memory area with continuous addresses needs to be operated, and all the memory data of each memory unit in the memory area is returned; and then the data return unit executes data distribution, so that the execution efficiency of a plurality of first access requests is improved.

Optionally, each memory access request carries first identification information of an accessed memory area and second identification information of an accessed memory unit in a corresponding memory area;

the arbitration unit is used for determining the plurality of first memory access requests based on the first identification information;

and the data returning unit is used for determining the target storage unit accessed by each first memory access request based on the second identification information.

Optionally, at least one of the first memory access requests is obtained by merging second memory access requests accessing the same storage unit in the same storage area;

the arbitration unit is used for sending each first memory access request to the control unit through a plurality of instruction channels; each instruction channel corresponds to one storage unit in the storage area and is used for sending a first memory access request for accessing the storage unit corresponding to the instruction channel.

Optionally, at least one of the first access request and/or at least one of the second access request is obtained by splitting access requests for accessing multiple continuous storage units.

Optionally, the apparatus further comprises a blending unit, configured to:

under the condition that the plurality of first memory access requests meet preset conditions, adding indication information into each first memory access request, and sending each first memory access request carrying the indication information to the control unit; the indication information is used for indicating that the first memory access request meets the preset conditions, and the preset conditions comprise: each first memory access request in the plurality of first memory access requests accesses a storage unit, and a thread sending the first memory access request meets a preset corresponding relation with the storage unit accessed by the first memory access request;

and the data returning unit is used for returning the data in the corresponding target storage unit to each first memory access request based on the preset corresponding relation after the memory returns the data of each storage unit in the target storage area.

Optionally, each memory access request includes bypass information, where the bypass information includes a first correspondence between a thread that sends the memory access request and a memory location that is accessed by the memory access request; the memory access processing device also comprises a deployment unit and a bypass storage unit:

the allocation unit is used for extracting the bypass information from each first memory access request and sending the bypass information to the bypass storage unit for storage;

and the data returning unit is used for taking the bypass information out of the bypass storage unit after the memory returns the data of each storage unit in the target storage area, and returning the data in the corresponding target storage unit to each first memory access request based on the taken bypass information.

Optionally, in a case that the first memory access request is obtained by merging a plurality of second memory access requests accessing the same storage unit in the same storage area, the first correspondence relationship of the first memory access request includes: the corresponding relation between the thread corresponding to each second memory access request and the target storage unit accessed by the first memory access request is realized;

under the condition that the first memory access request is obtained by splitting a third memory access request for accessing a plurality of continuous memory units, the first corresponding relation of the first memory access request comprises the following steps: and sending a corresponding relation between the thread of the third memory access request and the memory unit accessed by the first memory access request.

Optionally, the data returning unit is further configured to:

acquiring a code corresponding to each first memory access request, wherein the code corresponding to one first memory access request is used for determining a target storage unit in a target storage area accessed by the first memory access request; the number of coded bits is less than the total number of memory cells in the target memory region;

and returning the data in the target storage unit accessed by the first memory access request to the first memory access request based on the code corresponding to the first memory access request.

Optionally, the length of the code is determined based on a logarithm of the total number of storage units in the target storage area.

Optionally, the blending unit is further configured to:

and sending the code corresponding to each first memory access request to the bypass storage unit for storage.

Optionally, the control unit is configured to:

acquiring a plurality of fourth memory access requests, wherein each fourth memory access request in the plurality of fourth memory access requests is a write request and carries first data needing to be written into a target storage unit;

and requesting the memory to access the target storage area so as to write the first data carried by the fourth access requests into the storage units in the target storage area.

Optionally, the apparatus further includes a deployment unit and a bypass information storage unit;

the allocation unit is used for extracting the first data from each fourth memory access request and then sending the first data to the bypass storage unit for storage;

and the control unit is used for writing the first data carried by the fourth memory access requests into the memory units in the target memory area after the first data is acquired from the bypass memory unit.

Optionally, the arbitration unit is configured to:

obtaining a plurality of memory access request groups, wherein each memory access request group comprises a plurality of first memory access requests, and the first memory access requests in different memory access request groups are used for accessing different target storage areas;

and arbitrating the priority of each memory access request group so that the control unit requests the memory for the target storage area accessed by the first memory access request in the memory access request group with the highest access priority.

In a second aspect, the disclosed embodiment provides a memory access processing apparatus, configured to process multiple memory access requests in parallel for a memory, where the memory includes multiple memory areas, and each memory area includes multiple memory units; the memory access processing device comprises:

and the control unit is used for requesting the memory to access the target storage area so as to write the first data carried by each first memory access request into the target storage unit in the target storage area.

In the embodiment of the disclosure, for a plurality of memory access requests in parallel, by acquiring a plurality of first memory access requests accessing the same target storage area, the arbitration unit makes the control unit request the memory to access the target storage area, and may write first data carried by the plurality of first memory access requests into the storage unit in the target storage area. Because the plurality of first memory access requests access the target storage area, the memory does not need to stride long addresses to write data, and the access efficiency of the memory is improved.

Optionally, at least one of the first memory access requests is obtained by splitting a memory access request for accessing multiple consecutive storage units.

Optionally, the arbitration unit is configured to:

obtaining a plurality of access request groups, wherein each access request group comprises a plurality of first access requests, and the first access requests in different access request groups are used for accessing different target storage areas;

In a third aspect, an embodiment of the present disclosure provides a processor, where the processor includes the memory access processing apparatus according to any embodiment of the present disclosure.

In a fourth aspect, an embodiment of the present disclosure provides a chip including the processor according to any one of the embodiments of the present disclosure.

In a fifth aspect, an embodiment of the present disclosure provides a board card, where the board card includes a package structure in which at least one chip according to an embodiment of the present disclosure is packaged.

In a sixth aspect, an embodiment of the present disclosure provides an electronic device, where the electronic device includes the chip or the board card according to any embodiment of the present disclosure.

In a seventh aspect, an embodiment of the present disclosure provides a memory access processing method, where the method is used to process multiple memory access requests to a memory in parallel, where the memory includes multiple memory areas, and each memory area includes multiple memory units; the method comprises the following steps:

determining a plurality of first memory access requests for accessing the same target storage area, wherein different first memory access requests are used for accessing different target storage units in the target storage area;

requesting data of each storage unit in the target storage area from the memory;

and after the memory returns the data of each storage unit in the target storage area, returning the data in the corresponding target storage unit for each first memory access request.

the method further comprises the following steps:

determining, by an arbitration unit, the plurality of first memory access requests based on the first identification information;

and determining the target storage unit accessed by each first memory access request by the data returning unit based on the second identification information.

the method further comprises the following steps: sending each first memory access request to the control unit through a plurality of instruction channels by an arbitration unit; each instruction channel corresponds to one storage unit in the storage area and is used for sending a first memory access request for accessing the storage unit corresponding to the instruction channel.

Optionally, the method further includes:

adding indication information in each first memory access request by a allocating unit under the condition that the plurality of first memory access requests meet preset conditions, and sending each first memory access request carrying the indication information to the control unit; the indication information is used for indicating that the first memory access request meets the preset conditions, and the preset conditions comprise: each first memory access request in the plurality of first memory access requests accesses a storage unit, and a thread sending the first memory access request meets a preset corresponding relation with the storage unit accessed by the first memory access request;

and after the data of each storage unit in the target storage area is returned by the memory, the data return unit returns the data in the corresponding target storage unit to each first memory access request based on the preset corresponding relation.

Optionally, each memory access request includes bypass information, where the bypass information includes a first correspondence between a thread that sends the memory access request and a storage unit accessed by the memory access request; the method further comprises the following steps:

extracting the bypass information from each first memory access request by a allocating unit and sending the bypass information to a bypass storage unit for storage;

and after the data of each storage unit in the target storage area is returned by the memory, the data return unit takes the bypass information out of the bypass storage unit, and returns the data in the corresponding target storage unit to each first memory access request based on the taken bypass information.

Optionally, the method further includes:

a data return unit acquires a code corresponding to each first memory access request, wherein the code corresponding to one first memory access request is used for determining a target storage unit in a target storage area accessed by the first memory access request; the number of coded bits is less than the total number of memory cells in the target memory region;

Optionally, the method further includes: and the allocating unit sends the codes corresponding to each first memory access request to the bypass storage unit for storage.

Optionally, the method further includes:

the control unit acquires a plurality of fourth memory access requests, wherein each fourth memory access request in the plurality of fourth memory access requests is a write request and carries first data needing to be written into the target storage unit; and requesting the memory to access the target storage area so as to write the first data carried by the fourth access requests into the storage units in the target storage area.

Optionally, the method further includes:

the allocation unit extracts the first data from each fourth memory access request and then sends the first data to a bypass storage unit for storage;

and after the control unit acquires the first data from the bypass storage unit, writing the first data carried by the fourth memory access requests into the storage units in the target storage area.

Optionally, the method further includes:

obtaining a plurality of memory access request groups by an arbitration unit, wherein each memory access request group comprises a plurality of first memory access requests, and the first memory access requests in different memory access request groups are used for accessing different target storage areas;

In an eighth aspect, an embodiment of the present disclosure provides a memory access processing method, where the method is used to process multiple memory access requests to a memory in parallel, where the memory includes multiple memory areas, and each memory area includes multiple memory units; the method comprises the following steps:

and requesting the memory to access the target storage area so as to write the first data carried by each first memory access request into a target storage unit in the target storage area.

Optionally, each memory access request carries first identification information of an accessed memory area and second identification information of an accessed memory unit in a corresponding memory area; the method further comprises the following steps:

Optionally, the method further includes:

obtaining a plurality of memory access request groups by an arbitration unit, wherein each memory access request group comprises a plurality of first memory access requests, and the first memory access requests in different memory access request groups are used for accessing different target storage areas; and arbitrating the priority of each access request group so that the control unit requests a target storage area accessed by a first access request in the access request group with the highest access priority from the memory.

In a ninth aspect, the embodiments of the present disclosure provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method of any of the embodiments.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram of a memory partitioning method according to an embodiment of the disclosure.

Fig. 2 is a schematic structural diagram of a memory access processing apparatus according to an embodiment of the present disclosure.

Fig. 3A is a schematic diagram of a logical address of an embodiment of the present disclosure.

Fig. 3B is a schematic diagram of two memory cells of an embodiment of the disclosure.

FIG. 3C is a schematic diagram of thread processing according to an embodiment of the disclosure.

Fig. 3D is a schematic diagram of the processing of memory access requests of an embodiment of the present disclosure.

Fig. 4 is a schematic structural diagram of a processor according to an embodiment of the disclosure.

Fig. 5 is a schematic structural diagram of another memory access processing apparatus according to an embodiment of the present disclosure.

Fig. 6 is a schematic block diagram of another processor according to an embodiment of the disclosure.

Fig. 7 is a schematic diagram of a chip of an embodiment of the disclosure.

Fig. 8 is a schematic diagram of a board card according to an embodiment of the disclosure.

Fig. 9A and 9B are schematic views of an electronic device of an embodiment of the disclosure.

Fig. 10 is a flowchart of a data processing method of an embodiment of the present disclosure.

Fig. 11 is a flow chart of a data processing method of an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if," as used herein, may be interpreted as "at … …" or "when … …" or "in response to a determination," depending on the context.

In order to make the technical solutions in the embodiments of the present disclosure better understood and make the above objects, features and advantages of the embodiments of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure are described in further detail below with reference to the accompanying drawings.

To achieve high-speed computation, many processors introduce techniques for hardware multithreading. Taking a GPU as an example, some GPUs include multiple SMs (Streaming multiprocessors), the operation execution unit in the SM is warp (Thread group), and one warp can schedule multiple threads (Thread). One thread, i.e., one instruction, is the smallest unit of execution for processor operations, and groups of threads can be executed in parallel with the support of hardware resources. One warp needs to take one SM to run and multiple wars needs to take turns to enter the SM. It is common to group 32 threads into a warp.

The memory access behavior of a plurality of threads under one thread group to the memory comprises the access to the memory to read data or the access to the memory to write data. And the latency associated with accessing memory is becoming one of the bottlenecks in computing systems. In particular, for an external memory (external memory for short) which is not on the same chip as the processor, such a memory has the characteristics of large capacity and low bandwidth, and the problem of performance restriction is particularly prominent. Therefore, in chip design related to image processing, high-performance computing, and the like, it is important to design a mechanism for accessing the memory by the thread group.

As shown in fig. 1, the present embodiment divides the memory 10 into a plurality of memory areas, which are indicated by dashed line boxes in fig. 1, showing memory areas 100 to 10 i; each memory region comprises M memory cells Bank, and the specific value of M can be determined according to the number N of threads schedulable by warp, 32 banks are taken as an example in fig. 1. To ensure maximum efficiency and efficient use of hardware resources, M is equal to N; of course M greater than N is optional.

The memory such as external memory has the characteristics of large capacity, small bandwidth and limited access path, when N threads access the memory, if the address span accessed by some threads is large, as shown in fig. 1, 3 parallel access requests are shown, which respectively need to access the positions P1, P2 and P3 of the memory. Since the distance between the 3 positions is long and the address span is large, the delay of data transmission of the memory is increased, and the response efficiency of the memory access request is low.

Based on this, the disclosed embodiment provides a memory access processing apparatus, as shown in fig. 2, the memory access processing apparatus is configured to process a plurality of memory access requests in parallel to a memory 10, the memory is divided into a plurality of memory areas, and each memory area includes a plurality of memory units;

the memory access processing device comprises:

the arbitration unit 201 is configured to determine multiple first memory access requests accessing the same target storage area, where different first memory access requests are used to access different target storage units in the target storage area.

A control unit 202, configured to request the memory for data of each storage unit in the target storage area.

And the data returning unit 203 is configured to return the data in the corresponding target storage unit for each first memory access request after the memory returns the data of each storage unit in the target storage area.

The memory access Processing device of the embodiment of the disclosure may be applied to various types of multithreaded processors such as a GPU, a Neural Network Processing Unit (NPU), or a CPU, and the disclosure does not limit the type of the processor. The processor may schedule multiple threads to process data in parallel.

For example, the memory access processing apparatus of this embodiment may be implemented in an existing processor with hardware multithreading, and after the memory access requests of a plurality of threads in an original thread group are processed by the memory access processing apparatus of this embodiment, the memory access processing apparatus accesses the memory.

The memory of embodiments of the present disclosure includes memory accessible to each thread within a thread group. Taking GPU as an example, the GPU may include a shared Memory (shared Memory) or a Global Memory (Global Memory). A shared memory is an on-chip (on-chip) memory; global Memory is an external (off chip) Memory and can be implemented by a Dynamic Random Access Memory (DRAM, also known as a video Memory). Compared with a Global Memory, the shared Memory has a smaller capacity, the span of positions accessed by a plurality of threads is not too large, the scheme of the embodiment can be applied as required, and the effect of the embodiment applied to the external Memory with a large capacity and a low bandwidth is more remarkable.

The embodiment of the disclosure divides a memory into a plurality of memory areas, each memory area comprises a plurality of memory units, when a plurality of memory access requests in parallel in a thread group are processed, an arbitration unit can determine a plurality of first memory access requests accessing the same target memory area, a control unit requests data from the memory by taking the target memory area as a unit, the memory only needs to read and return data of each memory unit in the target memory area once, and a data return unit returns data in the corresponding target memory unit to each first memory access request, thereby ensuring the continuity of data address reading of the memory and improving the memory access speed of the memory.

For example, each memory region comprises N memory locations, and the memory access requests of 5 threads in the thread group need to be processed in parallel. The 5 memory access requests may be located far away. The arbitration unit of this embodiment is able to determine that 2 of the access requests are accessing different storage units of the same target storage Area1, and the other 3 access requests are accessing different storage units of another target storage Area 2. As shown in fig. 2, for the 2 first access requests accessing the target storage Area1, the control unit can request the memory for the data of the N storage units of the target storage Area1, where M is 32 as an example, and then the data in the corresponding target storage unit is returned to the 2 first access requests by the data return unit, for example, the data of Bank31 and the data of Bank5 are returned to the two first access requests respectively as shown in fig. 2.

Therefore, the memory access processing device of the embodiment does not directly execute the multiple memory access requests of the thread group in parallel, but obtains multiple first memory access requests for accessing the same target storage area, and requests data to the memory by taking the storage area as granularity; for the memory, the processing capacity is relatively weak, and only the memory areas with continuous addresses need to be operated, and all the memory data of each memory unit in the memory areas are returned; and subsequently, the data return unit executes data distribution, so that the execution efficiency of a plurality of first access requests is improved.

For a write request, the memory access request carries first data needing to be written into a target storage unit; the control unit of the embodiment of the present disclosure may request the memory to access the target storage area when the memory access request is a write request, so as to write the first data carried by the plurality of first memory access requests into the storage unit in the target storage area. Because the plurality of first access requests access the target storage area, the memory does not need to stride long addresses to write data, and the access efficiency of the memory is improved.

In some examples, each storage region of the memory corresponds to identification information, each storage unit in each storage region corresponds to identification information, and each access request carries first identification information of the accessed storage region and second identification information of the accessed storage unit in the corresponding storage region, so that the arbitration unit may be configured to determine the plurality of first access requests based on the first identification information, and the data return unit may be configured to determine a target storage unit accessed by each first access request based on the second identification information. The identification information of each storage area and the identification information of each storage unit may be flexibly configured as needed, which is not limited in this embodiment.

The number of memory cells in each memory region can be set as desired. As an example, each memory region in the memory includes M memory cells Bank, and banks in the same position in each memory region may use the same identification, for example, the numbers 0 to (M-1) are used for representation; of course, it is clear to those skilled in the art that the identifier may be expressed in other ways according to the actual situation, and details are not described here.

The data bit width accessed by each access request and the storage bit width of the Bank can be designed according to the minimum storage unit of the memory. For example, the memory stores data in a byte as a minimum storage unit, and the data size of each memory access request operation can be designed to be only one byte, and the Bank size can also be one byte. The information amount of one byte is small, multiple requests are needed to access data of multiple bytes, and each request needs to be executed one by one, so that the execution efficiency is low. Optionally, the bit width of the data accessed by the memory access request of the thread may be greater than 1 byte. The data bit width of the access request of the thread can be the same as or different from the storage bit width of the Bank, and if the data bit width of the access request is greater than the storage bit width of the Bank, a plurality of storage units are accessed by one access request; if the data bit width is less than or equal to the storage bit width, the data accessed by one access request can fall into one Bank or cross two banks. The two are designed to be the same, the data accessed by some memory access requests is just in one Bank, and the processing of the memory access requests can be relatively efficient.

The Bank may further divide the storage locations according to the storage bit width of the Bank, for example, the storage bit width of the Bank is a plurality of bytes, and taking 4 bytes as an example, the Bank may further divide the storage locations into 4 storage locations, and each storage location stores one byte of data.

The memory access request carrying the first identification information and the second identification information can be understood as a logical address of a location of a memory to be accessed by the memory access request, and the logical address has a mapping relation with a physical address of the memory. Based on the memory partitioning, the logical address can be used to determine the memory location in the memory region accessed by the memory access request. In the case that the storage unit is divided into a plurality of storage locations, the logical address may also be used to determine the storage location of the data accessed by the memory access request in the storage unit.

It will be clear to those skilled in the art that the logical addresses may be implemented in a variety of ways. As an example, the logical address may include: an identification of the memory region (which may be referred to as the base address of the Bank), an identification of the memory location in the memory region, and an identification of the memory location in the memory location. As shown in fig. 3A, the logical address is represented by 40 bits of data, wherein the 39 th bit to the 7 th bit are used for representing the identification of the storage area, and the 6 th bit to the 2 nd bit are used for representing the identification of the Bank in the storage area; taking the example that the storage bit width of the Bank is 4 bytes, the storage bit width of each storage location is 1 byte, and bits 1 to 0 in the logical address can be used to indicate the storage location of the data accessed by the access request in the Bank.

If the data operated by the access request is a plurality of bytes, as an example, the identifier of the storage location in the logical address may indicate the storage location where the first byte of the data accessed by the access request is located, and the plurality of bytes of data operated by the access request may be accessed from the storage location where the first byte is located and the storage location above the first byte.

By way of example, please see fig. 3B, and fig. 3B (a) shows that Bank1 and Bank2 each include 4 bytes. The bit width of the data accessed by the access request is also 4 bytes.

If the data to be accessed by the access request is 4 bytes in Bank1, the logical address points to byte 0 in Bank1, and bits 1 to 0 in the address can be represented by "00", i.e. 4 bytes above byte 1 in Bank1, which is represented by a gray square in fig. 3b (b).

If the data to be accessed by the access request is from the 1 st byte of Bank1 to the 0 th byte of Bank2, the logical address points to the 1 st byte in Bank1, the identification of the memory cell in the address is the identification of Bank1, and the 1 st to 0 th bits in the logical address need to be represented by "01", that is, 4 bytes above the 1 st byte of Bank1, which is represented by a gray square in fig. 3b (c), span to the 1 st byte of Bank 2. Therefore, the position accessed by the memory access request spans two banks, and the subsequent execution efficiency is affected in the case of the address non-alignment problem. For example, before the processor acquires the thread group to the final access memory, the memory access request of the thread may need to undergo various processing such as parsing or address conversion or compression, and if the memory access request crosses two banks, more computation is brought to the processing of the memory access request. For example, in some scenarios, after a memory access request is sent from an arbitration unit, the memory access request also needs to reach a memory through many buses, caches and other units on a processor, and some bus interfaces and the like require address alignment, so as to improve the processing efficiency of the memory access request.

Based on this, in some examples, the memory access processing apparatus may include an alignment unit, configured to determine, from the multiple memory access requests, memory access requests that access multiple consecutive storage units, split the memory access requests, and obtain the memory access requests that access any of the multiple consecutive storage units respectively.

For example, the data to be accessed by the thread's access request Q1 is from the 2 nd storage location of Bank1 to the 1 st storage location of Bank 2. Two memory access requests obtained by splitting the memory access request Q1 in this embodiment may be: memory access request Q11 to the 2 nd to 4 th storage locations of Bank1, and memory access request Q12 to the 1 st storage location of Bank 2.

In the case that the storage unit includes a plurality of storage locations, the split access request accesses a part of the storage locations in the storage unit. For example, the logic address carried by the memory access request obtained by splitting may indicate the first storage location of the storage unit accessed by the memory access request, and carry location identification information for indicating the valid location.

For example, the logical address of the access request Q11 includes the identification of the memory cell Bank1, and bits 1 to 0 of the logical address are "00", which indicates the first memory location of Bank1 accessed by the access request; the memory access request also carries position identification information for indicating that the valid position is the 2 nd to 4 th storage positions in the Bank 1.

The logical address of the access request Q12 comprises the identification of a storage unit Bank2, and the 1 st to 0 th bits of the logical address are '00', which indicates the first storage position of Bank2 accessed by the access request; the memory access request also carries location identification information to indicate that the valid location is the 1 st storage location in Bank 2.

Therefore, after alignment processing, the 1 st to 0 th bits of all the access requests are "00", so that when other units on a subsequent processor are processing, the byte identifiers of the logical addresses in the access requests all indicate the first storage position of the storage unit, that is, the addresses are all aligned.

For example, in order to process each memory access request in parallel, the number of the alignment units may be the same as the number of memory access requests in parallel in the thread group, and one alignment unit is used to determine whether a memory access request needs to be split.

For a plurality of memory access requests in parallel, the arbitration unit distinguishes the target storage areas accessed by the memory access requests, so that the plurality of memory access requests in parallel can be divided into one or a plurality of memory access request groups, each memory access request group comprises a plurality of first memory access requests, and the first memory access requests in different memory access request groups are used for accessing different target storage areas. Under the condition that a plurality of access request groups exist, the arbitration unit can successively send each access request group to the control unit. As an example, the arbitration unit may also arbitrate the priority of each access request group, so that the control unit requests the memory to access the target storage area accessed by the first access request in the access request group with the highest priority. Still in the foregoing example, 5 memory access requests are divided into two memory access request groups, and the two memory access request groups can be executed sequentially, and it is clear to those skilled in the art that the priority of each memory access request group can be arbitrated according to any rule, for example, randomly, or according to the address determination of the target storage area accessed by the first memory access request in each memory access request group, and so on. Although the thread group is divided into a plurality of groups of memory access in sequence, which may reduce certain execution efficiency, for a plurality of first memory access requests of one memory access request group, the memory obtains data of each memory cell in a memory area with continuous addresses for one-time return, and does not need to operate across far addresses, and the data return speed is greatly improved, so that the execution efficiency of the thread group can still be better improved in this embodiment.

For example, the arbitration unit may send each memory access request group to the control unit in sequence according to the priority, so that the control unit requests the memory to receive the target storage area accessed by the first memory access request in the memory access request group when receiving one memory access request group. A plurality of first memory requests in a memory access request group may be sent in parallel, for example, the memory access processing apparatus may include: and each instruction path is used for sending an access request for accessing the storage unit corresponding to the instruction path.

By way of example, each memory region includes M banks, and at least M instruction paths are provided accordingly, one instruction path corresponding to each Bank for issuing memory access requests for accessing the corresponding Bank. Illustratively, banks in the same position in each memory region may have the same identification, for example, the identifications of M banks in the respective memory regions may be represented by numbers such as 0 to (M-1). Instruction lane may also be numbered, for example, instruction lane 0 sends an instruction to access Bank0, instruction lane 1 sends an instruction to access Bank1, and so on. Based on this, a plurality of instruction paths respond to a plurality of memory access requests as many as possible at one time, and the parallel transmission of the plurality of memory access requests is realized.

However, in the above design, the instruction path can only transmit one access request at a time, and if a plurality of first access requests access the same Bank, each first access request needs to be sent one by the corresponding instruction path, and the instruction path is blocked, which results in a reduction in execution efficiency.

For example, as shown in fig. 3C, which is a schematic diagram of a thread processing in the embodiment of the disclosure, in fig. 3C (a), each of 32 threads accesses exactly 32 banks in sequence, and thus are sent in parallel through corresponding instruction paths one by one; in FIG. 3C (b), each of the 32 threads, although not in sequence, is issued in parallel by exactly 32 instruction paths; in fig. 3c (c), there are cases where multiple threads access the same Bank0, so the command path of Bank0 is blocked, and multiple threads need to wait for the command path of Bank0 to respond one by one.

Therefore, in this embodiment, the arbitration unit may merge the second access requests accessing the same storage unit in the same storage area, and the first access requests all access different storage units in the same storage area, so that the instruction path is not blocked during transmission, and the execution efficiency of the access requests is improved.

For example, after the memory access requests of the foregoing embodiment are split, merging the memory access requests may be performed, and then dividing the memory access request group is performed, so that at least one of the first memory access requests and/or at least one of the second memory access requests is split by memory access requests accessing multiple consecutive storage units, so that the split memory access requests may be merged with other memory access requests, so as to reduce blocking of an instruction path.

The types of threads under one warp are generally the same, and in the scenario of accessing the memory in this embodiment, the threads are all read types or all write types, and the type information may be carried in the access request. The merging of the memory access requests of the embodiment can be applied to the two types of memory access requests. For simplicity, taking two requests as an example, accessing the same storage unit in the same storage area may be as follows:

firstly, both access requests access the same storage position in a storage unit; e.g., each requesting the same byte of data in a memory location.

Accessing different storage positions in one storage unit by the two access requests; for example, one of the memory accesses requests operation on the 1 st byte of data stored in Bank1, and the other memory access requests operation on the 2 nd byte of data stored in Bank 1.

Storage positions respectively accessed by the two access requests are overlapped; for example, one of the access requests operates on the data of the 1 st byte and the 2 nd byte stored in the Bank1, and the other access requests operates on the data of the 2 nd byte and the 3 rd byte under the Bank 1.

As mentioned above, all the above cases can be merged into a memory access request. The new memory access request obtained by merging may be obtained by splicing the two original memory access requests, or may also be generated, where the new memory access request carries information of the two original memory access requests.

As an example, the multiple access requests of the thread group currently needing to be processed is Q _A To Q _P There are 16 threads sending memory access requests, 5 of which are shown in FIG. 3D for simplicity.

Q _A Accessing two consecutive memory locations in the same memory region, thus splitting into Q _a1 And Q _a2 。

Q _B And Q _a2 Access the same memory cell of the same memory region, thus merging into Q _a2-B 。

Q _C And Q _D Access to the same memory location in the same memory region may be combined as Q _C-D 。

Other memory access requests Q _F To Q _P The same process may also be performed to divide the 16 memory requests into one or more memory request groups. Each memory access request group can be sequentially sent to the control unit by the sending unit according to the priority, so that when the control unit receives one memory access request group, the control unit requests the memory to access the data of the memory area accessed by the first memory access request in the memory access request group.

The control unit requests data from the memory by taking the memory area as granularity, and the data returning unit returns the data in the corresponding target memory unit to each first memory access request after the memory returns the data of each memory unit in the target memory area.

For example, for Q in a memory access request group _a1 、Q _a2-B 、Q _C-D And Q _E Due to Q _E Without merging or splitting, the data return unit goes to Q _E Returning data, also equal to responding to send Q _E The thread of (2).

For Q _C-D Which are Q sent by two threads separately _C And Q _D The merge results, both of which are data accessing the same Bank, so the data Return Unit goes to Q _C-D When returning data, Q may be sent in response _C Thread of and response sending Q _D The thread of (2).

For Q _a2-B Which is split and merged, so the data return unit goes to Q _a2-B When returning data, Q may be sent in response _a2 Corresponding thread and response Send Q _B The thread of (2).

For Q _a1 Data return unit to Q _a1 When returning data, can respond to Q _a1 A corresponding thread; wherein Q is _a1 And Q _a2 By thread Q _A Split to get, the data return unit can be according to Q _a1 Corresponding data and data response thread Q corresponding to Qa2 _A 。

When the data return unit returns data to each first memory access request in the memory access request group, the corresponding relationship between a Bank visited by each first memory access request and a thread of the first memory access request needs to be determined. In other examples, a memory access request group is sent from the arbitration unit, when the control unit requests to access the memory according to the memory access request group, the memory access request may need to be further processed by parsing, address conversion, or the like, and some information carried in the memory access request is unrelated to the request of the control unit to access the memory, such as the correspondence described above, which is needed only when the data is returned by the data return unit. Based on this, the memory access processing apparatus of this embodiment further includes a bypass information storage unit, which may be used to temporarily store the bypass information in the memory access request, where the bypass information indicates temporarily storable information that is not needed temporarily, and in practical application, a person skilled in the art may set one or more information in the memory access request as the bypass information according to needs, which is not limited in this embodiment. The bypass information in the memory access request can be temporarily stored without being sent downwards, and other information except the bypass information in the memory access request is sent downwards to execute other processing, so that the sending speed and the execution efficiency of the memory access request are improved.

As an example, the bypass information may include a first correspondence between a thread that sent the memory access request and a memory location accessed by the memory access request; the allocation unit is used for extracting the bypass information from each first memory access request and sending the bypass information to the bypass storage unit for storage; and the data returning unit is used for taking the bypass information out of the bypass storage unit after the memory returns the data of each storage unit in the target storage area, and returning the data in the corresponding target storage unit to each first memory access request based on the taken bypass information. For example, the arbitration unit divides a plurality of access requests of the thread group into one or more access request groups, each access request group is sequentially sent to the allocation unit, and the allocation unit can store the first corresponding relation of the access request group for each access request group.

It is clear to those skilled in the art that the first correspondence may be stored in a variety of ways, for example, a thread that sends an access request corresponds to a thread identifier, the access request carries second identification information of an accessed storage unit in a corresponding storage area, and the storage of the first correspondence may be to store the thread identifier and the second identification information of each access request.

As in the read merge process described above, there is a case where a plurality of threads correspond to one memory location. For example, the identities of the M memory cells under the memory region are represented by M numbers from 0 to M-1. Assuming that there are 32 thread groups, the memory access requests of thread31, thread 30 and thread 29 all access Bank0, and these three memory access requests can be merged. In the first mapping relationship, Bank0, thread31, thread 30, and thread 29 can be expressed as:

Bank0：1110……0；

that is, the thread corresponding to record Bank0 needs 32 bits, and 3 bits of the 32 bits are set to 1, indicating that thread31, thread 30, and thread 29 correspond to Bank 0. From the perspective of Bank0, three threads are needed. Moreover, since the number of threads accessing the same memory location in the same memory area in a thread group is unknown, the data bits of the threads corresponding to the memory banks need to be the same as the number of threads in the thread group, and the larger M, the larger the data bits.

And from the thread's perspective, the number of banks visited by the thread can be determined. The number of the memory units accessed by the threads can be determined by dividing the memory bit width of the memory units and setting the data bit width accessed by the memory access requests of the threads. For example, in the case that the storage bit width of the storage unit is the same as the data bit width accessed by the memory access request, the Bank accessed by the memory access request of the thread is one or two.

Therefore, the storage of the corresponding relation of the two can be compressed by taking the thread as a dimension. For example, each memory cell in a memory region may correspond to a code, and the number of bits of the code may be less than the total number of memory cells in the memory region. When the banks accessed by the access requests of the threads are one, each thread only corresponds to one Bank, and the storage of data is obviously reduced.

For example, the coding of M memory cells in a memory region may be determined based on the logarithm of M. If M is the power of n of 2, the minimum length of the code of the memory cell can be n bits; if the length of the code of the memory cell is not n-th power of 2, the length of the code of the memory cell can be obtained by rounding up M. It is clear to those skilled in the art that the length of the code in practical application can be adjusted according to the need, and this embodiment does not limit this.

For example, the storage area includes 32 storage units, the numbers from 0 to 31 are used to represent the storage unit identifiers, and the length of the code of each storage unit identifier can be reduced to 5 bits at minimum.

For example, taking threads as dimensions, if not encoded, the correspondence between the

threads

31, 30 and 29 and the Bank0 can be expressed as:

thread 31: 0000 … … 1;

thread 30: 0000 … … 1;

the thread 29: 0000 … … 1;

the 0 th position is set to 1, which represents Bank 0. The Bank0 code may be 00000 through the encoding of 32 banks. Other Bank codes have the same principle, for example, the Bank1 code can be 00001, and so on, and will not be described herein. Therefore, the data size can be reduced through encoding, a circuit with a smaller area can be adopted for realizing the bypass information storage unit in hardware, and the hardware power consumption is also reduced.

A memory access request group comprises memory access requests of a plurality of threads, and through the embodiment of coding the Bank, the storage of the first corresponding relation of each first memory access request in the memory access request group can be used for storing the corresponding relation between the thread identification of each first memory access request in the memory access request group and the Bank coding.

Or, the identifier of the thread may not need to be stored, for example, because the memory access requests of the multiple threads of the thread group are divided into multiple memory access request groups, one memory access request group may only include a part of threads, the threads that are not included do not need to respond, the multiple threads of the thread group may store the flag information indicating whether the threads need to respond or not according to a set sequence, and the flag information may have two types, which respectively indicate a state that a response is needed and a state that a response is not needed. For example, it can be represented by at least 1 bit of data bits, and the no-response and the required-response can be represented by binary data "0" and "1", respectively. Of course, it will be clear to one skilled in the art that the labeling information is not listed here. Furthermore, for the thread needing response, the corresponding Bank code is stored.

Or, the data can be stored in an array mode, the array includes a plurality of elements equal to the number of the threads, each element corresponds to one thread, and the threads which do not need to respond can be marked by marking information such as 0; and the thread needing response stores the corresponding Bank code. For example, a memory access request of one thread accesses a plurality of memory cells and is split into a plurality of memory cells, then a plurality of Bank codes corresponding to the thread are provided, as an example, the plurality of Bank codes can be spliced and stored, and the data bit width will be increased in the case of splicing and storing. Optionally, the arbitration unit may divide a plurality of memory access requests obtained by splitting the memory access request of one thread into different memory access request groups, and increase the number of the memory access request groups, so that only one Bank corresponding to the thread exists in the plurality of first memory access requests in each memory access request group, and the data amount may be reduced when the code is stored.

Similarly, when the data return unit returns the data of each first memory access request aiming at one memory access request group, the code corresponding to each first memory access request can be obtained based on the bypass information storage unit, so that the target storage unit in the target storage area accessed by each first memory access request is determined, and the data in the target storage unit accessed by the first memory access request is returned to the first memory access request.

In addition to the first correspondence described above, in some examples, the bypass information of the memory access request may further include location identification information and the like. For example, in the foregoing embodiment that needs address alignment, the split first memory access request only needs to operate a partial storage location under one Bank, and the first memory access request carries location identification information used to indicate an effective storage location operated by the memory access request in a corresponding storage unit. Or, the memory access request sent by the thread may also be in such a case, for example, the storage bit width of the storage unit is greater than 1 byte, the storage unit may be divided into a plurality of storage locations, and the memory access request sent by the thread only needs to operate a part of the storage locations in a Bank and also carries the location identification information. Therefore, the bypass information of the access request stored by the bypass information storage unit can also comprise the position identification information. When the data return unit returns the data of the corresponding target storage unit to the first memory access request, the position identification information can be taken out from the bypass information storage unit, and the memory access request is responded according to the position identification information.

For the first correspondence between the thread of the memory access request and the memory location accessed by the memory access request, in some examples, the first correspondence may be further optimized to reduce the amount of data stored by the bypass information memory location. The blending unit is used for: under the condition that the plurality of first memory access requests meet preset conditions, adding indication information into each first memory access request, and sending each first memory access request carrying the indication information to the control unit; the indication information is used for indicating that the first memory access request meets the preset conditions, and the preset conditions comprise: each first memory access request in the plurality of first memory access requests accesses a storage unit, and a thread sending the first memory access request meets a preset corresponding relation with the storage unit accessed by the first memory access request; and the data returning unit is used for returning the data in the corresponding target storage unit to each first memory access request based on the preset corresponding relation after the memory returns the data of each storage unit in the target storage area.

The carrying of the indication information may be to expand the bit width of the signal of the access request; or in some scenarios, some data bits are reserved in the memory access request, and can be used for marking to represent a specific meaning as required, for example, a tag signal in the memory access request, and a specific mark, for example, binary 0 or 1, is written into the data bits to make them represent the preset condition.

The preset corresponding relationship may be flexibly determined as required, for example, the preset corresponding relationship may be one or more, and this embodiment does not limit the preset corresponding relationship, where the preset corresponding relationship is determined according to a corresponding relationship between a thread that frequently occurs in an actual application and a storage unit that is accessed by the first memory access request.

For example, if a common situation existing in the practical application is that there is no address misalignment condition for M threads in a thread group (that is, there is no first thread in the thread group), and the storage unit identifiers of the logical addresses in each access request are respectively in one-to-one correspondence with M storage units in one storage area. For example, the preset corresponding relationship may be a case shown in fig. 3c (a) and 3c (b), and the like, where Thread0 to Thread31 access exactly one storage unit in one storage area one by one, and 32 threads serve as a memory access request group, and in such a one-to-one corresponding case, the first corresponding relationship does not need to be stored, and the allocating unit may enable each first memory access request to carry indication information. Optionally, a specific flag, for example, a binary 0 or 1, is written in the tag signal of the access request, so that the specific flag represents the preset condition. The data returning unit can determine that the memory access request group where the first memory access request is located meets a preset condition according to the indication information of the first memory access request, and directly returns data in the corresponding target storage unit to each first memory access request by using the preset corresponding relation. Therefore, under the condition that the preset condition is met, the first corresponding relation does not need to be written into the bypass information storage unit, the data amount stored by the bypass information storage unit can be reduced, the power consumption and the area of the bypass information storage unit are optimized, and the data return unit can respond to each request more quickly.

For the access request of the write type, the execution efficiency of the request can be improved by temporarily storing some bypass information through the bypass information storage unit. For example, the memory access request of the write type carries first data to be written into the target storage unit; as previously described, in some scenarios, a memory access request is issued from the arbitration unit and does not directly reach the control unit to access the memory. Optionally, in this embodiment, the allocating unit may be configured to extract the first data from each first memory access request and then send the first data to the bypass storage unit for storage; and the control unit is used for writing the first data carried by the plurality of first memory access requests into the storage unit in the target storage area after the first data is acquired from the bypass information storage unit.

For example, the allocating unit receives a memory access request group sent by the sending unit, where each memory access request carries first data and other information, the allocating unit may write the first data of each memory access request into the bypass information storage unit, and the other information of each memory access request continues to be sent downwards and may be subjected to other processing, after the processed other information of each memory access request reaches the control unit, the control unit then takes out the first data of each memory access request from the bypass information storage unit, and requests the memory to write data by combining the processed other information of each memory access request. Therefore, unnecessary data transmission can be reduced aiming at the write request, the sending efficiency of the memory access request is improved, and the subsequent processing efficiency of other information in the memory access request is also improved.

Optionally, different bypass information storage units may be further set for different types of memory access requests, for example, the bypass information storage unit may include: the read operation bypass information storage unit is used for storing bypass information of a read type memory access request; and the write operation bypass information storage unit is used for storing the bypass information of the access request of the write type.

Next, a description will be given by way of an embodiment. Referring to fig. 4, which is a schematic diagram of a processor according to an embodiment of the present disclosure, the memory access processing apparatus according to this embodiment may be applied to a processor, and as an example, the processor may include a thread block control unit, a strong synchronization unit, the memory access processing apparatus according to this embodiment, and the like.

(1) The thread block control unit is used for issuing a plurality of threads of the thread group warp to the memory access processing device, and each thread is a memory access request.

(2) A thread group synchronization unit to: each thread (i.e., thread-level instruction) under one warp is regulated, specifically, N threads under one thread group are scheduled each time, and after all threads in the thread group receive responses, the next thread group is acquired. The thread group acquired by the thread group synchronization unit is issued to the address alignment unit.

(3) An address alignment unit to: and judging whether the memory access request of each thread group has the address non-alignment problem, if so, splitting the memory access request of the thread with the non-aligned address into two memory access requests corresponding to the thread. Optionally, the number of the address alignment units may be N, so as to process the address non-alignment problem of the N threads in parallel.

(4) An arbitration unit operable to:

reading and merging, namely if the access requests are reading requests, judging whether a plurality of second access requests of the same storage unit under the same storage area exist or not, and if so, merging the second access requests into a first access request.

And secondly, distinguishing access request groups, namely dividing a plurality of access requests into one or more access request groups.

Under the condition that a plurality of access request groups exist, the priority of each access request group can be arbitrated, and each access request group is sent out in sequence through an instruction channel. And a plurality of first memory access requests of the same memory access request group are sent out in parallel.

And sending a memory access request group, namely sending a data access instruction for accessing the jth Bank through a jth instruction path, wherein j is an integer from 0 to M-1. Threads under a batch may have simultaneous access to a Bank, e.g., write type instructions are not merged. In the case of Bank blocking, it may be set to send in order of the size of thread id, for example, to send threads with smaller id preferentially.

(5) Each instruction channel sends out one or more first access requests of an access request group in parallel when receiving the access request group sent by the Bank access arbitration unit.

(6) A blending unit for: receiving a first memory access request of a memory access request group sent by a plurality of instruction paths.

The allocating unit may include one or more selectors SEL for allocating the information carried by the first access request.

The scheduling unit further includes a read operation bypass information storage unit and a write operation bypass information storage unit.

For example, through the selector SEL, the allocating unit stores the bypass information of the read request to the read operation bypass information storage unit when receiving the read request; in the case where a write request is received, bypass information of the write request is stored to the write operation bypass information storage unit.

After the access request is extracted with the bypass information, other information of the access request can be used as an instruction and sent to other units at the downstream through the instruction queue. For example, the memory access processing device may implement N 'instruction queues, where each instruction queue is used to send one instruction, thereby implementing parallel processing of N' instructions. The instructions are ultimately sent to the control unit, which accesses the memory according to one or more instructions. For simplicity, other units behind the instruction queue are not shown in fig. 4, and it is clear to those skilled in the art that in practical applications, an instruction may also pass through one or more units of other functions, as needed, before the instruction reaches the control unit.

For the memory access request of the read type, the bypass information stored in the read operation bypass information storage unit may include the first corresponding relationship in the foregoing embodiment; or, the location identification information and the like may be stored in the case that the memory access request carries the location identification information. The storage of the first correspondence relationship may adopt the encoding method of the foregoing embodiment to reduce the data amount. Wherein, the control unit requests the data from the memory, the memory sends the data to the data returning unit (not shown in fig. 4) according to the request, the bypass information of the read operation bypass information storage unit is fetched by the data returning unit, and the fetching can be performed by adopting the decoding manner of the foregoing embodiment.

The write operation bypass information storage unit is used for storing the bypass information of the access request of the write type. For example, the memory access request carries data to be written to the memory, and the data can be temporarily stored in the unit. Other information of the memory access request is sent to the control unit as an instruction. After the control unit receives the instruction, the data corresponding to the instruction is taken out from the write operation bypass information storage unit and written into the memory.

Referring to fig. 5, another memory access processing apparatus is provided in an embodiment of the present disclosure, configured to process multiple memory access requests in parallel to a memory, where the memory includes multiple memory areas, and each memory area includes multiple memory units; the memory access processing device comprises:

the arbitration unit 501 is configured to determine multiple first memory access requests for accessing the same target storage area, where different first memory access requests are used for accessing different target storage units in the target storage area.

A control unit 502, configured to request the memory to access the target storage area, so as to write the first data carried by each first access request into a target storage unit in the target storage area.

Optionally, the arbitration unit is configured to: obtaining a plurality of memory access request groups, wherein each memory access request group comprises a plurality of first memory access requests, and the first memory access requests in different memory access request groups are used for accessing different target storage areas; and arbitrating the priority of each memory access request group so that the control unit requests the memory for the target storage area accessed by the first memory access request in the memory access request group with the highest access priority.

Referring to fig. 6, an embodiment of the present disclosure further provides a processor, where the processor includes the memory access processing apparatus described in any of the foregoing embodiments.

Referring to fig. 7, an embodiment of the present disclosure further provides a chip 700, where the chip 700 includes a processor 701, and the processor 701 may adopt the processor described in any of the above embodiments. In some examples, the chip includes a memory 702, the memory 702 coupled to the processor 701, the memory including a plurality of memory regions, each of the memory regions including a plurality of memory cells. The details of the embodiments of the present disclosure are described in the foregoing embodiments, and are not repeated herein.

In addition, the embodiment of the disclosure also provides a board card, which comprises a packaging structure packaged with at least one chip. Referring to fig. 8, an exemplary board 800 is provided, where the board 800 includes the chip 700 and may include other components, including but not limited to: a memory 802, an interface device 804, and a processor 806.

The memory is connected with the chip in the chip packaging structure through a bus and used for storing data. The memory may comprise a plurality of memory regions, each memory region comprising a plurality of memory cells, for example: DDR SDRAM (English Data Rate SDRAM, Double Data Rate SDRAM), etc. The memory is connected with the chip through a bus.

The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device 808 (such as a terminal, a server, a camera and the like). In an embodiment, the interface device may include a PCIE interface, and may also be a network interface, or other interfaces, and the disclosure is not limited thereto. The details of the embodiments of the present disclosure are described in the foregoing embodiments, and are not repeated herein.

Referring to fig. 9A, an embodiment of the present disclosure also provides an electronic device, which includes a chip 901 and a memory 902. In some examples, the electronic device includes a memory 902, and the memory 902 is connected to the processor 9011 of the chip 901. Referring to fig. 9B, an embodiment of the present disclosure further provides another electronic device, which includes a board card 800. The details of the embodiments of the present disclosure are described in the foregoing embodiments, and are not repeated herein.

Referring to fig. 10, an embodiment of the present disclosure further provides a memory access processing method, which may be applied to the processor of the foregoing embodiment, the method is used for processing multiple memory access requests in parallel to a memory, where the memory includes multiple memory areas, and each memory area includes multiple memory units; the method may include:

step 1001, determining a plurality of first memory access requests for accessing the same target memory area. Wherein different first memory access requests are used for accessing different target memory units in the target memory area;

step 1002, requesting the memory for data of each storage unit in the target storage area.

And 1003, after the memory returns the data of each storage unit in the target storage area, returning the data in the corresponding target storage unit for each first memory access request.

Optionally, in the method, an arbitration unit determines a plurality of first memory access requests for accessing the same target storage area, where different first memory access requests are used for accessing different target storage units in the target storage area;

requesting, by a control unit, data of each storage unit in the target storage area from the memory;

and after the memory returns the data of each storage unit in the target storage area, the data returning unit returns the data in the corresponding target storage unit according to each first memory access request.

the method further comprises the following steps:

the method further comprises the following steps: the arbitration unit sends each first memory access request to the control unit through a plurality of instruction channels; each instruction channel corresponds to one storage unit in the storage area and is used for sending a first memory access request for accessing the storage unit corresponding to the instruction channel.

Optionally, the method further includes:

adding indication information in each first memory access request by a allocating unit under the condition that the plurality of first memory access requests meet preset conditions, and sending each first memory access request carrying the indication information to the control unit; the indication information is used for indicating that the first memory access request meets the preset conditions, and the preset conditions comprise: each first memory access request in the plurality of first memory access requests accesses a storage unit, and a thread sending the first memory access request and the storage unit accessed by the first memory access request meet a preset corresponding relation;

Optionally, each memory access request includes bypass information, where the bypass information includes a first correspondence between a thread that sends the memory access request and a memory location that is accessed by the memory access request; the method further comprises the following steps:

extracting the bypass information from each first memory access request by a deployment unit and sending the bypass information to a bypass storage unit for storage;

under the condition that the first memory access request is obtained by splitting a third memory access request for accessing a plurality of continuous storage units, a first corresponding relation of the first memory access request comprises the following steps: and sending a corresponding relation between the thread of the third memory access request and the memory unit accessed by the first memory access request.

Optionally, the method further includes:

Referring to fig. 11, an embodiment of the present disclosure further provides a memory access processing method, which may be applied to the processor of the foregoing embodiment, the method is used for processing multiple memory access requests in parallel to a memory, where the memory includes multiple memory areas, and each memory area includes multiple memory units; the method comprises the following steps:

step 1102, determine a plurality of first memory access requests accessing the same target memory area. Wherein different first access requests are used for accessing different target storage units in the target storage area.

Step 1104, requesting the memory to access the target storage area, so as to write the first data carried by each first access request into a target storage unit in the target storage area.

and requesting the memory to access the target storage area by a control unit so as to write the first data carried by each first memory access request into a target storage unit in the target storage area.

Optionally, at least one of the first memory access requests is obtained by splitting memory access requests for accessing multiple continuous memory units.

Optionally, the method further includes:

obtaining a plurality of memory access request groups by an arbitration unit, wherein each memory access request group comprises a plurality of first memory access requests, and the first memory access requests in different memory access request groups are used for accessing different target storage areas; and arbitrating the priority of each memory access request group so that the control unit requests the memory for the target storage area accessed by the first memory access request in the memory access request group with the highest access priority.

The embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the memory access processing method described in any of the foregoing embodiments.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, and the modules described as separate components may or may not be physically separate, and the functions of the modules may be implemented in one or more software and/or hardware when implementing the embodiments of the present disclosure. And part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The foregoing is merely a detailed description of the embodiments of the present disclosure, and it should be noted that modifications and embellishments could be made by those skilled in the art without departing from the principle of the embodiments of the present disclosure, and should be considered as the scope of the embodiments of the present disclosure.

Claims

1. A memory access processing device is used for processing a plurality of memory access requests in parallel to a memory, wherein the memory comprises a plurality of memory areas, and each memory area comprises a plurality of memory units; the memory access processing device comprises:

a control unit, configured to request the memory for data of each storage unit in the target storage area;

2. The device of claim 1, wherein each memory access request carries first identification information of an accessed memory area and second identification information of an accessed memory unit in a corresponding memory area;

3. The device according to claim 1 or 2,

at least one first memory access request is obtained by merging second memory access requests which access the same storage unit in the same storage area;

4. The apparatus of claim 3,

at least one first memory access request and/or at least one second memory access request are obtained by splitting memory access requests for accessing a plurality of continuous memory units.

5. The apparatus of claim 4, further comprising a blending unit to:

6. The apparatus of any one of claims 1 to 5, wherein each memory access request comprises bypass information, the bypass information comprising a first correspondence between a thread sending the memory access request and a memory location accessed by the memory access request; the memory access processing device also comprises a deployment unit and a bypass storage unit:

7. The apparatus of claim 6,

under the condition that the first memory access request is obtained by combining a plurality of second memory access requests for accessing the same storage unit in the same storage area, the first corresponding relation of the first memory access request comprises the following steps: the corresponding relation between the thread corresponding to each second memory access request and the target storage unit accessed by the first memory access request is realized;

8. The apparatus according to any one of claims 1 to 7, wherein the data return unit is further configured to:

9. The apparatus of claim 8, wherein the length of the code is determined based on a logarithm of the total number of storage units in the target storage area.

10. The apparatus according to claim 8 or 9, wherein the deployment unit is further configured to:

11. The apparatus according to any one of claims 1 to 10, wherein the control unit is configured to:

12. The apparatus of claim 11, further comprising a deployment unit and a bypass information storage unit;

13. The apparatus according to any of claims 1 to 10, wherein the arbitration unit is configured to:

14. An access processing apparatus for processing a plurality of access requests in parallel to a memory, the memory comprising a plurality of memory regions, each memory region comprising a plurality of memory locations; the memory access processing device comprises:

15. A processor comprising the memory access processing apparatus of any one of claims 1 to 14.

16. A chip, characterized in that the chip comprises the processor of claim 15.

17. The chip of claim 16, wherein the chip comprises a memory, the memory coupled to the processor; the memory includes a plurality of memory regions, each memory region including a plurality of memory cells.

18. A board comprising a package structure in which at least one chip according to claim 16 or 17 is packaged.

19. An electronic device, characterized in that the electronic device comprises a chip according to any one of claims 16 or 17, or a card according to claim 18.

20. The electronic device of claim 19, wherein the electronic device comprises a memory, the memory being coupled to the processor of the chip; the memory includes a plurality of memory regions, each memory region including a plurality of memory cells.

21. A memory access processing method is characterized in that the method is used for processing a plurality of memory access requests in parallel to a memory, the memory comprises a plurality of memory areas, and each memory area comprises a plurality of memory units; the method comprises the following steps:

22. A memory access processing method is characterized in that the method is used for processing a plurality of memory access requests in parallel to a memory, wherein the memory comprises a plurality of memory areas, and each memory area comprises a plurality of memory units; the method comprises the following steps:

and requesting the memory to access the target storage area so as to write the first data carried by each first access request into a target storage unit in the target storage area.

23. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of claim 21 or 22.