CN114416397A

CN114416397A - Chip, memory access method and computer equipment

Info

Publication number: CN114416397A
Application number: CN202111655195.XA
Authority: CN
Inventors: 朱志岐; 李越; 王文强; 何博; 徐宁仪
Original assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Current assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-29

Abstract

The disclosure provides a chip, a method for accessing a memory and computer equipment. The chip comprises a request sending unit and a data returning unit; the request sending unit is used for acquiring the memory access addresses of the threads in the multiple thread groups, merging the memory access addresses, sending memory access requests to corresponding channels in the shared memory based on the merged memory access addresses, and sending merging information representing the merging mode of the memory access addresses to the data returning unit; and the data return unit is used for acquiring the data returned by the shared memory and returning the data to the corresponding thread based on the merging information. The chip can combine the memory access addresses of the threads in the multiple thread groups, and reduces repeated memory access requests for the same address, so that the number of the memory access requests sent to the shared memory is reduced, and the pressure of a memory access system is reduced.

Description

Chip, memory access method and computer equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a chip, a method for accessing a memory, and a computer device.

Background

In computer terminal hardware such as a Graphics Processing Unit (GPU) and an Artificial Intelligence (AI) chip, there are high requirements on the amount of computation and the computation speed, and the computing power of the processor is often improved by multi-thread design, and parallel Processing of data is realized. These threads, working in parallel, are large in number and are accompanied by a large number of requests to access memory. Therefore, it is necessary to optimize the access mode under the multi-thread condition to improve the access efficiency of the multi-thread.

Disclosure of Invention

To overcome the problems in the related art, embodiments of the present disclosure provide a chip, a method for accessing a memory, and a computer device, so as to solve the defects in the related art.

According to a first aspect of embodiments of the present disclosure, there is provided a chip, the chip comprising: a request sending unit and a data returning unit; the request sending unit is used for acquiring the memory access addresses of the threads in the multiple thread groups, merging the memory access addresses, sending memory access requests to corresponding channels in the shared memory based on the merged memory access addresses, and sending merging information representing the merging mode of the memory access addresses to the data returning unit; and the data return unit is used for acquiring the data returned by the shared memory and returning the data to the corresponding thread based on the merging information.

Optionally, the merging information includes intra-group merging information and inter-group merging information, where the intra-group merging information is used to represent a merging manner of memory addresses of threads in the same thread group, and the inter-group merging information is used to represent a merging manner of memory addresses of threads between different thread groups.

Optionally, the request sending unit includes: the in-group merging unit is used for merging the access addresses of all threads in the same thread group to obtain the in-group merging information; and the inter-group merging unit is used for merging the memory access addresses of all threads among different thread groups to obtain the inter-group merging information.

Optionally, the intra-group merging information includes an intra-group access merging identifier, where the intra-group access merging identifier is used to characterize whether an access address of each thread in a thread group where the thread is located is merged with an access address of the thread; the intra-group merging unit includes: the comparator array is used for comparing the memory access addresses of all threads in the same thread group and generating first address equal identification of all threads according to the comparison result, wherein the first address equal identification is used for representing whether the memory access address of each thread in the thread group where the corresponding thread is located is the same as the memory access address of the corresponding thread; and the AND gate array is used for performing AND operation on the thread effective identification and the identification equal to the first address to obtain the access and storage merging identification in the group, and the thread effective identification is used for representing whether the access and storage address of each thread in the thread group is effective or not.

Optionally, the inter-group combining information includes an inter-group access and memory combination identifier, where the inter-group access and memory combination identifier is used to uniquely identify each thread group performing inter-group access and memory combination; the inter-group merging unit comprises a channel address cache control unit, the channel address cache control unit comprises a plurality of address cache units, and each address cache unit corresponds to one channel; the channel address cache control unit is used for writing the memory access address merged by the merging unit in the group into the address cache unit of the corresponding channel and merging the same cache address in the same address cache unit.

Optionally, the address caching unit is further configured to cache a second address equality flag, where the second address equality flag is used to represent whether each address in the address caching unit is the same as a target address; the channel address cache control unit is specifically configured to: and combining the address and the target address under the condition that a second address equal identifier corresponding to one address represents that the address is the same as the target address.

Optionally, the address cache unit is further configured to cache an effective address identifier, where the effective address identifier is used to characterize whether each access address cached in the address cache unit is effective; the channel address cache control unit is specifically configured to: determining a target address from first memory access addresses in the address cache unit, wherein effective address identifiers corresponding to the first memory access addresses indicate that the first memory access addresses are effective; and under the condition that the memory access request corresponding to the target address is sent to the shared storage unit, modifying the effective address identifier corresponding to the target address to indicate that the first memory access address is invalid.

Optionally, the channel address cache control unit is specifically configured to: and sending the memory access requests corresponding to the target addresses in the plurality of address cache units to the shared memory each time.

Optionally, the channel address cache control unit is specifically configured to: when the transmission condition is met, the memory access address is transmitted to the shared memory; the transmission conditions include: the address cache unit corresponding to any channel is full, or the address cache units corresponding to all idle channels are not empty, or no new memory access address merged by the merging units in the group is written into the address cache unit.

Optionally, the group merging information includes a thread channel mapping identifier, where the thread channel mapping identifier is used to characterize each thread whose access address in the thread group corresponds to each channel; the request transmitting unit further includes: and the address splitting unit is used for splitting the memory access address from the memory access request of each thread group and generating the thread channel mapping identifier.

Optionally, the data return unit includes: the information cache unit is used for receiving and caching the merging information; and the broadcast control unit is used for returning the data returned by the shared memory to the corresponding thread based on the merging information cached by the information caching unit.

Optionally, the information caching unit includes: the intra-group information caching unit is used for receiving and caching the intra-group merging information from the request sending unit; and an inter-group information caching unit for receiving and caching the inter-group merging information from the request sending unit.

Optionally, the broadcast control unit is specifically configured to: firstly, performing intergroup broadcasting according to the intergroup merging information so as to return data returned by the shared memory to each thread group; and performing in-group broadcasting according to the in-group merging information so as to return the data returned by the shared memory to each thread in the same thread group.

Optionally, the data return unit further includes: and the data return cache unit is used for caching the data returned to each thread, and respectively writing the data of each thread back to the register file corresponding to the thread in the processor when each thread in the thread group receives the returned data.

According to a second aspect of the embodiments of the present disclosure, there is provided a method for accessing a memory, which is applied to a chip of any one of the embodiments of the present disclosure, the method including: acquiring the memory access addresses of the threads in the multiple thread groups through a request sending unit in the chip, merging the memory access addresses, sending memory access requests to corresponding channels in the shared memory based on the merged memory access addresses, and sending merging information representing the merging mode of the memory access addresses to the data return unit; and acquiring the data returned by the shared memory through a data return unit in the chip, and returning the data to the corresponding thread based on the merging information.

According to a third aspect of embodiments of the present disclosure, there is provided a computer device comprising the chip of any of the embodiments of the present disclosure.

The chip can combine the memory access addresses of the threads in the multiple thread groups, and reduces repeated memory access requests for the same address, so that the number of the memory access requests sent to the shared memory is reduced, and the pressure of a memory access system is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

Fig. 1 is a block diagram of a chip of an embodiment of the disclosure.

Fig. 2 is a detailed structural diagram of a chip of an embodiment of the disclosure.

FIG. 3 is a flow chart of a method for intra-group merging in a chip according to an embodiment of the disclosure.

Fig. 4 is a schematic diagram of a method for performing intra-group merging in a chip according to an embodiment of the disclosure.

Fig. 5 is a schematic diagram of a write access address of an address cache unit in a chip according to an embodiment of the present disclosure.

Fig. 6 is a schematic diagram of an address cache unit in a chip reading a memory access address according to an embodiment of the disclosure.

Fig. 7 is a flowchart of a method of accessing memory according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Fig. 1 is a block diagram of a chip of an embodiment of the disclosure. As shown in fig. 1, an embodiment of the disclosure provides a chip 100 for accessing a memory in a multi-threaded environment. The chip 100 includes:

the request sending unit 110 is configured to obtain memory access addresses of threads in multiple thread groups, merge the memory access addresses, send a memory access request to a corresponding channel in the shared memory 300 based on the merged memory access addresses, and send merge information representing a merge manner of the memory access addresses to the data returning unit 120;

a data returning unit 120, configured to obtain data returned by the shared memory 300, and return the data to a corresponding thread based on the merging information.

In the embodiment of the present disclosure, one or more memory access requests may be sent to the chip 100 of the embodiment of the present disclosure through the processor 200, and access to the shared memory 300 is implemented through the chip 100. In the present disclosure, the processor 200 refers to a computer accessory having functions of interpreting computer instructions and Processing data in computer software, and may be one of a Central Processing Unit (CPU), a graphics processor, or an artificial intelligence chip, and the present disclosure is not limited thereto. The processor 200 in the embodiment of the present disclosure may be another processor chip independent from the chip 100 in the embodiment of the present disclosure, and the chip 100 in the embodiment of the present disclosure is directly or indirectly connected to the processor chip through a circuit between chips, so as to implement the transfer of the access request and the return data. The processor 200 in the embodiment of the present disclosure may also be a processing unit with a processor function integrated on the same chip as the chip 100, and the chip 100 in the embodiment of the present disclosure is directly or indirectly connected to the processing unit through a circuit in the chip to implement transfer of the access request and the return data.

The chip 100 of the disclosed embodiment is capable of receiving and processing memory access requests from a plurality of processes of a plurality of processors 200. In a multi-threaded environment, the various threads contained within a process may be divided into one or more thread groups.

Shared memory 300 may be used in a multi-processor computer system and accessed by different processors. The shared memory 300 may be a shared memory implemented based on a physical memory, a shared memory implemented based on a file mapping, or a shared memory implemented based on other manners, and the manner of sharing the memory 300 is not limited in this disclosure. The shared memory 300 accessed by the chip 100 of the embodiment of the present disclosure may be a shared memory implemented on another chip independent from the chip 100, and the chip 100 of the embodiment of the present disclosure is directly or indirectly connected to a chip where the shared memory is located through a circuit between chips, so as to access the shared memory 300. The shared memory 300 accessed by the chip 100 according to the embodiment of the present disclosure may also be a shared memory unit integrated on the same chip as the chip 100, and the chip 100 according to the embodiment of the present disclosure is directly or indirectly connected to the shared memory unit through a circuit in the chip, so as to access the shared memory 300.

In the present disclosure, the shared memory 300 may include one or more channels. The channel referred to herein is a storage unit instantiated in the shared memory 300, and different channels correspond to different access addresses, and a channel to be accessed may be determined according to the access address carried in the access request, and data stored in the channel may be acquired. In a single-channel shared memory, different memory addresses may be used to access a different set of data bits under the same channel (e.g., the upper 16 bits of the channel or the lower 16 bits of the channel); in the multi-channel shared memory, different memory access addresses can be used for accessing different channels or accessing a group of different data bits under the same channel. In the present disclosure, different channels of the multi-channel shared memory can be accessed simultaneously to improve access efficiency.

In a thread group including a plurality of threads, there often exist threads that need to access the same address, and the chip 100 according to the embodiment of the present disclosure may merge memory access addresses of threads in the plurality of thread groups in the request sending unit 110, so as to reduce repeated memory access requests for the same address, thereby reducing the number of memory access requests issued to the shared memory 300, and reducing the pressure of a memory access system. Since the merging of the access requests is performed in the request sending unit 110, in order to determine which thread requests the data returned by the shared memory unit, the access addresses need to be merged in the request sending unit 110, and at the same time, the merging mode is recorded as merging information, and the merging information is sent to the data returning unit 120, so that the data returning unit 120 can broadcast the returned data to a plurality of corresponding threads according to the merging information, and all threads can receive correct returned data.

Fig. 2 is a detailed structural diagram of a chip of an embodiment of the disclosure. As shown in fig. 2, the request sending unit 110 includes:

an intra-group merging unit 112, configured to merge the memory access addresses of each thread in the same thread group to obtain the intra-group merging information;

and the inter-group merging unit 113 is configured to merge memory addresses of threads among different thread groups to obtain the inter-group merging information.

The threads in the same thread group may have the same access and memory addresses, and the access and memory addresses of the threads in the same thread group may be merged, and this merging mode is called intra-group merging. The situation that the memory access addresses of the threads in different thread groups are the same may also exist, and the memory access addresses of the threads in different thread groups may be merged, and this merging mode is called inter-group merging. The recorded merging information may be divided into two types of merging information, i.e., intra-group merging information and inter-group merging information, for both of the intra-group merging and inter-group merging modes. The intra-group merging information is used for representing the merging mode of the memory access addresses of all threads in the same thread group, and the inter-group merging information is used for representing the merging mode of the memory access addresses of all threads among different thread groups.

Correspondingly, the request sending unit 110 of the chip 100 of the embodiment of the present disclosure sets the intra-group merging unit 112 and the inter-group merging unit 113 to implement intra-group merging and inter-group merging of the memory addresses, respectively. In the chip 100 of the embodiment of the present disclosure, the intra-group combining of the access addresses may be performed before the inter-group combining, the access addresses of the threads in the same thread group are combined by the intra-group combining unit 112, the access addresses after the intra-group combining are sent to the inter-group combining unit 113, and the combined access addresses of different thread groups are combined by the inter-group combining unit 113. After the inter-group combining unit 113 performs the inter-group combining of the access addresses, it generates an access request based on the inter-group combined access addresses and sends the access request to the shared memory 300, so as to access the corresponding channel in the shared memory 300. In the chip 100 according to the embodiment of the present disclosure, the intra-group combining unit 112 records intra-group combining information and sends the intra-group combining information to the data returning unit 120 when performing intra-group combining, and the inter-group combining unit 113 records inter-group combining information and sends the inter-group combining information to the data returning unit 120 when performing inter-group combining.

FIG. 3 is a flow chart of a method for intra-group merging in a chip according to an embodiment of the disclosure. As shown in fig. 3, the method of intra-group merging performed in the chip 100 of the embodiment of the present disclosure includes:

step S301: comparing the memory access addresses of all threads in the same thread group, and generating first address equal identification of all threads according to the comparison result;

step S302: and operating the thread effective identification and the identification equal to the first address to obtain the access and storage merging identification in the group.

In step S301, the first address equality flag is used to indicate whether the memory access address of each thread in the thread group where the corresponding thread is located is the same as the memory access address of the corresponding thread. In an embodiment, for a thread group including m threads, an m-bit identifier may be used as the first address equal identifier, each bit of the identifier corresponds to each thread in the thread group, and each data bit of the identifier adopts a different value to represent that the memory access address of the thread corresponding to the data bit is the same as or different from the memory access address of the thread. For example, the value of one data bit of the identifier is "1", which indicates that the memory access address of the thread corresponding to the bit is the same as the memory access address of the thread, and the value of one data bit of the identifier is "0", which indicates that the memory access address of the thread corresponding to the bit is different from the memory access address of the thread. For example, in a thread group including four threads, i.e., thread 0, thread 1, thread 2, and thread 3, assuming that the memory access addresses of the threads are address 0, address 1, address 0, and address 2 in sequence, in the thread group, the memory access addresses of thread 0, thread 1, thread 2, and thread 3 are compared with thread 0 in sequence, and the first address identity corresponding to thread 0 is obtained as 1010.

In the embodiment of the present disclosure, the pairwise comparison of each thread in the thread group may be performed by starting from the thread with the smallest number, sequentially comparing the memory access address of each thread in the thread group with the memory access address of the thread, and sequentially obtaining the first address equality identifier of each thread. In particular, for the thread numbered later, the comparison with the access address of the thread numbered earlier may not be performed again, and the bits corresponding to the thread numbered earlier in the first address equality flag of the thread numbered later are all 0 by direct default. For example, in the above example of the thread group with four threads, when the first address identity of the thread 2 is obtained, the comparison between the memory access address of the thread 0 and the memory access address of the thread 1 may not be performed, and therefore, even if the address accessed by the thread 0 is the same as the address accessed by the thread 2, the bit corresponding to the thread 0 on the first address identity of the thread 2 is recorded as the default 0, that is, the first address identity of the thread 2 is recorded as 0010.

In step S302, the thread valid identifier is used to characterize whether the memory access address of each thread in the thread group is valid. In the process of merging in the group, the effective memory access address refers to a memory access address which is not merged with the memory access addresses of other threads in the thread group, and the memory access address can be merged with the memory access addresses of other threads; and the invalid memory access address refers to a memory access address which is merged with the memory access addresses of other threads in the thread group. For a thread group comprising m threads, an m-bit identifier can be used as the effective identification of the threads, and each data bit of the identifier corresponds to the access address of each thread in the thread group. Each data bit of the thread valid identifier represents that the memory access address corresponding to the data bit is valid or invalid through different values. In one embodiment, "1" may be used to indicate that the memory address corresponding to the data bit is valid, and "0" may be used to indicate that the memory address corresponding to the data bit is invalid. In embodiments of the present disclosure, each bit of the thread valid identification may be initialized to valid when an intra-group merge is initiated.

The thread effective identifier of one thread is equal to the first address of the thread, and operation is carried out, and the memory access merging identifier in the group of the thread can be obtained. The group internal access merging identification of one thread is used for representing whether the access address of each thread in the thread group where the thread is located is merged with the access address of the thread. For a thread group comprising m threads, an m-bit identifier can be used as the access and merging identifier in the group, and each data bit of the identifier corresponds to each thread in the thread group. Aiming at any thread A in the thread group, each data bit in the group memory access merging identification of the thread A indicates that the memory access address of the thread corresponding to the data bit is merged or not merged with the memory access address of the thread A through different values.

In an embodiment, a value "1" may be used to indicate that the memory access address of the thread corresponding to the data bit is merged with the memory access address of the thread a, and a value "0" may be used to indicate that the memory access address of the thread corresponding to the data bit is not merged with the memory access address of the thread a. In this disclosure, the in-group merging unit 112 may sequentially merge the threads in the thread group according to the thread numbering sequence, and when the access and memory address merging is performed on the thread after numbering, the thread valid identifier should remove the valid bit corresponding to the thread that has been merged, that is, the position corresponding to the thread that has been merged is invalid.

For example, in a thread group including thread 0, thread 1, thread 2, and thread 3, assuming that the access addresses of the four threads are address 0, address 1, address 0, and address 2 in sequence, the first address equality identifiers corresponding to thread 0, thread 1, thread 2, and thread 3 of the thread group are 1010, 0100, 0010, and 0001, respectively, and the thread validity identifier corresponding to the thread group is initialized to 1111, when the access addresses are merged for the first time in the group, the thread validity identifier and the first address equality identifier of thread 0 are anded, that is, 1111&1010, and the obtained result 1010 is the access merging identifier in the group corresponding to thread 0; when the access address merging is performed for the second time in the group, the effective identification of the thread removes the effective bit corresponding to the thread performing merging in the first time of the group merging to obtain 0101, and the identification equal to the first address of the thread 1 performs an and operation, and the obtained result 0100 is the access merging identification in the group corresponding to the address 1; according to the same steps, the group memory access merging identifier 0001 corresponding to the address 2 can be obtained.

It is worth noting that the group memory access merging identifier corresponding to the memory access address of the thread 2 obtained after the memory access address merging is performed for the third time in the group is 0000, which means that the memory access address of the thread 2 is merged with the memory access addresses of other threads in the previous memory access address merging process, and therefore the memory access address obtained after the group memory merging does not need to be included again in the memory access address of the thread 2, and a repeated memory access request is generated. The intra-group access merging identifier may completely express a correspondence between each access address obtained after intra-group merging in the intra-group merging unit 112 and each thread in the thread group, and a merging manner of the access addresses of each thread in the thread group, and therefore, the intra-group access merging identifier may be used as the intra-group merging information in the embodiment of the present disclosure.

In step S302, before performing and operation on the thread valid identifier and the identifier equal to the first address of the other threads except the first thread in the thread group, a process of removing the valid bit of the merged thread needs to be performed on the thread valid identifier. In an embodiment, the processing may be a first way, that is, the processing may be implemented by performing an and operation on the thread valid identifier and the access mask merged last time, and updating the value of the thread valid identifier to be the operation result. The access mask for each merging can be obtained by negating the access merging identifier in the group obtained after each merging, the access mask is used for representing other threads in the thread group in each merging except the thread with the merged access address, and the thread which is not merged in the group in the thread group after each merging can be obtained by removing the thread which is merged in the previous merging from the thread effective identifier.

For example, in the above example, the access mask of the first merging is 0101, and the access mask and the thread valid identifier are anded, that is, 0101&1111, that is, the thread valid identifier of the second merging is 0101; similarly, the memory mask for the second merge is-0100, i.e. 1011, and the thread valid identifier for the third merge is 1011&0101, i.e. 0010.

In an embodiment, the processing may also be a second manner, that is, the thread valid identifier may not be updated in each merging, but when the access mask of each merging is obtained, a value obtained after the non-merging of the group access merging identifier obtained after each merging is taken and then is subjected to an and operation with the access mask of the previous merging, so as to obtain the merged access mask, and after the and operation is performed on the thread valid identifier and the access mask of the previous merging in each merging, the result and the identifier equal to the first address are subjected to an and operation, so as to obtain the group access merging identifier.

For example, in the above example, the thread valid tag at each merge is 1111, the first merged access mask is 0101, the AND operation to obtain the intra-group access merge tag at the second merge is 1111&0101&0100, i.e., 0100, and the second merged access mask is-0100 &0101, i.e., 0001. The embodiment of the present disclosure does not limit the specific manner in which the valid flag of the thread removes the valid bit of the merged thread, and may be any one of the two manners or another manner.

The process of merging the same access addresses in the same thread group in the group merging unit to obtain the group access merging identifier may be as shown in fig. 4. The manner of removing the valid bit of the merged thread from the thread valid identifier of the thread group is the second manner, and the result of removing the valid bit of the merged thread from the thread valid identifier of the thread group is referred to as the actual thread valid identifier of the thread group. In fig. 4, adr _ equivalent _ flag represents the first address equal identifier of the thread m in the thread group, cls _ flgm represents the group internal access merging identifier of the thread m, tm _ req represents the thread valid identifier of the thread group, mask represents the access mask of the thread m, and m { tm _ req & mask-1 [ m ] } represents the actual thread valid identifier of the thread group obtained after the thread valid identifier of the thread group and the access mask of the thread m-1 are and-operated.

Firstly, for the convenience of understanding, the address comparison process is represented by a two-dimensional matrix, and each address in the two-dimensional matrix is the access address of the thread in the same thread group. The addresses of all threads are arranged in the horizontal and vertical directions of the two-dimensional matrix in sequence respectively, the value on the diagonal line of the two-dimensional matrix is set to be 1, the value on each position on the left lower corner of the diagonal line is set to be 0, the value on each position on the right upper corner of the diagonal line is obtained according to whether the access address corresponding to the row where the position is located is the same as the access address corresponding to the column where the position is located, and the value obtained by each final row of the two-dimensional matrix is the first address equal identification of the thread corresponding to the row. After the first address equal identification of each thread is obtained, the effective identification of the thread and the first address equal identification of the thread 0 are subjected to AND operation, the access addresses in the thread group, which are the same as the access address of the thread 0, are merged to obtain the group access merging identification of the thread 0, and the access mask of the thread 0 is obtained by negating the group access merging identification of the thread 0. Then, after performing an and operation according to the thread effective identifier of the thread group and the access mask of the thread 0, obtaining an actual thread effective identifier of the thread 1, performing an and operation on the identifier that the actual thread effective identifier of the thread 1 is equal to the first address of the thread 1, merging the access addresses of the threads with the same access addresses in the thread group and the thread 1 to obtain an intra-group access merging identifier of the thread 1, and performing an and operation on the value obtained by negating the intra-group access merging identifier of the thread 1 and the access mask of the thread 0 to obtain the access mask of the thread 1. According to a similar mode, after performing AND operation according to the thread effective identification of the thread group and the access mask of the previous thread in sequence, obtaining the actual thread effective identification of the next thread, performing AND operation on the identification that the actual thread effective identification of the next thread is equal to the first address of the next thread, merging the access addresses of the threads with the same access address in the thread group and the next thread to obtain the group access merging identification of the next thread, and performing AND operation on the inverted value of the group access merging identification of the next thread and the access mask of the previous thread to obtain the access mask of the next thread. And finally, obtaining the in-group combination identification and the access mask of each thread of the thread group.

As shown in fig. 2, the intra-group merging unit 112 in the embodiment of the present disclosure may include a comparator array 1121 and an and gate array 1122. The comparator array 1121 is used to implement step S301 in fig. 3. The comparator array 1121 is an integrated circuit composed of a plurality of comparators, each of which is an electronic component capable of comparing two input signals, and setting an output signal equal to the comparison result to 1 and an output signal unequal to the comparison result to 0, and may be an integrated electronic component having the above functions, or an integrated circuit having the above functions and composed of a plurality of more basic electronic components, which is not limited in this disclosure. The and gate array 1122 is used to implement step S302 in fig. 3. The and gate array 1122 is an integrated circuit including a plurality of and gates, and the and gates are electronic components that perform and operations on input signals and output operation results. In the embodiment of the present disclosure, the and gate array 1122 may also include a not gate in addition to the and gate, and is configured to perform negation on the intra-group access merge flag when the access mask is generated.

As shown in fig. 2, the inter-group combining unit 113 in the embodiment of the present disclosure may include a channel address cache control unit 1131, and the channel address cache control unit 1131 includes a plurality of address cache units, each corresponding to one channel. The channel address cache control unit 1131 is configured to write the access addresses merged by the intra-group merging unit 112 into the address cache units of the corresponding channels, and merge the same cache addresses in the same address cache unit. In this embodiment of the present disclosure, if the number of channels is k, k address cache units are provided in the channel address cache control unit 1131 of the chip 100, and when receiving the access address merged by the intra-group merging unit 112, the channel address cache control unit 1131 determines, according to the access address, a channel corresponding to the address in the shared memory 300, and then writes the channel into the corresponding address cache unit. In the same address cache unit, a plurality of memory access addresses from different thread groups but accessing the same address may be written, so that the memory access addresses may be further merged in the inter-group merging unit 113, and the memory access requests sent to the shared memory 300 by the request sending unit 110 are further reduced by subtracting the same memory access addresses in adjacent thread groups.

In one embodiment, the address buffer unit may be a FIFO (First In First Out) memory. The FIFO memory has the characteristic that data written later cannot be read out earlier than data written earlier, and can ensure that the memory access request corresponding to the memory access address written earlier can be processed in time. The FIFO memory can read and write data based on a write pointer and a read pointer. When the channel address cache control unit 1131 writes the access address into the address cache unit, the access addresses are written in sequence from the position corresponding to the write pointer of the address cache unit, and after the access request is written in, the address cache unit moves the write pointer to the next writable position according to the number of the written access requests, that is, the write pointer is increased by the number equal to the number of the written access requests. When the channel address cache control unit 1131 reads out the access address from the address cache unit, the read-out target address is the address pointed by the read pointer. When the address cache unit reads the target address, the access addresses of other caches in the address cache unit, which are the same as the target address, are read simultaneously, so that the merging of the same cache addresses in the address cache unit is realized. After the target address is read, the address cache unit will move the read pointer to the next readable address, i.e. the next address in the address cache unit different from the sent target address. And if the address which can be read next does not exist in the address cache unit, moving the read pointer to the position of the write pointer. When the read pointer is at the position of the write pointer, the address buffer unit cannot read the address.

In an embodiment, the address caching unit may further cache a second address equality flag. The second equivalent identifier is used to characterize each cache address in the address cache unit that is the same as the address pointed by the target address, i.e., the read pointer. For example, address 0, address 1, address 2 and address 3 are written in one address buffer unit, where address 0, address 1 and address 3 are the same address, and address 2 is another address, and when the read pointer points to address 0, the second address equality flags corresponding to address 0, address 1, address 2 and address 3 are respectively 1, 0 and 1. When the address cache unit reads the target address, the corresponding second addresses with the same identifier 1 in the address cache unit are read at the same time, so that the same cache addresses in the address cache unit are combined. After the target address is read out, the next readable address is the corresponding address with the equal second address and the identifier of 0. In the above example, address 0, address 1 and address 3 are read out simultaneously. It should be noted that when the read pointer moves to the next readable address, the target address is changed to the next readable address, and at this time, all the second address equal identifiers cached in the address cache channel need to be updated correspondingly according to the changed target address. In an embodiment, the second address equality flag may also be used to characterize a subsequent address in the address cache unit that is the same as a previous address, i.e. only consecutive addresses are compared. For example, in the address cache unit of the above example, the second address equality flags corresponding to address 0, address 1, address 2, and address 3 are 0, 1, 0, and 0, respectively. When the address cache unit reads the target address, if the second address of the next address is equal and is 1, the address is read at the same time, the next address behind the address is continuously judged, all continuous addresses with the second equal and is 1 are read at the same time, and the process is stopped until the second equal and is 0 of the next address, so that the same cache addresses in the address cache unit are merged. After all the addresses are read, the next readable address is the address when the reading is stopped, and the second address corresponding to the address is equal and is marked as 0. In the above example, address 0 and address 1 are read simultaneously, but address 3 is not read simultaneously. It should be noted that although this method may read a smaller number of addresses than the former method when reading the buffer addresses in the address buffer unit at a time, it is not necessary to update the second address equality flag in the address buffer unit after reading the addresses each time, and the second address equality flag only needs to be recorded at the same time when the addresses are written in the address buffer unit.

In an embodiment, the address caching unit may further cache an effective address identifier. The effective address mark is used for representing each cache address which is not sent in the address cache unit. And writing the memory access address merged in the group into the address cache unit in the channel address cache control unit 1131, and generating the effective address identifier, where the generated effective address identifier is set to 1. When the address cache unit reads out the target address, the cache addresses in the address cache unit which are the same as the target address, namely the effective access identifiers of all the cache addresses read out simultaneously are set to be 0, so that the same cache addresses in the address cache unit are merged. When reading the target address, the other addresses read at the same time may also be cached after the address pointed by the read pointer, and in order to avoid repeatedly issuing a request by the cache address, the effective address identifier of 0 may be used to indicate that the target address read by the address caching unit is the cache address whose effective address identifier is 1. After the target address is read, the next readable address should also be the cache address with the effective address id 1. For example, address 0, address 1, address 2, and address 3 are written in one address cache unit, where address 0, address 1, and address 3 are the same address, and address 2 is another address, when the read pointer points to address 0, the valid address identifiers corresponding to address 0, address 1, address 2, and address 3 are respectively 1, and 1; after address 0, address 1 and address 3 are read out simultaneously, the read pointer is moved to address 2, and the effective addresses corresponding to address 0, address 1, address 2 and address 3 are identified as 0, 1 and 0.

Fig. 5 is a schematic diagram of a write access address of an address cache unit in a chip according to an embodiment of the present disclosure. In fig. 5, the first position, the second position, the third position, and the fourth position of the address cache unit are from bottom to top, and these four positions are four consecutive positions in the address cache unit, but the address cache unit may include other positions (not shown in the figure) in addition to these four positions, and the other positions may be positions before the first position in fig. 5 or positions after the fourth position in fig. 5, and respectively represent, from left to right, a written access address, a second address equal identifier corresponding to an address in each position, and an effective address identifier corresponding to an address in each position. As shown in fig. 5, before a memory access address is written, a memory access address (not labeled in the figure) is written in a first position on the address cache unit, and the memory access address is the same as a target address on a position to which a read pointer points, and the memory access address is not written in any of the other three positions, then a second address equal identifier corresponding to the first position on the address cache unit is 1, the second address equal identifiers are not recorded in the other three positions, valid address identifiers of the four positions are 1, 0, and 0, respectively, and a write pointer points to the second position. When the memory access address is written into the address cache unit, two memory access addresses are newly written, and all memory access addresses with the same second address and the same identifier 1 are read, wherein in the newly written memory access addresses, the first memory access address is different from the next target address pointed by the read pointer after reading, and the second memory access address is the same as the next target address pointed by the read pointer after reading. As shown in fig. 5, after the memory address is written in, because the memory address at the first position on the address cache unit is read out, the second address equal identifier and the effective address identifier corresponding to the first position in the address cache unit are both reset to 0, and the first memory address (not shown in the figure) and the second memory address (not shown in the figure) in the newly written memory address are written in the second position and the third position on the address cache unit, the second address equal identifier corresponding to the second position on the address cache unit is set to 0, the second address equal identifier corresponding to the third position on the address cache unit is set to 1, the effective address identifiers corresponding to the second position and the third position on the address cache unit are both set to 1, and the write pointer is moved to the fourth position.

Fig. 6 is a schematic diagram of an address cache unit in a chip reading a memory access address according to an embodiment of the disclosure. In fig. 6, from bottom to top, the first position, the second position, the third position and the fourth position of the address buffer unit are respectively, the four positions are four consecutive positions in the address buffer unit, and other positions (not shown in the figure) may be included besides the four positions, and the other positions may be positions before the first position in fig. 6 or positions after the fourth position in fig. 6. And respectively representing the written memory access address, a second address equal identifier corresponding to the address at each position and an effective address identifier corresponding to the address at each position from left to right. As shown in fig. 6, before a memory access address is read, memory access addresses (not labeled in the figure) are written into four positions on the address cache unit, the memory access addresses in the first three positions are the same, the memory access address in the fourth position is different from the memory access addresses in the other three positions, and the read pointer points to the first position, then the second address equality identifications corresponding to the first three positions on the address cache unit are all 1, the second address equality identifications corresponding to the fourth position are 0, and the effective address identifications of the four positions are all 1. When the memory address is read out from the address cache unit at this time, the read target address is the memory address on the first position, and all the addresses with the equal second addresses and the equal identifier of 1 are all regarded as being read out, i.e. the memory addresses on the first three positions are all read out, and the read pointer moves to the next valid address which is not read out, i.e. the read pointer moves to the fourth position. As shown in fig. 6, after the memory access address is read out, since the memory access addresses at the first three positions on the address cache unit are already read out, the second address equality flag and the effective address flag corresponding to the first three positions in the address cache unit are both reset to 0, the read pointer after being read out points to the fourth position, and the second address equality flags corresponding to the position (including the fourth position) where the memory access address identical to the fourth position is located are both reset to 1.

In the inter-group combining unit 113, an inter-group access and memory combination identifier for representing each thread group performing inter-group access and memory combination may be generated according to the thread group corresponding to the cache address combined in the address cache channel, and the inter-group access and memory combination identifier is sent to the data returning unit 120 as inter-group combining information. For example, address 0 from thread group 0, address 1 from thread group 1, address 2 from thread group 2, and address 3 from thread group 3 are written into an address buffer unit, wherein address 0, address 1, and address 3 are the same address, and address 2 is another address, and when the read pointer points to address 0, address 1, and address 3 are read out simultaneously, and the inter-group access merge flag is recorded as 1101.

In the embodiment of the present disclosure, since the multi-channel shared memory 300 may allow different channels to access the shared memory 300 at the same time, each time the channel address cache control unit 1131 sends the access request corresponding to the target address in the address cache unit to the shared memory 300, the access request corresponding to the target address in the multiple address cache units may be sent to the shared memory 300 at the same time, so that each channel of the shared memory 300 can be utilized, thereby improving the access efficiency and improving the bandwidth utilization rate of the access system. For example, in an embodiment, if the channel 0 address cache unit 11310 has a target address 0 to be sent, and the channel 1 address cache unit 11311 has a target address 1 to be sent, the target address 0 and the target address 1 may be read from the channel 0 address cache unit 11310 and the channel 1 address cache unit 11311 at the same time, and the access request corresponding to the target address 0 and the access request corresponding to the target address 1 are sent to the shared memory 300 at the same time, so as to access the channel 0 and the channel 1 of the shared memory 300, respectively. It should be noted that, when sending the access request to the shared memory 300, a channel corresponding to the access address of the access request in the shared memory 300 should be idle, and other access requests are not being processed, so as to avoid the problems of access conflict and the like.

In the embodiment of the present disclosure, the channel address cache control unit 1131 sends the memory access address to the shared memory 300 only when the sending condition is met. The transmission conditions include: the address cache unit corresponding to any channel is full, or the address cache units corresponding to all idle channels are not empty, or no new access address merged by the intra-group merging unit 112 is written into the address cache unit. The fact that the address cache unit corresponding to any channel is full means that one cache unit cannot continuously write the access address, so that the access address in the address cache unit needs to be read out, and the problems that the access address is lost due to the fact that the new access address cannot be normally written in, the corresponding thread cannot receive return data and the like are solved. The address cache units corresponding to all idle channels are not empty, so that the bandwidth of the memory access system can be utilized to the greatest extent, a plurality of channels of the shared memory 300 can be accessed as far as possible at the same time, the times of accessing the shared memory 300 are reduced, and the power consumption of the memory access system is reduced. When no new memory access address merged by the intra-group merging unit 112 is written into the address cache unit, it is difficult to achieve a state where the address cache units corresponding to all idle channels are not empty, and at this time, the address cached in the address cache unit should be sent out in time, so that the memory access delay increase caused by long-time waiting is avoided.

As shown in fig. 2, the request sending unit 110 in the embodiment of the present disclosure may further include an address splitting unit 111. The address splitting unit 111 is configured to split a memory access address from the memory access request of each thread group, and generate a thread channel mapping identifier. The thread channel mapping identification is used for representing each thread of which the memory access address in the thread group corresponds to each channel. Based on the thread channel mapping, the in-group merging unit 112 may merge the memory access addresses corresponding to different channels in the same thread group, and the merged memory access addresses may also be directly written into the address cache unit of the corresponding channel, without determining the channel corresponding to the memory access address again. In one embodiment, for a thread group containing m threads, an m-bit identifier may be used as the thread channel map identifier. For example, in a thread group having four threads, thread 0, thread 1, thread 2 and thread 3, assuming that thread 0 accesses address 0, thread 1 accesses address 1, thread 2 accesses address 0, and thread 3 accesses address 2, and address 0, address 2 correspond to channel 0, and address 1 corresponds to channel 1, the thread channel mapping identifier corresponding to channel 0 in the thread group is 1011, and the thread channel mapping identifier corresponding to channel 1 is 0100. In one embodiment, the thread channel mapping identifier may also be sent to the data return unit 120 as a combined information in a group.

As shown in fig. 2, the data returning unit 120 in the embodiment of the present disclosure may include an information caching unit 121 and a broadcast control unit 122. The information caching unit 121 is configured to receive and cache the merging information sent by the request sending unit 110; the broadcast control unit 122 is configured to return data returned by the shared memory 300 to a corresponding thread based on the merging information cached by the information caching unit 121.

Since the access request sent to the shared memory 300 in the request sending unit 110 is an access request generated according to the merged access address, the data returned by the shared memory 300 corresponds to the access address, not to a specific thread. Therefore, it is necessary to determine which thread group the memory access address corresponds to and which thread in the thread groups to which the memory access address corresponds to according to the merging information corresponding to the memory access address, so that all threads can receive correct return data.

In the embodiment of the present disclosure, the merging of the access addresses may be divided into two manners, i.e., intra-group merging and inter-group merging, and the merging information may also be divided into two types, i.e., intra-group merging information and inter-group merging information, and correspondingly, the information caching unit 121 in the data returning unit 120 may also include an intra-group information caching unit 1211 and an inter-group information caching unit 1212, which are respectively configured to receive and cache the intra-group merging information and the inter-group merging information from the request sending unit 110.

In the embodiment of the present disclosure, since the merging of the access addresses is performed by first performing intra-group merging and then performing inter-group merging, the broadcast control unit 122 in the data return unit 120 should broadcast the return data to the corresponding threads in the reverse order, that is, perform inter-group broadcasting according to the inter-group merging information first to return the data returned by the shared memory 300 to each thread group, and then perform intra-group broadcasting according to the intra-group merging information to return the data returned by the shared memory 300 to each thread in the same thread group.

For example, the chip 100 receives a memory access request sent by the processor 200, where the memory access request includes a thread group 0 and a thread group 1, the thread group 0 includes four threads, i.e., a thread 00, a thread 01, a thread 02, and a thread 03, and the thread group 1 includes four threads, i.e., a thread 10, a thread 11, a thread 12, and a thread 13, where the thread 00, the thread 10, and the thread 11 all access an address 0 corresponding to a channel 0, the thread 01 accesses an address 1 corresponding to a channel 1, the thread 12 accesses an address 2 corresponding to a channel 2, and the thread 02, the thread 03, and the thread 13 access an address 3 corresponding to a channel 3. In the request sending unit, firstly merging the memory access addresses of the four threads in the thread group 0 and the memory access addresses of the four threads in the thread group 1 in the group merging unit to obtain a merged memory access address 00, a merged memory access address 01, a merged memory access address 02, a merged memory access address 10, a merged memory access address 11 and a merged memory access address 12, wherein the memory access address 00 is the same as the memory access address 10 and is an address corresponding to the channel 0; the access address 01 is an address corresponding to the channel 1; the memory access address 11 is an address corresponding to the channel 2; the access address 02 is the same as the access address 12 and is an address corresponding to the channel 3. The in-group merging information corresponding to each access address can be recorded and sent to the data return unit for caching.

Then, in the inter-group merging unit, after merging the access address 00 and the access address 10, generating an access request 0 for accessing the channel 0, generating an access request 1 for accessing the channel 1 according to the access address 01, generating an access request 2 for accessing the channel 2 according to the access address 11, merging the access address 02 and the access address 12, generating an access request 3 for accessing the channel 3, recording inter-group merging information corresponding to addresses corresponding to the access requests respectively, and sending the inter-group merging information to the data return unit for caching. And respectively sending the memory access request 0, the memory access request 1, the memory access request 2 and the memory access request 3 to a channel corresponding to access in the shared memory. After receiving the data 0 in the channel 0, the data 1 in the channel 1, the data 2 in the channel 2 and the data 3 in the channel 3 returned by the shared memory in the data return unit, firstly returning the data 0 to the thread group 0 and the thread group 1, returning the data 1 to the thread group 0, returning the data 2 to the thread group 1 and returning the data 3 to the thread group 3 according to the cached inter-group merging information; and returning the data 0 to the thread 00 in the thread group 0, the threads 10 and 11 in the thread group 1, the data 1 to the thread 01 in the thread group 0, the data 2 to the thread 12 in the thread group 1, the data 3 to the threads 02 and 03 in the thread group 0 and the thread 13 in the thread group 1 according to the cached in-group merging information, and finally enabling each thread in all the thread groups to receive correct returned data.

As shown in fig. 2, the data return unit 120 in the embodiment of the present disclosure may further include a data return cache unit 123. Because in the request sending unit 110, when multiple threads in the same thread group access the shared memory 300 according to different channels, the time for accessing the memory by the memory access address corresponding to each thread may be different, and the time for each thread in the thread group to receive the data returned by the shared memory 300 may also be different. Therefore, the data return cache unit 123 may be used to cache the data returned to each thread in the thread group, and when each thread in the thread group receives the returned data, the data of each thread is written back to the register file corresponding to the thread in the processor 200.

Corresponding to the embodiment of the chip, the disclosure also provides an embodiment of a method for accessing a memory, which is applied to the chip.

As shown in fig. 7, an embodiment of the present disclosure provides a method for accessing a memory, where the method includes:

step S701: acquiring the memory access addresses of the threads in the multiple thread groups, merging the memory access addresses, sending memory access requests to corresponding channels in the shared memory based on the merged memory access addresses, and sending merging information representing the merging mode of the memory access addresses to the data return unit;

step S702: and acquiring data returned by the shared memory, and returning the data to the corresponding thread based on the merging information.

Optionally, step S701 further includes: merging the access addresses of all threads in the same thread group through an in-group merging unit to obtain in-group merging information; and merging the access addresses of the threads among different thread groups through an inter-group merging unit to obtain the inter-group merging information.

Optionally, the intra-group merging information includes an intra-group access merging identifier, where the intra-group access merging identifier is used to characterize whether an access address of each thread in a thread group where the thread is located is merged with an access address of the thread; step S701 further includes: the method comprises the steps that the memory access addresses of all threads in the same thread group are compared through a comparator array, first address equal identification of all the threads is generated according to a comparison result, and the first address equal identification is used for representing whether the memory access address of each thread in the thread group where the corresponding thread is located is the same as the memory access address of the corresponding thread; and performing AND operation on the identifier with the thread effective identifier and the identifier with the same first address through an AND gate array to obtain the access and storage merging identifier in the group, wherein the thread effective identifier is used for representing whether the access and storage address of each thread in the thread group is effective or not.

Optionally, the inter-group combining information includes an inter-group access and memory combination identifier, where the inter-group access and memory combination identifier is used to uniquely identify each thread group performing inter-group access and memory combination; step S701 further includes: and the access addresses merged by the merging units in the group are written into the address cache units of the corresponding channels through the channel address cache control unit, and the same cache addresses in the same address cache unit are merged.

Optionally, step S701 further includes: caching second address equal identification through the address cache unit, wherein the second address equal identification is used for representing whether each address in the address cache unit is the same as a target address or not; and combining the address and the target address under the condition that a second address corresponding to one address is equal in identifier and represents that the address is the same as the target address through the channel address cache control unit.

Optionally, step S701 further includes: the address cache unit is used for realizing cache effective address identification, and the effective address identification is used for representing whether each memory access address cached in the address cache unit is effective or not; determining a target address from a first memory access address in the address cache unit through the channel address cache control unit, wherein an effective address identifier corresponding to the first memory access address indicates that the first memory access address is effective; and under the condition that the memory access request corresponding to the target address is sent to the shared storage unit, modifying the effective address identifier corresponding to the target address to indicate that the first memory access address is invalid.

Optionally, step S701 further includes: and the access requests corresponding to the target addresses in the plurality of address cache units are all sent to the shared memory at each time through the channel address cache control unit.

Optionally, step S701 further includes: the channel address cache control unit is used for sending the memory access address to the shared memory each time when the sending condition is met; the transmission conditions include: the address cache unit corresponding to any channel is full, or the address cache units corresponding to all idle channels are not empty, or no new memory access address merged by the merging units in the group is written into the address cache unit.

Optionally, the group merging information includes a thread channel mapping identifier, where the thread channel mapping identifier is used to characterize each thread whose access address in the thread group corresponds to each channel; step S702 further includes: and splitting a memory access address from the memory access request of each thread group through an address splitting unit, and generating the thread channel mapping identifier.

Optionally, step S702 further includes: receiving and caching the merged information through an information caching unit; and the broadcast control unit is used for returning the data returned by the shared memory to the corresponding thread based on the merging information cached by the information caching unit.

Optionally, step S702 further includes: the in-group information caching unit receives and caches the in-group combined information from the request sending unit; and realizing receiving and caching the intergroup merging information from the request sending unit through the intergroup information caching unit.

Optionally, step S702 further includes: the broadcasting control unit is used for realizing the interclass broadcasting according to the interclass merging information so as to return the data returned by the shared memory to each thread group; and performing in-group broadcasting according to the in-group merging information so as to return the data returned by the shared memory to each thread in the same thread group.

Optionally, step S702 further includes: and caching the data returned to each thread through a data return cache unit, and respectively writing the data of each thread back to a register file corresponding to the thread in the processor when each thread in the thread group receives the returned data.

In the above method, the specific steps of the method for accessing a memory provided in the embodiment of the present disclosure may be applied to the chip described in the above chip embodiment, where step S701 is implemented by a request sending unit in the chip, and step S702 is implemented by a data returning unit in the chip, and the specific process may refer to the description of the above chip embodiment, and for brevity, will not be described again here.

In addition, the embodiment of the present disclosure further provides a computer device including the chip described in any one of the chip embodiments above in the present disclosure. The specific functions of the chip may refer to the description of the above chip embodiments, and are not described herein again for brevity.

The foregoing is only a specific embodiment of the embodiments of the present disclosure, and it should be noted that, for those skilled in the art, a plurality of modifications and decorations can be made without departing from the principle of the embodiments of the present disclosure, and these modifications and decorations should also be regarded as the protection scope of the embodiments of the present disclosure.

Claims

1. A chip, wherein the chip comprises:

a request sending unit and a data returning unit;

the request sending unit is used for acquiring the memory access addresses of the threads in the multiple thread groups, merging the memory access addresses, sending memory access requests to corresponding channels in the shared memory based on the merged memory access addresses, and sending merging information representing the merging mode of the memory access addresses to the data returning unit;

and the data return unit is used for acquiring the data returned by the shared memory and returning the data to the corresponding thread based on the merging information.

2. The chip according to claim 1, wherein the merging information includes intra-group merging information and inter-group merging information, the intra-group merging information is used to characterize a merging manner of memory addresses of threads in the same thread group, and the inter-group merging information is used to characterize a merging manner of memory addresses of threads between different thread groups.

3. The chip according to claim 2, wherein the request sending unit includes:

the in-group merging unit is used for merging the access addresses of all threads in the same thread group to obtain the in-group merging information; and

and the inter-group merging unit is used for merging the memory access addresses of all threads among different thread groups to obtain the inter-group merging information.

4. The chip of claim 3, wherein the intra-group merging information includes an intra-group access merging flag, and the intra-group access merging flag is used to characterize whether an access address of each thread in a thread group where the thread is located is merged with an access address of the thread; the intra-group merging unit includes:

the comparator array is used for comparing the memory access addresses of all threads in the same thread group and generating first address equal identification of all threads according to the comparison result, wherein the first address equal identification is used for representing whether the memory access address of each thread in the thread group where the corresponding thread is located is the same as the memory access address of the corresponding thread; and

and the AND gate array is used for performing AND operation on the thread effective identification and the identification equal to the first address to obtain the access and storage merging identification in the group, and the thread effective identification is used for representing whether the access and storage address of each thread in the thread group is effective or not.

5. The chip according to claim 3 or 4, wherein the inter-group combining information includes an inter-group access and memory combination identifier, and the inter-group access and memory combination identifier is used for uniquely identifying each thread group performing inter-group access and memory combination; the inter-group merging unit comprises a channel address cache control unit, the channel address cache control unit comprises a plurality of address cache units, and each address cache unit corresponds to one channel;

the channel address cache control unit is used for writing the memory access address merged by the merging unit in the group into the address cache unit of the corresponding channel and merging the same cache address in the same address cache unit.

6. The chip of claim 5, wherein the address cache unit is further configured to cache a second address equality flag, where the second address equality flag is used to characterize whether each address in the address cache unit is the same as a target address;

the channel address cache control unit is specifically configured to: and combining the address and the target address under the condition that a second address equal identifier corresponding to one address represents that the address is the same as the target address.

7. The chip according to claim 5 or 6, wherein the address cache unit is further configured to cache an effective address identifier, where the effective address identifier is used to characterize whether each access address cached in the address cache unit is valid;

the channel address cache control unit is specifically configured to: determining a target address from first memory access addresses in the address cache unit, wherein effective address identifiers corresponding to the first memory access addresses indicate that the first memory access addresses are effective;

and under the condition that the memory access request corresponding to the target address is sent to the shared storage unit, modifying the effective address identifier corresponding to the target address to indicate that the first memory access address is invalid.

8. The chip according to any one of claims 5 to 7, wherein the channel address cache control unit is specifically configured to:

and sending the memory access requests corresponding to the target addresses in the plurality of address cache units to the shared memory each time.

9. The chip according to claim 8, wherein the channel address cache control unit is specifically configured to:

when the transmission condition is met, the memory access address is transmitted to the shared memory; the transmission conditions include: the address cache unit corresponding to any channel is full, or the address cache units corresponding to all idle channels are not empty, or no new memory access address merged by the merging units in the group is written into the address cache unit.

10. The chip according to any one of claims 2 to 9, wherein the intra-group merge information includes a thread channel mapping identifier, and the thread channel mapping identifier is used to characterize each thread whose access address corresponds to each channel in a thread group; the request transmitting unit further includes:

and the address splitting unit is used for splitting the memory access address from the memory access request of each thread group and generating the thread channel mapping identifier.

11. The chip according to any one of claims 1 to 10, wherein the data return unit comprises:

the information cache unit is used for receiving and caching the merging information; and

and the broadcast control unit is used for returning the data returned by the shared memory to the corresponding thread based on the merging information cached by the information caching unit.

12. The chip of claim 11, wherein the information caching unit comprises:

the intra-group information caching unit is used for receiving and caching the intra-group merging information from the request sending unit; and

and the inter-group information caching unit is used for receiving and caching the inter-group merging information from the request sending unit.

13. The chip according to claim 12, wherein the broadcast control unit is specifically configured to:

firstly, performing intergroup broadcasting according to the intergroup merging information so as to return data returned by the shared memory to each thread group;

and performing in-group broadcasting according to the in-group merging information so as to return the data returned by the shared memory to each thread in the same thread group.

14. The chip of claim 12, wherein the data return unit further comprises:

and the data return cache unit is used for caching the data returned to each thread, and respectively writing the data of each thread back to the register file corresponding to the thread in the processor when each thread in the thread group receives the returned data.

15. A method for accessing a memory, applied to the chip of any one of claims 1 to 14, the method comprising:

acquiring the memory access addresses of the threads in the multiple thread groups through a request sending unit in the chip, merging the memory access addresses, sending memory access requests to corresponding channels in the shared memory based on the merged memory access addresses, and sending merging information representing the merging mode of the memory access addresses to the data return unit;

and acquiring the data returned by the shared memory through a data return unit in the chip, and returning the data to the corresponding thread based on the merging information.

16. A computer device comprising a chip as claimed in any one of claims 1 to 14.