WO2021114768A1 - 数据处理装置、方法、芯片、处理器、设备及存储介质 - Google Patents

数据处理装置、方法、芯片、处理器、设备及存储介质 Download PDF

Info

Publication number
WO2021114768A1
WO2021114768A1 PCT/CN2020/114010 CN2020114010W WO2021114768A1 WO 2021114768 A1 WO2021114768 A1 WO 2021114768A1 CN 2020114010 W CN2020114010 W CN 2020114010W WO 2021114768 A1 WO2021114768 A1 WO 2021114768A1
Authority
WO
WIPO (PCT)
Prior art keywords
data processing
cache
core module
processing request
basic core
Prior art date
Application number
PCT/CN2020/114010
Other languages
English (en)
French (fr)
Other versions
WO2021114768A8 (zh
Inventor
王晓阳
左航
倪怡芳
Original Assignee
成都海光微电子技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 成都海光微电子技术有限公司 filed Critical 成都海光微电子技术有限公司
Publication of WO2021114768A1 publication Critical patent/WO2021114768A1/zh
Publication of WO2021114768A8 publication Critical patent/WO2021114768A8/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management

Definitions

  • This application relates to the technical field of processors, and specifically to a data processing device, method, chip, processor, device, and storage medium.
  • GPU Graphics Processing Unit
  • CPU Central Processing Unit
  • the existing GPU has certain difficulties in scalability, because the Compute Engine in the GPU is connected to each other through the cache network. If you want to expand this architecture, for example, from four computing engines to With eight computing engines, it is difficult to connect more computing engines simply by expanding the cache network. This is because, on the one hand, simply expanding the cache network will lengthen the access path of the computing engine, which will result in a significant decrease in performance. On the other hand, there are limitations on chip winding resources and physical technology, which directly reduces The expansion of the cache network will increase the process complexity and is more difficult to implement.
  • an embodiment of the present application provides a data processing device, the data processing device includes: at least two basic core modules, each of the basic core modules includes: multiple computing engines, cache networks, and multiple conversion switches , Multiple cache units, shared bus and core cache;
  • the plurality of cache units and the core cache are respectively connected to the cache network, the plurality of calculation engines are connected to the cache network through the plurality of transfer switches, and the A plurality of transfer switches are serially connected through the shared bus;
  • the shared bus of the first basic core module of the at least two basic core modules is connected to the core cache of the second basic core module, and any switch in the first basic core module is configured to access the After the first data processing request of the first target cache unit in the second basic core module, the first data processing request is transmitted to the core of the second basic core module through the shared bus of the first basic core module Cache, the core cache of the second basic core module is configured to access the first target cache unit based on the first data processing request.
  • the data processing device includes at least two basic core modules, and each basic core module includes multiple calculation engines, and each calculation engine is connected to the cache network through a transfer switch, and the multiple transfer switches pass through the
  • the shared bus is serially connected, and the shared bus in one basic core module is connected to the core cache in another basic core module, and the core cache in another basic core module is connected to the cache network in another basic core module Therefore, through the above architecture, the expansion of the number of computing engines is realized.
  • the transfer switch in a basic core module receives a data processing request to access the target cache unit in another basic core module, the transfer switch can transmit the data processing request to the shared bus through the shared bus connected to it.
  • the core cache of another basic core module connected by the bus thereby realizing the transmission of the data processing request to another basic core module.
  • the data processing request that has been transmitted to the core cache can be connected through the core cache
  • the shared bus of the second basic core module is connected to the core cache of the first basic core module, and any switch in the second basic core module is configured to receive After accessing the second data processing request of the second target cache unit in the first basic core module, the second data processing request is transmitted to the first basic core module through the shared bus of the second basic core module.
  • the core cache of the core module, and the core cache of the first basic core module is configured to access the second target cache unit based on the second data processing request.
  • the shared bus of the second basic core module is also connected to the core cache of the first basic core module, In this way, the two basic core modules are connected to each other, and the calculation engine in any one of the two basic core modules connected to each other can be compared to the other basic core module without affecting the performance and increasing the process complexity. Access to the target cache unit in the core module. In this way, the storage client in one basic core module can access the target cache unit in another basic core module.
  • each computing engine includes multiple storage clients, and each storage client is connected to a cache route in the cache network through a transfer switch, The core cache is connected to a cache route in the cache network.
  • each storage client in each computing engine is connected to a cache route through a switch, and the core cache is connected to a cache route in the cache network. Because the basic core The transfer switch in the module is connected through the shared bus. Therefore, when any storage client in the computing engine needs to access the cache unit in another basic core module, its data processing request does not go through the cache network, but through the transfer switch and sharing The system transmits to the core cache of another basic core module, so that the storage client in one basic core module can access another basic core module.
  • a plurality of storage clients included in the plurality of calculation engines correspond to the plurality of transfer switches in a one-to-one correspondence, and each storage client passes The corresponding switch is connected to a cache route in the cache network, the cache network includes a plurality of cache paths arranged in a grid, and each cache route in the cache network is connected to each adjacent cache route .
  • each storage client in each computing engine is connected to a cache route through a corresponding transfer switch. Since the transfer switch in the basic core module is connected through a shared bus, Therefore, when any storage client in the computing engine needs to access the cache unit in another basic core module, its data processing request does not go through the cache network, but is transmitted to another basic core module through a transfer switch and a shared system. In the core cache, the storage client in one basic core module can access another basic core module.
  • the first data processing request is a read request
  • the core cache of the second basic core module is configured as:
  • the first target data is passed through all Returning the shared bus of the first basic core module to the storage client that sent the first data processing request;
  • the core cache of the second basic core module when the first data processing request sent by the storage client in the computing engine of the first basic core module is transmitted to the core cache of the second basic core module, if the first data is already stored in the core cache When processing the first target data requested by the request, the core cache will directly return the first target data to the client. If the core cache does not store the first target data, the core cache can pass through the second base connected to it. The cache network of the core module obtains the first target data from the first target cache unit and returns it to the client, so that the storage client in the calculation engine of the first basic core module can cache the second basic core module. Unit access.
  • each switch includes a first port, a second port, a third port, a fourth port, a first data selector, a data buffer, an arbiter, and a second data selector;
  • the first port is configured to be connected to a corresponding storage client
  • the second port is configured to be connected to a cache route
  • the third port is configured to be connected to a previous hop switch via a shared bus
  • the fourth port is configured to be connected to the next hop switch or the core buffer of another basic core module through a shared bus
  • the first data selector is respectively connected to the first port, the second port and the The data buffer is connected
  • the arbiter is connected to the data buffer, the third port, and the fourth port
  • the second data selector is connected to the first port and the second port, respectively.
  • the third port is connected to the fourth port;
  • the first data selector is configured to send the data processing request of the storage client received by the first port to a cache route connected to the second port, or to the data buffer;
  • the arbiter is configured to receive the data processing request sent by the data buffer and the third port, and when there are multiple data processing requests received, determine the data processing request with priority response among the multiple data processing requests , And output the data processing request of the priority response to the shared bus through the fourth port;
  • the second data selector is configured to output the readback data received by the fourth port to a storage client connected to the first port, or output to a shared bus through the third port, and is further configured to The readback data received by the second port is output to the storage client connected to the first port.
  • the transfer switch can send the data processing request sent by the storage client received by the first port to the buffer route connected to the second port through the first data selector, or send it to the data buffer
  • the transfer switch The arbiter in can receive the data processing request sent by the data buffer and the third port, and when there are multiple data processing requests received, determine the data processing request with priority response, and pass the data processing request with priority response through the first Four ports output to the shared bus; the transfer switch can output the readback data received by the fourth port to the storage client connected to the first port through the second data selector, or output to the shared bus through the third port, and is also configured And output the read-back data received by the second port to the storage client connected to the first port. Therefore, the data processing request can be routed to the cache network or the shared bus through the transfer switch, or the returned read-back data can be routed to the storage client or the shared bus.
  • the first data processing request is a write request including write data
  • any switch in the first basic core module is configured as:
  • the first data processing request When the first data processing request is received, the first data processing request is stored in the data buffer, and the first data processing request is returned to the storage client that initiated the first data processing request.
  • the requested write confirmation message
  • the first data processing request When the first data processing request satisfies the output condition, the first data processing request is output to the shared bus through the fourth port through the arbiter, so as to process the first data through the shared bus
  • the request is transmitted to the cache core of the second basic core module, so that the cache core of the second basic core module transfers the cache core of the second basic core module based on the first data processing request through the cache network of the second basic core module.
  • the write data is written into the first target cache unit.
  • the write request when a write request containing write data from a storage client is received, the write request can be stored in the data buffer, and a write confirmation message for the write request can be returned to the storage client that initiated the request.
  • the write request is output to the shared bus by the arbiter and transmitted from the shared bus to the cache core of the second basic core module, the cache core of the second basic core module based on the write request, through the cache network of the second basic core module, will The write data is written into the target cache unit that needs to be accessed, so that a quick response to the write request of the storage client can be realized.
  • an embodiment of the present application provides a data processing method, which should be configured as the data processing device described in the first aspect, and the method includes:
  • any switch in the first basic core module After receiving the first data processing request for accessing the first target cache unit in the second basic core module, any switch in the first basic core module passes through the shared bus of the first basic core module Transmitting the first data processing request to the core cache of the second basic core module;
  • the core cache of the second basic core module accesses the first target cache unit based on the first data processing request.
  • the data processing device includes at least two basic core modules, and each basic core module includes multiple calculation engines, and each calculation engine is connected to the cache network through a transfer switch, and the multiple transfer switches pass through the
  • the shared bus is serially connected, and the shared bus in one basic core module is connected to the core cache in another basic core module, and the core cache in another basic core module is connected to the cache network in another basic core module Therefore, through the above architecture, the expansion of the number of computing engines is realized.
  • the transfer switch in a basic core module receives a data processing request to access the target cache unit in another basic core module, the transfer switch can transmit the data processing request to the shared bus through the shared bus connected to it.
  • the core cache of another basic core module connected by the bus thereby realizing the transmission of the data processing request to another basic core module.
  • the data processing request that has been transmitted to the core cache can be connected through the core cache
  • the shared bus of the second basic core module is connected to the core cache of the first basic core module, and the method further includes:
  • any switch in the second basic core module After receiving a second data processing request to access the second target cache unit in the first basic core module, any switch in the second basic core module passes through the shared bus of the second basic core module Transmitting the second data processing request to the core cache of the first basic core module;
  • the core cache of the first basic core module accesses the second target cache unit based on the second data processing request.
  • the shared bus of the second basic core module is also connected to the core cache of the first basic core module, In this way, the two basic core modules are connected to each other, and the calculation engine in any one of the two basic core modules connected to each other can be compared to the other basic core module without affecting the performance and increasing the process complexity. Access to the target cache unit in the core module. In this way, the storage client in one basic core module can access the target cache unit in another basic core module.
  • the first data processing request is a read request
  • the core cache of the second basic core module accessing the first target cache unit based on the first data processing request includes:
  • the core cache of the second basic core module When the core cache of the second basic core module receives the first data processing request, and the core cache of the second basic core module stores the first target data requested by the first data processing request , Returning the first target data to the storage client that sent the first data processing request through the shared bus of the first basic core module;
  • the core cache of the second basic core module when the first data processing request sent by the storage client in the computing engine of the first basic core module is transmitted to the core cache of the second basic core module, if the first data is already stored in the core cache When processing the first target data requested by the request, the core cache will directly return the first target data to the client. If the core cache does not store the first target data, the core cache can pass through the second base connected to it. The cache network of the core module obtains the first target data from the first target cache unit and returns it to the client, so that the storage client in the calculation engine of the first basic core module can cache the second basic core module. Unit access.
  • the first data processing request is a write request containing write data
  • the core cache of the second basic core module accesses the first target based on the first data processing request Cache unit, including:
  • any switch in the first basic core module When any switch in the first basic core module receives the first data processing request, it stores the first data processing request in a data buffer, and sends it to the initiator that initiated the first data processing request.
  • the storage client returns a write confirmation message for the first data processing request;
  • the arbiter in any of the transfer switches When the first data processing request satisfies the output condition, the arbiter in any of the transfer switches outputs the first data processing request to the shared bus through the fourth port of the any transfer switch, so as to pass all the The shared bus transmits the first data processing request to the cache core of the second basic core module;
  • the cache core of the second basic core module writes the write data into the first target cache unit through the cache network of the second basic core module based on the first data processing request;
  • the cache core of the second basic core module writes the write data into the first target cache unit through the cache network of the second basic core module.
  • the write request when a write request containing write data from a storage client is received, the write request can be stored in the data buffer, and a write confirmation message for the write request can be returned to the storage client that initiated the request.
  • the write request is output to the shared bus by the arbiter and transmitted from the shared bus to the cache core of the second basic core module, the cache core of the second basic core module based on the write request, through the cache network of the second basic core module, will The write data is written into the target cache unit that needs to be accessed, so that a quick response to the write request of the storage client can be realized.
  • the arbiter is configured to determine the data processing request from the shared bus as a priority response when a plurality of the data processing requests come from a shared bus and a storage client respectively Data processing request.
  • the arbiter is configured to determine the data processing request received first as the priority response data when multiple data processing requests are from a shared bus or a storage client. Process the request.
  • the arbiter is configured to count the number of waiting times of data processing requests temporarily stored in the data buffer, and select the data processing request with the largest number of waiting times in the data buffer Data processing request as a priority response.
  • the storage duration of the data stored in the core cache reaches a preset duration threshold, the data is deleted or the data is set to a state that allows overwriting.
  • an embodiment of the present application provides a processor, including the data processing device described in the foregoing first aspect.
  • an embodiment of the present application provides a chip including the data processing device described in the first aspect above, and the data processing device is formed on the same semiconductor substrate.
  • an embodiment of the present application provides a processor, including the chip described in the fourth aspect.
  • an embodiment of the present application provides an electronic device, including: a memory and a processor, and a computer program is stored in the memory.
  • a computer program is stored in the memory.
  • the above-mentioned second aspect is implemented. Data processing method.
  • an embodiment of the present application provides a storage medium in which a computer program is stored, and when the computer program is executed by a processor, the data processing method described in the second aspect is implemented.
  • Fig. 1 is a schematic structural diagram of a data processing device provided by an embodiment of the present application.
  • Fig. 2 is a schematic structural diagram of another data processing device provided by an embodiment of the present application.
  • Fig. 3 is a schematic structural diagram of another data processing device provided by an embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of a core cache provided by an embodiment of the present application.
  • Fig. 5 is a schematic structural diagram of a transfer switch shown in an embodiment of the present application.
  • Fig. 6 is a schematic structural diagram of another transfer switch shown in an embodiment of the present application.
  • Fig. 7 is a flowchart of a data processing method provided by an embodiment of the present application.
  • Fig. 8 is a flowchart of another data processing method provided by an embodiment of the present application.
  • Fig. 9 is a block diagram of an electronic device provided by an embodiment of the present application.
  • Icon 100-data processing device; 110-basic core module; 111-calculation engine; 112-cache network; 113-changeover switch; 114-cache unit; 115-share bus; 116-core cache; 110a-first basic core Module; 110b-Second basic core module; 111a-Calculation engine in the first basic core module; 111b-Calculation engine in the second basic core module; 112a-Cache network in the first basic core module; 112b-Second foundation
  • the GPU chip usually contains four computing engines (Compute Engine), each computing engine can be understood as a core of the GPU, and each computing engine usually contains multiple memory clients (Memory Client). Each storage client can be understood as a core of the computing engine. All storage clients are connected to the cache network and access the memory/cache through the cache network. Because the computing engines in the GPU are currently connected to each other through the above-mentioned cache network, the GPU has certain difficulties in scalability. If you want to expand this architecture, for example, from four computing engines to eight computing engines, if you simply expand the cache network, the access path of the storage client in the computing engine will become longer. In the worst case, one storage Clients may need often long paths to access cache/memory.
  • the cache network when expanding from four computing engines to eight computing engines, if the cache network is expanded, the cache network needs to be expanded to twice the original size. In this case, if the GPU is located in the upper left corner If the storage client needs to access the cache in the lower right corner, the length of the access path of the storage client will also be doubled, which will result in a significant decrease in performance. On the other hand, due to the limitation of chip winding resources and physical technology, the expansion from four computing engines to eight computing engines will also greatly increase the difficulty of the manufacturing process.
  • FIG. 1 is a schematic structural diagram of a data processing device 100 provided by an embodiment of the present application.
  • the data processing device 100 may be applied to a processor, and the processor may be a GPU, a deep computing unit (DCU), or a deep computing unit (DCU).
  • the processor may be a GPU, a deep computing unit (DCU), or a deep computing unit (DCU).
  • CPU the CPU can also be a CPU integrated with a GPU.
  • the DCU can be understood as a graphics processor configured for general computing (General Purpose Computing on Graphics Processing Units, GPGPU), but DCU usually does not include graphics in general GPUs The processed part.
  • GPGPU General Purpose Computing on Graphics Processing Unit
  • the data processing device 100 includes: at least two basic core modules 110, and each basic core module 110 includes: a plurality of calculation engines 111, a cache network 112, a plurality of switches 113, and a plurality of caches.
  • each basic core module 110 a plurality of cache units 114 and a core cache 116 are respectively connected to the cache network 112, a plurality of calculation engines 111 are connected to the cache network 112 through a plurality of transfer switches 113, and a plurality of transfer switches 113 are connected through a shared bus 115 Serial connection.
  • the core cache 116 is configured to be connected to the shared bus 115 in another basic core module 110 to realize the connection of the two basic core modules 110.
  • FIG. 2 is a schematic structural diagram of another data processing device 100 provided by an embodiment of the present application.
  • the first basic core module 110a and the second basic core module 110b of the at least two basic core modules are taken as Examples are explained. As shown in FIG.
  • the shared bus 115a of the first basic core module 110a is connected to the core cache 116b of the second basic core module 110b, and any switch 113a in the first basic core module 110a is configured to After the first data processing request of the first target cache unit in the second basic core module 110b, the first data processing request is transmitted to the core cache 116b of the second basic core module 110b through the shared bus 115a of the first basic core module 110a, The core cache 116b of the second basic core module 110b is configured to access the first target cache unit based on the first data processing request.
  • the first target cache unit may be any one of the plurality of cache units 114b in the second basic core module 110b.
  • any switch 113a After receiving the first data processing request, any switch 113a transmits the first data processing request to the core buffer 116b of the second basic core module 110b through the shared bus 115a of the first basic core module 110a, which can be understood as : If there is another transfer switch 113a between the current transfer switch 113a and the core cache 116b on the shared bus 115a, the current transfer switch 113a receives the first data processing request and transfers it through the shared bus 115a. The first data processing request is transmitted to the next hop changeover switch 113a, and the next hop changeover switch 113a continues to transmit the first data processing request downstream until the first data processing request is transmitted to the core buffer 116b.
  • the upstream and downstream in this application refer to the direction of data transmission.
  • the first basic core module 110a and the second basic core module 110b may be any two basic core modules having a connection relationship among the at least two basic core modules.
  • the first basic core module 110a and the second basic core module 110b may be two adjacent basic core modules.
  • each basic core module 110 includes four calculation engines 111.
  • the cache network 112 is composed of multiple cache routers (Cache Routers). Any cache unit 114 may be a cache bank, such as an L2 (second level cache) cache bank.
  • each basic core module 110 may be provided with 16 L2 cache libraries, and each cache unit shown in FIG. 1 or FIG. 2 may represent 4 L2 cache libraries.
  • each basic core module includes multiple computing engines, each computing engine is connected to the cache network through a transfer switch, the multiple transfer switches are serially connected through the shared bus, and one basic core module
  • the shared bus in is connected to the core cache in another basic core module, and the core cache in another basic core module is connected to the cache network in another basic core module. Therefore, through the above architecture, the computing Expansion of the number of engines.
  • the transfer switch can transmit the data processing request to the shared bus through the shared bus connected to it.
  • the core cache of another basic core module connected by the bus thereby realizing the transmission of the data processing request to another basic core module.
  • the data processing request that has been transmitted to the core cache can be connected through the core cache
  • the shared bus 115b of the second basic core module 110b is connected to the core buffer 116a of the first basic core module 110a, and any switch 113b in the second basic core module 110b is configured to receive After accessing the second data processing request of the second target cache unit in the first basic core module 110a, the second data processing request is transmitted to the core cache of the first basic core module 110a through the shared bus 115b of the second basic core module 110b 116a.
  • the core cache 116a of the first basic core module 110a is configured to access the second target cache unit based on the second data processing request.
  • the second target cache unit may be any one of the plurality of cache units 114a in the first basic core module 110a.
  • the shared bus 115a of the first basic core module 110a is connected to the core cache 116b of the second basic core module 110b
  • the shared bus 115b of the second basic core module 110b is also connected to the second basic core module 110b.
  • the core cache 116a of a basic core module 110a is connected, so that the two basic core modules are connected to each other, so that any one of the two basic core modules connected to each other can be connected to each other without affecting performance and increasing process complexity.
  • the calculation engine in the core module can access the target cache unit in another basic core module.
  • each computing engine 111 may include multiple storage clients, and each storage client passes through a switch 113 and a cache in the cache network 112. Routing connection, the core cache 116 is connected to a cache in the cache network 112 by routing.
  • the multiple storage clients in each calculation engine 111 are equivalent to multiple cores in the calculation engine 111.
  • this application can not only enable the computing engine of one of the two basic core modules connected to each other to access the target cache unit in the other basic core module, but also can realize the storage client pair in one basic core module. Access to the target cache unit in another basic core module.
  • the cache network 112 includes a plurality of cache routes arranged in a grid, each cache route in the cache network is connected to each adjacent cache route, and one of the cache routes is connected to the core cache 116;
  • multiple storage clients included in multiple computing engines 111 correspond to multiple transfer switches 113 one-to-one, and each storage client is routed to a cache in the cache network 112 through a corresponding transfer switch 113 connection.
  • FIG. 3 is a schematic structural diagram of another data processing apparatus 100 provided by an embodiment of the present application.
  • any calculation engine 111a in the first basic core module 110a includes multiple storage clients 1111a.
  • the multiple storage clients 1111a included in all the computing engines 111a in a basic core module 110a correspond to the multiple conversion switches 113a one-to-one.
  • the first basic core module 110a has n storage clients 1111a.
  • the cache network 112a in the first basic core module 110a includes a plurality of cache routes 1121a arranged in a grid pattern (or called an array arrangement), and each cache route 1121a in the cache network 112a is connected to each adjacent one.
  • the cache route 1121a is connected. For example, when there is an adjacent cache route 1121a above, below, or on the left of a cache route 1121a, the cache route 1121a is connected with the adjacent cache routes 1121a above, below, and on the left.
  • Each storage client 1111a is connected to a corresponding transfer switch 113a, and is connected to a cache route 1121a through the corresponding transfer switch 113a.
  • a plurality of transfer switches 113a are serially connected through the shared bus 115a of the first basic core module 110a.
  • the shared bus 115a is connected to the core cache 116b of the second basic core module 110b, and the core cache 116a is connected to a cache route 1121a.
  • any computing engine 111b in the second basic core module 110b includes multiple storage clients 1111b, and all the computing engines 111b in the second basic core module 110b include multiple storage clients 1111b There is one-to-one correspondence with the plurality of changeover switches 113b.
  • the cache network 112b in the second basic core module 110b includes a plurality of cache routes 1121b arranged in a grid pattern, and each cache route 1121b in the cache network 112b is connected to each adjacent cache route 1121b.
  • Each storage client 1111b is connected to a corresponding transfer switch 113b, and is connected to a cache route 1121b through the transfer switch 113b, and a plurality of transfer switches 113b are serially connected through the shared bus 115b of the second basic core module 110b.
  • the shared bus 115b is connected to the core cache 116a of the first basic core module 110a, and the core cache 116b is connected to a cache route 1121b.
  • the processing methods of the above-mentioned data processing device 100 are also different.
  • the storage client in one basic core module caches the target in another basic core module. The access process of the unit is explained.
  • the core cache 116b of the second basic core module 110b may be configured as:
  • the first target data is passed through the first basic core module 110a.
  • the shared bus 115a returns to the storage client 1111a that sent the first data processing request.
  • the cache network 112b of the second basic core module 116b is sent from the first data processing request.
  • a target cache unit obtains the first target data, and returns the first target data to the storage client 1111a that sends the first data processing request through the shared bus 115a of the first basic core module 110a.
  • the core cache 116b of the second basic core module 110b receives the first data processing request sent by a certain storage client 1111a of the first basic core module 110a, if the first data is currently stored in the core cache 116b
  • the original path of the first target data (that is, the opposite path of the path that the first data processing request is transmitted from the storage client 1111a to the core cache 116b) can be returned to the sending of the first data processing request
  • the storage client 1111a The first target data already stored in the core cache 116b may be obtained from the cache unit where the first target data is located and stored in the core cache 116b when a data processing request requesting access to the first target data is received last time. of.
  • the core cache 116b may be similar to a storage client 1111b, and send the first data processing request to the first data processing request through the cache route 1121b connected to it.
  • the first data processing request is routed to the first target cache unit through the cache network 112b.
  • the cache network 112b routes the first target data to the core cache 116b, and the core cache 116b returns the first target data to the storage client 1111a that sent the first data processing request.
  • the structure of the core cache may be as shown in FIG. 4, which is a schematic diagram of the structure of a core cache provided in an embodiment of the present application.
  • the cache core may include: a cache control (Cache Control) module, a tag cache ( Tag-Cache), Dirty-Mask module and Data-Cache.
  • the cache control module is configured to implement the following functions: Write-Buffer, Address-Tag management , Read back data (Read-Return) return, hit-miss check (Hit-Miss Check), etc.
  • the hit or miss check module can be used to determine whether the data requested by the data processing request is hit.
  • the hit check module is determined, it means that the data cache has stored the data processing request.
  • the requested data can be obtained from the data cache, output to the shared bus through the read-back data module, and returned to the storage client that sent the data processing request.
  • the core cache 116b of the second basic core module 110b receives a data processing request for the first target data (for example, when it receives a data processing request for the first target data for the first time), it receives a data processing request from the target data. After the cache unit obtains the first target data, the first target data is stored in the core cache 116b, so that the core cache 116b can directly return the first target data when the next access request comes. Considering that the storage space in the core cache 116b is limited, a mechanism for periodically cleaning the cache can be set.
  • the core cache 116b is required to retrieve the data from the corresponding target cache unit in the above-mentioned manner.
  • any storage client 1111a in the first basic core module 110b can access the cache unit in the second basic core module 110b.
  • the core cache 116a of the first basic core module 110a has the same function as the core cache 116b of the second basic core module 110b, and can be configured as:
  • the second target data is passed through the second basic core module 110b.
  • the shared bus 115b returns to the storage client 1111b that sent the second data processing request.
  • the second target cache unit obtains the second target data, and returns the second target data to the storage client 1111b that sends the second data processing request through the shared bus 115b of the second basic core module 110b.
  • any storage client 1111b in the second basic core module 110b can access the cache unit 114a in the first basic core module 110a.
  • Any two basic core modules connected to each other in the above-mentioned data processing apparatus 100 can implement access to the cache unit 114 in another basic core module 110 through the above-mentioned implementation manner.
  • FIG. 5 is a schematic structural diagram of a conversion switch shown in an embodiment of the present application.
  • Each conversion switch 113 may include a first port 1131, a second port 1132, a third port 1133, a fourth port 1134, and a first data selector. 1135, a data buffer 1136, an arbiter 1137, and a second data selector 1138;
  • the first port 1131 is configured to be connected to the corresponding storage client
  • the second port 1132 is configured to be connected to a cache route
  • the third port 1133 is configured to be connected to the previous hop switch 113 through the shared bus 115
  • the fourth port 1133 is configured to be connected to the previous hop switch 113 through the shared bus 115.
  • the port 1134 is configured to be connected to the next hop switch 113 or the core buffer 116 of another basic core module 110 through the shared bus 115, and the first data selector 1135 is respectively connected to the first port 1131, the second port 1132 and the data buffer 1136 is connected, the arbiter 1137 is connected to the data buffer 1136, the third port 1133, and the fourth port 1134, and the second data selector 1138 is connected to the first port 1131, the second port 1132, the third port 1133, and the fourth port, respectively. 1134 connection.
  • any one of the first port 1131, the second port 1132, the third port 1133, and the fourth port 1134 may refer to one port or multiple ports.
  • the first port 1131 may include Multiple ports, multiple ports can be respectively configured to transmit one or more of read request, write request, write data, and write confirmation message.
  • the first data selector 1135 is configured to send the data processing request of the storage client received by the first port 1131 to the cache route connected to the second port 1132, or to the data buffer 1136, and is configured to send the second port 1132
  • the received write confirmation message is returned to the storage client through the first port 1131.
  • the arbiter 1137 is configured to receive data processing requests sent by the data buffer 1136 and the third port 1133, and when there are multiple data processing requests received, determine the data processing request with priority response among the multiple data processing requests, and The data processing request with priority response is output to the shared bus 115 through the fourth port 1134.
  • the arbiter 1137 may determine which data processing request of the multiple data processing requests should be responded to first according to a preset strategy. For example, generally speaking, the data processing request from the shared bus 115 has a higher priority than the data processing request from the storage client, and for data sources from the same priority (that is, all from the storage client or both from the storage client). Shared bus) multiple data processing requests can adopt the first-in-first-out principle (that is, the data processing request received first will be responded first), and the waiting times of the data processing request temporarily stored in the data buffer 1136 can be performed. count.
  • the arbiter 1137 It will prioritize request 3 as a priority response data processing request, and increase the number of waiting times of request 1 and request 2 by 1.
  • the arbiter 1137 decides the next time, if there is still data from the shared bus 115 in the data buffer 1136
  • the arbiter 1137 still preferentially responds to the data processing request from the shared bus 115, and again increases the number of waiting times for each data processing request from the storage client until there is no data from the shared bus 115 in the data buffer 1136.
  • the data processing request with the largest number of waiting times is selected as the data processing request with the priority response.
  • an upper threshold of the number of waiting times may be set for the data processing request.
  • the second data selector 1138 is configured to output the readback data received by the fourth port 1134 to the storage client connected to the first port 1131, or output to the shared bus 115 through the third port 1133, and is also configured to output the second The readback data received by the port 1132 is output to the storage client connected to the first port 1131.
  • the first data selector 1135 can determine whether the data processing request is routed to the cache route or routed to the data buffer 1136 based on the hash function.
  • the data processing request usually contains the cache address to be accessed.
  • the cache address can usually be represented by a binary number with preset bits.
  • the hash function in this embodiment can perform an exclusive OR operation on the binary number of the cache address. , A new binary number is obtained, and the new binary number can be used as the target cache address, so that the data processing request is routed to the corresponding target cache unit according to the target cache address.
  • FIG. 6 is a schematic structural diagram of another transfer switch shown in an embodiment of the present application.
  • the transfer switch 113 is also provided with a request input register (in_request) 11391 and two Two request data registers (request_data) 11392a and 11392b, request output register (out_request) 11393, four read back data registers (read_ret) 11394a, 11394b, 11394c and 11394d, two bus registers 11395a and 11395b, and an acknowledgement message register (ack ) 11396, these registers can all be level 1 registers.
  • in_request request input register
  • request_data Two request data registers
  • out_request request output register
  • read_ret read back data registers
  • ack acknowledgement message register
  • the request input register 11391 is connected to the first port 1131 and the first data selector 1135, and is configured to send the read request or write request of the storage client received by the first port 1131 to the first data selector 1135 to request the output register 11393 is connected to the first data selector 1135 and the second port 1132, and is configured to receive the read request or write request sent by the first data selector 1135 and output to the cache route through the second port 1132;
  • request data register 11392a and the first port 1131 is connected to the first data selector 1135, configured to send the write data corresponding to the write request received by the first port 1131 to the first data selector 1135, requesting the data register 11392b and the second port 1132 and the first data selection 1135 is connected to the first data selector 1135 and is configured to receive the write data sent by the first data selector 1135 and output to the cache routing through the second port 1132;
  • the confirmation message register 11396 is connected to the second port 1132 and the first data selector 1135 and
  • the read back data registers 11394a, 11394b, 11394c, and 11394d are connected to the first port 1131, the second port 1132, the third port 1133, and the fourth port 1134, respectively.
  • the read back data registers 11394a, 11394b, 11394c, and 11394d are all connected to the second data
  • the selector 1138 is connected, in which the read-back data register 11394b is configured to send the read-back data received from the second port 1132 from the cache route to the second data selector 1138, and the read-back data register 11394d is configured to receive the fourth port 1134
  • the received read-back data from the shared bus 115 is sent to the second data selector 1138, and the read-back data register 11394a is configured to receive the read-back data from the cache route sent by the second data selector 1138, and send it through the first port 1131
  • the read-back data is returned to the storage client, the read-back data register 11394c
  • any transfer switch 113a in the first basic core module 110a can be configured as:
  • the first data processing request When the first data processing request is received, the first data processing request is stored in the data buffer 1136, and a write confirmation message for the first data processing request is returned to the storage client that initiated the first data processing request.
  • the arbiter 1137 When the first data processing request satisfies the output condition, the arbiter 1137 outputs the first data processing request to the shared bus 115a through the fourth port 1134, so as to transmit the first data processing request to the second basic core module through the shared bus 115a
  • the cache core 116b of the 110b enables the cache core 116b of the second basic core module 110b to write the write data into the first target cache unit through the cache network 112b of the second basic core module 110b based on the first data processing request.
  • the write request when a write request containing write data from a storage client is received, the write request can be stored in the data buffer, and a write confirmation message for the write request can be immediately returned to the storage client that initiated the request.
  • the actual write request and write data are output to the shared bus by the arbiter, they are transmitted from the shared bus to the cache core of the second basic core module.
  • the cache core of the second basic core module is based on the write request and passes through the second basic core module.
  • the cache network of the core module writes the write data into the target cache unit that needs to be accessed, which can realize a quick response to the write request of the storage client.
  • FIG. 7 is a flowchart of a data processing method provided by an embodiment of the present application. The method may be applied to the data processing apparatus 100 described in any of the above embodiments. Referring to FIG. 7, the data processing method may include:
  • Step S101 After receiving the first data processing request to access the first target cache unit in the second basic core module, any switch in the first basic core module transfers the first data processing request to the first basic core module through the shared bus of the first basic core module. The data processing request is transmitted to the core cache of the second basic core module.
  • Step S102 The core cache of the second basic core module accesses the first target cache unit based on the first data processing request.
  • any switch in a basic core module receives a data processing request to access the target cache unit in another basic core module
  • the switch can process the data through the shared bus connected to it.
  • the request is transmitted to the core cache of another basic core module connected to the shared bus, so that the data processing request is transmitted to another basic core module.
  • the data processing request that has been transmitted to the core cache can pass through
  • the cache network connected to the core cache accesses the target cache unit, so that the computing engine in one basic core module can access the target cache unit in another basic core module through the above-mentioned shared bus-based architecture.
  • the calculation engine can be expanded without affecting performance and increasing process complexity.
  • FIG. 8 is another example provided by this application.
  • the method may further include:
  • Step S103 After receiving the second data processing request for accessing the second target cache unit in the first basic core module, any switch in the second basic core module transfers the second data processing request through the shared bus of the second basic core module. The data processing request is transmitted to the core cache of the first basic core module.
  • Step S104 the core cache of the first basic core module accesses the second target cache unit based on the second data processing request.
  • the shared bus of the first basic core module is connected to the core cache of the second basic core module
  • the shared bus of the second basic core module is also connected to the core cache of the first basic core module. Connection, so that the two basic core modules are connected to each other, and the computing engine in any one of the two basic core modules connected to each other can be connected to the other without affecting performance and increasing process complexity. Access to the target cache unit in a basic core module.
  • the core cache of the second basic core module in step S102 accessing the first target cache unit based on the first data processing request may include:
  • the core cache of the second basic core module When the core cache of the second basic core module receives the first data processing request, and the core cache of the second basic core module stores the first target data requested by the first data processing request, it passes the first target data through the first data processing request.
  • the shared bus of a basic core module is returned to the storage client that sent the first data processing request.
  • the core cache of the second basic core module When the core cache of the second basic core module receives the first data processing request and the first target data does not exist in the core cache of the second basic core module, based on the first data processing request, pass the cache of the second basic core module
  • the network obtains the first target data from the first target cache unit, and returns the first target data to the storage client that sends the first data processing request through the shared bus of the first basic core module.
  • the storage client in the first basic core module can access the cache unit in the second basic core module.
  • the core cache of the first basic core module in step S104 accessing the second target cache unit based on the second data processing request may include:
  • the core cache of the first basic core module When the core cache of the first basic core module receives the second data processing request and the second target data requested by the second data processing request is stored in the core cache of the first basic core module, it passes the second target data through the second data processing request.
  • the shared bus of the two basic core modules is returned to the storage client that sends the second data processing request.
  • the core cache of the first basic core module When the core cache of the first basic core module receives the second data processing request, and the second target data does not exist in the core cache of the first basic core module, it passes the cache of the first basic core module based on the second data processing request The network obtains the second target data from the second target cache unit, and returns the second target data to the storage client that sends the second data processing request through the shared bus of the second basic core module.
  • the storage client in the second basic core module can access the cache unit in the first basic core module.
  • any two basic core modules connected to each other in the above-mentioned data processing device 100 can implement access to a cache unit in another basic core module through the above-mentioned implementation manner.
  • the core of the second basic core module described in step S102 may include:
  • any switch in the first basic core module receives the first data processing request, it stores the first data processing request in the data buffer, and returns the first data processing request to the storage client that initiated the first data processing request. Process the requested write confirmation message.
  • the arbiter in any switch When the first data processing request satisfies the output condition, the arbiter in any switch will output the first data processing request to the shared bus through the fourth port of any switch, so as to process the first data through the shared bus.
  • the request is transmitted to the cache core of the second basic core module.
  • the cache core of the second basic core module writes the write data into the first target cache unit through the cache network of the second basic core module.
  • the write request when a write request containing write data from a storage client is received, the write request can be stored in the data buffer, and a write confirmation message for the write request can be immediately returned to the storage client that initiated the request.
  • the actual write request and write data are output to the shared bus by the arbiter, they are transmitted from the shared bus to the cache core of the second basic core module.
  • the cache core of the second basic core module is based on the write request and passes through the second basic core module.
  • the cache network of the core module writes the write data into the target cache unit that needs to be accessed, which can realize a quick response to the write request of the storage client.
  • An embodiment of the present application also provides a processor, and the processor may include the data processing apparatus 100 provided in any of the foregoing embodiments.
  • the processor may be a GPU or a CPU, or may be the aforementioned DCU, or may be a processor integrated with a GPU (or DCU) and a CPU (it can be understood that the GPU or the DCU and the CPU are located on a chip).
  • an embodiment of the present application further provides a chip, which may include the data processing device 100 provided in any of the above embodiments, and the data processing device 100 is formed on the same semiconductor substrate. It can be understood that, on the chip, all basic core modules included in the data processing device 100 are formed on the same semiconductor substrate.
  • the embodiment of the present application also provides another processor.
  • the processor may include the above-mentioned chip.
  • the processor may be a GPU or a CPU, or may be the above-mentioned DCU, or may be an integrated GPU (or DCU) and a CPU.
  • Processor can be understood as GPU or DCU and CPU located on the same chip).
  • FIG. 9 is a block diagram of an electronic device 200 provided by an embodiment of the present application.
  • the electronic device 200 may include: a memory 201 and a processor 202, and the memory 201 and the processor 202 may be connected through a bus.
  • a computer program is stored in the memory 201, and when the computer program is executed by the processor 202, the above-mentioned data processing method can be implemented.
  • the processor 202 may be the aforementioned processor including the data processing apparatus 100.
  • the memory 201 may be, but is not limited to, random access memory, read-only memory, programmable read-only memory, erasable programmable read-only memory, electrically erasable programmable read-only memory, and the like.
  • the electronic device 200 may be, but is not limited to, a smart phone, a personal computer (PC), a tablet computer, a personal digital assistant (PDA), a mobile Internet device (MID), etc.
  • PC personal computer
  • PDA personal digital assistant
  • MID mobile Internet device
  • the embodiments of the present application also provide a storage medium in which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned data processing method can be implemented.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code includes one or more Executable instructions.
  • the functions marked in the block may also occur in a different order from the order marked in the drawings.
  • each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or actions Or it can be realized by a combination of dedicated hardware and computer instructions.
  • the functional modules in the various embodiments of the present application may be integrated together to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.
  • the present application provides a data processing device, method, chip, processor, device, and storage medium, which can realize the expansion of a calculation engine without affecting performance and increasing process complexity.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Multi Processors (AREA)

Abstract

一种数据处理装置、方法、芯片、处理器、设备及存储介质,数据处理装置包括:至少两个基础核心模块(110),每个基础核心模块(110)包括多个计算引擎(111)、缓存网络(112)、多个转换开关(113)、多个缓存单元(114)、共享总线(115)以及核心缓存(116);该多个缓存单元(114)及核心缓存(116)分别与缓存网络(112)连接,多个计算引擎(111)通过多个转换开关(113)与缓存网络(112)连接,多个转换开关(113)通过共享总线(115)串行连接,共享总线(115)与另一基础核心模块(110)的核心缓存(116)连接,转换开关配置成在接收到访问另一基础核心模块(110)的数据处理请求后,通过共享总线(115)将数据处理请求传输至另一基础核心模块(110)的核心缓存(116),使其基于第一数据处理请求访问另一基础核心模块(110)的缓存单元(114)。能够在不影响性能、不提高工艺复杂度的情况下实现计算引擎(111)的扩展。

Description

数据处理装置、方法、芯片、处理器、设备及存储介质
本申请要求于2019年12月11日提交中国专利局,申请号为2019112722834、名称为“数据处理装置、方法、芯片、处理器、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及处理器技术领域,具体而言,涉及一种数据处理装置、方法、芯片、处理器、设备及存储介质。
背景技术
图形处理器(Graphics Processing Unit,GPU),是一种专用于在个人电脑、工作站、游戏机和一些移动设备(如平板电脑、智能手机等)上进行图像和图形相关运算工作的微处理器。GPU能够使显卡减少对中央处理器(Central Processing Unit,CPU)的依赖,并进行部分原本CPU的工作,尤其是对于3D图形的处理。
目前,现有的GPU在可扩展性上存在一定的困难,因为目前GPU中的计算引擎(Compute Engine)是通过缓存网络实现互相连接的,如果要扩展这个架构,例如从四个计算引擎扩展到八个计算引擎,很难简单地通过扩展缓存网络来连接更多的计算引擎。这是因为一方面单纯地将缓存网络扩大,会使计算引擎的访问路径变长,从而会导致性能的明显下降,另一方面,是存在芯片绕线资源的限制和物理工艺的限制,直接将缓存网络扩大会增加工艺复杂度且较难实现。
因此,如何在不影响性能、不提高工艺复杂度的基础上实现计算引擎的扩展是当前亟需解决的问题。
发明内容
本申请实施例所提供的技术方案如下所示:
第一方面,本申请实施例提供一种数据处理装置,所述数据处理装置包括:至少两个基础核心模块,每个所述基础核心模块包括:多个计算引擎、缓存网络、多个转换开关、多个缓存单元、共享总线以及核心缓存;
每个所述基础核心模块中,所述多个缓存单元以及所述核心缓存分别与所述缓存网络连接,所述多个计算引擎通过所述多个转换开关与所述缓存网络连接,所述多个转换开关通过所述共享总线串行连接;
所述至少两个基础核心模块中的第一基础核心模块的共享总线与第二基础核心模块的核心缓存连接,所述第一基础核心模块中的任一转换开关配置成在接收到访问所述第二基础核心模块中的第一目标缓存单元的第一数据处理请求后,通过所述第一基础核心模块的共享总线将所述第一数据处理请求传输至所述第二基础核心模块的核心缓存,所述第二基础核心模块的核心缓存配置成基于所述第一数据处理请求访问所述第一目标缓存单元。
在上述实施方式中,数据处理装置包括至少两个基础核心模块,每个基础核心模块中都包括多个计算引擎,每个计算引擎都通过转换开关与缓存网络连接,多个转换开关通过所述共享总线串行连接,并且一个基础核心模块中的共享总线与另一个基础核心模块中的核心缓存连接,而另一个基础核心模块中的核心缓存是与另一个基础核心模块中的缓存网络连接的,因此,通过上述架构,实现了对计算引擎的数量的扩展。当一基础核心模块中的任一转换开关在接收到访问另一基础核心模块中的目标缓存单元的数据处理请求后,该转换开关能够通过其连接的共享总线将该数据处理请求传输至与共享总线连接的另一基础核心模块的核心缓存,从而实现了将该数据处理请求传输到另一基础核心模块,此时已传输至该核心缓存中的该数据处理请求就能够通过该核心缓存所连接的缓存网络,访问到目标缓存单元,从而通过上述的基于共享总线的架构,实现了一个基础核心模块中的计算引擎对另一基础核心模块中的目标缓存单元的访问。由于在一个基础核心模块中由于没有对缓存网络进行扩展,计算引擎的访问路径没有延长,因此对于一个基础核心模块的性能没 有影响,也没有增加工艺复杂度。因此,能够在不影响性能、不提高工艺复杂度的情况下实现对计算引擎扩展。
在一种可选的实施方式中,所述第二基础核心模块的共享总线与所述第一基础核心模块的核心缓存连接,所述第二基础核心模块中的任一转换开关配置成在接收到访问所述第一基础核心模块中的第二目标缓存单元的第二数据处理请求后,通过所述第二基础核心模块的共享总线将所述第二数据处理请求传输至所述第一基础核心模块的核心缓存,所述第一基础核心模块的核心缓存配置成基于所述第二数据处理请求访问所述第二目标缓存单元。
在上述实施方式中,在第一基础核心模块的共享总线与第二基础核心模块的核心缓存连接基础上,第二基础核心模块的共享总线也与所述第一基础核心模块的核心缓存连接,从而使两个基础核心模块相互连接,能够在不影响性能、不提高工艺复杂度的情况下,使相互连接的两个基础核心模块中的任一基础核心模块中的计算引擎,对另一基础核心模块中的目标缓存单元的访问。从而能够实现一个基础核心模块中的存储客户端对另一基础核心模块中的目标缓存单元的访问。
在一种可选的实施方式中,在每个所述基础核心模块中,每个计算引擎包括多个存储客户端,每个存储客户端通过一个转换开关与缓存网络中的一个缓存路由连接,核心缓存与缓存网络中的一个缓存路由连接。
在上述实施方式中,在每个基础核心模块中,每个计算引擎中的每个存储客户端通过一个转换开关与一个缓存路由连接,核心缓存与缓存网络中的一个缓存路由连接,由于基础核心模块中的转换开关通过共享总线连接,因此在计算引擎中的任一存储客户端需要访问另一基础核心模块中的缓存单元时,其数据处理请求不经过缓存网络,而是通过转换开关以及共享总系传输到另一基础核心模块的核心缓存中,从而能够实现一个基础核心模块中的存储客户端对另一基础核心模块的访问。
在一种可选的实施方式中,在每个所述基础核心模块中,所述多个计算引擎包括的多个存储客户端与所述多个转换开关一一对应,每个存储客户端通过对应的转换开关与缓存网络中的一个缓存路由连接,所述缓存网络包括呈网格状排布的多个缓存路,所述缓存网络中的每个缓存路由与相邻的每个缓存路由连接。
在上述实施方式中,在每个基础核心模块中,每个计算引擎中的每个存储客户端通过一个对应的转换开关与一个缓存路由连接,由于基础核心模块中的转换开关通过共享总线连接,因此在计算引擎中的任一存储客户端需要访问另一基础核心模块中的缓存单元时,其数据处理请求不经过缓存网络,而是通过转换开关以及共享总系传输到另一基础核心模块的核心缓存中,从而能够实现一个基础核心模块中的存储客户端对另一基础核心模块的访问。
在一种可选的实施方式中,所述第一数据处理请求为读请求,所述第二基础核心模块的核心缓存配置成:
在接收到所述第一数据处理请求,且所述第二基础核心模块的核心缓存中存储有所述第一数据处理请求所请求的第一目标数据时,将所述第一目标数据通过所述第一基础核心模块的共享总线返回至发送所述第一数据处理请求的存储客户端;
在接收到所述第一数据处理请求,且所述第二基础核心模块的核心缓存中不存在所述第一目标数据时,基于所述第一数据处理请求,通过所述第二基础核心模块的缓存网络,从所述第一目标缓存单元中获取所述第一目标数据,并将所述第一目标数据通过所述第一基础核心模块的共享总线返回至发送所述第一数据处理请求的存储客户端。
在上述实施方式中,在第一基础核心模块的计算引擎中的存储客户端发送的第一数据处理请求传输到第二基础核心模块的核心缓存时,如果该核心缓存中已经存储有第一数据处理请求所请求的第一目标数据时,该核心缓存会将第一目标数据直接返回给该客户端,如果该核心缓存没有存储第一目标数据,则该核心缓存能够通过与其连接的第二基础核心 模块的缓存网络从第一目标缓存单元中获取该第一目标数据并返回给该客户端,从而能够实现第一基础核心模块的计算引擎中的存储客户端对第二基础核心模块中的缓存单元的访问。
在一种可选的实施方式中,每个转换开关包括第一端口、第二端口、第三端口、第四端口、第一数据选择器、数据缓冲器、裁决器和第二数据选择器;
其中,所述第一端口配置成与对应的存储客户端连接,所述第二端口配置成与一个缓存路由连接,所述第三端口配置成通过共享总线与上一跳转换开关连接,所述第四端口配置成通过共享总线与下一跳转换开关或另一基础核心模块的核心缓存连接,所述第一数据选择器分别与所述第一端口、所述第二端口和所述数据缓冲器连接,所述裁决器分别与所述数据缓冲器、所述第三端口和所述第四端口连接,所述第二数据选择器分别与所述第一端口、所述第二端口、所述第三端口和所述第四端口连接;
所述第一数据选择器配置成将所述第一端口接收到的存储客户端的数据处理请求发送至与所述第二端口连接的缓存路由,或者发送至所述数据缓冲器;
所述裁决器配置成接收所述数据缓冲器和所述第三端口发送的数据处理请求,并在接收到的数据处理请求为多个时,确定多个数据处理请求中优先响应的数据处理请求,并将所述优先响应的数据处理请求通过所述第四端口输出至共享总线;
所述第二数据选择器配置成将所述第四端口接收到的读回数据输出至与所述第一端口连接的存储客户端,或者通过所述第三端口输出至共享总线,还配置成将所述第二端口接收到的读回数据输出至与所述第一端口连接的存储客户端。
在上述实施方式中,转换开关能够通过第一数据选择器将第一端口接收到的存储客户端发送的数据处理请求发送至与第二端口连接的缓存路由,或者发送至数据缓冲器,转换开关中的裁决器能够接收数据缓冲器和第三端口发送的数据处理请求,并在接收到的数据处理请求为多个时,确定优先响应的数据处理请求,并将优先响应的数据处理请求通过第四端口输出至共享总线;转换开关能够通过第二数据选择器将第四端口接收到的读回数据输出至与第一端口连接的存储客户端,或者通过第三端口输出至共享总线,还配置成将所述第二端口接收到的读回数据输出至与所述第一端口连接的存储客户端。因此,能够通过转换开关将数据处理请求路由到缓存网络或共享总线,或者将返回的读回数据路由至存储客户端或共享总线。
在一种可选的实施方式中,所述第一数据处理请求为包含写入数据的写请求,所述第一基础核心模块中的任一转换开关配置成:
在接收到所述第一数据处理请求时,将所述第一数据处理请求存储在所述数据缓冲器,并向发起所述第一数据处理请求的存储客户端返回针对所述第一数据处理请求的写确认消息;
在所述第一数据处理请求满足输出条件时,通过所述裁决器将所述第一数据处理请求通过所述第四端口输出至共享总线,以通过所述共享总线将所述第一数据处理请求传输至所述第二基础核心模块的缓存核心,以使所述第二基础核心模块的缓存核心基于所述第一数据处理请求,通过所述第二基础核心模块的缓存网络,将所述写入数据写入所述第一目标缓存单元中。
在上述实施方式中,能够在接收到存储客户端的包含写入数据的写请求时,将写请求存储在数据缓冲器,并向发起请求的存储客户端返回针对该写请求的写确认消息,在写请求被裁决器输出至共享总线,并由共享总线传输至第二基础核心模块的缓存核心时,第二基础核心模块的缓存核心基于该写请求,通过第二基础核心模块的缓存网络,将该写入数据写入需要访问的目标缓存单元中,能够实现对存储客户端的写请求的快速响应。
第二方面,本申请实施例提供一种数据处理方法,应配置成上述第一方面所述的数据处理装置,所述方法包括:
所述第一基础核心模块中的任一转换开关在接收到访问所述第二基础核心模块中的第 一目标缓存单元的第一数据处理请求后,通过所述第一基础核心模块的共享总线将所述第一数据处理请求传输至所述第二基础核心模块的核心缓存;
所述第二基础核心模块的核心缓存基于所述第一数据处理请求访问所述第一目标缓存单元。
在上述实施方式中,数据处理装置包括至少两个基础核心模块,每个基础核心模块中都包括多个计算引擎,每个计算引擎都通过转换开关与缓存网络连接,多个转换开关通过所述共享总线串行连接,并且一个基础核心模块中的共享总线与另一个基础核心模块中的核心缓存连接,而另一个基础核心模块中的核心缓存是与另一个基础核心模块中的缓存网络连接的,因此,通过上述架构,实现了对计算引擎的数量的扩展。当一基础核心模块中的任一转换开关在接收到访问另一基础核心模块中的目标缓存单元的数据处理请求后,该转换开关能够通过其连接的共享总线将该数据处理请求传输至与共享总线连接的另一基础核心模块的核心缓存,从而实现了将该数据处理请求传输到另一基础核心模块,此时已传输至该核心缓存中的该数据处理请求就能够通过该核心缓存所连接的缓存网络,访问到目标缓存单元,从而通过上述的基于共享总线的架构,实现了一个基础核心模块中的计算引擎对另一基础核心模块中的目标缓存单元的访问。由于在一个基础核心模块中由于没有对缓存网络进行扩展,计算引擎的访问路径没有延长,因此对于一个基础核心模块的性能没有影响,也没有增加工艺复杂度。因此,能够在不影响性能、不提高工艺复杂度的情况下实现对计算引擎扩展。
在一种可选的实施方式中,所述第二基础核心模块的共享总线与所述第一基础核心模块的核心缓存连接,所述方法还包括:
所述第二基础核心模块中的任一转换开关在接收到访问所述第一基础核心模块中的第二目标缓存单元的第二数据处理请求后,通过所述第二基础核心模块的共享总线将所述第二数据处理请求传输至所述第一基础核心模块的核心缓存;
所述第一基础核心模块的核心缓存基于所述第二数据处理请求访问所述第二目标缓存单元。
在上述实施方式中,在第一基础核心模块的共享总线与第二基础核心模块的核心缓存连接基础上,第二基础核心模块的共享总线也与所述第一基础核心模块的核心缓存连接,从而使两个基础核心模块相互连接,能够在不影响性能、不提高工艺复杂度的情况下,使相互连接的两个基础核心模块中的任一基础核心模块中的计算引擎,对另一基础核心模块中的目标缓存单元的访问。从而能够实现一个基础核心模块中的存储客户端对另一基础核心模块中的目标缓存单元的访问。
在一种可选的实施方式中,所述第一数据处理请求为读请求,所述第二基础核心模块的核心缓存基于所述第一数据处理请求访问所述第一目标缓存单元,包括:
所述第二基础核心模块的核心缓存在接收到所述第一数据处理请求,且所述第二基础核心模块的核心缓存中存储有所述第一数据处理请求所请求的第一目标数据时,将所述第一目标数据通过所述第一基础核心模块的共享总线返回至发送所述第一数据处理请求的存储客户端;
在接收到所述第一数据处理请求,且所述第二基础核心模块的核心缓存中不存在所述第一目标数据时,基于所述第一数据处理请求,通过所述第二基础核心模块的缓存网络,从所述第一目标缓存单元中获取所述第一目标数据,并将所述第一目标数据通过所述第一基础核心模块的共享总线返回至发送所述第一数据处理请求的存储客户端。
在上述实施方式中,在第一基础核心模块的计算引擎中的存储客户端发送的第一数据处理请求传输到第二基础核心模块的核心缓存时,如果该核心缓存中已经存储有第一数据处理请求所请求的第一目标数据时,该核心缓存会将第一目标数据直接返回给该客户端,如果该核心缓存没有存储第一目标数据,则该核心缓存能够通过与其连接的第二基础核心模块的缓存网络从第一目标缓存单元中获取该第一目标数据并返回给该客户端,从而能够 实现第一基础核心模块的计算引擎中的存储客户端对第二基础核心模块中的缓存单元的访问。
在一种可选的实施方式中,所述第一数据处理请求为包含写入数据的写请求,所述第二基础核心模块的核心缓存基于所述第一数据处理请求访问所述第一目标缓存单元,包括:
所述第一基础核心模块中的任一转换开关在接收到所述第一数据处理请求时,将所述第一数据处理请求存储在数据缓冲器,并向发起所述第一数据处理请求的存储客户端返回针对所述第一数据处理请求的写确认消息;
在所述第一数据处理请求满足输出条件时,所述任一转换开关中的裁决器将所述第一数据处理请求通过所述任一转换开关的第四端口输出至共享总线,以通过所述共享总线将所述第一数据处理请求传输至所述第二基础核心模块的缓存核心;
所述第二基础核心模块的缓存核心基于所述第一数据处理请求,通过所述第二基础核心模块的缓存网络,将所述写入数据写入所述第一目标缓存单元中;
所述第二基础核心模块的缓存核心基于所述第一数据处理请求,通过所述第二基础核心模块的缓存网络,将所述写入数据写入所述第一目标缓存单元中。
在上述实施方式中,能够在接收到存储客户端的包含写入数据的写请求时,将写请求存储在数据缓冲器,并向发起请求的存储客户端返回针对该写请求的写确认消息,在写请求被裁决器输出至共享总线,并由共享总线传输至第二基础核心模块的缓存核心时,第二基础核心模块的缓存核心基于该写请求,通过第二基础核心模块的缓存网络,将该写入数据写入需要访问的目标缓存单元中,能够实现对存储客户端的写请求的快速响应。
在一种可选的实施方式中,所述裁决器配置成当多个所述数据处理请求分别来自于共享总线和存储客户端时,将来自于所述共享总线的数据处理请求确定为优先响应的数据处理请求。
在一种可选的实施方式中,所述裁决器配置成当多个所述数据处理请求均来自于共享总线或存储客户端时,将最先接收到的数据处理请求确定为优先响应的数据处理请求。
在一种可选的实施方式中,所述裁决器配置成对暂存于数据缓冲器中的数据处理请求的等待次数进行计数,并在所述数据缓冲器中选择等待次数最大的数据处理请求作为优先响应的数据处理请求。
在一种可选的实施方式中,当存储在所述核心缓存中的数据的存储时长达到预设的时长阈值时,将所述数据删除或将所述数据设置为允许覆盖状态。
第三方面,本申请实施例提供一种处理器,包括上述第一方面所述的数据处理装置。
第四方面,本申请实施例提供一种芯片,包括上述第一方面所述的数据处理装置,所述数据处理装置形成在同一半导体基板上。
第五方面,本申请实施例提供一种处理器,包括上述第四方面所述的芯片。
第六方面,本申请实施例提供一种电子设备,包括:存储器和处理器,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,实现上述第二方面所述的数据处理方法。
第七方面,本申请实施例提供一种存储介质,所述存储介质中存储有计算机程序,所述计算机程序被处理器执行时,实现上述第二方面所述的数据处理方法。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍。应当理解,以下附图仅示出了本申请的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1是本申请实施例提供的一种数据处理装置的结构示意图。
图2是本申请实施例提供的另一种数据处理装置的结构示意图。
图3是本申请实施例提供的又一种数据处理装置的结构示意图。
图4是本申请实施例提供的一种核心缓存的结构示意图。
图5是本申请实施例示出的一种转换开关的结构示意图。
图6是本申请实施例示出的另一种转换开关的结构示意图。
图7是本申请实施例提供的一种数据处理方法的流程图。
图8是本申请实施例提供的另一种数据处理方法的流程图。
图9是本申请实施例提供的一种电子设备的框图。
图标:100-数据处理装置;110-基础核心模块;111-计算引擎;112-缓存网络;113-转换开关;114-缓存单元;115-共享总线;116-核心缓存;110a-第一基础核心模块;110b-第二基础核心模块;111a-第一基础核心模块中的计算引擎;111b-第二基础核心模块中的计算引擎;112a-第一基础核心模块中缓存网络;112b-第二基础核心模块中缓存网络;113a-第一基础核心模块中的转换开关;113b-第二基础核心模块中的转换开关;114a-第一基础核心模块中的缓存单元;114b-第二基础核心模块中的缓存单元;115a-第一基础核心模块中的共享总线;115b-第二基础核心模块中的共享总线;116a-第一基础核心模块中的核心缓存;116b-第二基础核心模块中的核心缓存;1111a-计算引擎111a中的存储客户端;1111b-计算引擎111b中的存储客户端;1121b-缓存网络112b中的缓存路由;1131-第一端口、1132-第二端口、1133-第三端口、1134-第四端口、1135-第一数据选择器、1136-数据缓冲器、1137-裁决器和1138-第二数据选择器;11391-请求输入寄存器;11392a、11392b-请求数据寄存器;11393-请求输出寄存器;11394a、11394b、11394c、11394d-读回数据寄存器;11395a、11395b-总线寄存器;11396-确认消息寄存器。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。需要说明的是,术语“第一”、“第二”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
在现有技术中,GPU芯片通常包含有四个计算引擎(Compute Engine),每个计算引擎可以理解为GPU的一个核心(Core),每个计算引擎通常包含多个存储客户端(Memory Client),每个存储客户端可以理解为是计算引擎中的一个核心,所有的存储客户端都与缓存网络连接,并通过该缓存网络访问内存/缓存。因为目前GPU中的计算引擎是通过上述的缓存网络实现互相连接的,所以GPU在可扩展性上存在一定的困难。如果要扩展这个架构,例如从四个计算引擎扩展到八个计算引擎,如果单纯地将缓存网络扩大,会使计算引擎中的存储客户端的访问路径变长,在最差的情况下,一个存储客户端可能需要经常很长的路径来访问缓存/内存。例如,在从四个计算引擎扩展到八个计算引擎时,如果采用将缓存网络扩大的方式,则需要对应的将缓存网络扩展到原来的两倍,在此情况下,如果GPU中位于左上角的存储客户端需要访问右下角的缓存,则该存储客户端的访问路径的长度也将扩展到两倍,从而会导致性能的明显下降。另一方面,由于芯片绕线资源的限制和物理工艺的限制,从四个计算引擎扩展到八个计算引擎时,也会大大增加制造工艺难度。
因此,如何在不影响性能、不提高工艺复杂度的基础上实现计算引擎的扩展是本领域技术人员的一大难题。鉴于上述问题,本申请申请人经过长期研究探索,提出以下实施例以解决上述问题。下面结合附图,对本申请实施例作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。
图1是本申请实施例提供的一种数据处理装置100的结构示意图,该数据处理装置100可以应用于一种处理器,该处理器可以是GPU、深度计算单元(Deep Computing Unit,DCU)或CPU,该CPU也可以是集成了GPU的CPU,该DCU可以理解为一种配置成通用计算的图形处理器(General Purpose Computing on Graphics Processing Units,GPGPU),但是DCU通常不包括一般GPU中的图形处理的部分。
请参照图1,该数据处理装置100包括:至少两个基础核心模块110,每个基础核心模块110包括:多个计算引擎111、缓存网络112、多个转换开关(Switch)113、多个缓存单元114、共享总线(Share Bus)115以及核心缓存(Core Cache)116。
每个基础核心模块110中,多个缓存单元114以及核心缓存116分别与缓存网络112连接,多个计算引擎111通过多个转换开关113与缓存网络112连接,多个转换开关113通过共享总线115串行连接。核心缓存116配置成与另一基础核心模块110中的共享总线115连接,以实现两个基础核心模块110的连接。
图2是本申请实施例提供的另一种数据处理装置100的结构示意图,下面结合图2,以所述至少两个基础核心模块中的第一基础核心模块110a和第二基础核心模块110b为例进行说明。如图2所示,第一基础核心模块110a的共享总线115a与第二基础核心模块110b的核心缓存116b连接,该第一基础核心模块110a中的任一转换开关113a配置成在接收到访问第二基础核心模块110b中的第一目标缓存单元的第一数据处理请求后,通过第一基础核心模块110a的共享总线115a将第一数据处理请求传输至第二基础核心模块110b的核心缓存116b,第二基础核心模块110b的核心缓存116b配置成基于第一数据处理请求访问第一目标缓存单元。该第一目标缓存单元可以是第二基础核心模块110b中的多个缓存单元114b中的任意一个缓存单元。
其中,任一转换开关113a在接收到第一数据处理请求后,通过第一基础核心模块110a的共享总线115a将第一数据处理请求传输至第二基础核心模块110b的核心缓存116b,可以理解为:如果在共享总线115a上,在当前的转换开关113a与核心缓存116b之间还存在其他的转换开关113a,则当前的转换开关113a在接收到该第一数据处理请求后,通过共享总线115a将第一数据处理请求传输至下一跳转换开关113a,并由该下一跳转换开关113a将第一数据处理请求继续向下游传输,直至该第一数据处理请求被传输至核心缓存116b。其中,需要说明的是,本申请中的上游和下游是以数据传输的方向为参照的。
另外,当基础核心模块的数量大于2时,该第一基础核心模块110a和第二基础核心模块110b可以是所述至少两个基础核心模块中具有连接关系的任意两个基础核心模块。例如,当所述至少两个基础核心模块采用线性排列的结构时,第一基础核心模块110a和第二基础核心模块110b可以是左右相邻的两个基础核心模块。
还需要说明的是,每个基础核心模块110中的计算引擎111的数量可以根据需要设置,例如,通常情况下,每个基础核心模块110中包括4个计算引擎111。缓存网络112是由多个缓存路由(Cache Router)组成。任一缓存单元114可以是缓存库(Cache Bank),例如L2(二级缓存)缓存库。示例的,在每个基础核心模块110中可以设置有16个L2缓存库,图1或图2中所示的每个缓存单元可以表示4个L2缓存库。
在上述实施方式中,每个基础核心模块中都包括多个计算引擎,每个计算引擎都通过转换开关与缓存网络连接,多个转换开关通过所述共享总线串行连接,并且一个基础核心模块中的共享总线与另一个基础核心模块中的核心缓存连接,而另一个基础核心模块中的核心缓存是与另一个基础核心模块中的缓存网络连接的,因此,通过上述架构,实现了对计算引擎的数量的扩展。当一基础核心模块中的任一转换开关在接收到访问另一基础核心模块中的目标缓存单元的数据处理请求后,该转换开关能够通过其连接的共享总线将该数据处理请求传输至与共享总线连接的另一基础核心模块的核心缓存,从而实现了将该数据处理请求传输到另一基础核心模块,此时已传输至该核心缓存中的该数据处理请求就能够通过该核心缓存所连接的缓存网络,访问到目标缓存单元,从而通过上述的基于共享总线的架构,实现了一个基础核心模块中的计算引擎对另一基础核心模块中的目标缓存单元的访问。由于在一个基础核心模块中由于没有对缓存网络进行扩展,计算引擎的访问路径没有延长,因此对于一个基础核心模块的性能没有影响,也没有增加工艺复杂度。因此,能够在不影响性能、不提高工艺复杂度的情况下实现对计算引擎扩展。
进一步的,如图2所示,第二基础核心模块110b的共享总线115b与第一基础核心模块110a的核心缓存116a连接,第二基础核心模块110b中的任一转换开关113b配置成在接收到访问第一基础核心模块110a中的第二目标缓存单元的第二数据处理请求后,通过第二基础核心模块110b的共享总线115b将第二数据处理请求传输至第一基础核心模块110a 的核心缓存116a,第一基础核心模块110a的核心缓存116a配置成基于第二数据处理请求访问第二目标缓存单元。该第二目标缓存单元可以是第一基础核心模块110a中的多个缓存单元114a中的任意一个缓存单元。
由此可见,在上述实施方式中,在第一基础核心模块110a的共享总线115a与第二基础核心模块110b的核心缓存116b连接的基础上,第二基础核心模块110b的共享总线115b也与第一基础核心模块110a的核心缓存116a连接,从而使两个基础核心模块相互连接,能够在不影响性能、不提高工艺复杂度的情况下,使相互连接的两个基础核心模块中的任一基础核心模块中的计算引擎能够访问另一基础核心模块中的目标缓存单元。在一种可选的实施方式中,在每个基础核心模块110中,每个计算引擎111可包括多个存储客户端,每个存储客户端通过一个转换开关113与缓存网络112中的一个缓存路由连接,核心缓存116与缓存网络112中的一个缓存路由连接。其中,每个计算引擎111中的多个存储客户端就相当于计算引擎111中的多个核心,一般情况下,GPU中的一个计算引擎中有64个存储客户端。
因此,本申请不仅能够使得相互连接的两个基础核心模块中的其中一个基础核心模块的计算引擎访问另一个基础核心模块中的目标缓存单元,还能够实现一个基础核心模块中的存储客户端对另一基础核心模块中的目标缓存单元的访问。进一步的,缓存网络112包括呈网格状排布的多个缓存路由,缓存网络中的每个缓存路由与相邻的每个缓存路由连接,其中一个缓存路由与核心缓存116连接;此外,在每个基础核心模块110中,多个计算引擎111包括的多个存储客户端与多个转换开关113一一对应,每个存储客户端通过对应的转换开关113与缓存网络112中的一个缓存路由连接。
示例的,图3是本申请实施例提供的又一种数据处理装置100的结构示意图,参见图3,第一基础核心模块110a中的任一计算引擎111a均包括多个存储客户端1111a,第一基础核心模块110a中所有计算引擎111a包括的多个存储客户端1111a与多个转换开关113a一一对应,例如,第一基础核心模块110a中如果有n个存储客户端1111a,则第一基础核心模块110a中就有n个转换开关113a,且每个存储客户端1111a与一个对应的转换开关113a连接。第一基础核心模块110a中的缓存网络112a包括呈网格状排布(或者称为阵列排布)的多个缓存路由1121a,缓存网络112a中的每个缓存路由1121a均与相邻的每个缓存路由1121a连接,例如一个缓存路由1121a的上方、下方或左方均存在相邻的缓存路由1121a时,则该缓存路由1121a与上方、下方和左方的相邻缓存路由1121a均连接。每个存储客户端1111a与一个对应的转换开关113a连接,并通过该对应的转换开关113a与一个缓存路由1121a连接,同时,多个转换开关113a通过第一基础核心模块110a的共享总线115a串行连接,该共享总线115a与第二基础核心模块110b的核心缓存116b连接,该核心缓存116a与一个缓存路由1121a连接。
与第一基础核心模块110a类似,第二基础核心模块110b中的任一计算引擎111b均包括多个存储客户端1111b,第二基础核心模块110b中所有计算引擎111b包括的多个存储客户端1111b与多个转换开关113b一一对应。第二基础核心模块110b中缓存网络112b包括呈网格状排布的多个缓存路由1121b,缓存网络112b中的每个缓存路由1121b均与相邻的每个缓存路由1121b连接。每个存储客户端1111b与一个对应的转换开关113b连接,并通过该转换开关113b与一个缓存路由1121b连接,同时多个转换开关113b通过第二基础核心模块110b的共享总线115b串行连接,该共享总线115b与第一基础核心模块110a的核心缓存116a连接,该核心缓存116b与一个缓存路由1121b连接。
对于不同的数据处理请求,上述的数据处理装置100的处理方式也有所不同,下面分别针对不同类型的数据处理请求,对一个基础核心模块中的存储客户端对另一基础核心模块中的目标缓存单元的访问过程进行说明。
上述第一数据处理请求为读请求时,第二基础核心模块110b的核心缓存116b可以配置成:
在接收到第一数据处理请求,且第二基础核心模块110b的核心缓存116b中存储有第一数据处理请求所请求的第一目标数据时,将第一目标数据通过第一基础核心模块110a的共享总线115a返回至发送第一数据处理请求的存储客户端1111a。
在接收到第一数据处理请求,且第二基础核心模块110b的核心缓存116b中不存在第一目标数据时,基于第一数据处理请求,通过第二基础核心模块116b的缓存网络112b,从第一目标缓存单元中获取第一目标数据,并将第一目标数据通过第一基础核心模块110a的共享总线115a返回至发送第一数据处理请求的存储客户端1111a。
即,在第二基础核心模块110b的核心缓存116b接收到来自第一基础核心模块110a的某一存储客户端1111a发送的第一数据处理请求时,如果核心缓存116b中当前已经存储了第一数据处理请求所请求的第一目标数据,可以将该第一目标数据原路(即第一数据处理请求从储客户端1111a传输至核心缓存116b的路径的相反路径)返回至发送第一数据处理请求的存储客户端1111a。其中,核心缓存116b中已经存储的第一目标数据可以是在上一次接收到请求访问该第一目标数据的数据处理请求时,从第一目标数据所在的缓存单元获取并存储在核心缓存116b中的。
如果核心缓存116b中没有存储第一数据处理请求所请求的第一目标数据,则核心缓存116b可以与一个存储客户端1111b类似,通过与其连接的缓存路由1121b将该第一数据处理请求发送至第二基础核心模块110b的缓存网络112b中,以便将该第一数据处理请求通过缓存网络112b路由到第一目标缓存单元,在从第一目标缓存单元中获取该第一目标数据后,由缓存网络112b将该第一目标数据路由到核心缓存116b中,由核心缓存116b将该第一目标数据原路返回至发送第一数据处理请求的存储客户端1111a。
另外,核心缓存的结构可以如图4所示,图4是本申请实施例提供的一种核心缓存的结构示意图,参见图4,缓存核心可以包括:缓存控制(Cache Control)模块,标签缓存(Tag-Cache),脏数据掩码(Dirty-Mask)模块和数据缓存(Data-Cache),该缓存控制模块配置成实现以下功能:写缓存(Write-Buffer),地址标签(Address-Tag)管理,读回数据(Read-Return)返回,命中未命中校验(Hit-Miss Check)等。示例的,当接收到数据处理请求时,可以通过命中或未命中校验模块确定是否命中数据处理请求所请求的数据,当确定命中校验模块确定时,说明数据缓存已经存储了数据处理请求所请求的数据,从而可以从数据缓存中获取该数据,并通过读回数据模块输出至共享总线,并返回至发送数据处理请求的存储客户端。
其中,第二基础核心模块110b的核心缓存116b在接收到针对该第一目标数据的一次数据处理请求时(例如第一次接收到针对该第一目标数据的数据处理请求时),在从目标缓存单元获取到该第一目标数据后,将该第一目标数据存储在核心缓存116b中,以便下一次访问请求到来时核心缓存116b可以直接返回该第一目标数据。考虑到核心缓存116b中的存储空间有限,可以设置定期清理缓存的机制,例如当存储在核心缓存116b中的数据的存储时长达到预设的时长阈值时,将该数据删除(或者设置为允许覆盖),在该数据被删除后,下一次接收到针对该数据的访问请求时,需要核心缓存116b按照上述的方式从对应的目标缓存单元中重新获取该数据。
通过上述实施方式,能够实现第一基础核心模块110b中的任一存储客户端1111a对第二基础核心模块110b中的缓存单元的访问。
同理,第二数据处理请求为读请求时,第一基础核心模块110a的核心缓存116a与第二基础核心模块110b的核心缓存116b的作用相同,可以配置成:
在接收到第二数据处理请求,且第一基础核心模块110a的核心缓存116a中存储有第二数据处理请求所请求的第二目标数据时,将第二目标数据通过第二基础核心模块110b的共享总线115b返回至发送第二数据处理请求的存储客户端1111b。
在接收到第二数据处理请求,且第一基础核心模块110a的核心缓存116a中不存在第二目标数据时,基于第二数据处理请求,通过第一基础核心模块116a的缓存网络112a,从 第二目标缓存单元中获取第二目标数据,并将第二目标数据通过第二基础核心模块110b的共享总线115b返回至发送第二数据处理请求的存储客户端1111b。
从而能够实现第二基础核心模块110b中的任一存储客户端1111b对第一基础核心模块110a中的缓存单元114a的访问。上述的数据处理装置100中的相互连接的任意两个基础核心模块均可以通过上述的实施方式实现对另一个基础核心模块110中的缓存单元114的访问。
图5是本申请实施例示出的一种转换开关的结构示意图,每个转换开关113均可以包括第一端口1131、第二端口1132、第三端口1133、第四端口1134、第一数据选择器1135、数据缓冲器1136、裁决器(Arbitor)1137和第二数据选择器1138;
其中,第一端口1131配置成与对应的存储客户端连接,第二端口1132配置成与一个缓存路由连接,第三端口1133配置成通过共享总线115与上一跳转换开关113连接,第四端口1134配置成通过共享总线115与下一跳转换开关113或另一基础核心模块110的核心缓存116连接,第一数据选择器1135分别与第一端口1131、第二端口1132和数据缓冲器1136连接,裁决器1137分别与数据缓冲器1136、第三端口1133和第四端口1134连接,第二数据选择器1138分别与第一端口1131、第二端口1132、第三端口1133和第四端口1134连接。其中,需要说明的是第一端口1131、第二端口1132、第三端口1133和第四端口1134中的任一端口可以是指一个端口,也可以是多个端口,例如第一端口1131可以包含多个端口,多个端口可以分别配置成传输读请求、写请求、写入数据、写确认消息中的一种或几种。
第一数据选择器1135配置成将第一端口1131接收到的存储客户端的数据处理请求发送至与第二端口1132连接的缓存路由,或者发送至数据缓冲器1136,以及配置成将第二端口1132接收到的写确认消息通过第一端口1131返回给存储客户端。
裁决器1137配置成接收数据缓冲器1136和第三端口1133发送的数据处理请求,并在接收到的数据处理请求为多个时,确定多个数据处理请求中优先响应的数据处理请求,并将优先响应的数据处理请求通过第四端口1134输出至共享总线115。
其中,当接收到的数据处理请求为多个,裁决器1137可以根据预先设置的策略来确定多个数据处理请求中的哪一个数据处理请求应当优先响应。示例的,通常来说,来自共享总线115的数据处理请求相较于来自存储客户端的数据处理请求拥有更高的优先级,而对于来自相同优先级数据来源(即都来自存储客户端或都来自共享总线)的多个数据处理请求则可以采用先入先出的原则(即先接收到的数据处理请求先进行响应),并且可以对暂存于数据缓冲器1136中的数据处理请求的等待次数进行计数。例如,假设当前数据缓冲器1136中存储有3个数据处理请求,分别为请求1、请求2和请求3,其中请求1、请求2来自存储客户端,请求3来自共享总线115,则裁决器1137会优先将请求3确定为优先响应的数据处理请求,并将请求1和请求2的等待次数加1,在裁决器1137下一次裁决时,如果数据缓冲器1136中依然存在来自共享总线115的数据处理请求时,则裁决器1137还是优先响应来自共享总线115的数据处理请求,并再次将每个来自的存储客户端的数据处理请求的等待次数加1,直至数据缓冲器1136中没有来自共享总线115的数据处理请求时,在当前数据缓冲器1136中选择等待次数最大的数据处理请求作为优先响应的数据处理请求。在一种可选的实施方式中,为了防止某个数据处理请求在数据缓冲器1136等待时间过长,可以为数据处理请求设置一个等待次数的上限阈值,当某一数据处理请求的等待次数达到或超过该上限阈值时,则裁决器1137将该数据处理请求确定为当前优先响应的数据处理请求。
第二数据选择器1138配置成将第四端口1134接收到的读回数据输出至与第一端口1131连接的存储客户端,或者通过第三端口1133输出至共享总线115,还配置成将第二端口1132接收到的读回数据输出至与第一端口1131连接的存储客户端。
其中,第一数据选择器1135可以基于哈希函数来确定该数据处理请求是路由至缓存路 由,还是路由至数据缓冲器1136。示例的,数据处理请求中通常包含要访问的缓存地址,该缓存地址通常可以通过预设比特位的二进制数表征,本实施例中的哈希函数可以对该缓存地址的二进制数进行异或运算,从而得到一个新的二进制数,该新的二进制数即可作为目标缓存地址,从而依据该目标缓存地址将该数据处理请求路由至对应的目标缓存单元。通过上述方式,可以控制使访问本基础核心模块110的数据处理请求被路由至本基础核心模块110的缓存网络112中,而访问另一基础核心模块110的数据处理请求被路由至数据缓冲器1136中以便通过共享总线115到达另一基础核心模块110。
在一种可选的实施方式中,图6是本申请实施例示出的另一种转换开关的结构示意图,如图6所示,转换开关113中还设置有请求输入寄存器(in_request)11391、两个请求数据寄存器(request_data)11392a和11392b、请求输出寄存器(out_request)11393、四个读回数据寄存器(read_ret)11394a、11394b、11394c和11394d、两个总线寄存器11395a和11395b,以及确认消息寄存器(ack)11396,这些寄存器均可以为一级寄存器。
其中,请求输入寄存器11391与第一端口1131和第一数据选择器1135连接,配置成将第一端口1131接收到的存储客户端的读请求或写请求发送至第一数据选择器1135,请求输出寄存器11393与第一数据选择器1135和第二端口1132连接,配置成接收第一数据选择器1135发送的读请求或写请求并通过第二端口1132输出至缓存路由;请求数据寄存11392a与第一端口1131和第一数据选择器1135连接,配置成将第一端口1131接收到的写请求对应的写入数据发送至第一数据选择器1135,请求数据寄存11392b与第二端口1132和第一数据选择器1135连接,配置成接收第一数据选择器1135发送的写入数据并通过第二端口1132输出至缓存路由;确认消息寄存器11396与第二端口1132和第一数据选择器1135连接,配置成接收缓存路由返回的写确认消息,并将写确认消息发送至第一数据选择器1135;总线寄存器11395a与第三端口1133和裁决器1137连接,配置成将第三端口1133接收到的共享总线传来的读请求或写请求(及写入数据)发送至裁决器1137,总线寄存器11395b与第四端口1134和裁决器1137连接,配置成将裁决器1137发送的读请求或写请求(及写入数据)通过第四端口1134发送至共享总线。
读回数据寄存器11394a、11394b、11394c和11394d分别与第一端口1131、第二端口1132、第三端口1133和第四端口1134连接,读回数据寄存器11394a、11394b、11394c和11394d均与第二数据选择器1138连接,其中读回数据寄存器11394b配置成将第二端口1132接收到的来自缓存路由的读回数据发送至第二数据选择器1138,读回数据寄存器11394d配置成将第四端口1134接收到的来自共享总线115的读回数据发送至第二数据选择器1138,读回数据寄存器11394a配置成接收第二数据选择器1138发送的来自缓存路由的读回数据,并通过第一端口1131将读回数据返回至存储客户端,读回数据寄存器11394c配置成接收第二数据选择器1138发送的来自共享总线115的读回数据,并通过第三端口1131将读回数据发送至共享总线115。
基于上述的转换开关结构,在第一数据处理请求为包含写入数据的写请求时,第一基础核心模块110a中的任一转换开关113a可以配置成:
在接收到第一数据处理请求时,将第一数据处理请求存储在数据缓冲器1136,并向发起第一数据处理请求的存储客户端返回针对第一数据处理请求的写确认消息。
在第一数据处理请求满足输出条件时,通过裁决器1137将第一数据处理请求通过第四端口1134输出至共享总线115a,以通过共享总线115a将第一数据处理请求传输至第二基础核心模块110b的缓存核心116b,以使第二基础核心模块110b的缓存核心116b基于第一数据处理请求,通过第二基础核心模块110b的缓存网络112b,将写入数据写入第一目标缓存单元中。
在上述实施方式中,能够在接收到存储客户端的包含写入数据的写请求时,将写请求存储在数据缓冲器,并立即向发起请求的存储客户端返回针对该写请求的写确认消息,而实际的写请求和写入数据在被裁决器输出至共享总线后,由共享总线传输至第二基础核心 模块的缓存核心,第二基础核心模块的缓存核心基于该写请求,通过第二基础核心模块的缓存网络,将写入数据写入需要访问的目标缓存单元中,能够实现对存储客户端的写请求的快速响应。
图7是本申请实施例提供的一种数据处理方法的流程图,该方法可以应用于上述任一实施例所述的数据处理装置100,参见图7,该数据处理方法可以包括:
步骤S101,第一基础核心模块中的任一转换开关在接收到访问第二基础核心模块中的第一目标缓存单元的第一数据处理请求后,通过第一基础核心模块的共享总线将第一数据处理请求传输至第二基础核心模块的核心缓存。
步骤S102,第二基础核心模块的核心缓存基于第一数据处理请求访问第一目标缓存单元。
上述步骤S101至S102的实施方式,与上述图1所示的实施例中示出的实施方式相同,可以参照上述图1所示的实施例中示出的实施方式,不再赘述。
通过上述实施方式,当一基础核心模块中的任一转换开关在接收到访问另一基础核心模块中的目标缓存单元的数据处理请求后,该转换开关能够通过其连接的共享总线将该数据处理请求传输至与共享总线连接的另一基础核心模块的核心缓存,从而实现了将该数据处理请求传输到另一基础核心模块,此时已传输至该核心缓存中的该数据处理请求就能够通过该核心缓存所连接的缓存网络,访问到目标缓存单元,从而通过上述的基于共享总线的架构,实现了一个基础核心模块中的计算引擎对另一基础核心模块中的目标缓存单元的访问,而在一个基础核心模块中由于没有对缓存网络进行扩展,计算引擎的访问路径没有延长,从而对于一个基础核心模块的性能没有影响,也没有增加工艺复杂度。因此,能够在不影响性能、不提高工艺复杂度的情况下实现对计算引擎扩展。
在一种可选的实施方式中,基于图2所示的数据处理装置100,第二基础核心模块的共享总线与第一基础核心模块的核心缓存连接,图8是本申请实施例提供的另一种数据处理方法的流程图,参见图8,该方法还可以包括:
步骤S103,第二基础核心模块中的任一转换开关在接收到访问第一基础核心模块中的第二目标缓存单元的第二数据处理请求后,通过第二基础核心模块的共享总线将第二数据处理请求传输至第一基础核心模块的核心缓存。
步骤S104,第一基础核心模块的核心缓存基于第二数据处理请求访问第二目标缓存单元。
上述步骤S103至S104的实施方式,与上述图2所示的实施例中示出的实施方式相同,可以参照上述图2所示的实施例中示出的实施方式,不再赘述。
由此可见,在上述实施方式中,在第一基础核心模块的共享总线与第二基础核心模块的核心缓存连接基础上,第二基础核心模块的共享总线也与第一基础核心模块的核心缓存连接,从而使两个基础核心模块相互连接,能够在不影响性能、不提高工艺复杂度的情况下,使相互连接的两个基础核心模块中的任一基础核心模块中的计算引擎,对另一基础核心模块中的目标缓存单元的访问。
在一种可选的实施方式中,在第一数据处理请求为读请求时,步骤S102所述的第二基础核心模块的核心缓存基于第一数据处理请求访问第一目标缓存单元,可以包括:
第二基础核心模块的核心缓存在接收到第一数据处理请求,且第二基础核心模块的核心缓存中存储有第一数据处理请求所请求的第一目标数据时,将第一目标数据通过第一基础核心模块的共享总线返回至发送第一数据处理请求的存储客户端。
第二基础核心模块的核心缓存在接收到第一数据处理请求,且第二基础核心模块的核心缓存中不存在第一目标数据时,基于第一数据处理请求,通过第二基础核心模块的缓存网络,从第一目标缓存单元中获取第一目标数据,并将第一目标数据通过第一基础核心模块的共享总线返回至发送第一数据处理请求的存储客户端。
通过上述实施方式,实现了第一基础核心模块中的存储客户端对第二基础核心模块中 的缓存单元的访问。同理,步骤S104所述的第一基础核心模块的核心缓存基于第二数据处理请求访问第二目标缓存单元,可以包括:
第一基础核心模块的核心缓存在接收到第二数据处理请求,且第一基础核心模块的核心缓存中存储有第二数据处理请求所请求的第二目标数据时,将第二目标数据通过第二基础核心模块的共享总线返回至发送第二数据处理请求的存储客户端。
第一基础核心模块的核心缓存在接收到第二数据处理请求,且第一基础核心模块的核心缓存中不存在第二目标数据时,基于第二数据处理请求,通过第一基础核心模块的缓存网络,从第二目标缓存单元中获取第二目标数据,并将第二目标数据通过第二基础核心模块的共享总线返回至发送第二数据处理请求的存储客户端。
上述步骤与图3或图4所示的实施例中示出的实施方式相同,可以参照上述图3或图4所示的实施例中示出的实施方式,不再赘述。
由此可见,上述实施方式中,能够实现第二基础核心模块中的存储客户端对第一基础核心模块中的缓存单元的访问。同理,上述的数据处理装置100中的相互连接的任意两个基础核心模块均可以通过上述的实施方式实现对另一个基础核心模块中的缓存单元的访问。
在一种可选的实施方式中,基于图5或图6所示的转换开关,在第一数据处理请求为包含写入数据的写请求时,步骤S102所述的第二基础核心模块的核心缓存基于第一数据处理请求访问第一目标缓存单元,可以包括:
第一基础核心模块中的任一转换开关在接收到第一数据处理请求时,将第一数据处理请求存储在数据缓冲器,并向发起第一数据处理请求的存储客户端返回针对第一数据处理请求的写确认消息。
在第一数据处理请求满足输出条件时,该任一转换开关中的裁决器将第一数据处理请求通过该任一转换开关的第四端口输出至共享总线,以通过共享总线将第一数据处理请求传输至第二基础核心模块的缓存核心。
第二基础核心模块的缓存核心基于第一数据处理请求,通过第二基础核心模块的缓存网络,将写入数据写入第一目标缓存单元中。
在上述实施方式中,能够在接收到存储客户端的包含写入数据的写请求时,将写请求存储在数据缓冲器,并立即向发起请求的存储客户端返回针对该写请求的写确认消息,而实际的写请求和写入数据在被裁决器输出至共享总线后,由共享总线传输至第二基础核心模块的缓存核心,第二基础核心模块的缓存核心基于该写请求,通过第二基础核心模块的缓存网络,将写入数据写入需要访问的目标缓存单元中,能够实现对存储客户端的写请求的快速响应。
本申请实施例还提供一种处理器,该处理器可以包括上述的任一实施例所提供的数据处理装置100。该处理器可以为GPU或CPU,或者可以是上述的DCU,或者可以是集成了GPU(或DCU)和CPU的处理器(可以理解为GPU或DCU与CPU位于一个芯片上)。
在另一种实施方式中,本申请实施例还提供一种芯片,该芯片可以包括上述的任一实施例所提供的数据处理装置100,该数据处理装置100形成在同一半导体基板上。其中,可以理解的是,在该芯片上,数据处理装置100中包含的所有基础核心模块均形成在同一半导体基板上。
本申请实施例还提供另一种处理器,该处理器可以包括上述的芯片,该处理器可以为GPU或CPU,或者可以是上述的DCU,或者可以是集成了GPU(或DCU)和CPU的处理器(可以理解为GPU或DCU与CPU位于一个芯片上)。
图9是本申请实施例提供的一种电子设备200的框图,参见图9,该电子设备200可以包括:存储器201和处理器202,存储器201和处理器202可以通过总线连接。该存储器201中存储有计算机程序,该计算机程序被处理器202执行时,能够实现上述的数据处理方法。其中处理器202可以为上述的包括数据处理装置100的处理器。存储器201可以是, 但不限于,随机存取存储器,只读存储器,可编程只读存储器,可擦除可编程只读存储器,电可擦除可编程只读存储器等。该电子设备200可以是但不限于,智能手机、个人电脑(Personal Computer,PC)、平板电脑、个人数字助理(Personal Digital Assistant,PDA)、移动上网设备(Mobile Internet Device,MID)等。
本申请实施例还提供一种存储介质,该存储介质中存储有计算机程序,所述计算机程序被处理器执行时,能够实现上述的数据处理方法。
在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,也可以通过其它的方式实现。以上所描述的装置和方法实施例仅仅是示意性的,例如,附图中的流程图和框图显示了根据本申请的多个实施例的方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个配置成实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实施方式中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。另外,在本申请各个实施例中的各功能模块可以集成在一起形成一个独立的部分,也可以是各个模块单独存在,也可以两个或两个以上模块集成形成一个独立的部分。
以上所述仅为本申请的优选实施例而已,并不配置成限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。
工业实用性:
本申请提供一种数据处理装置、方法、芯片、处理器、设备及存储介质,能够在不影响性能、不提高工艺复杂度的情况下实现计算引擎的扩展。

Claims (20)

  1. 一种数据处理装置,其特征在于,所述数据处理装置包括:至少两个基础核心模块,每个所述基础核心模块包括:多个计算引擎、缓存网络、多个转换开关、多个缓存单元、共享总线以及核心缓存;
    每个所述基础核心模块中,所述多个缓存单元以及所述核心缓存分别与所述缓存网络连接,所述多个计算引擎通过所述多个转换开关与所述缓存网络连接,所述多个转换开关通过所述共享总线串行连接;
    所述至少两个基础核心模块中的第一基础核心模块的共享总线与第二基础核心模块的核心缓存连接,所述第一基础核心模块中的任一转换开关配置成在接收到访问所述第二基础核心模块中的第一目标缓存单元的第一数据处理请求后,通过所述第一基础核心模块的共享总线将所述第一数据处理请求传输至所述第二基础核心模块的核心缓存,所述第二基础核心模块的核心缓存配置成基于所述第一数据处理请求访问所述第一目标缓存单元。
  2. 根据权利要求1所述的数据处理装置,其特征在于,所述第二基础核心模块的共享总线与所述第一基础核心模块的核心缓存连接,所述第二基础核心模块中的任一转换开关配置成在接收到访问所述第一基础核心模块中的第二目标缓存单元的第二数据处理请求后,通过所述第二基础核心模块的共享总线将所述第二数据处理请求传输至所述第一基础核心模块的核心缓存,所述第一基础核心模块的核心缓存配置成基于所述第二数据处理请求访问所述第二目标缓存单元。
  3. 根据权利要求1或2所述的数据处理装置,其特征在于,在每个所述基础核心模块中,每个所述计算引擎包括多个存储客户端,每个所述存储客户端通过一个所述转换开关与所述缓存网络中的一个缓存路由连接,所述核心缓存与所述缓存网络中的一个缓存路由连接。
  4. 根据权利要求3所述的数据处理装置,其特征在于,在每个所述基础核心模块中,所述多个计算引擎包括的多个存储客户端与所述多个转换开关一一对应,每个所述存储客户端通过对应的所述转换开关与所述缓存网络中的一个缓存路由连接,所述缓存网络包括呈网格状排布的多个缓存路,所述缓存网络中的每个缓存路由与相邻的每个缓存路由连接。
  5. 根据权利要求3所述的数据处理装置,其特征在于,所述第一数据处理请求为读请求,所述第二基础核心模块的核心缓存配置成:
    在接收到所述第一数据处理请求,且所述第二基础核心模块的核心缓存中存储有所述第一数据处理请求所请求的第一目标数据时,将所述第一目标数据通过所述第一基础核心模块的共享总线返回至发送所述第一数据处理请求的存储客户端;
    在接收到所述第一数据处理请求,且所述第二基础核心模块的核心缓存中不存在所述第一目标数据时,基于所述第一数据处理请求,通过所述第二基础核心模块的缓存网络,从所述第一目标缓存单元中获取所述第一目标数据,并将所述第一目标数据通过所述第一基础核心模块的共享总线返回至发送所述第一数据处理请求的存储客户端。
  6. 根据权利要求1-5中任意一项所述的数据处理装置,其特征在于,每个所述转换开关均包括第一端口、第二端口、第三端口、第四端口、第一数据选择器、数据缓冲器、裁决器和第二数据选择器;
    其中,所述第一端口配置成与对应的存储客户端连接,所述第二端口配置成与一个缓存路由连接,所述第三端口配置成通过共享总线与上一跳转换开关连接,所述第四端口配置成通过共享总线与下一跳转换开关或另一基础核心模块的核心缓存连接,所述第一数据选择器分别与所述第一端口、所述第二端口和所述数据缓冲器连接,所 述裁决器分别与所述数据缓冲器、所述第三端口和所述第四端口连接,所述第二数据选择器分别与所述第一端口、所述第二端口、所述第三端口和所述第四端口连接;
    所述第一数据选择器配置成将所述第一端口接收到的存储客户端的数据处理请求发送至与所述第二端口连接的缓存路由,或者发送至所述数据缓冲器;
    所述裁决器配置成接收所述数据缓冲器和所述第三端口发送的数据处理请求,并在接收到的数据处理请求为多个时,确定多个数据处理请求中优先响应的数据处理请求,并将所述优先响应的数据处理请求通过所述第四端口输出至共享总线;
    所述第二数据选择器配置成将所述第四端口接收到的读回数据输出至与所述第一端口连接的存储客户端,或者通过所述第三端口输出至共享总线,还配置成将所述第二端口接收到的读回数据输出至与所述第一端口连接的存储客户端。
  7. 根据权利要求6所述的数据处理装置,其特征在于,所述第一数据处理请求为包含写入数据的写请求,所述第一基础核心模块中的任一转换开关配置成:
    在接收到所述第一数据处理请求时,将所述第一数据处理请求存储在所述数据缓冲器,并向发起所述第一数据处理请求的存储客户端返回针对所述第一数据处理请求的写确认消息;
    在所述第一数据处理请求满足输出条件时,通过所述裁决器将所述第一数据处理请求通过所述第四端口输出至共享总线,以通过所述共享总线将所述第一数据处理请求传输至所述第二基础核心模块的缓存核心,以使所述第二基础核心模块的缓存核心基于所述第一数据处理请求,通过所述第二基础核心模块的缓存网络,将所述写入数据写入所述第一目标缓存单元中。
  8. 根据权利要求6所述的数据处理装置,其特征在于,所述裁决器配置成当多个所述数据处理请求分别来自于共享总线和存储客户端时,将来自于所述共享总线的数据处理请求确定为优先响应的数据处理请求。
  9. 根据权利要求6所述的数据处理装置,其特征在于,所述裁决器配置成当多个所述数据处理请求均来自于共享总线或存储客户端时,将最先接收到的数据处理请求确定为优先响应的数据处理请求。
  10. 根据权利要求6所述的数据处理装置,其特征在于,所述裁决器配置成对暂存于数据缓冲器中的数据处理请求的等待次数进行计数,并在所述数据缓冲器中选择等待次数最大的数据处理请求作为优先响应的数据处理请求。
  11. 根据权利要求1-10中任意一项所述的数据处理装置,其特征在于,当存储在所述核心缓存中的数据的存储时长达到预设的时长阈值时,将所述数据删除或将所述数据设置为允许覆盖状态。
  12. 一种数据处理方法,其特征在于,应配置成权利要求1-11任一项所述的数据处理装置,所述方法包括:
    所述第一基础核心模块中的任一转换开关在接收到访问所述第二基础核心模块中的第一目标缓存单元的第一数据处理请求后,通过所述第一基础核心模块的共享总线将所述第一数据处理请求传输至所述第二基础核心模块的核心缓存;
    所述第二基础核心模块的核心缓存基于所述第一数据处理请求访问所述第一目标缓存单元。
  13. 根据权利要求12所述的数据处理方法,其特征在于,所述第二基础核心模块的共享总线与所述第一基础核心模块的核心缓存连接,所述方法还包括:
    所述第二基础核心模块中的任一转换开关在接收到访问所述第一基础核心模块中的第二目标缓存单元的第二数据处理请求后,通过所述第二基础核心模块的共享总线将所述第二数据处理请求传输至所述第一基础核心模块的核心缓存;
    所述第一基础核心模块的核心缓存基于所述第二数据处理请求访问所述第二目标缓存单元。
  14. 根据权利要求12或13所述的数据处理方法,其特征在于,所述第一数据处理请求为读请求,所述第二基础核心模块的核心缓存基于所述第一数据处理请求访问所述第一目标缓存单元,包括:
    所述第二基础核心模块的核心缓存在接收到所述第一数据处理请求,且所述第二基础核心模块的核心缓存中存储有所述第一数据处理请求所请求的第一目标数据时,将所述第一目标数据通过所述第一基础核心模块的共享总线返回至发送所述第一数据处理请求的存储客户端;
    在接收到所述第一数据处理请求,且所述第二基础核心模块的核心缓存中不存在所述第一目标数据时,基于所述第一数据处理请求,通过所述第二基础核心模块的缓存网络,从所述第一目标缓存单元中获取所述第一目标数据,并将所述第一目标数据通过所述第一基础核心模块的共享总线返回至发送所述第一数据处理请求的存储客户端。
  15. 根据权利要求12所述的数据处理方法,其特征在于,所述第一数据处理请求为包含写入数据的写请求,所述第二基础核心模块的核心缓存基于所述第一数据处理请求访问所述第一目标缓存单元,包括:
    所述第一基础核心模块中的任一转换开关在接收到所述第一数据处理请求时,将所述第一数据处理请求存储在数据缓冲器,并向发起所述第一数据处理请求的存储客户端返回针对所述第一数据处理请求的写确认消息;
    在所述第一数据处理请求满足输出条件时,所述任一转换开关中的裁决器将所述第一数据处理请求通过所述任一转换开关的第四端口输出至共享总线,以通过所述共享总线将所述第一数据处理请求传输至所述第二基础核心模块的缓存核心;
    所述第二基础核心模块的缓存核心基于所述第一数据处理请求,通过所述第二基础核心模块的缓存网络,将所述写入数据写入所述第一目标缓存单元中。
  16. 一种处理器,其特征在于,包括权利要求1-11任一项所述的数据处理装置。
  17. 一种芯片,其特征在于,包括权利要求1-11任一项所述的数据处理装置,所述数据处理装置形成在同一半导体基板上。
  18. 一种处理器,其特征在于,包括权利要求17所述的芯片。
  19. 一种电子设备,其特征在于,包括:存储器和处理器,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,实现权利要求12-15任一项所述的数据处理方法。
  20. 一种存储介质,其特征在于,所述存储介质中存储有计算机程序,所述计算机程序被处理器执行时,实现权利要求12-15任一项所述的数据处理方法。
PCT/CN2020/114010 2019-12-11 2020-09-08 数据处理装置、方法、芯片、处理器、设备及存储介质 WO2021114768A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911272283.4A CN111080510B (zh) 2019-12-11 2019-12-11 数据处理装置、方法、芯片、处理器、设备及存储介质
CN201911272283.4 2019-12-11

Publications (2)

Publication Number Publication Date
WO2021114768A1 true WO2021114768A1 (zh) 2021-06-17
WO2021114768A8 WO2021114768A8 (zh) 2021-07-15

Family

ID=70314023

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/114010 WO2021114768A1 (zh) 2019-12-11 2020-09-08 数据处理装置、方法、芯片、处理器、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111080510B (zh)
WO (1) WO2021114768A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111080510B (zh) * 2019-12-11 2021-02-12 海光信息技术股份有限公司 数据处理装置、方法、芯片、处理器、设备及存储介质
CN111881078B (zh) * 2020-07-17 2022-04-19 上海芷锐电子科技有限公司 基于gpgpu芯片的多用户通用计算处理方法和系统
CN112231243B (zh) * 2020-10-29 2023-04-07 海光信息技术股份有限公司 一种数据处理方法、处理器及电子设备
CN114721996B (zh) * 2022-06-09 2022-09-16 南湖实验室 一种分布式原子操作的实现方法与实现装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040059875A1 (en) * 2002-09-20 2004-03-25 Vivek Garg Cache sharing for a chip multiprocessor or multiprocessing system
CN101131624A (zh) * 2007-08-17 2008-02-27 杭州华三通信技术有限公司 存储控制系统及其处理节点
CN101794271A (zh) * 2010-03-31 2010-08-04 华为技术有限公司 多核内存一致性的实现方法和装置
CN102801600A (zh) * 2011-05-24 2012-11-28 清华大学 片上网络中缓存一致性的维护方法和片上网络路由
CN103970712A (zh) * 2013-01-16 2014-08-06 马维尔国际贸易有限公司 多个处理器系统中的互连环形网络
CN105808497A (zh) * 2014-12-30 2016-07-27 华为技术有限公司 一种数据处理方法
CN107291629A (zh) * 2016-04-12 2017-10-24 华为技术有限公司 一种用于访问内存的方法和装置
CN111080510A (zh) * 2019-12-11 2020-04-28 海光信息技术有限公司 数据处理装置、方法、芯片、处理器、设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2710481B1 (en) * 2011-05-20 2021-02-17 Intel Corporation Decentralized allocation of resources and interconnect structures to support the execution of instruction sequences by a plurality of engines
US9442772B2 (en) * 2011-05-20 2016-09-13 Soft Machines Inc. Global and local interconnect structure comprising routing matrix to support the execution of instruction sequences by a plurality of engines

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040059875A1 (en) * 2002-09-20 2004-03-25 Vivek Garg Cache sharing for a chip multiprocessor or multiprocessing system
CN101131624A (zh) * 2007-08-17 2008-02-27 杭州华三通信技术有限公司 存储控制系统及其处理节点
CN101794271A (zh) * 2010-03-31 2010-08-04 华为技术有限公司 多核内存一致性的实现方法和装置
CN102801600A (zh) * 2011-05-24 2012-11-28 清华大学 片上网络中缓存一致性的维护方法和片上网络路由
CN103970712A (zh) * 2013-01-16 2014-08-06 马维尔国际贸易有限公司 多个处理器系统中的互连环形网络
CN105808497A (zh) * 2014-12-30 2016-07-27 华为技术有限公司 一种数据处理方法
CN107291629A (zh) * 2016-04-12 2017-10-24 华为技术有限公司 一种用于访问内存的方法和装置
CN111080510A (zh) * 2019-12-11 2020-04-28 海光信息技术有限公司 数据处理装置、方法、芯片、处理器、设备及存储介质

Also Published As

Publication number Publication date
WO2021114768A8 (zh) 2021-07-15
CN111080510A (zh) 2020-04-28
CN111080510B (zh) 2021-02-12

Similar Documents

Publication Publication Date Title
WO2021114768A1 (zh) 数据处理装置、方法、芯片、处理器、设备及存储介质
US10169080B2 (en) Method for work scheduling in a multi-chip system
US9529532B2 (en) Method and apparatus for memory allocation in a multi-node system
US8190820B2 (en) Optimizing concurrent accesses in a directory-based coherency protocol
US8225027B2 (en) Mapping address bits to improve spread of banks
JP2002304328A (ja) マルチプロセッサシステム用コヒーレンスコントローラ、およびそのようなコントローラを内蔵するモジュールおよびマルチモジュールアーキテクチャマルチプロセッサシステム
US20120185633A1 (en) On-chip router and multi-core system using the same
US10592459B2 (en) Method and system for ordering I/O access in a multi-node environment
US9535873B2 (en) System, computer-implemented method and computer program product for direct communication between hardward accelerators in a computer cluster
JPH0776942B2 (ja) マルチプロセッサ・システムおよびそのデータ伝送装置
JP2010218364A (ja) 情報処理システム、通信制御装置および方法
KR20100135283A (ko) 피어투피어 특수 목적 프로세서 아키텍처 및 방법
US7818509B2 (en) Combined response cancellation for load command
US9965187B2 (en) Near-memory data reorganization engine
TW201543218A (zh) 具有多節點連接的多核網路處理器互連之晶片元件與方法
US10922258B2 (en) Centralized-distributed mixed organization of shared memory for neural network processing
WO2015134098A1 (en) Inter-chip interconnect protocol for a multi-chip system
US9542317B2 (en) System and a method for data processing with management of a cache consistency in a network of processors with cache memories
US20240048475A1 (en) Interconnection device
US10592465B2 (en) Node controller direct socket group memory access
CN111858096B (zh) 一种基于目录的最近距离cache监听读的方法及系统
TW569219B (en) Architecture and method for updating cache data
Gioiosa et al. Exploring data vortex network architectures
US11487695B1 (en) Scalable peer to peer data routing for servers
CN107273318A (zh) 并行处理设备和通信控制方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20900362

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20900362

Country of ref document: EP

Kind code of ref document: A1