WO2023123902A1 - 芯片系统中的数据传输处理方法及相关装置 - Google Patents

芯片系统中的数据传输处理方法及相关装置 Download PDF

Info

Publication number
WO2023123902A1
WO2023123902A1 PCT/CN2022/099777 CN2022099777W WO2023123902A1 WO 2023123902 A1 WO2023123902 A1 WO 2023123902A1 CN 2022099777 W CN2022099777 W CN 2022099777W WO 2023123902 A1 WO2023123902 A1 WO 2023123902A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
chip
data
chiplet
identifier
Prior art date
Application number
PCT/CN2022/099777
Other languages
English (en)
French (fr)
Inventor
黎立煌
陈宁
王和国
曹庆新
Original Assignee
深圳云天励飞技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳云天励飞技术股份有限公司 filed Critical 深圳云天励飞技术股份有限公司
Publication of WO2023123902A1 publication Critical patent/WO2023123902A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution

Definitions

  • the present invention relates to the field of communication technology, in particular to a data transmission processing method and a related device in a chip system.
  • a chip system may include multiple sub-chips, each of which has the function of independently processing data, and the multiple sub-chips are connected in a certain topology to realize mutual communication. Moreover, the multiple sub-chips can cooperatively process a single large-scale computing task in a model-parallel manner, so as to improve the processing efficiency of the task. In the process of cooperatively processing tasks, frequent interactive transmission of data is required among the multiple sub-chips, and the efficiency of the data transmission affects the processing performance of the entire chip system.
  • the embodiment of the present application discloses a data transmission processing method in a chip system and related devices, which can realize efficient data transmission between sub-chips in the chip system and improve the processing performance of the chip system.
  • the present application provides a data transmission processing method in a chip system, the method comprising:
  • the source sub-chip receives configuration parameters, and configures a preset stream output table according to the aforementioned configuration parameters;
  • the aforementioned stream output table includes the identification of the data stream and the identification of the destination sub-chip of the aforementioned data stream;
  • the aforementioned source sub-chip and the aforementioned destination sub-chip are A sub-chip in a plurality of sub-chips included in the system-on-a-chip, the aforementioned plurality of sub-chips are connected in a preset topology;
  • the aforementioned source chip executes the first data processing task to obtain output data
  • the aforementioned source sub-chip generates and sends the aforementioned output data to a plurality of data packets based on the aforementioned stream output table, and the aforementioned data packets include the identification of the aforementioned data flow and the identification of the aforementioned destination sub-chip.
  • the source sub-chip can quickly generate and send data packets by querying the configured stream output table, which improves the data transmission efficiency and realizes the chip system neutron chip The efficient data transmission between them improves the processing performance of the chip system.
  • the aforementioned stream output table further includes the identifier of the aforementioned first data processing task, the indication information of the number of data packets of the aforementioned data stream that have been sent, and the information contained in the aforementioned source sub-chip for storing the aforementioned data stream.
  • One or more items in the starting address of the data; the aforementioned data packet further includes the identification of the aforementioned first data processing task.
  • the flow output table includes task identifiers that can be used to distinguish different tasks, and makes the generated data packet also include the corresponding task identifier to indicate the task to which the data in the data packet belongs.
  • the indication information of the number of data packets included in the stream output table and the above-mentioned start address can quickly calculate the storage address of the data to be sent, so that the data can be quickly obtained and packaged for sending.
  • the aforementioned source sub-chip before the aforementioned source sub-chip generates multiple data packets of the aforementioned output data based on the aforementioned stream output table and sends them, it further includes: the aforementioned source sub-chip receives the unblocked data packets; wherein, the aforementioned unblocked data packets includes the identification of the aforementioned data stream, and the aforementioned unblocking data packet is used to indicate to the aforementioned source sub-chip that the aforementioned destination sub-chip is ready to receive the aforementioned data stream.
  • the destination sub-chip after the destination sub-chip is ready to receive data, it sends an unblocking data packet to the source sub-chip, which can avoid data packet loss caused by receiving data because the destination sub-chip is not ready.
  • the aforementioned source sub-chip sending the aforementioned multiple data packets includes: the aforementioned source sub-chip sending the aforementioned multiple data packets based on a port forwarding mapping table; wherein the aforementioned port forwarding mapping table includes the aforementioned destination sub-chip The mapping relationship between the identifier and the sending port.
  • the sending port of the data packet is determined by querying the port forwarding mapping table, so as to realize fast transmission of the data packet.
  • the aforementioned data packet when there are multiple destination chiplets for the aforementioned data stream, the aforementioned data packet includes identifiers of the aforementioned multiple destination chiplets.
  • the data packet can carry the identifiers of multiple destination sub-chips. Compared with the existing situation where each destination sends a data packet, the number of data packets to be sent can be reduced and the transmission bandwidth can be saved.
  • the aforementioned chip system includes a subsystem, the aforementioned subsystem includes at least two sub-chips, and the aforementioned subsystem is configured with a subsystem identifier; when the aforementioned target sub-chip is a sub-chip of the aforementioned subsystem, the aforementioned data packet Also includes the identification of the aforementioned subsystems.
  • the destination of the data packet can be quickly located through the identification of the chip subsystem, thereby improving the efficiency of data transmission.
  • the present application provides a data transmission processing method in a chip system, the method comprising:
  • the target sub-chip receives configuration parameters, and configures a preset stream input table according to the aforementioned configuration parameters, the aforementioned stream input table includes at least one data stream identifier to be received; the aforementioned target sub-chip is a sub-chip among a plurality of sub-chips included in the chip system , the aforementioned multiple sub-chips are connected in a preset topology;
  • the aforementioned destination sub-chip When the aforementioned destination sub-chip receives the data packet, it is judged whether the aforementioned stream input table contains the data stream identifier in the aforementioned data packet;
  • the destination sub-chip can quickly receive data packets for processing by querying the configured stream input table, which improves the data receiving efficiency and realizes the chip system neutron chip The efficient data transmission between them improves the processing performance of the chip system.
  • the aforementioned stream input table further includes the identifier of the aforementioned first data processing task, the indication information of the number of data packets that have received the aforementioned data stream, and the information included in the aforementioned destination sub-chip for storing the aforementioned data stream.
  • One or more items in the starting address of the data; the aforementioned data packet further includes the identification of the aforementioned first data processing task.
  • the flow input table includes task identifiers that can be used to distinguish data packets of different tasks.
  • the indication information of the number of the data packets included in the stream output table and the above start address can quickly calculate the storage address of the data in the received data packets, so that the data storage can be realized quickly.
  • the aforementioned destination sub-chip before the aforementioned destination sub-chip receives the aforementioned data packet, it further includes: the aforementioned destination sub-chip sends an unblocking data packet; wherein, the aforementioned unblocking data packet includes the identification of the aforementioned data stream and the ID of the source sub-chip. Identification, the aforementioned unblocking data packet is used to indicate to the aforementioned source sub-chip that the aforementioned destination sub-chip is ready to receive the aforementioned data flow; the aforementioned source sub-chip is the sub-chip that sends the aforementioned data packet in the aforementioned chip system.
  • the destination sub-chip after the destination sub-chip is ready to receive data, it sends an unblocking data packet to the source sub-chip, which can avoid data packet loss caused by receiving data because the destination sub-chip is not ready.
  • the foregoing chip system includes a subsystem, the foregoing subsystem includes at least two subchips, and the foregoing subsystem is configured with a subsystem identifier; the foregoing data packet further includes the foregoing subsystem identifier.
  • the destination of the data packet can be quickly located through the identification of the chip subsystem, thereby improving the efficiency of data transmission.
  • the present application provides a data transmission processing method in a chip system, the method comprising:
  • the controller assigns the first data processing task to the source sub-chip, and the data obtained after the execution of the first data processing task is completed is sent to the destination sub-chip in the form of a data stream;
  • the aforementioned controller, the aforementioned source sub-chip and the aforementioned destination sub-chip are A sub-chip in a plurality of sub-chips included in the system-on-a-chip, the aforementioned plurality of sub-chips are connected in a preset topology;
  • the aforementioned controller configures an identifier for the aforementioned data flow
  • the aforementioned controller sends the identification of the aforementioned data stream and the identification of the aforementioned destination sub-chip to the aforementioned source sub-chip; wherein, the aforementioned identification of the aforementioned data stream and the identification of the aforementioned destination sub-chip are used to correlate the stream output table stored in the aforementioned source sub-chip;
  • the aforementioned stream output table is the basis for sending data from the aforementioned source chip.
  • This application assigns tasks to the source sub-chip through the controller, and configures the stream output table of the source sub-chip, so that the source sub-chip can quickly generate and send data packets by querying the configured stream output table, which improves the data transmission efficiency and realizes The high-efficiency data transmission between sub-chips in the chip system is improved, and the processing performance of the chip system is improved.
  • the aforementioned method also includes:
  • the aforementioned controller assigns a second data processing task to the aforementioned target chiplet, and the aforementioned second data processing task is executed based on the data obtained after the aforementioned first data processing task is executed;
  • the aforementioned controller sends the identifier of the aforementioned data stream to the aforementioned target chiplet; wherein, the aforementioned identifier of the data stream is used for storing in a stream input table of the aforementioned target chiplet, and the aforementioned stream input table is the basis for the aforementioned target chiplet to receive data.
  • the controller assigns tasks to the target sub-chip, and configures the stream input table of the target sub-chip, so that the target sub-chip can quickly receive data packets for processing by querying the configured stream input table, which improves the data receiving efficiency and realizes
  • the high-efficiency data transmission between sub-chips in the chip system is improved, and the processing performance of the chip system is improved.
  • the aforementioned method also includes:
  • the aforementioned controller obtains the data transmission status between the sub-chips in the aforementioned chip system
  • the aforementioned controller generates scheduling information for the aforementioned source sub-chip based on the aforementioned data transmission situation, and the aforementioned scheduling information indicates that the aforementioned data flow is sent to the sending port of the aforementioned destination sub-chip in the aforementioned source sub-chip;
  • the aforementioned controller sends the aforementioned scheduling information to the aforementioned source chiplet.
  • the controller implements the scheduling of the data packet sending port based on the data transmission status of the entire chip system, thereby avoiding the transmission of data packets from crowded paths and improving the efficiency of data packet transmission.
  • the present application provides a source sub-chip, the source sub-chip includes:
  • a receiving unit configured to receive configuration parameters
  • the configuration unit is configured to configure a preset stream output table according to the aforementioned configuration parameters;
  • the aforementioned stream output table includes the identification of the data stream and the identification of the destination sub-chip of the aforementioned data stream;
  • the aforementioned source sub-chip and the aforementioned destination sub-chip are the chip system includes The sub-chips in the plurality of sub-chips, the aforementioned plurality of sub-chips are connected with a preset topology;
  • an execution unit configured to execute a first data processing task to obtain output data
  • a generating unit configured to generate a plurality of data packets from the aforementioned output data based on the aforementioned stream output table, wherein the aforementioned data packets include the identifier of the aforementioned data stream and the identifier of the aforementioned destination sub-chip;
  • a sending unit configured to send the aforementioned multiple data packets.
  • the aforementioned stream output table further includes the identifier of the aforementioned first data processing task, the indication information of the number of data packets of the aforementioned data stream that have been sent, and the information contained in the aforementioned source chip for storing the aforementioned data stream.
  • the foregoing data packet further includes an identifier of the foregoing first data processing task.
  • the aforementioned receiving unit is further configured to, before the aforementioned generating unit generates the aforementioned output data into multiple data packets based on the aforementioned stream output table,
  • the aforementioned unblocking data packet includes the identification of the aforementioned data stream, and the aforementioned unblocking data packet is used to indicate to the aforementioned source sub-chip that the aforementioned destination sub-chip is ready to receive the aforementioned data stream.
  • the foregoing sending unit is specifically configured to:
  • the aforementioned plurality of data packets are sent based on the port forwarding mapping table; wherein, the aforementioned port forwarding mapping table includes a mapping relationship between the identification of the aforementioned destination sub-chip and the sending port.
  • the aforementioned data packet when there are multiple destination chiplets for the aforementioned data stream, the aforementioned data packet includes identifiers of the aforementioned multiple destination chiplets.
  • the aforementioned system-on-a-chip includes a subsystem, the aforementioned subsystem includes at least two sub-chips, and the aforementioned subsystem is configured with a subsystem identifier;
  • the aforementioned data packet further includes the identifier of the aforementioned subsystem.
  • the present application provides a target sub-chip, the target sub-chip includes:
  • a receiving unit configured to receive configuration parameters
  • the configuration unit is configured to configure a preset stream input table according to the aforementioned configuration parameters, the aforementioned stream input table includes at least one data stream identifier to be received; the aforementioned target sub-chip is a sub-chip among the multiple sub-chips included in the chip system, and the aforementioned multiple sub-chips The chiplets are connected in a preset topology;
  • a judging unit configured to, when receiving a data packet, judge whether the aforementioned flow input table contains the data flow identifier in the aforementioned data packet;
  • the storage unit is configured to store the data in the aforementioned data packet when the aforementioned stream input table contains the data flow identifier in the aforementioned data packet.
  • the aforementioned stream input table further includes the identifier of the aforementioned first data processing task, the indication information of the number of data packets that have received the aforementioned data stream, and the information included in the aforementioned destination sub-chip for storing the aforementioned data stream.
  • the foregoing data packet further includes an identifier of the foregoing first data processing task.
  • the aforementioned destination chiplet further includes a sending unit, configured to, before the aforementioned receiving unit receives the aforementioned data packet,
  • the aforementioned unblocking data packet includes the identification of the aforementioned data flow and the identification of the source sub-chip, and the aforementioned unblocking data packet is used to indicate to the aforementioned source sub-chip that the aforementioned destination sub-chip is ready to receive the aforementioned data Stream preparation; the aforementioned source sub-chip is the sub-chip that sends the aforementioned data packet in the aforementioned chip system.
  • the aforementioned system-on-a-chip includes a subsystem, the aforementioned subsystem includes at least two sub-chips, and the aforementioned subsystem is configured with a subsystem identifier;
  • the aforementioned data packet also includes the identification of the aforementioned subsystem.
  • the present application provides a controller, which includes:
  • the allocation unit is used to assign the first data processing task to the source sub-chip, and the data obtained after the execution of the first data processing task is completed is sent to the destination sub-chip in the form of a data stream; the aforementioned controller, the aforementioned source sub-chip and the aforementioned target sub-chip
  • the sub-chip is a sub-chip among the multiple sub-chips included in the chip system, and the aforementioned multiple sub-chips are connected in a preset topology;
  • a configuration unit configured to configure an identifier for the aforementioned data flow
  • a sending unit configured to send the identifier of the aforementioned data stream and the identifier of the aforementioned destination chiplet to the aforementioned source chiplet; wherein, the identifier of the aforementioned data stream and the identifier of the aforementioned destination chiplet are used to correlate the stream output stored in the aforementioned source chiplet table; the aforementioned stream output table is the basis for sending data from the aforementioned source chip.
  • the aforementioned allocating unit is further configured to allocate a second data processing task to the aforementioned target chiplet, and the aforementioned second data processing task is executed based on the data obtained after the execution of the aforementioned first data processing task is completed;
  • the aforementioned sending unit is further configured to send the identification of the aforementioned data stream to the aforementioned target sub-chip; wherein, the aforementioned identification of the data stream is used to be stored in the stream input table of the aforementioned target sub-chip, and the aforementioned stream input table is received by the aforementioned target sub-chip Basis for data.
  • the foregoing controller further includes:
  • An acquisition unit configured to acquire data transmission between sub-chips in the aforementioned chip system
  • a generating unit configured to generate scheduling information for the aforementioned source sub-chip based on the aforementioned data transmission situation, the aforementioned scheduling information indicating that the aforementioned data flow is sent to the sending port of the aforementioned destination sub-chip in the aforementioned source sub-chip;
  • the aforementioned sending unit is further configured to send the aforementioned scheduling information to the aforementioned source chiplet.
  • the present application provides a sub-chip, the sub-chip includes a processor, a memory, and a communication port; wherein, the aforementioned memory and the communication port are coupled to the aforementioned processor, the aforementioned communication port is used to send and receive data, and the aforementioned memory is used to store computer A program, the aforementioned processor is used to call the aforementioned computer program, so that the aforementioned chiplet executes the method as described in any one of the first aspect;
  • the aforementioned sub-chips are sub-chips among the multiple sub-chips included in the system-on-a-chip, and the aforementioned multiple sub-chips are connected in a preset topology.
  • the present application provides a sub-chip, the sub-chip includes a processor, a memory, and a communication port; wherein, the aforementioned memory and the communication port are coupled to the aforementioned processor, the aforementioned communication port is used to send and receive data, and the aforementioned memory is used to store computer A program, the aforementioned processor is used to call the aforementioned computer program, so that the aforementioned chiplet executes the method as described in any one of the second aspect;
  • the aforementioned sub-chips are sub-chips among the multiple sub-chips included in the system-on-a-chip, and the aforementioned multiple sub-chips are connected in a preset topology.
  • the present application provides a sub-chip, the sub-chip includes a processor, a memory, and a communication port; wherein, the aforementioned memory and the communication port are coupled to the aforementioned processor, the aforementioned communication port is used to send and receive data, and the aforementioned memory is used to store computer A program, the aforementioned processor is used to call the aforementioned computer program, so that the aforementioned chiplet executes the method as described in any one of the third aspect;
  • the aforementioned sub-chips are sub-chips among the multiple sub-chips included in the system-on-a-chip, and the aforementioned multiple sub-chips are connected in a preset topology.
  • the present application provides a chip system, which includes a source sub-chip, a destination sub-chip, and a controller; wherein, the aforementioned source sub-chip is the source sub-chip described in any one of the above fourth aspects, and the aforementioned object
  • the chiplet is the target chiplet according to any one of the fifth aspect above, and the aforementioned controller is the controller according to any one of the sixth aspect above; or,
  • the source sub-chip is the sub-chip described in the seventh aspect
  • the destination sub-chip is the sub-chip described in the eighth aspect
  • the controller is the sub-chip described in the ninth aspect.
  • the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the foregoing computer program is executed by a processor, the method described in any one of the first aspect is implemented; or,
  • the present application provides a computer program product, including a computer program.
  • a computer program product including a computer program.
  • the above fourth aspect to the twelfth aspect are all corresponding to implementing the method provided in any one of the above first aspect to the third aspect. Therefore, the beneficial effects that it can achieve can refer to the beneficial effects in the corresponding method, and will not be repeated here.
  • Fig. 1 is a schematic diagram of the chip system provided by the present application.
  • FIG. 2 is a schematic structural diagram of a chiplet provided by the present application.
  • 3 to 6 are schematic diagrams of the chip system provided by the present application.
  • FIG. 7 is a schematic diagram of sub-chipset division provided by the present application.
  • FIG. 8 is a schematic flowchart of a data transmission processing method in the chip system provided by the present application.
  • FIGS. 9A and 9B are schematic diagrams of the data processing flow provided by this application.
  • Figure 10 is a schematic diagram of the data packet structure provided by the present application.
  • Fig. 11 is a schematic diagram of ports of a sub-chip in the chip system provided by the present application.
  • FIG. 12 is a schematic structural diagram of a data packet provided by the present application.
  • FIG. 13 is a schematic flow diagram of the routing implementation solution provided by the present application.
  • Fig. 14 is a schematic diagram of the direction coordinate system provided by the present application.
  • FIG. 15 is a schematic diagram of the direction coordinate system based on the construction of the chiplet provided by the present application.
  • FIG. 16 is a schematic diagram of the ports of the sub-chip in the system-on-a-chip provided by the present application.
  • 17 to 19 are schematic structural diagrams of virtual devices provided by the present application.
  • FIG. 20 is a schematic structural diagram of a physical device provided by the present application.
  • FIG. 1 is a schematic structural diagram of a chip system provided by an embodiment of the present application.
  • the chip system 110 includes a plurality of sub-chips (16 sub-chips are exemplarily shown in FIG. 1 ), and the multiple sub-chips are connected according to a preset topology connection relationship.
  • the 16 sub-chips in FIG. 1 can be arranged in a matrix form, Then, a single chiplet is respectively connected to two, three or four surrounding chiplets.
  • Each sub-chip has its own memory, and FIG. 1 schematically shows the memory of some sub-chips.
  • the memory can be, for example, synchronous dynamic random access memory (synchronous dynamic random access memory, SDRAM) or double rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), DDRSDRAM can be abbreviated as DDR.
  • SDRAM synchronous dynamic random access memory
  • DDRSDRAM Double Rate SDRAM
  • Each sub-chip in the chip system 100 has complete processing capability and can perform tasks independently. Of course, multiple sub-chips in the chip system 100 can cooperate with each other to execute large-scale processing tasks.
  • FIG. 2 exemplarily shows a schematic structural diagram of the chiplets in the above-mentioned chip system 110 .
  • the structure of the sub-chip can be network-on-chip (network-on-chip, Presented in the form of NoC). It can be seen that the chiplet may include a processing module, a routing module, a static memory, a memory controller and four ports (d0, d1, d2 and d3).
  • the above-mentioned processing module is a control unit (control unit, CU) in the sub-chip, which is responsible for the management of each processing flow in the sub-chip.
  • CU control unit
  • the above-mentioned routing module is responsible for data synchronization inside the sub-chip, data synchronization between sub-chips, data broadcasting and data transmission.
  • the routing module also includes a control unit, which is used to manage the routing process in the routing module.
  • the routing module also includes a local buffer, which can be used to temporarily store data to be processed.
  • the routing module also includes the port forwarding mapping module (forwarding-port mapper, FPM), the FPM can be a hardware module or a software module, a port forwarding mapping table is stored in the FPM, and the port forwarding mapping table includes the mapping relationship between the destination sub-chip and the sending port, which can be used to forward the data packet Mapped to the corresponding port for sending.
  • the routing module stores the stream input table (stream in table, SIT) and stream output table (stream out table, SOT), the SIT and SOT are used for data transmission between sub-chips, which will be introduced in detail later, and will not be described in detail here.
  • the above-mentioned static memory may be a static random-access memory (static random-access memory, SRAM), etc., and is used for storing data in the sub-chip.
  • static random-access memory static random-access memory, SRAM
  • SRAM static random-access memory
  • the above-mentioned memory controller is connected to the memory corresponding to the sub-chip, and the memory controller may be, for example, a DDR controller or the like.
  • the above four ports are network interfaces of the sub-chips, which can realize data transmission between the above-mentioned sub-chips.
  • the connection between the above sub-chips and sub-chips is realized through the four ports.
  • the sub-chip may also include two such ports, for example, sub-chip 0 , sub-chip 3 , sub-chip 12 and sub-chip 15 in FIG. 1 .
  • the sub-chip may also include three such ports, such as the sub-chip 4 in FIG. 1 .
  • FIG. 3 exemplarily shows a chip system 120, which also includes a plurality of sub-chips (8 sub-chips are exemplarily shown in FIG. 2 ), and a plurality of sub-chips of the chip system 120 are arranged and connected in the form of a cuboid, which This connection method can shorten the data transmission path between sub-chips as much as possible.
  • the chiplets in the system-on-chip 120 reference may be made to the corresponding description in FIG. 2 above, and details will not be repeated here.
  • the above-mentioned chip system 110 or chip system 120 can be used as a chip subsystem, by A plurality of such subsystems form a larger system-on-a-chip, for example, refer to the system-on-a-chip 130 shown in FIG. 4 .
  • the system-on-a-chip 130 may include multiple subsystems of the above-mentioned chips.
  • FIG. 4 shows an example including 8 subsystems.
  • Each subsystem of the multiple subsystems may be the above-mentioned system-on-a-chip 110 or system-on-a-chip 120 .
  • Each subsystem can be regarded as a whole, then the multiple subsystems can be connected through a preset topological connection relationship, for example, the connection can be arranged in the form of a cuboid, as shown in FIG. 4 .
  • the above-mentioned chip system 110 is used as an example to illustrate the connection structure diagram of each subsystem in the chip system 130 , as shown in FIG. 5 .
  • the chip system 130 includes 8 subsystems, each of the 8 subsystems includes 16 sub-chips, and the 16 sub-chips can be arranged and connected in a matrix. Between two adjacent subsystems, the connection between the two subsystems can be realized by connecting any sub-chip in one subsystem to any sub-chip in the other subsystem.
  • the chiplets arranged at the corners of the matrix in each subsystem are exemplarily used as the chiplets connected to another subsystem. For example, in subsystem 0, the connection between subsystem 0 and subsystem 1 is established through chiplet 3 in subsystem 0 and chiplet 0 in subsystem 1.
  • connection between the subsystem 0 and the subsystem 2 is established through the chiplet 12 in the subsystem 0 and the chiplet 0 in the subsystem 2 .
  • connection between subsystem 0 and subsystem 4 is established through chiplet 0 in subsystem 0 and chiplet 0 in subsystem 4 .
  • the foregoing subsystems may further include a control bus, for example, refer to the chip system 140 shown in FIG. 6 .
  • the system-on-a-chip 140 includes a central controller (the central controller may be a sub-chip or a control module in the system-on-a-chip 140 ), and the central controller may manage the task processing flow in the system-on-a-chip 140 .
  • the control bus is connected with the central controller for each subsystem to receive control instructions from the central controller.
  • each subsystem may be connected to the control line by a sub-chip, and after receiving the control command, the sub-chip may forward it to the corresponding sub-chip in the same subsystem.
  • each chiplet in each subsystem is connected to the control line for directly receiving control commands. This application does not limit the specific connection of the controller.
  • the system-on-a-chip provided in the embodiment of the present application also includes a central controller for managing the task processing flow of the entire system-on-a-chip.
  • the central controller may be a sub-chip or a control module in the chip system.
  • the central controller can obtain the load conditions and resource usage conditions of all sub-chips in the chip system, so as to assign tasks to each sub-chip based on these conditions.
  • a task scheduler (scheduler) in the central controller can be used to assign tasks to these sub-chips based on information such as load conditions and resource usage conditions of each sub-chip.
  • each sub-chip is capable of processing data independently. But in the case of large data tasks, the processing efficiency of a single sub-chip is low.
  • the multiple sub-chips included in the above chip system may be divided into multiple sub-chip groups, and each sub-chip group includes at least one sub-chip. In this way, data tasks can be processed with the sub-chipset as a processing unit, thereby improving processing efficiency.
  • FIG. 7 takes the above chip system 110 as an example, and divides 16 sub-chips in the chip system into 9 sub-chip groups. Refer to the division situation shown in FIG. 7 for details, and each sub-chip group includes at least one sub-chip.
  • multiple subchips of the same subchipset may be adjacent subchips, such as subchipset 3 , subchipset 4 , subchipset 7 , and subchipset 8 .
  • the sub-chips of the same sub-chipset may be non-adjacent sub-chips, for example, the sub-chipset 2 is composed of non-adjacent sub-chips 1 and 12 .
  • the chip system provided in the embodiment of the present application may implement data task processing in a manner of data parallelism, model parallelism, or model parallelism plus data parallelism. in:
  • Data parallelism refers to dividing the data to be processed into several data blocks, and assigning the several data blocks to different sub-chipsets, and each chipset runs the same processing program to process the allocated data. For example, assuming that the data to be processed is divided into 3 data blocks, and the existing 3 sub-chipsets can run the same processing program to process the data blocks, then the first data block of the 3 data blocks can be sent to the 3 The first sub-chipset in the three sub-chipsets processes, the second data block in the three data blocks is sent to the second sub-chipset in the three sub-chipsets for processing, and the third data block in the three data blocks is processed. The data block is sent to the third sub-chipset among the three sub-chipsets for processing.
  • Model parallelism means that multiple sub-chipsets jointly complete a data processing task, and each sub-chipset in the multiple sub-chipsets only executes a part of the entire data processing task (the part of the steps may be one or more processing steps). For example, assuming that a data processing task requires three steps to be processed, two sub-chipsets can be configured to jointly complete the task. Wherein, the first sub-chipset completes the processing of the first two steps in the three steps, and the second sub-chipset obtains the processed data from the first sub-chipset to complete the processing of the third step. Alternatively, three sub-chipsets can be configured to work together to accomplish this task.
  • the first sub-chipset completes the processing of the first step in the three steps
  • the second sub-chipset obtains the processed data from the first sub-chipset to complete the processing of the second step
  • the third sub-chipset completes the processing of the second step from the first sub-chipset. 2.
  • the sub-chipset acquires the processed data to complete the processing in the third step. That is, each chipset may complete one or more steps, which may be specifically determined according to the load and resource usage of the chipset.
  • the method of model parallelism plus data parallelism combines the above two methods of data parallelism and model parallelism to process data. For example, if a data processing task needs to go through three steps to complete the processing, then three sub-chipsets can be configured to jointly complete the task. Wherein, the first sub-chipset completes the processing of the first step in the three steps, the second sub-chipset obtains the processed data from the first sub-chipset to complete the processing of the second step, and the third sub-chipset completes the processing of the second step from the first sub-chipset. 2. The sub-chipset acquires the processed data to complete the processing in the third step. However, since the processing of the first step is relatively complicated, it takes more time to complete the processing of this step.
  • one or more sub-chipsets can be configured to jointly execute the processing task of the first step.
  • a fourth sub-chipset may be further configured to perform the processing task of the first step together with the aforementioned first sub-chipset.
  • the data for processing in the first step may be divided into two parts, one part is sent to the first sub-chipset for processing, and the other part is sent to the fourth sub-chipset for processing. Then, the processed data of the first sub-chipset and the fourth sub-chipset are sent to the second sub-chipset for processing in the second step.
  • each processing step can be processed by data parallel processing, or some processing steps can be processed by data parallel processing, which can be based on specific The implementation is determined, and this application does not limit it.
  • the central controller of the chip system may distribute data processing tasks to each sub-chipset.
  • Using model parallelism or model parallelism plus data parallelism to realize data task processing requires data transmission between sub-chips. Data transmission will generate time delay and reduce processing efficiency.
  • an embodiment of the present application provides a data transmission processing method in the chip system.
  • the data transmission processing method provided by the embodiment of the present application includes but is not limited to the following steps:
  • the source sub-chip receives configuration parameters, and configures a preset stream output table according to the configuration parameters; the stream output table includes the identification of the data stream and the identification of the destination sub-chip of the data stream; the source sub-chip executes the first data Process tasks to get output data.
  • the above-mentioned source sub-chip in the processing of data tasks using model parallelism or model parallelism plus data parallelism, can be any sub-chip in the first sub-chipset, and the first sub-chipset is a system-on-chip A sub-chipset for performing the processing task of the first step in the target task.
  • the first step may include one or more processing steps of the target task.
  • the chip system may be the chip system 110 , the chip system 120 , the chip system 130 , or the chip system 140 described above.
  • the above-mentioned stream output table is initialized and stored in the source sub-chip in advance, and is used for the source sub-chip to transmit corresponding data.
  • the stream output table is the basis for the source chip to send data.
  • the flow output table includes the identifier of the data flow and the identifier of the destination sub-chip to which the data flow goes.
  • the flow output table may also include the identifier of the task to which the data included in the data flow belongs, the indication information of the number of data packets sent in the data flow, and the information used to store the data in the source chip.
  • One or more of the starting addresses of the data included in the data stream are examples of the data included in the data stream.
  • one sub-chip sends data to another sub-chip, and the data is packaged into multiple data packets, and these data packets are numbered and sent in sequence, and these consecutively sent data packets form a data stream.
  • each data packet can carry 1kb of data. If the total size of the transmitted data is 64kb, then these data can be split and encapsulated into 64 data packets for transmission, and the 64 data packets can be form a data stream.
  • the stream output table may be initialized by the central controller of the chip system. Specifically, based on the foregoing description, it can be seen that the central controller allocates each data processing task to each sub-chipset, then the central controller can configure a corresponding task identifier for each data processing task to distinguish different tasks.
  • the central controller also configures a sub-chipset to execute each processing step of the data processing task. Therefore, the central controller can know the flow direction of the data flow corresponding to the data processing task, and configure a data flow identifier for the corresponding data flow to distinguish different data flows. For ease of understanding, an example is given below.
  • the central controller can perform processing based on the data size of the data processing task, the processing complexity of the eight steps, and the load and resource usage of each sub-chipset in the chip system. Handle assignment of tasks. For example, as shown in FIG. 9A , the central controller configures step 1 of the data processing task to be performed by sub-chipset 1, step 2 and step 3 of the data processing task to be performed by sub-chipset 2, and step 3 of the data processing task to be performed by sub-chipset 3. In step 4 of the data processing task, step 5, step 6 and step 7 of the data processing task are performed by the sub-chipset 4, and step 8 of the data processing task is performed by the sub-chipset 5.
  • sub-chipset 1 includes sub-chip 4
  • sub-chipset 2 includes sub-chip 5
  • sub-chipset 3 includes sub-chip 6
  • sub-chipset 4 includes sub-chip 8
  • sub-chipset 5 includes sub-chip 0 and sub-chip chip9.
  • sub-chipset 1 can send the data processed in step 1 to sub-chipset 2 and sub-chip respectively.
  • the identifiers of the data stream of the sub-chipset 1 and the data stream from the sub-chipset 5 are the same, for example, both are 17.
  • the data obtained by sub-chipset 2 after completing the processing of step 2 and step 3 is sent to sub-chipset 3 for processing in step 4, and the identifier of the data flow from sub-chipset 2 to sub-chipset 3 may be 18 .
  • the data obtained by the sub-chipset 3 after completing the processing of step 4 is sent to the sub-chipset 5 to perform the processing of step 8, and the identifier of the data flow from the sub-chipset 3 to the sub-chipset 5 may be 19 .
  • the sub-chipset 4 completes the processing of step 5, step 6 and step 7.
  • the data obtained is sent to the sub-chipset 5 to perform the processing of step 8.
  • the identification of the data flow from the sub-chipset 4 to the sub-chipset 5 can be 20 .
  • the data transmission between the sub-chipsets is actually the data transmission between the sub-chips in the sub-chipset.
  • the sub-chipset 1 sends the data processed in step 1 to the sub-chipset 5.
  • Others are the same and will not be repeated here.
  • Step 2 needs data 1 in the data processed by step 1
  • step 5 needs data 2 in the data processed by step 1
  • step 8 needs data after step 1 Data 3 in the data.
  • the sub-chipset can send the data 1 to the sub-chipset 2 , the data 2 to the sub-chipset 4 , and the data 3 to the sub-chipset 5 . Since the data sent to the three sub-chipsets of sub-chipset 2, sub-chipset 4, and sub-chipset 5 are different, the identifiers of the corresponding data streams are also different.
  • the identification of the data flow from sub-chipset 1 to sub-chipset 2 can be 15, the identification of data flow from sub-chipset 1 to sub-chipset 4 can be 16, and the identification of data flow from sub-chipset 1 to sub-chipset 5
  • the identifier of the data stream can be 17.
  • identification of data streams between other sub-chipsets reference may be made to the foregoing description of FIG. 9A , which will not be repeated here.
  • the data transmission between the sub-chip processing steps in the sub-chip set does not need to configure the identification of the data flow.
  • the data obtained after step 2 is sent to the corresponding module in the sub-chip 5 for processing
  • the internal data transmission of the sub-chip 5 does not need to configure the identifier of the data flow.
  • the data transmission in the sub-chip 8 in the sub-chipset 4 is the same, and there is no need to configure the identification of the data flow.
  • the central controller can know the flow direction of the corresponding data flow, that is, can know the identification of the destination chiplet of the data flow.
  • the central controller can combine the identification of the above data processing task, the identification of the data flow corresponding to the data processing task, the identification of the source sub-chip of the data flow, the identification of the destination sub-chip of the data flow, and the corresponding processing steps of the source sub-chip and other information associated storage.
  • the associative storage can be stored in the form of a table, and the table can be called a data flow table (stream table, ST). For easy understanding of the data flow table, see Table 1, for example.
  • the above-mentioned Table 1 exemplarily shows the information associated with step 1 of the data processing task, and the information associated with other steps is the same, and will not be repeated here.
  • the data flow table includes task identity document (TID), source sub-chip identification (source mask, S_mask), source sub-chip execution steps, destination sub-chip identification (destination mask, D_mask), the steps performed by the target sub-chip and the identification of the data flow (stream identity document, SID).
  • the above identifier of the task refers to the identifier of the corresponding data processing task.
  • the identifier of the task may be, for example, 1 or other identifiers, which is not limited in the present application.
  • the identification of the source sub-chip and the steps performed by the source sub-chip indicate the identification of the sub-chip that performs step 1.
  • the identification of the target chiplet and the steps executed by the target chiplet indicate the destination of the processed data in step 1, and the corresponding processing steps executed by the destination.
  • the identifier of the data stream indicates the identifier of the data stream formed by the source chiplet sending the data processed in step 1 to each corresponding destination chiplet. Refer to Table 1 for the specific identification, but the identification shown in Table 1 is only an example, and other identification symbols can be used instead. In the program executed by the computer, these identifications can be expressed in binary or hexadecimal, etc. The application does not limit the expression of various signs.
  • source sub-chips and destination sub-chips described in the embodiments of the present application are for data streams, and the source sub-chips and destination sub-chips corresponding to different data streams may be different.
  • the content included in the data flow table is not limited to the items shown in Table 1 above.
  • the content included in the data flow table may be part or all of the items shown in Table 1 above.
  • other content can also be included, such as other steps performed by the purpose sub-chip (for example, the above-mentioned sub-chip 8 also processes step 6 and step 7 in addition to step 5, so the above-mentioned data flow table can include the information of the step 6 and step 7, etc. ).
  • the data flow table is stored in the central controller, then the central controller can initialize the flow output table in the source sub-chip based on the data flow table. Specifically, the central controller can find the related information corresponding to the above-mentioned source sub-chip in the data flow table, that is, find information such as the corresponding task ID, the ID of the destination sub-chip, and the ID of the data flow. Then, the associated information is sent to the source chiplet. After receiving the associated information, the source subchip fills the information into the stream output table.
  • One or more items in the associated information corresponding to the source chip in the data flow table are the above configuration parameters.
  • the flow output table in the source chiplet may also include indication information of the number of data packets that have been sent for the data flow and the number of packets used to store the data flow in the source chiplet. The starting address of the included data. These two items may be filled in the flow output table by the source subchip based on its own information.
  • the source chip receives a data processing task assigned by the central controller (the data processing task may be the first data processing task in S801 above), and then completes the preset processing steps to obtain the processed data (the data is the output data obtained by executing the first data processing task in S801 above).
  • the processed data is stored in the memory buffer of the source sub-chip.
  • the source chiplet can know the size of the processed data and the starting address of storage.
  • the processed data is the data to be sent, and the size of the data packets to be sent is preset, so the number of data packets to be sent can be known by knowing the size of the processed data. Therefore, the source chip can fill in the number of data packets to be sent and the storage start address of the data to be sent into the above-mentioned stream output table. So far, the initialization of the stream output table in the source chip is completed.
  • the first data processing task received by the source sub-chip may be specific task data and/or task execution instructions, etc., and the source sub-chip may cache the task in a preset storage space after receiving the first data processing task, for subsequent implementation.
  • the above Table 2 exemplarily shows the information corresponding to task 1 in the stream output table. It can be seen that the TID, SID and D_mask corresponding to task 1 in Table 2 are obtained by the central controller from the above Table 1 and sent to the source chip, so they are the same as the above Table 1.
  • the start address (start address, S_addr) in Table 2 refers to the start address used to store the data included in the data stream to be sent in the source chiplet.
  • the data packet count (count packet, C_packet) in Table 2 refers to the indication information of the number of data packets that have been sent in the data flow. The data packet count may be a countdown. For example, at initialization, the data packet count is the number of all data packets included in the data flow. Then, each time the source chip sends a data packet, the data packet count is decremented by one.
  • the flow output table may include information corresponding to multiple tasks, which are distinguished by task identifiers.
  • the identifiers of data flows of different tasks may be the same or different, depending on the data flow table of the central controller.
  • the source chiplet may complete the initialization of the stream output table before executing the first data processing task assigned by the central controller.
  • the source chip can initialize the number of data packets of the sent data stream in the stream output table to zero, and initialize the start address for storing the data included in the data stream to the start address of a designated storage space address.
  • the source sub-chip executes the data processing task assigned by the central controller, and the processed data can be stored in the designated storage space.
  • the source chip sends the processed data the number of data packets of the sent data flow in the flow output table is increased by one every time a data packet is sent.
  • the source chiplet generates multiple data packets from the output data based on the flow output table, and the data packets include an identifier of the data flow and an identifier of the destination chiplet.
  • a data packet corresponding to the data stream may be generated based on the stream output table.
  • the data packet may include the type (type) of the data packet, the identifier of the task, the identifier of the data flow, the identifier of the destination sub-chip, the serial number of the data packet and data.
  • the identification of the target sub-chip in the data packet may be the identification of one or more target sub-chips. If the data in the data packet corresponds to one destination sub-chip, the identifier of the destination sub-chip in the data packet is the identifier of the one destination sub-chip.
  • the identifiers of the destination chiplets in the data packet are the identifiers of the multiple destination chiplets.
  • data flow 17 corresponds to two destination chiplets, namely chiplet 0 and chiplet 9, then the data packets in the data flow 17 include the identifications of chiplet 0 and chiplet 9.
  • the source sub-chip obtains the start address for storing the data to be sent (that is, the above output data) based on the task identifier and the data flow identifier in the above stream output table, and reads the data to be sent based on the start address Generate the above packet.
  • the generated data packet may also carry sideband information, and the sideband information may include one or more information of task identifier, data flow identifier, or destination chiplet identifier. These sideband information may not be encapsulated in the data packet, but sent together with the data packet.
  • the information included in the data packet can only be known by the routing module of the sub-chip, and other modules such as ports in the sub-chip are not aware of it. Therefore, in order to facilitate fast forwarding of the data packet, the data packet may be configured to carry the above side information.
  • FIG. 10 In order to facilitate the understanding of the format of the data packet and the format of the side information, refer to FIG. 10 by way of example. The format of the data packet shown in Figure 10 and the corresponding sideband information are only examples.
  • the data packet may also include other information, and the sideband information may also include more information, which is not covered by this application. limit.
  • the source sub-chip sends the above multiple data packets to the destination sub-chip.
  • the data packet can be sent to the destination sub-chip.
  • the source chip can send the data packet based on the port forwarding mapping table.
  • the routing module of the source sub-chip also includes a port forwarding mapping module FPM, which stores a port forwarding mapping table, and the port forwarding mapping table includes the identification of the destination sub-chip and the sending port mapping relationship.
  • the source sub-chip can query the port forwarding mapping table based on the identifier of the destination sub-chip to which the data packet is sent, can query the corresponding sending port, and then send the data packet out of the sending port. See Figure 11 for easy understanding. In FIG. 11 , it is assumed that chiplet 0 is the source chiplet, and chiplet 1 is the destination chiplet.
  • the port forwarding mapping table in the sub-chip 0 stores the association relationship between the identifier of the sub-chip 1 and the port d1 of the sub-chip 0 .
  • sub-chip 0 when sub-chip 0 has a data packet to be sent to sub-chip 1, sub-chip 0 queries the port forwarding mapping table based on the identification of sub-chip 1, and knows that the sending port of the data packet is d1, then sub-chip 0 sends the data packet from port d1. data pack.
  • the destination chiplet receives the data packet.
  • the data packet After the above-mentioned source sub-chip sends out the data packet, the data packet reaches the destination sub-chip through one route of transmission.
  • the destination chiplet receives the data packet. For example, taking the above-mentioned FIG. 11 as an example, sub-chip 0 sends a data packet from port d1, and the data packet reaches the destination sub-chip through port d3 of sub-chip 1, and sub-chip 1 receives the data packet through port d3.
  • the destination sub-chip judges whether the flow input table contains the data flow identifier in the received data packet, and the flow input table includes at least one data flow identifier to be received; the flow input table includes the data flow identifier in the received data packet When the data flow is identified, the data in the received packet is stored.
  • the destination sub-chip may first store the data in the data packet into a buffer for subsequent processing.
  • the destination chiplet can store the data in the data packet based on the flow input table.
  • the stream input table is the basis for the destination chiplet to receive data.
  • the flow input table may include an identification of a task and an identification of a data flow.
  • the stream input table may be initialized by the central controller of the chip system. Based on the foregoing description, it can be seen that the data flow table is stored in the central controller. The central controller can query the data flow table based on the identifier of the target chiplet, obtain information related to the target chiplet and send it to the target chiplet.
  • the information associated with the target chiplet includes information such as an identifier of a corresponding task and an identifier of a data flow.
  • the destination sub-chip can write the information into its own stream input table.
  • One or more items of associated information corresponding to the destination sub-chip in the data flow table are configuration parameters for configuring the flow input table.
  • the identifier of the data stream in the stream input table is the identifier of the data stream to be received by the destination sub-chip, that is, only if the data stream identifier in the data packet belongs to the data stream identifier in the stream input table, the destination sub-chip can obtain the data
  • the data in the package is stored for subsequent processing.
  • the above-mentioned stream input table may also include information indicating the number of data packets of the received data stream and the starting address of the destination sub-chip for storing data included in the data stream.
  • the information may be filled in the stream input table by the destination chiplet based on its own information.
  • the destination chiplet may configure a designated storage space for the data stream to be received, and then initialize the start address of the designated storage space into the stream input table.
  • the destination sub-chip can initialize the number of data packets of the received data flow in the flow input table to zero, and then when the destination sub-chip receives the data packets of the data flow, every time a data packet is received Add one to the number of data packets of the corresponding received data flow in the flow input table.
  • the destination sub-chip can obtain the total number of data packets included in the data flow to be received from the source sub-chip or the controller, and input the flow into the corresponding data packets of the received data flow in the table The number is initialized to the total number. Then, in the process of receiving the data packets of the data flow by the destination sub-chip, the number of data packets of the corresponding received data flow in the flow input table is reduced by one each time a data packet is received.
  • the above Table 3 exemplarily shows the information corresponding to task 1 in the stream input table. It can be seen that the TID and SID corresponding to task 1 in Table 3 are obtained by the central controller from the above Table 1 and sent to the source chip, so they are the same as the above Table 1.
  • the start address (start address, S_addr) in Table 3 refers to the start address used to store the data included in the data stream 15 in the above-mentioned target chiplet.
  • the data packet count (count packet, C_packet) in Table 3 refers to the indication information of the number of data packets of the received data flow 15 .
  • the data packet count may be a countdown, for example, at initialization, the data packet count is the number of all data packets included in the data flow, and then, each time the destination chip receives a data packet, the data packet count is decremented by one.
  • the flow input table may include information corresponding to multiple tasks, which are distinguished by task identifiers, and the identifiers of data flows of different tasks may be the same or different, depending on the data flow table of the central controller.
  • the destination chiplet may initialize the above stream input table by receiving a header packet (header packet) from the source chiplet.
  • the header data packet may include the type of the data packet, the identifier of the task, the identifier of the data flow, the identifier of the destination subchip, the total number of data packets included in the data flow, and the number of data packets used to store the data flow in the destination subchip. initial address.
  • the header data packet may also carry sideband information, and the content of the sideband information may be the same as that of the aforementioned sideband information, which will not be repeated here.
  • FIG. 12 For ease of understanding the format of the header packet, refer to FIG. 12 for an example.
  • N_PK represents the total number of data packets included in the data stream
  • Address represents the starting address of the data used to store the data stream in the above-mentioned destination sub-chip. For other identifications, refer to the previous description and will not be repeated here.
  • the format of the data packet shown in Figure 12 and the corresponding sideband information are only examples. In a specific implementation, the data packet can also include other information, and the sideband information can also include more information. This application does not do this limit.
  • the destination chiplet can send an unblocking data packet to the source chiplet, and the unblocking data packet is used to indicate to the source chiplet that the destination chiplet is ready to receive the corresponding data flow.
  • the unblocking data packet may include a data packet type, an identifier of a task, an identifier of a data flow, and an identifier of a chiplet receiving the data packet.
  • the sub-chip receiving the data packet is the above-mentioned source sub-chip.
  • the unblocking data packet may also carry sideband information, and the content of the sideband information may be the same as that of the aforementioned sideband information, which will not be repeated here.
  • the destination sub-chip after the destination sub-chip initializes the flow input table, it can first obtain the data flow identifier in the data packet and compare it with the data flow identifier in its own flow input table when receiving the data packet sent by the source sub-chip . If the flow input table includes the data flow identifier in the received data packet, the flow input table can be queried based on the task identifier and the data flow identifier in the data packet to obtain the starting address of the corresponding stored data, and then based on the The start address calculates the corresponding storage address, and stores the data of the data packet in the corresponding storage address.
  • the central controller assigns a second data processing task to the target chiplet, and the second data processing task is executed based on the data obtained after the first data processing task is executed. Then, the source sub-chip executes the first data processing task to obtain output data, and sends the output data to the destination sub-chip in the form of a data stream.
  • the destination chiplet receives and stores the output data based on its own stream input table. Then, the destination chiplet can read the output data from the storage space based on the known storage address to execute its own second data processing task.
  • the system-on-a-chip may include multiple subsystems, and each subsystem is connected according to a preset topology connection relationship. If the source sub-chip and the destination sub-chip are not sub-chips in the same subsystem, then the data packet transmitted between the two sub-chips also includes the identifier of the subsystem where the sub-chip receiving the data packet is located. For example, if the source chiplet sends a data packet (such as the data packet included in the above data flow) to the destination chiplet, the packet also includes the identifier of the subsystem where the destination chiplet is located.
  • a data packet such as the data packet included in the above data flow
  • the packet also includes the identifier of the subsystem where the source chiplet is located. Data transmission between sub-chips across subsystems can be realized through the identifier of the subsystem in the data packet, thereby facilitating the processing of large-scale data tasks and improving processing efficiency.
  • the embodiment of the present application can realize efficient data transmission in the chip system based on the above-mentioned initialized stream output table and stream input table, and improve the data processing performance of the chip system.
  • the embodiment of the present application may provide a routing implementation manner, so that the sub-chips of the data chip system can be flexibly scheduled and efficiently transmitted.
  • the following takes the first sub-chip as an example for introduction.
  • routing implementation provided by the embodiment of the present application includes but is not limited to the following steps:
  • the first sub-chip acquires a first data packet; wherein, the first data packet includes the identification of the target sub-chip; the first sub-chip and the target sub-chip are sub-chips included in the chip system, and the chip system includes multiple The chiplets are connected in a preset topology.
  • the chip system may be the chip system 110 , the chip system 120 , the chip system 130 , or the chip system 140 described above.
  • the first sub-chip may be any sub-chip in any one of these chip systems.
  • the first sub-chip may be the source sub-chip described in FIG. 13 , then, the first sub-chip may obtain the first data packet when the first sub-chip generates the first data packet.
  • the first sub-chip may obtain the first data packet when the first sub-chip generates the first data packet.
  • the above-mentioned first sub-chip may be a sub-chip in the process of transmitting data packets from the source sub-chip to the destination sub-chip, then, the above-mentioned first sub-chip may obtain the first data packet by the first sub-chip The chip receives the first data packet.
  • the system-on-a-chip where the first sub-chip is located is the first system-on-a-chip.
  • the first sub-chip receives the first data packet from another sub-chip in the first system-on-a-chip.
  • the first data packet may include one or more of information such as packet type (type), task identifier, data stream identifier, destination sub-chip identifier, packet number, and data. item.
  • the above-mentioned first data packet may also carry sideband information, and the sideband information may include one or more information of a task identifier, a data flow identifier, or a destination chiplet identifier.
  • the content included in the first data packet reference may be made to the description of the data packet in step S802 above, which will not be repeated here.
  • the first sub-chip sends the data in the first data packet based on the data transmission between sub-chips in the chip system.
  • the above-mentioned first sub-chip may send the data in the above-mentioned first data packet based on the congestion situation of its own port.
  • each sub-chip includes a plurality of ports for communicating with other sub-chips.
  • each port is configured with a corresponding sending buffer, and the sending buffer is used to store data to be sent.
  • the first sub-chip After the first sub-chip receives the first data packet, it parses the first data packet to obtain the identity of the destination sub-chip in the first data packet. If the identification of the target chiplet indicates that the first chiplet is the target chiplet, then the first chiplet extracts the data stored in the first data packet for subsequent processing. Otherwise, the first sub-chip looks up the sending port of the first data packet in its own forwarding mapping table by using the identifier of the destination sub-chip as an index. For the introduction of the forwarding mapping table, reference may be made to the corresponding description in the foregoing description of FIG. 2 , which will not be repeated here. If there are multiple sending ports found, the specific sending port may be determined based on the congestion conditions of the sending buffers of the multiple sending ports. Specifically, in order to improve data transmission efficiency, the port with the least amount of data to be sent among the sending buffers of the multiple sending ports may be selected to send the first data packet.
  • the first chiplet extracts the first data packet The data in is stored for subsequent processing.
  • the first chiplet uses the identifiers of the remaining target chiplets as an index to look up the data sending port of the first data packet in its own forwarding mapping table.
  • the identification of the above-mentioned remaining target sub-chip is one, then, in the same way, after finding the corresponding sending port, select the port with the least amount of data to be sent in the sending buffer of the sending port to send the first data packet The data. Specifically, the data will be repackaged into a data packet for transmission, and the identification of the target sub-chip in the repackaged data packet no longer includes the identification of the first sub-chip, but only includes the identification of the remaining target sub-chips .
  • the first sub-chip searches for corresponding sending ports in its own forwarding mapping table. If the found sending ports are the same, then the data included in the first data packet may be copied to regenerate a data packet, and the newly generated data packet includes the identifiers of the remaining destination chiplets. And send this newly generated packet from the same send port found. Similarly, the sending port may be the port with the least amount of data to be sent in the sending buffer among the found sending ports.
  • the remaining target sub-chips are sub-chips A and sub-chips B.
  • the above-mentioned first sub-chip looks up the sending port mapped with the identifier of the sub-chip A and the sending port mapped with the identifier of the sub-chip B in its own forwarding mapping table. Assuming that the found sending ports are different, the first sub-chip may regenerate two data packets: data packet A and data packet B.
  • Both data packets include the data included in the above-mentioned first data packet, wherein the identifier of the target sub-chip included in the data packet A is the identifier of the sub-chip A, and the identifier of the target sub-chip included in the data packet B is the identifier of the sub-chip B. Then, the data packet A and the data packet B are sent through the respectively found sending ports. Similarly, the sending port may be the port with the least amount of data to be sent in the sending buffer among the found sending ports.
  • the first chiplet may send the data in the first data packet to the target chiplet based on the principle of minimum bandwidth consumption.
  • the principle of minimum bandwidth consumption refers to the principle of sending data to the destination sub-chip with the minimum transmission bandwidth.
  • FIG. 14 exemplarily shows a schematic diagram of a direction coordinate system centered on the first chiplet.
  • the directional coordinate system includes four directional axes: a first directional axis, a second directional axis, a third directional axis and a fourth directional axis.
  • the four direction axes all diverge outward from the center of the first sub-chip.
  • the first direction axis and the second direction axis are collinear and opposite in direction; the third direction axis and the fourth direction axis are collinear and opposite in direction.
  • the direction coordinate system also includes four regions: a first region, a second region, a third region and a fourth region.
  • the first area is bounded by the first direction axis and the third direction axis
  • the second area is bounded by the second direction axis and the third direction axis
  • the third area is bounded by the second direction axis and the fourth direction axis as boundaries
  • the fourth region is bounded by the first direction axis and the fourth direction axis.
  • the row of the first sub-chip is located on at least one of the first direction axis and the second direction axis
  • the column of the first sub-chip is located on the third direction axis and the second direction axis. on at least one of the four directional axes.
  • FIG. 15 An example can be seen in FIG. 15 .
  • the sub-chip 5 in the chip system is the first sub-chip
  • a direction coordinate system is established with the sub-chip 5 as the center.
  • the second row of the chiplet 5 is located on the first direction axis and the second direction axis
  • the second column of the chiplet 5 is located on the third direction axis and the fourth direction axis .
  • Chiplet 2 and the chiplet 3 are located in the first region of the directional coordinate system.
  • Chiplet 0 is located in the second area of the orientation coordinate system.
  • Chiplet 8 and chiplet 12 are located in the third region of the directional coordinate system.
  • the chiplet 10 , the chiplet 11 , the chiplet 14 and the chiplet 15 are located in the fourth area of the directional coordinate system.
  • the sub-chip 0 is the first sub-chip in the above-mentioned chip system in FIG. 15 .
  • a direction coordinate system is established centering on the sub-chip 0 .
  • the first row of the chiplet 0 is located on the first directional axis
  • the first column of the chiplet 0 is located on the fourth directional axis. Then, except for the sub-chips in the row and column where the sub-chip 0 is located, all other sub-chips are located in the fourth area of the coordinate system in this direction.
  • the above-mentioned minimum bandwidth consumption principle includes: when the destination chiplet included in the above-mentioned first data packet is on the target direction axis, the above-mentioned first chiplet sends along the direction of the target direction axis
  • the data of the first data packet; the target direction axis is the first direction axis, the second direction axis, the third direction axis or the fourth direction axis.
  • FIG. 15 For ease of understanding, it will be described with reference to the above-mentioned FIG. 15 as an example.
  • the sub-chip 5 is the above-mentioned first sub-chip, and it receives a data packet, and the identifier of the destination sub-chip in the data packet indicates that the destination sub-chip is the sub-chip 7 . If the data packet only includes the identifier of one destination chiplet, then, since the chiplet 7 is located on the first direction axis, the chiplet 5 sends the data packet along the direction of the first direction axis. That is, the sub-chip 5 first sends the data packet to the sub-chip 6 , and then the sub-chip 6 forwards the data packet to the sub-chip 7 .
  • the chiplet 7 as one of the target chiplets is located on the first direction axis. Therefore, the chiplet 5 copies the data in the data packet to generate a new data packet, and sends the new data packet along the direction of the first direction axis. That is, the sub-chip 5 first sends the new data packet to the sub-chip 6 , and then the sub-chip 6 forwards it to the sub-chip 7 .
  • the newly generated data packet includes the identification of the chiplet 7 .
  • the first data packet includes identifiers of the first destination chiplet and the second destination chiplet.
  • the minimum bandwidth consumption principle also includes: in the directional coordinate system centered on the first chiplet, when the first target chiplet and the second target chiplet are respectively located in the first area, the second area, and the second area of the coordinate system.
  • the first chiplet sends the second data packet along the direction of the common direction axis.
  • the second data packet includes the data, identifiers of the first destination chiplet and the second destination chiplet.
  • the common direction axis is the direction axis of the common boundary of the two adjacent regions.
  • the sub-chip 5 is the above-mentioned first sub-chip, and it receives a data packet, and the identification of the destination sub-chip in the data packet indicates that the destination sub-chips are the sub-chip 8 and the sub-chip 14 .
  • the sub-chip 8 is located in the third area
  • the sub-chip 14 is located in the fourth area, these two areas are adjacent areas, and the common boundary is the fourth direction axis. Therefore, the chiplet 5 sends the data packet along the direction of the fourth directional axis. That is, the sub-chip 5 first sends the data packet to the sub-chip 9 , and then the sub-chip 9 further forwards it.
  • the sub-chip 9 can also be regarded as the above-mentioned first sub-chip, and a direction coordinate system is established with the sub-chip 9 as the center, and then data is forwarded based on the above-mentioned minimum bandwidth consumption principle.
  • the first data packet includes identifiers of the first destination chiplet and the second destination chiplet.
  • the aforementioned minimum bandwidth consumption principle also includes: in the directional coordinate system centered on the first chiplet, when the first target chiplet is in the first area of the coordinate system, the second target chiplet is in the third area
  • the first chiplet sends the third data packet along the direction of one of the direction axes of the two boundaries of the first area, and sends the third data packet along one of the direction axes of the two boundaries of the third area
  • a fourth packet is sent in the direction of the axis.
  • the third data packet includes the data and the identifier of the first destination chiplet.
  • the fourth data packet includes the data and the identifier of the second destination chiplet.
  • the sub-chip 5 is the above-mentioned first sub-chip, and it receives a data packet, and the identification of the destination sub-chip in the data packet indicates that the destination sub-chips are the sub-chip 2 and the sub-chip 12 .
  • the chiplet 2 is located in the first area, and the chiplet 12 is located in the third area.
  • the chiplet 5 can regenerate two data packets: data packet A and data packet B based on the data in the received data packet.
  • the data package A includes data and the identification of the sub-chip 2
  • the data package B includes the data and the identification of the sub-chip 12 .
  • the data packet A is sent along the direction of the first direction axis or the third direction axis.
  • the data packet A is sent along the direction of the first direction axis, that is, the data packet A is first sent to the sub-chip 6 , and then the sub-chip 6 forwards the data packet A to the sub-chip 2 .
  • the chiplet 5 sends the data packet B along the direction of the second direction axis or the fourth direction axis.
  • the data packet B is sent along the direction of the fourth direction axis, that is, the data packet B is first sent to the sub-chip 9 , and then the sub-chip 9 continues to forward it further.
  • the first data packet includes identifiers of the first destination chiplet and the second destination chiplet.
  • the above minimum bandwidth consumption principle also includes: in the directional coordinate system centered on the first chiplet, when the first target chiplet is in the second area of the coordinate system, the second target chiplet is in the fourth area
  • the first chiplet sends the third data packet along the direction of one of the direction axes of the two borders of the second area, and sends the third data packet along the direction of one of the two border directions of the fourth area
  • a fourth packet is sent in the direction of the axis.
  • the third data packet includes the data and the identifier of the first destination chiplet.
  • the fourth data packet includes the data and the identifier of the second destination chiplet.
  • the chiplet 5 is the above-mentioned first chiplet, and it receives a data packet, and the identification of the target chiplet in the data packet indicates that the target chiplets are chiplet 0 and chiplet 10 .
  • Chiplet 0 is located in the second area
  • chiplet 10 is located in the fourth area.
  • the chiplet 5 can regenerate two data packets: data packet C and data packet D based on the data in the received data packet.
  • the data package C includes data and the identification of the sub-chip 0
  • the data package D includes the data and the identification of the sub-chip 10 .
  • the data packet C is sent along the direction of the second direction axis or the third direction axis.
  • the data packet C is sent along the direction of the second direction axis, that is, the data packet C is first sent to the sub-chip 4 , and then the sub-chip 4 forwards the data packet C to the sub-chip 0 .
  • the chiplet 5 sends the data packet D along the direction of the first direction axis or the fourth direction axis.
  • the data packet D is sent along the direction of the first direction axis, that is, the data packet D is first sent to the sub-chip 6 , and then the sub-chip 6 forwards it to the sub-chip 10 .
  • the first data packet includes identifiers of the first destination chiplet and the second destination chiplet.
  • the minimum bandwidth consumption principle further includes: in the directional coordinate system centered on the first sub-chip, when the first target sub-chip is in the target area, the second target sub-chip is on the direction axis of the boundary of the target area In the case of , the first chiplet sends the fifth data packet along the direction axis of the boundary direction of the target area.
  • the fifth data packet includes the data and identifications of the first and second destination chiplets.
  • the target area is the first area, the second area, the third area or the fourth area.
  • the sub-chip 5 is the above-mentioned first sub-chip, and it receives a data packet, and the identification of the destination sub-chip in the data packet indicates that the destination sub-chips are the sub-chip 14 and the sub-chip 9 .
  • the chiplet 14 is located in the fourth region, and the chiplet 9 is located on the fourth axis.
  • the fourth direction axis is the boundary direction axis of the fourth area, then the chiplet 5 can send the received data packet along the direction of the fourth direction axis, that is, to the chiplet 9 .
  • the chiplet 9 After the chiplet 9 receives the data packet, it stores the data in the data packet. And make a copy of the data to regenerate a data package.
  • the new data packet includes the identification of the sub-chip 14 , and then the new data packet is sent to the sub-chip 13 or the sub-chip 10 , and then forwarded to the sub-chip 14 by the sub-chip 13 or the sub-chip 10 .
  • the above-mentioned first sub-chip may receive scheduling information from a central controller in the chip system, and correspondingly send data based on the scheduling information.
  • the central controller of the chip system may also be responsible for data scheduling in the chip system. Specifically, the central controller obtains the data transmission status of each sub-chip through the control bus, and can know the congestion status of each transmission path and/or the port congestion status of each sub-chip by analyzing the data transmission status, so that based on these According to the situation, the data transmission strategy is formulated and sent to each sub-chip in the form of scheduling information. Each sub-chip sends corresponding data based on the scheduling information issued by the controller, thereby reducing the probability of congestion and improving data transmission efficiency.
  • the scheduling information sent by the central controller to a sub-chip may include identifiers of one or more sub-chips and identifiers of ports corresponding to the one or more sub-chips.
  • the chiplet After receiving the scheduling information, the chiplet updates the information into its own forwarding mapping table for subsequent data forwarding. For ease of understanding, refer to FIG. 16 as an example.
  • Fig. 16 exemplarily shows a structural diagram of a system-on-a-chip, assuming that the sub-chip 0 is used as the central controller of the system-on-a-chip, the sub-chip 0 can communicate with other sub-chips in the system-on-a-chip through a control bus (not shown in Fig. 16 ). chip communication. Specifically, the sub-chip 0 can collect information such as the amount of data to be sent at the ports in each sub-chip through the control bus, and based on this information, the congestion situation of each port can be analyzed, and then the congestion situation of each transmission path can be analyzed. Based on the situation obtained from these analyzes, sub-chip 0 can comprehensively formulate a data transmission strategy in the chip system, which will be sent to each sub-chip in the form of scheduling information.
  • the chiplet 0 finds that the port d2 of the chiplet 1 is relatively idle, and the port d1 of the chiplet 5 is also relatively idle. Then, the sub-chip 0 sends a scheduling message to the sub-chip 1, and the scheduling information includes the identification of the sub-chip 6 and the identification of the port d2. After sub-chip 1 receives the scheduling information, it updates the information that the destination sub-chip is sub-chip 6 and the corresponding sending port is port d2 into its own forwarding mapping table. In addition, the sub-chip 0 sends a scheduling message to the sub-chip 5, and the scheduling information includes the identification of the sub-chip 6 and the identification of the port d1.
  • the sub-chip 5 After receiving the scheduling information, the sub-chip 5 updates the information that the destination sub-chip is the sub-chip 6 and the corresponding sending port is the port d1 into its own forwarding mapping table. Then, when sub-chip 1 receives a data packet destined for sub-chip 6, it queries its own forwarding mapping table and learns that its sending port is d2, so it sends the data packet through port d2. After the data packet arrives at the sub-chip 5 , the sub-chip 5 inquires its own forwarding mapping table and learns that its sending port is d1, so it sends the data packet through the port d1 and delivers the data packet to the sub-chip 6 .
  • the embodiment of the present application transmits the received data based on the data transmission situation in the chip system, so as to flexibly schedule the data transmission, improve the data transmission efficiency, and further improve the processing performance of the chip system.
  • each device includes a corresponding hardware structure and/or software module for performing each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software drives hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
  • the embodiments of the present application may divide the device into functional modules according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. It should be noted that the division of modules in this embodiment of the present application is schematic, and is only a logical function division, and there may be other division methods in actual implementation.
  • FIG. 17 shows a schematic diagram of a specific logical structure of the device, which may be the above-mentioned source sub-chip.
  • the device 1700 includes:
  • a receiving unit 1701 configured to receive configuration parameters
  • the configuration unit 1702 is configured to configure a preset stream output table according to the aforementioned configuration parameters; the aforementioned stream output table includes the identification of the data stream and the identification of the target sub-chip of the aforementioned data stream; the aforementioned device 1700 and the aforementioned target sub-chip are chip systems that include The sub-chips in the plurality of sub-chips, the aforementioned plurality of sub-chips are connected with a preset topology;
  • an execution unit 1703 configured to execute a first data processing task to obtain output data
  • a generating unit 1704 configured to generate multiple data packets from the aforementioned output data based on the aforementioned stream output table, where the aforementioned data packets include the identifier of the aforementioned data stream and the identifier of the aforementioned destination chiplet;
  • a sending unit 1705 configured to send the aforementioned multiple data packets.
  • the aforementioned stream output table further includes the identifier of the aforementioned first data processing task, the indication information of the number of data packets of the aforementioned data stream that have been sent, and the data included in the aforementioned device 1700 for storing the aforementioned data stream One or more items in the start address of the data packet; the aforementioned data packet also includes the identification of the aforementioned first data processing task.
  • the aforementioned receiving unit 1701 is further configured to receive the unblocking data packet before the aforementioned generating unit generates the aforementioned output data into multiple data packets based on the aforementioned flow output table; wherein, the aforementioned unblocking data packet includes The identification of the aforementioned data stream is included, and the aforementioned unblocking data packet is used to indicate to the aforementioned device 1700 that the aforementioned destination chiplet is ready to receive the aforementioned data stream.
  • the aforementioned sending unit 1705 is specifically configured to: send the aforementioned multiple data packets based on a port forwarding mapping table; wherein, the aforementioned port forwarding mapping table includes a mapping relationship between the identification of the aforementioned destination sub-chip and the sending port.
  • the aforementioned data packet when there are multiple destination chiplets for the aforementioned data stream, the aforementioned data packet includes identifiers of the aforementioned multiple destination chiplets.
  • the aforementioned chip system includes a subsystem, the aforementioned subsystem includes at least two sub-chips, and the aforementioned subsystem is configured with a subsystem identifier; when the aforementioned target sub-chip is a sub-chip of the aforementioned subsystem, the aforementioned data packet Also includes the identification of the aforementioned subsystems.
  • FIG. 18 shows a specific logical structural diagram of the device, which may be the above-mentioned target sub-chip.
  • the device 1800 includes:
  • the configuration unit 1802 is configured to configure a preset stream input table according to the aforementioned configuration parameters, the aforementioned stream input table includes at least one data stream identifier to be received; the aforementioned device 1800 is a sub-chip among the multiple sub-chips included in the chip system, and the aforementioned multiple The chiplets are connected in a preset topology;
  • a judging unit 1803 configured to, when receiving a data packet, judge whether the aforementioned stream input table contains the data stream identifier in the aforementioned data packet;
  • the storage unit 1804 is configured to store the data in the aforementioned data packet when the aforementioned data packet includes the aforementioned identifier of the data flow to be received.
  • the aforementioned stream input table further includes the identifier of the aforementioned first data processing task, the indication information of the number of data packets that have received the aforementioned data stream, and the data included in the aforementioned device 1800 for storing the aforementioned data stream One or more items in the start address of the data packet; the aforementioned data packet also includes the identification of the aforementioned first data processing task.
  • the aforementioned apparatus 1800 further includes a sending unit, configured to send an unblocking data packet before the aforementioned receiving unit 1801 receives the aforementioned data packet; wherein, the aforementioned unblocking data packet includes the identifier of the aforementioned data flow and The identification of the source sub-chip, the aforementioned unblocking data packet is used to indicate to the aforementioned source sub-chip that the aforementioned device 1800 is ready to receive the aforementioned data flow; the aforementioned source sub-chip is the sub-chip that sends the aforementioned data packet in the aforementioned chip system.
  • the foregoing chip system includes a subsystem, the foregoing subsystem includes at least two subchips, and the foregoing subsystem is configured with a subsystem identifier; the foregoing data packet further includes the foregoing subsystem identifier.
  • FIG. 19 shows a specific logical structural diagram of the device, which may be the above-mentioned central controller.
  • the device 1900 includes:
  • the allocation unit 1901 is configured to allocate a first data processing task to the source sub-chip, and the data obtained after the execution of the aforementioned first data processing task is sent to the destination sub-chip in the form of a data stream; the aforementioned device 1900, the aforementioned source sub-chip and the aforementioned The target sub-chip is a sub-chip among multiple sub-chips included in the chip system, and the aforementioned multiple sub-chips are connected in a preset topology;
  • a configuration unit 1902 configured to configure an identifier for the aforementioned data flow
  • the sending unit 1903 is configured to send the identifier of the aforementioned data stream and the identifier of the aforementioned destination chiplet to the aforementioned source chiplet; wherein, the identifier of the aforementioned data stream and the identifier of the aforementioned destination chiplet are used to associate the stream stored in the aforementioned source chiplet Output table; the aforementioned stream output table is the basis for sending data from the aforementioned source chip.
  • the aforementioned allocating unit 1901 is further configured to allocate a second data processing task to the aforementioned target chiplet, and the aforementioned second data processing task is executed based on the data obtained after the aforementioned first data processing task is executed;
  • the aforementioned sending unit 1903 is further configured to send the identification of the aforementioned data stream to the aforementioned target sub-chip; wherein, the aforementioned identification of the data stream is used to be stored in the stream input table of the aforementioned target sub-chip, and the aforementioned stream input table is the aforementioned target sub-chip Basis for receiving data.
  • the foregoing device 1900 further includes:
  • An acquisition unit configured to acquire data transmission between sub-chips in the aforementioned chip system
  • a generating unit configured to generate scheduling information for the aforementioned source sub-chip based on the aforementioned data transmission situation, the aforementioned scheduling information indicating that the aforementioned data flow is sent to the sending port of the aforementioned destination sub-chip in the aforementioned source sub-chip;
  • the aforementioned sending unit 1903 is further configured to send the aforementioned scheduling information to the aforementioned source chiplet.
  • FIG. 20 is a schematic diagram of a specific hardware structure of the device provided by the present application.
  • the device 2000 includes: a processor 2001 , a memory 2002 and a communication port 2003 .
  • the processor 2001 , the communication port 2003 and the memory 2002 may be connected to each other or through a bus 2004 .
  • the memory 2002 is used to store computer programs and data of the device 2000, and the memory 2002 may include but not limited to random storage memory (random access memory, RAM), read-only memory (read-only memory, ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM) or portable read-only memory (compact disc read-only memory, CD-ROM), etc.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read only memory
  • portable read-only memory compact disc read-only memory
  • the communication port 2003 includes a sending port and a receiving port.
  • the number of the communication port 2003 may be multiple, and is used to support the device 2000 to communicate, for example, to receive or send data or messages.
  • the communication port 2003 may be the ports d0, d1, d2 and d3 shown in FIG. 2 above.
  • the processor 2001 may be a central processing unit, a general processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component or any combination thereof.
  • the processor can also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like.
  • the processor 2001 may be the processing module shown in FIG. 2 above.
  • the above-mentioned device 2000 is the source chip in the above-mentioned FIG. 8 and its possible implementation manners. Then, the processor 2001 in the device 2000 can be used to read the program stored in the memory 2002, so that the device 2000 executes the operations performed by the source subchip as described in FIG. 8 and its specific embodiments.
  • the above-mentioned device 2000 is the target chiplet in the above-mentioned FIG. 8 and its possible implementation manners. Then, the processor 2001 in the device 2000 can be used to read the program stored in the memory 2002, so that the device 2000 executes the operation performed by the target chiplet as described in FIG. 8 and its specific embodiment.
  • the above-mentioned device 2000 is the central controller in the above-mentioned FIG. 8 and its possible implementation manners. Then, the processor 2001 in the device 2000 can be used to read the program stored in the memory 2002, so that the device 2000 executes the operations performed by the central controller as described in FIG. 8 and its specific embodiments.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is executed by a processor as described in any of the above-mentioned FIG. 8 and its possible method embodiments. The operation performed by the source chiplet.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is executed by a processor as described in any of the above-mentioned FIG. 8 and its possible method embodiments. The operation performed by the destination chiplet.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is executed by a processor as described in any of the above-mentioned FIG. 8 and its possible method embodiments. Actions performed by the central controller.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product is read and executed by a computer, the operation performed by the source chip described in any of the above-mentioned FIG. 8 and its possible method embodiments will be be realized.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product is read and executed by a computer, the operation performed by the target chiplet described in any one of the above-mentioned FIG. 8 and its possible method embodiments will be be realized.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product is read and executed by a computer, the operations performed by the central controller in any of the above-mentioned FIG. 8 and its possible method embodiments will be be realized.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

一种芯片系统中的数据传输处理方法及相关装置,该方法包括:源子芯片接收配置参数,并根据该配置参数配置预设的流输出表;该流输出表包括数据流的标识和该数据流的目的子芯片的标识;该源子芯片和该目的子芯片为芯片系统包括的多个子芯片中的子芯片,该多个子芯片以预设的拓扑结构连接;该源子芯片执行第一数据处理任务得到输出数据;该源子芯片基于该流输出表将该输出数据生成多个数据包并发送,该数据包中包括该数据流的标识和该目的子芯片的标识。上述方法能够实现芯片系统中子芯片之间的高效数据传输,提高芯片系统的处理性能。

Description

芯片系统中的数据传输处理方法及相关装置 技术领域
本发明涉及通信技术领域,尤其涉及一种芯片系统中的数据传输处理方法及相关装置。
本申请要求于2021年12月28日提交中国专利局,申请号为202111633371.X、发明名称为“芯片系统中的数据传输处理方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中
背景技术
一个芯片系统可以包括多个子芯片,每个子芯片都具备单独处理数据的功能,该多个子芯片以一定的拓扑连接以实现互相通信。并且,该多个子芯片可以通过模型并行的方式协同处理单个大型计算任务,以提高任务的处理效率。在协同处理任务的过程中,该多个子芯片之间需要频繁进行数据的交互传输,该数据传输的效率影响着整个芯片系统的处理性能。
技术问题
本申请实施例公开了一种芯片系统中的数据传输处理方法及相关装置,能够实现芯片系统中子芯片之间的高效数据传输,提高芯片系统的处理性能。
第一方面,本申请提供一种芯片系统中的数据传输处理方法,该方法包括:
源子芯片接收配置参数,并根据前述配置参数配置预设的流输出表;前述流输出表包括数据流的标识和前述数据流的目的子芯片的标识;前述源子芯片和前述目的子芯片为芯片系统包括的多个子芯片中的子芯片,前述多个子芯片以预设的拓扑结构连接;
前述源子芯片执行第一数据处理任务得到输出数据;
前述源子芯片基于前述流输出表将前述输出数据生成多个数据包并发送,前述数据包中包括前述数据流的标识和前述目的子芯片的标识。
本申请中,通过提前配置源子芯片中的流输出表,使得源子芯片可以通过查询配置好的流输出表快速生成数据包并发送,提高了数据的发送效率,实现了芯片系统中子芯片之间的高效数据传输,提高了芯片系统的处理性能。
一种可能的实现方式中,前述流输出表还包括前述第一数据处理任务的标识、已发送前述数据流的数据包个数的指示信息和前述源子芯片中用于存储前述数据流包括的数据的起始地址中的一项或多项;前述数据包还包括前述第一数据处理任务的标识。
在本申请中,流输出表包括任务的标识可以用于区分不同的任务的信息,并且使得生成的数据包中也包括对应的任务标识,以指示数据包中的数据所属的任务。流输出表中包括的上述数据包的个数的指示信息和上述起始地址可以快速计算出待发送的数据的存储地址,从而可以快速获取数据打包发送。
一种可能的实现方式中,前述源子芯片基于前述流输出表将前述输出数据生成多个数据包并发送之前,还包括:前述源子芯片接收解除封锁数据包;其中,前述解除封锁数据包中包括前述数据流的标识,前述解除封锁数据包用于向前述源子芯片指示前述目的子芯片已做好接收前述数据流的准备。
在本申请中,目的子芯片做好接收数据的准备后向源子芯片发送解除封锁数据包,可以避免因目的子芯片没准备好而接收数据导致数据丢包等情况。
一种可能的实现方式中,前述源子芯片发送前述多个数据包,包括:前述源子芯片基于端口转发映射表发送前述多个数据包;其中,前述端口转发映射表包括前述目的子芯片的标识与发送端口的映射关系。
本申请中,通过查询端口转发映射表来确定数据包的发送端口,以实现数据包的快速传输。
一种可能的实现方式中,当前述数据流的目的子芯片为多个时,前述数据包包括前述多个目的子芯片的标识。
本申请中,数据包可以携带多个目的子芯片的标识,相比于现有的每个目的地都发送有一个数据包的情况,可以减少发送的数据包的数量,节省传输带宽。
一种可能的实现方式中,前述芯片系统包括子系统,前述子系统包括至少两个子芯片,前述子系统配置有子系统标识;当前述目的子芯片为前述子系统的子芯片时,前述数据包还包括前述子系统的标识。
本申请中,通过芯片子系统的标识可以快速定位到数据包的目的地,提高数据传输的效率。
第二方面,本申请提供一种芯片系统中的数据传输处理方法,该方法包括:
目的子芯片接收配置参数,并根据前述配置参数配置预设的流输入表,前述流输入表包括至少一个待接收的数据流标识;前述目的子芯片为芯片系统包括的多个子芯片中的子芯片,前述多个子芯片以预设的拓扑结构连接;
前述目的子芯片在接收到数据包时,判断前述流输入表中是否包含前述数据包中的数据流标识;
当前述流输入表中包含前述数据包中的数据流标识时,存储前述数据包中的数据。
本申请中,通过提前配置目的子芯片中的流输入表,使得目的子芯片可以通过查询配置好的流输入表快速接收数据包进行处理,提高了数据的接收效率,实现了芯片系统中子芯片之间的高效数据传输,提高了芯片系统的处理性能。
一种可能的实施方式中,前述流输入表还包括前述第一数据处理任务的标识、已接收前述数据流的数据包个数的指示信息和前述目的子芯片中用于存储前述数据流包括的数据的起始地址中的一项或多项;前述数据包还包括前述第一数据处理任务的标识。
在本申请中,流输入表包括任务的标识可以用于区分不同的任务的数据包。流输出表中包括的上述数据包的个数的指示信息和上述起始地址可以快速计算出接收的数据包中数据的存储地址,从而可以快速实现数据存储。
一种可能的实施方式中,前述目的子芯片接收前述数据包之前,还包括:前述目的子芯片发送解除封锁数据包;其中,前述解除封锁数据包中包括前述数据流的标识和源子芯片的标识,前述解除封锁数据包用于向前述源子芯片指示前述目的子芯片已做好接收前述数据流的准备;前述源子芯片为前述芯片系统中发送前述数据包的子芯片。
在本申请中,目的子芯片做好接收数据的准备后向源子芯片发送解除封锁数据包,可以避免因目的子芯片没准备好而接收数据导致数据丢包等情况。
一种可能的实施方式中,前述芯片系统包括子系统,前述子系统包括至少两个子芯片,前述子系统配置有子系统标识;前述数据包还包括前述子系统的标识。
本申请中,通过芯片子系统的标识可以快速定位到数据包的目的地,提高数据传输的效率。
第三方面,本申请提供一种芯片系统中的数据传输处理方法,该方法包括:
控制器为源子芯片分配第一数据处理任务,前述第一数据处理任务执行完成后获得的数据以数据流的形式发往目的子芯片;前述控制器、前述源子芯片和前述目的子芯片为芯片系统包括的多个子芯片中的子芯片,前述多个子芯片以预设的拓扑结构连接;
前述控制器为前述数据流配置标识;
前述控制器向前述源子芯片发送前述数据流的标识和前述目的子芯片的标识;其中,前述数据流的标识和前述目的子芯片的标识用于关联存储在前述源子芯片的流输出表;前述流输出表为前述源子芯片发送数据的依据。
本申请通过控制器为源子芯片分配任务,并配置源子芯片的流输出表,使得源子芯片可以通过查询配置好的流输出表快速生成数据包并发送,提高了数据的发送效率,实现了芯片系统中子芯片之间的高效数据传输,提高了芯片系统的处理性能。
一种可能的实施方式中,前述方法还包括:
前述控制器为前述目的子芯片分配第二数据处理任务,前述第二数据处理任务基于前述第一数据处理任务执行完成后获得的数据执行;
前述控制器向前述目的子芯片发送前述数据流的标识;其中,前述数据流的标识用于存储在前述目的子芯片的流输入表中,前述流输入表为前述目的子芯片接收数据的依据。
本申请通过控制器为目的子芯片分配任务,并配置目的子芯片的流输入表,使得目的子芯片可以通过查询配置好的流输入表快速接收数据包进行处理,提高了数据的接收效率,实现了芯片系统中子芯片之间的高效数据传输,提高了芯片系统的处理性能。
一种可能的实施方式中,前述方法还包括:
前述控制器获取前述芯片系统中子芯片之间的数据传输情况;
前述控制器基于前述数据传输情况为前述源子芯片生成调度信息,前述调度信息指示在前述源子芯片中将前述数据流发往前述目的子芯片的发送端口;
前述控制器向前述源子芯片发送前述调度信息。
本申请中,控制器基于整个芯片系统的数据传输情况实现对数据包发送端口的调度,从而避免数据包从拥挤的路径传输,提高数据包传输的效率。
第四方面,本申请提供一种源子芯片,该源子芯片包括:
接收单元,用于接收配置参数;
配置单元,用于根据前述配置参数配置预设的流输出表;前述流输出表包括数据流的标识和前述数据流的目的子芯片的标识;前述源子芯片和前述目的子芯片为芯片系统包括的多个子芯片中的子芯片,前述多个子芯片以预设的拓扑结构连接;
执行单元,用于执行第一数据处理任务得到输出数据;
生成单元,用于基于前述流输出表将前述输出数据生成多个数据包,前述数据包中包括前述数据流的标识和前述目的子芯片的标识;
发送单元,用于发送前述多个数据包。
一种可能的实施方式中,前述流输出表还包括前述第一数据处理任务的标识、已发送前述数据流的数据包个数的指示信息和前述源子芯片中用于存储前述数据流包括的数据的起始地址中的一项或多项;
前述数据包还包括前述第一数据处理任务的标识。
一种可能的实施方式中,前述接收单元,还用于在前述生成单元基于前述流输出表将前述输出数据生成多个数据包之前,
接收解除封锁数据包;其中,前述解除封锁数据包中包括前述数据流的标识,前述解除封锁数据包用于向前述源子芯片指示前述目的子芯片已做好接收前述数据流的准备。
一种可能的实施方式中,前述发送单元具体用于:
基于端口转发映射表发送前述多个数据包;其中,前述端口转发映射表包括前述目的子芯片的标识与发送端口的映射关系。
一种可能的实施方式中,当前述数据流的目的子芯片为多个时,前述数据包包括前述多个目的子芯片的标识。
一种可能的实施方式中,前述芯片系统包括子系统,前述子系统包括至少两个子芯片,前述子系统配置有子系统标识;
当前述目的子芯片为前述子系统的子芯片时,前述数据包还包括前述子系统的标识。
第五方面,本申请提供一种目的子芯片,该目的子芯片包括:
接收单元,用于接收配置参数;
配置单元,用于根据前述配置参数配置预设的流输入表,前述流输入表包括至少一个待接收的数据流标识;前述目的子芯片为芯片系统包括的多个子芯片中的子芯片,前述多个子芯片以预设的拓扑结构连接;
判断单元,用于在接收到数据包时,判断前述流输入表中是否包含前述数据包中的数据流标识;
存储单元,用于当前述流输入表中包含前述数据包中的数据流标识时,存储前述数据包中的数据。
一种可能的实施方式中,前述流输入表还包括前述第一数据处理任务的标识、已接收前述数据流的数据包个数的指示信息和前述目的子芯片中用于存储前述数据流包括的数据的起始地址中的一项或多项;
前述数据包还包括前述第一数据处理任务的标识。
一种可能的实施方式中,前述目的子芯片还包括发送单元,用于在前述接收单元接收前述数据包之前,
发送解除封锁数据包;其中,前述解除封锁数据包中包括前述数据流的标识和源子芯片的标识,前述解除封锁数据包用于向前述源子芯片指示前述目的子芯片已做好接收前述数据流的准备;前述源子芯片为前述芯片系统中发送前述数据包的子芯片。
一种可能的实施方式中,前述芯片系统包括子系统,前述子系统包括至少两个子芯片,前述子系统配置有子系统标识;
前述数据包还包括前述子系统的标识。
第六方面,本申请提供一种控制器,该控制器包括:
分配单元,用于为源子芯片分配第一数据处理任务,前述第一数据处理任务执行完成后获得的数据以数据流的形式发往目的子芯片;前述控制器、前述源子芯片和前述目的子芯片为芯片系统包括的多个子芯片中的子芯片,前述多个子芯片以预设的拓扑结构连接;
配置单元,用于为前述数据流配置标识;
发送单元,用于向前述源子芯片发送前述数据流的标识和前述目的子芯片的标识;其中,前述数据流的标识和前述目的子芯片的标识用于关联存储在前述源子芯片的流输出表;前述流输出表为前述源子芯片发送数据的依据。
一种可能的实施方式中,前述分配单元,还用于为前述目的子芯片分配第二数据处理任务,前述第二数据处理任务基于前述第一数据处理任务执行完成后获得的数据执行;
前述发送单元,还用于向前述目的子芯片发送前述数据流的标识;其中,前述数据流的标识用于存储在前述目的子芯片的流输入表中,前述流输入表为前述目的子芯片接收数据的依据。
一种可能的实施方式中,前述控制器还包括:
获取单元,用于获取前述芯片系统中子芯片之间的数据传输情况;
生成单元,用于基于前述数据传输情况为前述源子芯片生成调度信息,前述调度信息指示在前述源子芯片中将前述数据流发往前述目的子芯片的发送端口;
前述发送单元,还用于向前述源子芯片发送前述调度信息。
第七方面,本申请提供一种子芯片,该子芯片包括处理器、存储器和通信端口;其中,前述存储器和通信端口与前述处理器耦合,前述通信端口用于收发数据,前述存储器用于存储计算机程序,前述处理器用于调用前述计算机程序,以使得前述子芯片执行如第一方面任一项所述的方法;
前述子芯片为芯片系统包括的多个子芯片中的子芯片,前述多个子芯片以预设的拓扑结构连接。
第八方面,本申请提供一种子芯片,该子芯片包括处理器、存储器和通信端口;其中,前述存储器和通信端口与前述处理器耦合,前述通信端口用于收发数据,前述存储器用于存储计算机程序,前述处理器用于调用前述计算机程序,以使得前述子芯片执行如第二方面任一项所述的方法;
前述子芯片为芯片系统包括的多个子芯片中的子芯片,前述多个子芯片以预设的拓扑结构连接。
第九方面,本申请提供一种子芯片,该子芯片包括处理器、存储器和通信端口;其中,前述存储器和通信端口与前述处理器耦合,前述通信端口用于收发数据,前述存储器用于存储计算机程序,前述处理器用于调用前述计算机程序,以使得前述子芯片执行如第三方面任一项所述的方法;
前述子芯片为芯片系统包括的多个子芯片中的子芯片,前述多个子芯片以预设的拓扑结构连接。
第十方面,本申请提供一种芯片系统,该芯片系统包括源子芯片、目的子芯片和控制器;其中,前述源子芯片为上述第四方面任一项所述的源子芯片,前述目的子芯片为上述第五方面任一项所述的目的子芯片,前述控制器为上述第六方面任一项所述的控制器;或者,
前述源子芯片为上述第七方面所述的子芯片,前述目的子芯片为上述第八方面所述的子芯片,前述控制器为上述第九方面所述的子芯片。
第十一方面,本申请提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,前述计算机程序被处理器执行时,实现第一方面任意一项所述的方法;或者,
前述计算机程序被处理器执行时,实现第二方面任意一项所述的方法;或者,
前述计算机程序被处理器执行时,实现第三方面任意一项所述的方法。
第十二方面,本申请提供一种计算机程序产品,包括计算机程序,当前述计算机程序被处理器执行时,实现第一方面任意一项所述的方法;或者,
当前述计算机程序被处理器执行时,实现第二方面任意一项所述的方法;或者,
当前述计算机程序被处理器执行时,实现第三方面任意一项所述的方法。
可以理解地,上述第四方面至第十二方面均对应用于执行上述第一方面至第三方面中任一项所提供的方法。因此,其所能达到的有益效果可参考对应方法中的有益效果,此处不再赘述。
附图说明
下面将对本申请实施例中所需要使用的附图作介绍。
图1为本申请提供的芯片系统示意图;
图2为本申请提供的子芯片的结构示意图;
图3至图6为本申请提供的芯片系统示意图;
图7为本申请提供的子芯片组划分示意图;
图8为本申请提供的芯片系统中的数据传输处理方法的流程示意图;
图9A和图9B为本申请提供的数据处理流程示意图;
图10为本申请提供的数据包结构示意图;
图11为本申请提供的芯片系统中子芯片的端口示意图;
图12为本申请提供的数据包的结构示意图;
图13为本申请提供的路由实现方案的流程示意图;
图14为本申请提供的方向坐标系的示意图;
图15为本申请提供的基于子芯片构建方向坐标系的示意图;
图16为本申请提供的芯片系统中子芯片的端口示意图;
图17至图19为本申请提供的虚拟装置的结构示意图;
图20为本申请提供的实体装置的结构示意图。
本发明的实施方式
下面结合附图对本申请的实施例进行描述。
图1所示为本申请实施例提供的一种芯片系统的结构示意图。芯片系统110包括多个子芯片(图1中示例性示出了16个子芯片),该多个子芯片按照预设的拓扑连接关系连接,例如,图1中的16个子芯片可以按照矩阵的形式排列,然后,单个子芯片分别与周围的两个、三个或者四个子芯片连接。
每个子芯片都有各自的内存,图1中示例性画出了部分子芯片的内存。内存例如可以是同步动态随机存储器(synchronous dynamic random access memory,SDRAM)或者双倍速率同步动态随机存储器(Double Data Rate SDRAM,DDRSDRAM),DDRSDRAM可以简写为DDR。芯片系统100中每个子芯片都具有完整的处理能力,可以独立执行任务。当然,芯片系统100中的多个子芯片可以互相协作执行大型的处理任务。
参见图2,图2示例性示出了上述芯片系统110中的子芯片的结构示意图。子芯片的结构可以是以片上网络(network-on-chip, NoC)的形式呈现。可以看到,子芯片可以包括处理模块、路由模块、静态存储器、内存控制器和四个端口(d0、d1、d2和d3)。
上述处理模块为子芯片中的控制单元(control unit,CU),负责子芯片中各个处理流程的管理。
上述路由模块负责子芯片内部的数据同步、子芯片之间的数据同步、数据广播和数据传输。其中,路由模块也包括一个控制单元,该控制单元用于负责路由模块中路由流程的管理。路由模块中还包括本地的缓冲区,可以用于暂时存储待处理的数据。路由模块中还包括端口转发映射模块(forwarding-port mapper, FPM),该FPM可以是一个硬件模块或软件模块,该FPM中存储有一个端口转发映射表,该端口转发映射表中包括目的子芯片和发送端口的映射关系,可以用于将数据包映射到对应的端口进行发送。路由模块中存储有流输入表(stream in table, SIT)和流输出表(stream out table, SOT),该SIT和SOT用于子芯片之间的数据传输,后面会详细介绍,此处暂不详述。
上述静态存储器可以是静态随机存取存储器(static random-access memory,SRAM)等,用于存储子芯片中的数据。
上述内存控制器与子芯片对应的内存连接,该内存控制器例如可以是DDR控制器等。
上述四个端口(d0、d1、d2和d3)是子芯片的网络接口,可以实现上述子芯片与子芯片之间的数据传输。上述子芯片和子芯片之间的连接就是通过该四个端口来实现。可选的,子芯片也可以包括两个该端口,例如图1中的子芯片0、子芯片3、子芯片12和子芯片15。或者,可选的,子芯片也可以包括三个该端口,例如图1中的子芯片4等。
本申请实施例提供的芯片系统不限于上述图1所示的结构,还可以是其它的结构,例如参见图3。图3示例性示出了芯片系统120,该芯片系统120同样包括多个子芯片(图2中示例性示出了8个子芯片),该芯片系统120的多个子芯片以长方体的形式排列连接,这种连接方式可以使得子芯片之间的数据传输路径尽可能地缩短。芯片系统120中的子芯片的结构可以参见上述图2中对应的描述,此处不再赘述。
一种可能的实施方式中,对于超大型的任务,需要更多的子芯片一起来处理以提高任务处理效率,那么,可以通过将上述芯片系统110或芯片系统120作为一个芯片的子系统,由多个该子系统组成一个较大的芯片系统,示例性地,可以参见图4所示的芯片系统130。
该芯片系统130可以包括多个上述芯片的子系统,图4中以包括8个子系统为例示出,该多个子系统的每个子系统可以为上述芯片系统110或芯片系统120。可以将每个子系统看成是一个整体,那么,该多个子系统可以通过预设定的拓扑连接关系进行连接,例如,可以以长方体的形式排列连接,如图4所示。为了便于理解芯片系统130中各个子系统的连接方式,示例性地,以子系统为上述芯片系统110为例示出该芯片系统130中各个子系统的连接结构示意图,可以参见图5。
在图5中可以看到,芯片系统130中包括8个子系统,该8个子系统中每个子系统包括16个子芯片,该16个子芯片可以以矩阵的形式排列连接。相邻两个子系统之间,可以通过一个子系统中的任意一个子芯片与另一个子系统中的任意一个子芯片连接来实现该两个子系统的连接。图5中示例性地以每个子系统中排列在矩阵拐角的子芯片作为与另一个子系统连接的子芯片。例如,子系统0中,子系统0与子系统1之间是通过子系统0中的子芯片3与子系统1中的子芯片0来建立连接。子系统0与子系统2之间是通过子系统0中的子芯片12与子系统2中的子芯片0来建立连接。子系统0与子系统4之间是通过子系统0中的子芯片0与子系统4中的子芯片0来建立连接。
一种可能的实施方式中,上述子系统之间还可以包括控制总线,例如可以参见图6所示的芯片系统140。芯片系统140中包括一个中心控制器(该中心控制器可以是芯片系统140中的一个子芯片或者控制模块等),该中心控制器可以管理该芯片系统140中的任务处理流程。该控制总线与中心控制器连接,以用于各个子系统接收中心控制器的控制指令。具体实现中,每个子系统可以由一个子芯片连接该控制线,该子芯片接收到控制指令后可以代为转发到同子系统内对应的子芯片。或者,每个子系统中的每个子芯片都连接该控制线以用于直接接收控制指令。本申请对控制器的具体连接不做限制。
除了上述芯片系统140之外,本申请实施例提供的芯片系统(例如上述芯片系统110、芯片系统120和芯片系统130等)也包括一个中心控制器,用于管理整个芯片系统的任务处理流程。同理,该中心控制器可以是芯片系统中的一个子芯片或者控制模块等。中心控制器可以获取芯片系统中所有子芯片的负载情况和资源使用情况等,从而可以基于这些情况来为各个子芯片分配任务。示例性地,可以通过中心控制器中的任务调度器(scheduler)来基于各个子芯片的负载情况和资源使用情况等信息为这些子芯片分配任务。
在本申请实施例提供的芯片系统中,每个子芯片都具备独立处理数据的能力。但是在数据任务较大的情况下,单个子芯片的处理效率较低。为了提高任务的处理效率,可以将上述芯片系统中包括的多个子芯片划分成多个子芯片组,每个子芯片组至少包括一个子芯片。这样可以以子芯片组为处理单位来处理数据任务,从而提高处理效率。为了便于理解芯片组,可以参见图7。
图7以上述芯片系统110为例,将芯片系统中的16个子芯片划分成9个子芯片组,具体参见图7所示的划分情况,每个子芯片组至少包括一个子芯片。另外,同一个子芯片组的多个子芯片可以是相邻的子芯片,例如子芯片组3、子芯片组4、子芯片组7和子芯片组8。或者,同一个子芯片组的多个子芯片可以是不相邻的子芯片,例如子芯片组2,该子芯片组2由不相邻的子芯片1和子芯片12组成。
本申请实施例提供的芯片系统可以采用数据并行、模型并行或者模型并行加数据并行的方式来实现数据任务的处理。其中:
数据并行是指把待处理的数据划分成若干数据块,将该若干数据块分别分配到不同的子芯片组上,每一个芯片组运行同样的处理程序对所分派的数据进行处理。例如,假设待处理数据被划分为3个数据块,现有3个子芯片组可以运行同样的处理程序来处理数据块,那么,可以将该3个数据块中的第1数据块发送给该3个子芯片组中的第1个子芯片组处理,将该3个数据块中的第2数据块发送给该3个子芯片组中的第2个子芯片组处理,将该3个数据块中的第3数据块发送给该3个子芯片组中的第3个子芯片组处理。
模型并行是指多个子芯片组共同完成一个数据处理任务,该多个子芯片组中每一个子芯片组只执行整个数据处理任务的部分步骤(该部分步骤可以是一个或多个处理步骤)。例如,假设一个数据处理任务需要经过3个步骤才能完成处理,那么,可以配置两个子芯片组来共同完成该任务。其中,第1个子芯片组完成该3个步骤中的前2步骤的处理,第2子芯片组从第1子芯片组获取处理后的数据以完成第3步骤的处理。或者,可以配置三个子芯片组来共同完成该任务。其中,第1子芯片组完成该3个步骤中的第1步骤的处理,第2子芯片组从第1子芯片组获取处理后的数据完成第2步骤的处理,第3子芯片组从第2子芯片组获取处理后的数据完成第3步骤的处理。即每个芯片组完成的步骤可以是一个或多个,具体可以根据芯片组的负载情况和资源使用情况来确定。
模型并行加数据并行的方式是结合了上述数据并行和模型并行两种方式来处理数据。例如,一个数据处理任务需要经过3个步骤才能完成处理,那么,可以配置三个子芯片组来共同完成该任务。其中,第1子芯片组完成该3个步骤中的第1步骤的处理,第2子芯片组从第1子芯片组获取处理后的数据完成第2步骤的处理,第3子芯片组从第2子芯片组获取处理后的数据完成第3步骤的处理。但是,由于该第1步骤的处理比较复杂,需要花费较多的时间才能完成该步骤的处理,为了提高处理效率,可以再配置一个或多个子芯片组来共同执行该第1步骤的处理任务。例如,可以再配置一个第4子芯片组来和前述第1子芯片组一起来执行该第1步骤的处理任务。具体的,可以将用于进行第1步骤处理的数据分成两份,一份发送给该第1子芯片组处理,另一份发送给该第4子芯片组处理。然后,该第1子芯片组和第4子芯片组处理完之后的数据一起发送给第2子芯片组进行第2步骤的处理。
需要说明的是,上述模型并行加数据并行的方式中,可以是每个处理步骤都采用数据并行的处理方式来处理,或者可以是部分处理步骤采用数据并行的处理方式来处理,具体可以根据具体实现确定,本申请对此不做限制。
在具体实现中,可以通过芯片系统的中心控制器将数据处理任务分配到各个子芯片组。采用模型并行或者模型并行加数据并行的方式实现数据任务的处理,需要在子芯片之间进行数据的传输。数据传输会产生时延导致处理效率降低。为了实现芯片系统中子芯片之间的高效数据传输,提高芯片系统的处理性能,本申请实施例提供了一种芯片系统中的数据传输处理方法。
参见图8,本申请实施例提供的数据传输处理方法包括但不限于如下步骤:
S801、源子芯片接收配置参数,并根据该配置参数配置预设的流输出表;该流输出表包括数据流的标识和该数据流的目的子芯片的标识;该源子芯片执行第一数据处理任务得到输出数据。
具体实现中,在采用模型并行或者模型并行加数据并行的方式实现数据任务的处理中,上述源子芯片可以是第一子芯片组中的任一个子芯片,该第一子芯片组为芯片系统中用于执行目标任务中第一步骤的处理任务的子芯片组。该第一步骤可以包括该目标任务的一个或多个处理步骤。该芯片系统可以是前述介绍的芯片系统110、芯片系统120、芯片系统130或者芯片系统140等。
上述流输出表提前初始化存储在该源子芯片中,用于该源子芯片传输对应的数据。该流输出表为该源子芯片发送数据的依据。具体的,该流输出表包括数据流的标识和该数据流前往的目的子芯片的标识。一种可能的实施方式中,该流输出表还可以包括该数据流包括的数据所属的任务的标识、已发送该数据流的数据包个数的指示信息和该源子芯片中用于存储该数据流包括的数据的起始地址中的一项或多项。
具体的,关于数据流,一个子芯片向另一个子芯片发送数据,这些数据被封装成多个数据包,这些数据包按顺序编号并发送,这些连续发送的数据包形成数据流。一种可能的实现中,每个数据包可以携带1kb的数据,若传输的数据总大小为64kb,那么,可以将这些数据拆分封装成64个数据包进行发送,该64个数据包则可以形成一个数据流。
在具体的实现中,可以通过芯片系统的中心控制器来初始化流输出表。具体的,基于前面的描述可知,是由中心控制器将各个数据处理任务分配到各个子芯片组,那么,中心控制器可以为各个数据处理任务配置对应的任务标识以用于区分不同的任务。
另外,对于一个数据处理任务,也是由中心控制器配置子芯片组来执行该数据处理任务的各个处理步骤。因此,中心控制器可以获知该数据处理任务对应的数据流的流向,并为对应的数据流配置数据流标识以用于区分不同的数据流。为了便于理解,下面举例说明。
示例性地,参见图9A。假设某个数据处理任务包括八个处理步骤,中心控制器可以基于该数据处理任务的数据量大小、该八个步骤的处理复杂程度以及芯片系统中各个子芯片组的负载和资源使用情况,进行处理任务的分配。例如,如图9A所示,中心控制器配置由子芯片组1来执行该数据处理任务的步骤1,由子芯片组2来执行该数据处理任务的步骤2和步骤3,由子芯片组3来执行该数据处理任务的步骤4,由子芯片组4来执行该数据处理任务的步骤5、步骤6和步骤7,由子芯片组5来执行该数据处理任务的步骤8。其中,子芯片组1包括子芯片4,子芯片组2包括子芯片5,子芯片组3包括子芯片6和子芯片7,子芯片组4包括子芯片8,子芯片组5包括子芯片0和子芯片9。
在上述图9A中,假设执行步骤2、步骤5和步骤8均需要经步骤1处理后的数据,那么,子芯片组1可以将经步骤1处理后的数据分别发送个子芯片组2、子芯片组4和子芯片组5。由于子芯片组1发送给子芯片组2、子芯片组4和子芯片组5的数据相同,因此,从子芯片组1到子芯片组2的数据流、从子芯片组1到子芯片组4的数据流以及从子芯片组1到子芯片组5的数据流的标识相同,例如均为17。另外,子芯片组2完成步骤2和步骤3的处理获得的数据发送给子芯片组3以进行步骤4的处理,该从子芯片组2到子芯片组3的数据流的标识可以为18。子芯片组3完成步骤4的处理获得的数据发送给子芯片组5以进行步骤8的处理,该从子芯片组3到子芯片组5的数据流的标识可以为19。子芯片组4完成步骤5、步骤6和步骤7的处理获得的数据发送给子芯片组5以进行步骤8的处理,该从子芯片组4到子芯片组5的数据流的标识可以为20。
具体实现中,子芯片组之间的数据传输实际上是子芯片组中的子芯片之间的数据传输,例如上述子芯片组1将步骤1处理后的数据发送给子芯片组5,实际上是子芯片组1中的子芯片4将步骤1处理后的数据发送给子芯片组5中的子芯片0和子芯片9。其它的同理,不再赘述。
另一种可能的实施方式中,参见图9B中,假设执行步骤2、步骤5和步骤8均需要经步骤1处理后的数据,但需要的数据不同或不完全相同。例如,假设执行步骤2需要的是经步骤1处理后的数据中的数据1,执行步骤5需要的是经步骤1处理后的数据中的数据2,执行步骤8需要的是经步骤1处理后的数据中的数据3。那么,子芯片组可以将该数据1发送给子芯片组2,将该数据2发送给子芯片组4,将该数据3发送给子芯片组5。由于发送给子芯片组2、子芯片组4和子芯片组5这三个子芯片组的数据不同,因此,其对应的数据流的标识也不同。例如,从子芯片组1到子芯片组2的数据流的标识可以为15,从子芯片组1到子芯片组4的数据流的标识可以为16,从子芯片组1到子芯片组5的数据流的标识可以为17。其它的子芯片组之间的数据流的标识可以参见前面关于图9A的描述,此处不再赘述。
在具体实现中,子芯片组内部的子芯片处理步骤之间的数据传输无需配置数据流的标识,例如子芯片组2中,步骤2处理完获得的数据发送给子芯片5中对应的模块进行步骤3的处理,子芯片5内部的数据传输无需配置数据流的标识。子芯片组4中子芯片8内的数据传输同理,无需配置数据流的标识。
基于上述的描述,中心控制器可以获知对应数据流的流向,即可以获知数据流的目的子芯片的标识。中心控制器可以将上述数据处理任务的标识,数据处理任务对应的数据流的标识,该数据流的源子芯片的标识,该数据流的目的子芯片的标识以及该源子芯片对应处理的步骤等信息关联存储。该关联存储可以是以表的形式存储,该表可以称为数据流表(stream table,ST)。为了便于理解该数据流表,例如可以参见表1。
表1
Figure 202816dest_path_image001
示例性地,基于上述图9B,上述表1示例性示出了数据处理任务步骤1关联的信息,其它步骤关联的信息同理,此处不再赘述。如表1所示,该数据流表包括任务的标识(task identity document, TID),源子芯片的标识(source mask,S_mask),源子芯片执行的步骤,目的子芯片的标识(destination mask,D_mask),目的子芯片执行的步骤以及数据流的标识(stream identity document,SID)。
上述任务的标识指的是对应的数据处理任务的标识。该任务的标识例如可以是1或者其它的标识符号,本申请对此不做限制。
上述源子芯片的标识和源子芯片执行的步骤指示执行该步骤1的子芯片的标识。上述目的子芯片的标识和目的子芯片执行的步骤指示步骤1处理完的数据前往的目的地,以及该目的地执行的对应的处理步骤。数据流的标识指示源子芯片将步骤1处理完的数据发送给各个对应目的子芯片形成的数据流的标识。具体的标识参见表1所示,但是表1所示的标识仅为示例,可以使用是其它的标识符号代替,在计算机执行的程序中,这些标识可以用二进制或者十六进制表示等,本申请对各类标识的表示不做限制。
需要说明的是,本申请实施例中所述的源子芯片和目的子芯片是针对数据流来说的,不同的数据流对应的源子芯片和目的子芯片可能不同。
另外,可选的,上述数据流表包括的内容不限于上述表1所示的各项内容。在具体实现中,该数据流表包括的内容可以是上述表1所示的各项内容中的部分或者全部。或者还可以包括其它的内容,例如目的子芯片执行的其它步骤(例如上述子芯片8除了步骤5还处理步骤6和步骤7,那么,上述数据流表可以包括该步骤6和步骤7的信息等)。
基于上述的描述,中心控制器中存储有上述数据流表,那么,中心控制器可以基于该数据流表来初始化上述源子芯片中的流输出表。具体的,中心控制器可以在数据流表中找到上述源子芯片对应的关联信息,即找到对应的任务标识、目的子芯片的标识和数据流的标识等信息。然后,将这些关联信息发送给该源子芯片。该源子芯片接收到该关联信息之后,将这些信息填入流输出表中。该数据流表中源子芯片对应的关联信息中的一项或多项即为上述配置参数。
一种可能的实施方式中,基于前面的描述,该源子芯片中的流输出表还可以包括已发送该数据流的数据包个数的指示信息和该源子芯片中用于存储该数据流包括的数据的起始地址。这两项内容可以是该源子芯片基于自身的信息填写到该流输出表中。
示例性的,该源子芯片接收到中心控制器分配的数据处理任务(该数据处理任务可以为上述S801中的第一数据处理任务),然后,完成了预设的处理步骤获得处理完之后的数据(该数据即为上述S801中执行第一数据处理任务得到的输出数据)。该处理完之后的数据存储在该源子芯片的内存缓冲区中。那么,该源子芯片可以获知该处理完之后的数据的大小和存储的起始地址。该处理完之后的数据即为需要发送的数据,并且,发送的数据包的大小是预先设定的,那么,获知处理完之后的数据的大小即可获知待发送的数据包的个数。从而源子芯片可以将该待发送的数据包的个数和该待发送的数据的存储起始地址填写到上述流输出表中。至此,完成了源子芯片中流输出表的初始化。
上述源子芯片接收到的第一数据处理任务可以是具体的任务数据和/或任务执行指令等,源子芯片接收到该第一数据处理任务后可以将该任务缓存到预设的存储空间,以待后续执行。
为了便于理解上述该流输出表,示例性地可以参见表2。
表2
Figure 548347dest_path_image002
上述表2中示例性地示出了流输出表中任务1对应的信息。可以看到,表2中任务1对应的TID、SID和D_mask是中心控制器从上述表1中获取发送给源子芯片的,因此与上述表1相同。另外,表2中的起始地址(start address,S_addr)指的是上述源子芯片中用于存储待发送的数据流包括的数据的起始地址。该表2中的数据包计数(count packet,C_packet)指的是已发送该数据流的数据包个数的指示信息。该数据包计数可以是倒计数,例如,初始化时该数据包计数中是该数据流包括的全部数据包的个数,然后,源子芯片每发送一个数据包,该数据包计数减一。
另外,流输出表中可以包括多个任务对应的信息,通过任务的标识来区分,不同任务的数据流的标识可以相同或者不同,具体根据中心控制器的数据流表决定。
或者,示例性地,源子芯片可以在执行中心控制器分配的第一数据处理任务之前完成流输出表的初始化。具体的,该源子芯片可以将流输出表中已发送数据流的数据包个数初始化为零,并将用于存储该数据流包括的数据的起始地址初始化为一个指定存储空间的起始地址。那么,该源子芯片执行中心控制器分配的数据处理任务获得处理完之后的数据可以存储到该指定存储空间。并且,当源子芯片发送该处理完之后的数据的过程中,每发送一个数据包就将流输出表中已发送数据流的数据包个数加一。
S802、该源子芯片基于该流输出表将该输出数据生成多个数据包,该数据包中包括该数据流的标识和该目的子芯片的标识。
上述源子芯片初始化完成流输出表并且执行上述第一数据处理任务获得输出数据后,可以基于该流输出表生成对应数据流的数据包。数据包中可以包括数据包的类型(type)、任务的标识、数据流的标识、目的子芯片的标识、数据包编号和数据。数据包中的目的子芯片的标识可以是一个或多个目的子芯片的标识。若数据包中的数据对应的目的子芯片为一个,则数据包中的目的子芯片的标识为该一个目的子芯片的标识。若数据包中的数据对应的目的子芯片为多个,则数据包中的目的子芯片的标识为该多个目的子芯片的标识。例如,上述表2中,数据流17对应的目的子芯片有两个,分别为子芯片0和子芯片9,那么,该数据流17中的数据包包括该子芯片0和子芯片9的标识。
具体实现中,源子芯片基于上述流输出表中任务的标识和数据流的标识获取存储待发送的数据(即为上述输出数据)的起始地址,基于该起始地址读取待发送的数据生成上述数据包。
一种可能的实施方式中,上述生成的数据包中还可以携带边带信息,这些边带信息可以包括任务的标识、数据流的标识或目的子芯片的标识中的一项或多项信息。这些边带信息可以不封装在数据包内,而是随着数据包一起发送。在具体实现中,数据包内部包括的信息只有子芯片的路由模块才能获知,子芯片中的端口等其它模块并不感知。因此,为了便于快速转发数据包,可以配置数据包携带上述边带信息。为了便于理解数据包的格式和边带信息的格式,可以示例性地参见图10。图10所示的数据包的格式和对应的边带信息仅为示例,在具体实现中数据包中还可以包括其它的信息,边带信息也可以包括更多的信息,本申请对此不做限制。
S803、该源子芯片向目的子芯片发送上述多个数据包。
源子芯片基于上述流输出表生成数据包后,可以将该数据包向目的子芯片发送。具体的,源子芯片可以基于端口转发映射表发送该数据包。基于前面关于芯片系统的介绍可知,源子芯片的路由模块中还包括端口转发映射模块FPM,该FPM中存储有一个端口转发映射表,该端口转发映射表中包括目的子芯片的标识和发送端口的映射关系。
具体的,源子芯片可以基于数据包发往的目的子芯片的标识查询该端口转发映射表,可以查询到对应的发送端口,然后,将数据包从该发送端口中发送出去。为了便于理解可以参见图11。在图11中,假设子芯片0为源子芯片,子芯片1为目的子芯片。子芯片0中的端口转发映射表中存储了子芯片1的标识与子芯片0的端口d1的关联关系。那么,当子芯片0有数据包要发往子芯片1时,子芯片0基于子芯片1的标识查询端口转发映射表,获知数据包的发送端口为d1,则子芯片0从端口d1发送该数据包。
S804、该目的子芯片接收上述数据包。
上述源子芯片发出数据包之后,经过一路的传输数据包到达目的子芯片。目的子芯片接收该数据包。例如,以上述图11为例,子芯片0从端口d1发送数据包,该数据包经子芯片1的端口d3到达目的子芯片,该子芯片1则通过该端口d3接收该数据包。
S805、该目的子芯片判断流输入表中是否包含接收的数据包中的数据流标识,该流输入表包括至少一个待接收的数据流标识;在流输入表中包含该接收的数据包中的数据流标识时,存储该接收的数据包中的数据。
在具体实现中,目的子芯片接收到目的地为自身的数据包之后,可以先将数据包中的数据存储到缓冲区中以待后续的处理。具体的,目的子芯片可以基于流输入表来存储该数据包中的数据。该流输入表为该目的子芯片接收数据的依据。
该流输入表可以包括任务的标识和数据流的标识。在具体的实现中,可以通过芯片系统的中心控制器来初始化该流输入表。基于前面的描述可知,中心控制器中存储有数据流表。中心控制器可以基于该目的子芯片的标识查询该数据流表,获取该目的子芯片关联的信息发送给该目的子芯片。该目的子芯片关联的信息包括对应的任务的标识和数据流的标识等信息。目的子芯片接收到这些信息后,可以将这些信息写入自身的流输入表中。该数据流表中目的子芯片对应的关联信息中的一项或多项即为用于配置该流输入表的配置参数。该流输入表中的数据流的标识即为目的子芯片待接收的数据流标识,即只有数据包中的数据流标识属于该流输入表中的数据流标识,该目的子芯片才获取该数据包中的数据进行存储以待后续的处理。
另外,上述流输入表中还可以包括已接收数据流的数据包个数的指示信息和该目的子芯片中用于存储数据流包括的数据的起始地址等信息。这些信息可以是该目的子芯片基于自身的信息填写到该流输入表中。示例性地,目的子芯片可以为待接收的数据流配置一个指定的存储空间,然后将该指定的存储空间的起始地址初始化到该流输入表中。并且示例性地,该目的子芯片可以将该流输入表中已接收数据流的数据包个数初始化为零,然后在目的子芯片接收该数据流的数据包的过程中,每接收一个数据包就将流输入表中对应的已接收数据流的数据包个数加一。或者,示例性地,该目的子芯片可以从源子芯片或者控制器获知该待接收的数据流包括的数据包的总个数,并将该流输入表中对应的已接收数据流的数据包个数初始化为该总个数。然后在目的子芯片接收该数据流的数据包的过程中,每接收一个数据包就将流输入表中对应的已接收数据流的数据包个数减一。
为了便于理解该流输入表,示例性地可以参见表3。
表3
Figure 167547dest_path_image003
上述表3中示例性地示出了流输入表中任务1对应的信息。可以看到,表3中任务1对应的TID和SID是中心控制器从上述表1中获取发送给源子芯片的,因此与上述表1相同。另外,表3中的起始地址(start address,S_addr)指的是上述目的子芯片中用于存储数据流15包括的数据的起始地址。该表3中的数据包计数(count packet,C_packet)指的是已接收数据流15的数据包个数的指示信息。该数据包计数可以是倒计数,例如,初始化时该数据包计数中是该数据流包括的全部数据包的个数,然后,目的子芯片每接收一个数据包,该数据包计数减一。
另外,流输入表中可以包括多个任务对应的信息,通过任务的标识来区分,不同任务的数据流的标识可以相同或者不同,具体根据中心控制器的数据流表决定。
另一种可能的实施方式中,目的子芯片可以通过从源子芯片接收一个头数据包(header packet)来初始化上述流输入表。该头数据包可以包括数据包的类型、任务的标识、数据流的标识、目的子芯片的标识、数据流包括的数据包的总个数以及目的子芯片中用于存储该数据流的数据的起始地址。可选的,该头数据包也可以携带边带信息,该边带信息的内容可以和前述的边带信息的内容相同,此处不再赘述。为便于理解头数据包的格式,可以示例性地参见图12。
图12中,N_PK表示数据流包括的数据包的总个数,Address表示上述目的子芯片中用于存储该数据流的数据的起始地址,其它的标识参见前面的描述,此处不再赘述。图12所示的数据包的格式和对应的边带信息仅为示例,在具体实现中数据包中还可以包括其它的信息,边带信息也可以包括更多的信息,本申请对此不做限制。
一种可能的实施方式中,目的子芯片的流输入表初始化完成后,表明目的子芯片已经做好接收对应数据流的准备。因此,目的子芯片可以向源子芯片发送一个解除封锁数据包,该解除封锁数据包用于向该源子芯片指示该目的子芯片做好接收对应数据流的准备。
该解除封锁数据包可以包括数据包类型、任务的标识、数据流的标识和接收该数据包的子芯片的标识。该接收该数据包的子芯片即为上述源子芯片。可选的,该解除封锁数据包也可以携带边带信息,该边带信息的内容可以和前述的边带信息的内容相同,此处不再赘述。
基于上述的描述,目的子芯片初始化完成流输入表后,对于接收到源子芯片发送过来的数据包,可以先获取该数据包中的数据流标识与自身的流输入表中的数据流标识比较。若该流输入表中包括该接收的数据包中的数据流标识,则可以基于该数据包中的任务标识和数据流标识查询该流输入表获取对应的存储数据的起始地址,然后基于该起始地址计算对应的存储地址,将该数据包的数据存储到该对应的存储地址中。
一种可能的实现中,上述中心控制器向该目的子芯片分配了第二数据处理任务,该第二数据处理任务是基于上述第一数据处理任务执行完成后获得的数据执行的。那么,上述源子芯片执行完上述第一数据处理任务获得输出数据,将该输出数据以数据流的形式发送给该目的子芯片。该目的子芯片基于自身的流输入表接收并存储该输出数据。然后,该目的子芯片可以基于已知的存储地址从该存储空间中读取该输出数据以用于执行自身的第二数据处理任务。
一种可能的实施方式中,基于前述芯片系统的相关描述可知,芯片系统中可以包括多个子系统,每个子系统之间按照预设的拓扑连接关系连接。若上述源子芯片和目的子芯片不是同一个子系统中的子芯片,那么,该两个子芯片之间传输的数据包中还包括接收该数据包的子芯片所在的子系统的标识。例如,若是源子芯片向目的子芯片发送数据包(例如上述数据流包括的数据包等),那么,该数据包中还包括目的子芯片所在的子系统的标识。若是目的子芯片向源子芯片发送数据包(例如上述解除封锁数据包等),那么,该数据包中还包括源子芯片所在的子系统的标识。通过数据包中的子系统的标识可以实现跨子系统的子芯片之间的数据传输,从而有利于实现大型数据任务的处理,提高处理效率。
综上所述,本申请实施例基于上述初始化好的流输出表和流输入表可以实现芯片系统内高效的传输数据,提高芯片系统的数据处理性能。
一种可能的实现方式中,为了进一步提高上述芯片系统内的数据传输效率,本申请实施例可以提供一种在路由实现方式,以使得数据芯片系统的子芯片之间可以灵活调度、高效传输。下面以第一子芯片为例进行介绍。
参见图13,本申请实施例提供的路由实现方式包括但不限于如下步骤:
S1301、第一子芯片获取第一数据包;其中,该第一数据包包括目的子芯片的标识;该第一子芯片和该目的子芯片为芯片系统包括的子芯片,该芯片系统包括的多个子芯片以预设的拓扑结构连接。
该芯片系统可以是前述介绍的芯片系统110、芯片系统120、芯片系统130或者芯片系统140等。该第一子芯片可以是这些芯片系统中任意一个芯片系统中的任意一个子芯片。
示例性地,上述第一子芯片可以是上述图13中所述的源子芯片,那么,上述第一子芯片获取第一数据包可以是该第一子芯片生成该第一数据包。该生成第一数据包的具体实现可以参见上述步骤S802中对应的描述,此处不再赘述。
或者,示例性地,上述第一子芯片可以是数据包从源子芯片传输到目的子芯片的过程中途径的子芯片,那么,上述第一子芯片获取第一数据包可以是该第一子芯片接收该第一数据包。
为了便于后面的描述,成该第一子芯片所在的芯片系统为第一芯片系统。该第一子芯片从该第一芯片系统中的另一个子芯片中接收到上述第一数据包。
在具体实现中,该第一数据包中可以包括包(packet)的类型(type)、任务的标识、数据流的标识、目的子芯片的标识、包编号和数据等信息中的一项或多项。一种可能的实施方式中,上述第一数据包中还可以携带边带信息,这些边带信息可以包括任务的标识、数据流的标识或目的子芯片的标识中的一项或多项信息。关于该第一数据包中包括的内容,可以参见上述步骤S802中关于数据包的描述,此处不再赘述。
S1302、该第一子芯片基于上述芯片系统中子芯片之间的数据传输情况发送该第一数据包中的数据。
在具体实现中,芯片系统中子芯片之间的数据传输情况包括多种,下面示例性介绍几种可能的实现方式。
第一种可能的实现方式,上述第一子芯片可以基于自身端口的拥塞情况来发送上述第一数据包中的数据。
具体的,基于上述关于芯片系统的介绍可知,每个子芯片包括多个与其它子芯片通信的端口。其中,每个端口配置有对应的发送缓冲区,该发送缓冲区用于存储待发送的数据。
上述第一子芯片接收到上述第一数据包之后,解析该第一数据包获知该第一数据包中的目的子芯片的标识。若该目的子芯片的标识指示该第一子芯片为目的子芯片,那么,该第一子芯片提取该第一数据包中的数据存储,以待后续处理。否则,该第一子芯片以该目的子芯片的标识为索引,在自身的转发映射表查找该第一数据包的发送端口。关于转发映射表的介绍可以参见前述关于图2的描述中对应的描述,此处不再赘述。若查找到的发送端口包括多个,那么,可以基于该多个发送端口的发送缓冲区的拥塞情况来确定具体的发送端口。具体的,为了提高数据的传输效率,可以选择该多个发送端口的发送缓冲区中待发送的数据量最少的端口来发送该第一数据包。
一种可能的实施方式中,若上述第一数据包中包括多个目的子芯片的标识,并且该第一子芯片为其中一个目的子芯片,那么,该第一子芯片提取该第一数据包中的数据存储,以待后续处理。并且,该第一子芯片会以剩下的目的子芯片的标识为索引在自身的转发映射表查找该第一数据包的数据的发送端口。
若上述剩下的目的子芯片的标识为一个,那么,同理,查找到对应的发送端口后,选择发送端口的发送缓冲区中待发送的数据量最少的端口来发送该第一数据包中的数据。具体的,该数据会被重新封装为一个数据包进行发送,该重新封装的数据包中的目的子芯片的标识不再包括第一子芯片的标识,只包括该剩下的目的子芯片的标识。
若上述剩下的目的子芯片的标识还有多个,那么,该第一子芯片在自身的转发映射表中分别查找对应的发送端口。若查找到的发送端口相同,那么,可以复制上述第一数据包包括的数据重新生成一个数据包,该新生成的数据包中包括该多个剩下的目的子芯片的标识。并将该新生成的数据包从查找到的相同的发送端口发送。同理,该发送端口可以是找到的发送端口中发送缓冲区待发送数据量最少的端口。
或者,若上述剩下的目的子芯片的标识还有多个,以两个为例,假设该剩下的目的子芯片为子芯片A和子芯片B。上述第一子芯片在自身的转发映射表中查找该子芯片A的标识映射的发送端口,以及查找该子芯片B的标识映射的发送端口。假设查找到的发送端口不同,那么,第一子芯片可以重新生成两个数据包:数据包A和数据包B。两个数据包均包括上述第一数据包包括的数据,其中数据包A包括的目的子芯片的标识为子芯片A的标识,数据包B包括的目的子芯片的标识为子芯片B的标识。然后,通过各自查找到的发送端口发送该数据包A和数据包B。同理,该发送端口可以是找到的发送端口中发送缓冲区待发送数据量最少的端口。
第二种可能的实现方式,上述第一子芯片可以基于最小带宽消耗原则向目的子芯片发送上述第一数据包中的数据。该最小带宽消耗原则指的是以最小的传输带宽将数据送达目的子芯片的原则。
为了便于理解本实现方式,首先介绍一下以第一子芯片为中心构建的方向坐标系。图14示例性示出了以第一子芯片为中心构建的方向坐标系的示意图。可以看到,该方向坐标系包括四个方向轴:第一方向轴、第二方向轴、第三方向轴和第四方向轴。该四个方向轴均是以第一子芯片为中心向外发散。其中,第一方向轴和第二方向轴共线且方向相反;第三方向轴和第四方向轴共线且方向相反。该方向坐标系还包括四个区域:第一区域、第二区域、第三区域和第四区域。其中,该第一区域以该第一方向轴和该第三方向轴为边界;该第二区域以该第二方向轴和该第三方向轴为边界;该第三区域以该第二方向轴和该第四方向轴为边界,该第四区域以该第一方向轴和该第四方向轴为边界。
在芯片系统中,该第一子芯片所在的行位于该第一方向轴和该第二方向轴中的至少一个方向轴上,该第一子芯片所在的列位于该第三方向轴和该第四方向轴中的至少一个方向轴上。示例性地可以参见图15。假设芯片系统中的子芯片5为第一子芯片,那么,以该子芯片5为中心建立方向坐标系。在该方向坐标系中,该子芯片5所述的第二行位于第一方向轴和第二方向轴上,该子芯片5所述的第二列位于第三方向轴和第四方向轴上。然后,子芯片2和子芯片3位于该方向坐标系的第一区域。子芯片0位于该方向坐标系的第二区域。子芯片8和子芯片12位于该方向坐标系的第三区域。子芯片10、子芯片11、子芯片14和子芯片15位于该方向坐标系的第四区域。
一种可能的实施方式中,若上述图15的芯片系统中子芯片0第一子芯片,那么,以该子芯片0为中心建立方向坐标系。在该方向坐标系中,该子芯片0所述的第一行位于第一方向轴上,该子芯片0所述的第一列位于第四方向轴上。然后,除了子芯片0所在的行和所在的列的子芯片,其余子芯片均位于该方向坐标系的第四区域。
一种可能的实施方式中,上述最小带宽消耗原则包括:在上述第一数据包中包括的目的子芯片处于目标方向轴上的情况下,上述第一子芯片沿着该目标方向轴的方向发送该第一数据包的数据;该目标方向轴为该第一方向轴、该第二方向轴、该第三方向轴或该第四方向轴。为了便于理解,结合上述图15为例说明。
在图15中,假设子芯片5为上述第一子芯片,其接收到一个数据包,该数据包中的目的子芯片的标识指示目的子芯片为子芯片7。若该数据包中只包括一个目的子芯片的标识,那么,由于该子芯片7位于第一方向轴上,因此,子芯片5沿着该第一方向轴的方向发送该数据包。即子芯片5先将数据包发送给子芯片6,再由子芯片6转发给子芯片7。若该数据包中包括多个目的子芯片的标识,作为其中一个目的子芯片的子芯片7位于第一方向轴上。因此,子芯片5复制一份该数据包中的数据新生成一个数据包,并将该新的数据包沿着该第一方向轴的方向发送。即子芯片5先将该新的数据包发送给子芯片6,再由子芯片6转发给子芯片7。该新生成的数据包包括该子芯片7的标识。
一种可能的实施方式中,上述第一数据包中包括第一目的子芯片和第二目的子芯片的标识。上述最小带宽消耗原则还包括:以上述第一子芯片为中心建立的方向坐标系中,在该第一目的子芯片和第二目的子芯片分别处于该坐标系的第一区域、第二区域、第三区域和第四区域中相邻的两个区域的情况下,该第一子芯片沿着共同方向轴的方向发送第二数据包。该第二数据包包括该数据、第一目的子芯片和第二目的子芯片的标识。该共同方向轴为该相邻的两个区域共同边界的方向轴。为了便于理解,结合上述图15为例说明。
在图15中,假设子芯片5为上述第一子芯片,其接收到一个数据包,该数据包中的目的子芯片的标识指示目的子芯片为子芯片8和子芯片14。子芯片8位于第三区域,子芯片14位于第四区域,该两个区域为相邻区域,共同的边界为第四方向轴。因此,子芯片5沿着该第四方向轴的方向发送该数据包。即子芯片5先将数据包发送给子芯片9,再由子芯片9进行进一步的转发。具体的,可以将子芯片9也看成是上述第一子芯片,以该子芯片9为中心建立方向坐标系,然后再基于上述最小带宽消耗原则转发数据。
一种可能的实施方式中,上述第一数据包中包括第一目的子芯片和第二目的子芯片的标识。上述最小带宽消耗原则还包括:以上述第一子芯片为中心建立的方向坐标系中,在该第一目的子芯片处于该坐标系的第一区域,该第二目的子芯片处于该第三区域的情况下,该第一子芯片沿着该第一区域两条边界的方向轴中的一个方向轴的方向发送第三数据包,并沿着该第三区域两条边界方向轴中的一个方向轴的方向发送第四数据包。该第三数据包包括该数据和该第一目的子芯片的标识。该第四数据包包括该数据和该第二目的子芯片的标识。为了便于理解,结合上述图15为例说明。
在图15中,假设子芯片5为上述第一子芯片,其接收到一个数据包,该数据包中的目的子芯片的标识指示目的子芯片为子芯片2和子芯片12。子芯片2位于第一区域,子芯片12位于第三区域。那么,子芯片5可以基于接收的数据包中的数据重新生成两个数据包:数据包A和数据包B。数据包A中包括数据和子芯片2的标识,数据包B中包括数据和子芯片12的标识。然后,沿着第一方向轴或第三方向轴的方向发送该数据包A。例如,沿着第一方向轴的方向发送数据包A,即先将数据包A发送给子芯片6,再由子芯片6将数据包A转发给子芯片2。另外子芯片5沿着第二方向轴或第四方向轴的方向发送该数据包B。例如,沿着第四方向轴的方向发送数据包B,即先将数据包B发送给子芯片9,再由子芯片9继续进一步转发。
一种可能的实施方式中,上述第一数据包中包括第一目的子芯片和第二目的子芯片的标识。上述最小带宽消耗原则还包括:以上述第一子芯片为中心建立的方向坐标系中,在该第一目的子芯片处于该坐标系的第二区域,该第二目的子芯片处于该第四区域的情况下,该第一子芯片沿着该第二区域两条边界的方向轴中的一个方向轴的方向发送第三数据包,并沿着该第四区域两条边界方向轴中的一个方向轴的方向发送第四数据包。该第三数据包包括该数据和该第一目的子芯片的标识。该第四数据包包括该数据和该第二目的子芯片的标识。为了便于理解,结合上述图15为例说明。
在图15中,假设子芯片5为上述第一子芯片,其接收到一个数据包,该数据包中的目的子芯片的标识指示目的子芯片为子芯片0和子芯片10。子芯片0位于第二区域,子芯片10位于第四区域。那么,子芯片5可以基于接收的数据包中的数据重新生成两个数据包:数据包C和数据包D。数据包C中包括数据和子芯片0的标识,数据包D中包括数据和子芯片10的标识。然后,沿着第二方向轴或第三方向轴的方向发送该数据包C。例如,沿着第二方向轴的方向发送数据包C,即先将数据包C发送给子芯片4,再由子芯片4将数据包C转发给子芯片0。另外子芯片5沿着第一方向轴或第四方向轴的方向发送该数据包D。例如,沿着第一方向轴的方向发送数据包D,即先将数据包D发送给子芯片6,再由子芯片6转发给子芯片10。
一种可能的实施方式中,上述第一数据包中包括第一目的子芯片和第二目的子芯片的标识。上述最小带宽消耗原则还包括:以上述第一子芯片为中心建立的方向坐标系中,在该第一目的子芯片处于目标区域中,该第二目的子芯片处于该目标区域边界的方向轴上的情况下,该第一子芯片沿着该目标区域边界方向轴的方向发送第五数据包。该第五数据包包括该数据和该第一目的子芯片和第二目的子芯片的标识。该目标区域为第一区域、第二区域、第三区域或第四区域。为了便于理解,结合上述图15为例说明。
在图15中,假设子芯片5为上述第一子芯片,其接收到一个数据包,该数据包中的目的子芯片的标识指示目的子芯片为子芯片14和子芯片9。子芯片14位于第四区域,子芯片9位于第四方向轴上。第四方向轴为该第四区域的边界方向轴,那么,子芯片5可以该接收的数据包沿着第四方向轴的方向发送,即发送给子芯片9。子芯片9接收到该数据包后,存储数据包中的数据。并复制一份该数据重新生成一个数据包。该新的数据包包括子芯片14的标识,然后将该新的数据包发送给子芯片13或者子芯片10,再由该子芯片13或者子芯片10转发给子芯片14。
第三种可能的实现方式,上述第一子芯片可以接收来自芯片系统中的中心控制器的调度信息,基于调度信息对应发送数据。
具体的,基于前面对芯片系统的介绍中可知,芯片系统的中心控制器还可以负责芯片系统中的数据调度。具体的,中心控制器通过控制总线获取各个子芯片的数据传输情况,通过对这些数据传输情况的分析可以获知各个传输路径的拥塞情况和/或获知各个子芯片的端口拥塞情况,从而可以基于这些情况制定数据的传输策略,并以调度信息的形式下发给各个子芯片。各个子芯片基于控制器下发的调度信息来对应发送数据,从而降低了拥塞的概率,提高了数据传输效率。
示例性的,中心控制器向一个子芯片发送的调度信息中,可以包括一个或多个子芯片的标识和对应去往该一个或多个子芯片的端口的标识。子芯片接收到该调度信息后,将这些信息更新到自身的转发映射表中,以用于后续的数据转发。为例便于理解,可以示例性地参见图16。
图16示例性示出了一个芯片系统的结构示意图,假设子芯片0作为该芯片系统的中心控制器,该子芯片0可以通过控制总线(图16中未画出)与芯片系统中的其它子芯片通信。具体的,子芯片0可以通过控制总线收集各个子芯片中端口的待发送数据量等信息,基于这些信息可以分析出各个端口的拥塞情况,进而分析出各个传输路径的拥塞情况。基于这些分析得到的情况,子芯片0可以综合制定该芯片系统中的数据传输策略,将以调度信息的方式对应下发给各个子芯片。
例如,对于从子芯片1向子芯片6传输的数据,子芯片0经分析得知子芯片1的端口d2较空闲,并且子芯片5的端口d1也较空闲。那么,子芯片0向子芯片1发送一个调度信息,该调度信息包括子芯片6的标识和端口d2的标识。子芯片1接收到该调度信息后,将目的子芯片为子芯片6对应的发送端口为端口d2的信息更新到自身的转发映射表中。另外,子芯片0向子芯片5发送一个调度信息,该调度信息包括子芯片6的标识和端口d1的标识。子芯片5接收到该调度信息后,将目的子芯片为子芯片6对应的发送端口为端口d1的信息更新到自身的转发映射表中。那么,当子芯片1接收到一个发往子芯片6的数据包后,查询自身的转发映射表获知其发送端口为d2,因此,将数据包从端口d2发出。数据包到达子芯片5后,子芯片5查询自身的转发映射表获知其发送端口为d1,因此,将数据包从端口d1发出,将数据包送达子芯片6。
综上所述,本申请实施例通过基于芯片系统内的数据传输情况来传输接收到的数据,从而可以灵活调度数据的发送,提高数据的传输效率,进而提高芯片系统的处理性能。
上述主要对本申请实施例提供的芯片系统中的数据传输处理方法进行了介绍。可以理解的是,各个设备为了实现上述对应的功能,其包含了执行各个功能相应的硬件结构和/或软件模块。结合本文中所公开的实施例描述的各示例的单元及步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用使用不同方法来实现所描述的功能,但这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对设备进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
在采用对应各个功能划分各个功能模块的情况下,图17示出了装置的一种具体的逻辑结构示意图,该装置可以是上述源子芯片。该装置1700包括:
接收单元1701,用于接收配置参数;
配置单元1702,用于根据前述配置参数配置预设的流输出表;前述流输出表包括数据流的标识和前述数据流的目的子芯片的标识;前述装置1700和前述目的子芯片为芯片系统包括的多个子芯片中的子芯片,前述多个子芯片以预设的拓扑结构连接;
执行单元1703,用于执行第一数据处理任务得到输出数据;
生成单元1704,用于基于前述流输出表将前述输出数据生成多个数据包,前述数据包中包括前述数据流的标识和前述目的子芯片的标识;
发送单元1705,用于发送前述多个数据包。
一种可能的实施方式中,前述流输出表还包括前述第一数据处理任务的标识、已发送前述数据流的数据包个数的指示信息和前述装置1700中用于存储前述数据流包括的数据的起始地址中的一项或多项;前述数据包还包括前述第一数据处理任务的标识。
一种可能的实施方式中,前述接收单元1701,还用于在前述生成单元基于前述流输出表将前述输出数据生成多个数据包之前,接收解除封锁数据包;其中,前述解除封锁数据包中包括前述数据流的标识,前述解除封锁数据包用于向前述装置1700指示前述目的子芯片已做好接收前述数据流的准备。
一种可能的实施方式中,前述发送单元1705具体用于:基于端口转发映射表发送前述多个数据包;其中,前述端口转发映射表包括前述目的子芯片的标识与发送端口的映射关系。
一种可能的实施方式中,当前述数据流的目的子芯片为多个时,前述数据包包括前述多个目的子芯片的标识。
一种可能的实施方式中,前述芯片系统包括子系统,前述子系统包括至少两个子芯片,前述子系统配置有子系统标识;当前述目的子芯片为前述子系统的子芯片时,前述数据包还包括前述子系统的标识。
图17所示装置1700中各个单元的具体操作以及有益效果可以参见上述图8及其可能的方法实施例中对应的描述,此处不再赘述。
在采用对应各个功能划分各个功能模块的情况下,图18示出了装置的一种具体的逻辑结构示意图,该装置可以是上述目的子芯片。该装置1800包括:
接收单元1801,用于接收配置参数;
配置单元1802,用于根据前述配置参数配置预设的流输入表,前述流输入表包括至少一个待接收的数据流标识;前述装置1800为芯片系统包括的多个子芯片中的子芯片,前述多个子芯片以预设的拓扑结构连接;
判断单元1803,用于在接收到数据包时,判断前述流输入表中是否包含前述数据包中的数据流标识;
存储单元1804,用于当前述数据包中包含前述待接收的数据流标识时,存储前述数据包中的数据。
一种可能的实施方式中,前述流输入表还包括前述第一数据处理任务的标识、已接收前述数据流的数据包个数的指示信息和前述装置1800中用于存储前述数据流包括的数据的起始地址中的一项或多项;前述数据包还包括前述第一数据处理任务的标识。
一种可能的实施方式中,前述装置1800还包括发送单元,用于在前述接收单元1801接收前述数据包之前,发送解除封锁数据包;其中,前述解除封锁数据包中包括前述数据流的标识和源子芯片的标识,前述解除封锁数据包用于向前述源子芯片指示前述装置1800已做好接收前述数据流的准备;前述源子芯片为前述芯片系统中发送前述数据包的子芯片。
一种可能的实施方式中,前述芯片系统包括子系统,前述子系统包括至少两个子芯片,前述子系统配置有子系统标识;前述数据包还包括前述子系统的标识。
图18所示装置1800中各个单元的具体操作以及有益效果可以参见上述图8及其可能的方法实施例中对应的描述,此处不再赘述。
在采用对应各个功能划分各个功能模块的情况下,图19示出了装置的一种具体的逻辑结构示意图,该装置可以是上述中心控制器。该装置1900包括:
分配单元1901,用于为源子芯片分配第一数据处理任务,前述第一数据处理任务执行完成后获得的数据以数据流的形式发往目的子芯片;前述装置1900、前述源子芯片和前述目的子芯片为芯片系统包括的多个子芯片中的子芯片,前述多个子芯片以预设的拓扑结构连接;
配置单元1902,用于为前述数据流配置标识;
发送单元1903,用于向前述源子芯片发送前述数据流的标识和前述目的子芯片的标识;其中,前述数据流的标识和前述目的子芯片的标识用于关联存储在前述源子芯片的流输出表;前述流输出表为前述源子芯片发送数据的依据。
一种可能的实施方式中,前述分配单元1901,还用于为前述目的子芯片分配第二数据处理任务,前述第二数据处理任务基于前述第一数据处理任务执行完成后获得的数据执行;
前述发送单元1903,还用于向前述目的子芯片发送前述数据流的标识;其中,前述数据流的标识用于存储在前述目的子芯片的流输入表中,前述流输入表为前述目的子芯片接收数据的依据。
一种可能的实施方式中,前述装置1900还包括:
获取单元,用于获取前述芯片系统中子芯片之间的数据传输情况;
生成单元,用于基于前述数据传输情况为前述源子芯片生成调度信息,前述调度信息指示在前述源子芯片中将前述数据流发往前述目的子芯片的发送端口;
前述发送单元1903,还用于向前述源子芯片发送前述调度信息。
图19所示装置1900中各个单元的具体操作以及有益效果可以参见上述图8和图13及其可能的方法实施例中对应的描述,此处不再赘述。
图20所示为本申请提供的装置的一种具体的硬件结构示意图。该装置2000包括:处理器2001、存储器2002和通信端口2003。处理器2001、通信端口2003以及存储器2002可以相互连接或者通过总线2004相互连接。
示例性的,存储器2002用于存储装置2000的计算机程序和数据,存储器2002可以包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)或便携式只读存储器(compact disc read-only memory,CD-ROM)等。示例性的,该存储器2002可以是上述图2中所示的静态存储器。
通信端口2003包括发送端口和接收端口,通信端口2003的个数可以为多个,用于支持装置2000进行通信,例如接收或发送数据或消息等。示例性地,该通信端口2003可以是上述图2中所示的端口d0、d1、d2和d3。
示例性的,处理器2001可以是中央处理器单元、通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等等。示例性地,该处理器2001可以是上述图2中所示的处理模块。
一种可能的实施方式中,上述装置2000为上述图8及其可能的实施方式中的源子芯片。那么,该装置2000中的处理器2001可以用于读取上述存储器2002中存储的程序,使得装置2000执行如上述图8及其具体的实施例中所述的源子芯片执行的操作。
一种可能的实施方式中,上述装置2000为上述图8及其可能的实施方式中的目的子芯片。那么,该装置2000中的处理器2001可以用于读取上述存储器2002中存储的程序,使得装置2000执行如上述图8及其具体的实施例中所述的目的子芯片执行的操作。
一种可能的实施方式中,上述装置2000为上述图8及其可能的实施方式中的中心控制器。那么,该装置2000中的处理器2001可以用于读取上述存储器2002中存储的程序,使得装置2000执行如上述图8及其具体的实施例中所述的中心控制器执行的操作。
图20所示装置2000中各个单元的具体操作以及有益效果可以参见上述图8及其具体的方法实施例中对应的描述,此处不再赘述。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行上述图8及其可能的方法实施例中任一实施例所述的源子芯片执行的操作。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行上述图8及其可能的方法实施例中任一实施例所述的目的子芯片执行的操作。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行上述图8及其可能的方法实施例中任一实施例所述的中心控制器执行的操作。
本申请实施例还提供一种计算机程序产品,当该计算机程序产品被计算机读取并执行时,上述图8及其可能的方法实施例中任一实施例所述的源子芯片执行的操作将被实现。
本申请实施例还提供一种计算机程序产品,当该计算机程序产品被计算机读取并执行时,上述图8及其可能的方法实施例中任一实施例所述的目的子芯片执行的操作将被实现。
本申请实施例还提供一种计算机程序产品,当该计算机程序产品被计算机读取并执行时,上述图8及其可能的方法实施例中任一实施例所述的中心控制器执行的操作将被实现。
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (21)

  1. 一种芯片系统中的数据传输处理方法,其特征在于,所述方法包括:
    源子芯片接收配置参数,并根据所述配置参数配置预设的流输出表;所述流输出表包括数据流的标识和所述数据流的目的子芯片的标识;所述源子芯片和所述目的子芯片为芯片系统包括的多个子芯片中的子芯片,所述多个子芯片以预设的拓扑结构连接;
    所述源子芯片执行第一数据处理任务得到输出数据;
    所述源子芯片基于所述流输出表将所述输出数据生成多个数据包并发送,所述数据包中包括所述数据流的标识和所述目的子芯片的标识。
  2. 根据权利要求1所述的方法,其特征在于,所述流输出表还包括所述第一数据处理任务的标识、已发送所述数据流的数据包个数的指示信息和所述源子芯片中用于存储所述数据流包括的数据的起始地址中的一项或多项;
    所述数据包还包括所述第一数据处理任务的标识。
  3. 根据权利要求1所述的方法,其特征在于,所述源子芯片基于所述流输出表将所述输出数据生成多个数据包并发送之前,还包括:
    所述源子芯片接收解除封锁数据包;其中,所述解除封锁数据包中包括所述数据流的标识,所述解除封锁数据包用于向所述源子芯片指示所述目的子芯片已做好接收所述数据流的准备。
  4. 根据权利要求1所述的方法,其特征在于,所述源子芯片发送所述多个数据包,包括:
    所述源子芯片基于端口转发映射表发送所述多个数据包;其中,所述端口转发映射表包括所述目的子芯片的标识与发送端口的映射关系。
  5. 根据权利要求1所述的方法,其特征在于,当所述数据流的目的子芯片为多个时,所述数据包包括所述多个目的子芯片的标识。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述芯片系统包括子系统,所述子系统包括至少两个子芯片,所述子系统配置有子系统标识;
    当所述目的子芯片为所述子系统的子芯片时,所述数据包还包括所述子系统的标识。
  7. 一种芯片系统中的数据传输处理方法,其特征在于,所述方法包括:
    目的子芯片接收配置参数,并根据所述配置参数配置预设的流输入表,所述流输入表包括至少一个待接收的数据流标识;所述目的子芯片为芯片系统包括的多个子芯片中的子芯片,所述多个子芯片以预设的拓扑结构连接;
    所述目的子芯片在接收到数据包时,判断所述流输入表中是否包含所述数据包中的数据流标识;
    当所述流输入表中包含所述数据包中的数据流标识时,存储所述数据包中的数据。
  8. 根据权利要求7所述的方法,其特征在于,所述流输入表还包括所述第一数据处理任务的标识、已接收所述数据流的数据包个数的指示信息和所述目的子芯片中用于存储所述数据流包括的数据的起始地址中的一项或多项;
    所述数据包还包括所述第一数据处理任务的标识。
  9. 根据权利要求7所述的方法,其特征在于,所述目的子芯片接收所述数据包之前,还包括:
    所述目的子芯片发送解除封锁数据包;其中,所述解除封锁数据包中包括所述数据流的标识和源子芯片的标识,所述解除封锁数据包用于向所述源子芯片指示所述目的子芯片已做好接收所述数据流的准备;所述源子芯片为所述芯片系统中发送所述数据包的子芯片。
  10. 根据权利要求7-9任一项所述的方法,其特征在于,所述芯片系统包括子系统,所述子系统包括至少两个子芯片,所述子系统配置有子系统标识;
    所述数据包还包括所述子系统的标识。
  11. 一种芯片系统中的数据传输处理方法,其特征在于,所述方法包括:
    控制器为源子芯片分配第一数据处理任务,所述第一数据处理任务执行完成后获得的数据以数据流的形式发往目的子芯片;所述控制器、所述源子芯片和所述目的子芯片为芯片系统包括的多个子芯片中的子芯片,所述多个子芯片以预设的拓扑结构连接;
    所述控制器为所述数据流配置标识;
    所述控制器向所述源子芯片发送所述数据流的标识和所述目的子芯片的标识;其中,所述数据流的标识和所述目的子芯片的标识用于关联存储在所述源子芯片的流输出表;所述流输出表为所述源子芯片发送数据的依据。
  12. 根据权利要求11所述的方法,其特征在于,所述方法还包括:
    所述控制器为所述目的子芯片分配第二数据处理任务,所述第二数据处理任务基于所述第一数据处理任务执行完成后获得的数据执行;
    所述控制器向所述目的子芯片发送所述数据流的标识;其中,所述数据流的标识用于存储在所述目的子芯片的流输入表中,所述流输入表为所述目的子芯片接收数据的依据。
  13. 根据权利要求11或12所述的方法,其特征在于,所述方法还包括:
    所述控制器获取所述芯片系统中子芯片之间的数据传输情况;
    所述控制器基于所述数据传输情况为所述源子芯片生成调度信息,所述调度信息指示在所述源子芯片中将所述数据流发往所述目的子芯片的发送端口;
    所述控制器向所述源子芯片发送所述调度信息。
  14. 一种源子芯片,其特征在于,所述源子芯片包括:
    接收单元,用于接收配置参数;
    配置单元,用于根据所述配置参数配置预设的流输出表;所述流输出表包括数据流的标识和所述数据流的目的子芯片的标识;所述源子芯片和所述目的子芯片为芯片系统包括的多个子芯片中的子芯片,所述多个子芯片以预设的拓扑结构连接;
    执行单元,用于执行第一数据处理任务得到输出数据;
    生成单元,用于基于所述流输出表将所述输出数据生成多个数据包,所述数据包中包括所述数据流的标识和所述目的子芯片的标识;
    发送单元,用于发送所述多个数据包。
  15. 一种目的子芯片,其特征在于,所述目的子芯片包括:
    接收单元,用于接收配置参数;
    配置单元,用于根据所述配置参数配置预设的流输入表,所述流输入表包括至少一个待接收的数据流标识;所述目的子芯片为芯片系统包括的多个子芯片中的子芯片,所述多个子芯片以预设的拓扑结构连接;
    判断单元,用于在接收到数据包时,判断所述流输入表中是否包含所述数据包中的数据流标识;
    存储单元,用于当所述流输入表中包含所述数据包中的数据流标识时,存储所述数据包中的数据。
  16. 一种控制器,其特征在于,所述控制器包括:
    分配单元,用于为源子芯片分配第一数据处理任务,所述第一数据处理任务执行完成后获得的数据以数据流的形式发往目的子芯片;所述控制器、所述源子芯片和所述目的子芯片为芯片系统包括的多个子芯片中的子芯片,所述多个子芯片以预设的拓扑结构连接;
    配置单元,用于为所述数据流配置标识;
    发送单元,用于向所述源子芯片发送所述数据流的标识和所述目的子芯片的标识;其中,所述数据流的标识和所述目的子芯片的标识用于关联存储在所述源子芯片的流输出表;所述流输出表为所述源子芯片发送数据的依据。
  17. 一种子芯片,其特征在于,包括处理器、存储器和通信端口;其中,所述存储器和通信端口与所述处理器耦合,所述通信端口用于收发数据,所述存储器用于存储计算机程序,所述处理器用于调用所述计算机程序,以使得所述子芯片执行如权利要求1-6任一项所述的方法;
    所述子芯片为芯片系统包括的多个子芯片中的子芯片,所述多个子芯片以预设的拓扑结构连接。
  18. 一种子芯片,其特征在于,包括处理器、存储器和通信端口;其中,所述存储器和通信端口与所述处理器耦合,所述通信端口用于收发数据,所述存储器用于存储计算机程序,所述处理器用于调用所述计算机程序,以使得所述子芯片执行如权利要求7-10任一项所述的方法;
    所述子芯片为芯片系统包括的多个子芯片中的子芯片,所述多个子芯片以预设的拓扑结构连接。
  19. 一种子芯片,其特征在于,包括处理器、存储器和通信端口;其中,所述存储器和通信端口与所述处理器耦合,所述通信端口用于收发数据,所述存储器用于存储计算机程序,所述处理器用于调用所述计算机程序,以使得所述子芯片执行如权利要求11-13任一项所述的方法;
    所述子芯片为芯片系统包括的多个子芯片中的子芯片,所述多个子芯片以预设的拓扑结构连接。
  20. 一种芯片系统,其特征在于,所述芯片系统包括源子芯片、目的子芯片和控制器;其中,所述源子芯片为权利要求14任一项所述的源子芯片,所述目的子芯片为权利要求15任一项所述的目的子芯片,所述控制器为权利要求16任一项所述的控制器;或者,
    所述源子芯片为权利要求17所述的子芯片,所述目的子芯片为权利要求18所述的子芯片,所述控制器为权利要求19所述的子芯片。
  21. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时,实现权利要求1-6任意一项所述的方法;或者,
    所述计算机程序被处理器执行时,实现权利要求7-10任意一项所述的方法;或者,
    所述计算机程序被处理器执行时,实现权利要求11-13任意一项所述的方法。
PCT/CN2022/099777 2021-12-28 2022-06-20 芯片系统中的数据传输处理方法及相关装置 WO2023123902A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111633371.XA CN114328623A (zh) 2021-12-28 2021-12-28 芯片系统中的数据传输处理方法及相关装置
CN202111633371.X 2021-12-28

Publications (1)

Publication Number Publication Date
WO2023123902A1 true WO2023123902A1 (zh) 2023-07-06

Family

ID=81014166

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/099777 WO2023123902A1 (zh) 2021-12-28 2022-06-20 芯片系统中的数据传输处理方法及相关装置

Country Status (2)

Country Link
CN (1) CN114328623A (zh)
WO (1) WO2023123902A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117041186A (zh) * 2023-10-07 2023-11-10 苏州仰思坪半导体有限公司 数据传输方法、芯片系统、计算设备及存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297130A (zh) * 2021-12-28 2022-04-08 深圳云天励飞技术股份有限公司 芯片系统中的数据传输处理方法及相关装置
CN114328623A (zh) * 2021-12-28 2022-04-12 深圳云天励飞技术股份有限公司 芯片系统中的数据传输处理方法及相关装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103067295A (zh) * 2013-01-04 2013-04-24 华为技术有限公司 业务传输的方法、装置与系统
WO2013147805A1 (en) * 2012-03-29 2013-10-03 Intel Corporation Techniques for using an assigned switch identification at an input/output device
CN111274197A (zh) * 2018-12-05 2020-06-12 锐迪科(重庆)微电子科技有限公司 数据处理装置及方法
CN112532714A (zh) * 2020-11-25 2021-03-19 北京金山云网络技术有限公司 一种数据处理方法、处理装置、服务器及存储介质
CN112667557A (zh) * 2021-03-16 2021-04-16 南京蓝洋智能科技有限公司 一种适用于chiplet架构的数据传输方法
CN114328623A (zh) * 2021-12-28 2022-04-12 深圳云天励飞技术股份有限公司 芯片系统中的数据传输处理方法及相关装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013147805A1 (en) * 2012-03-29 2013-10-03 Intel Corporation Techniques for using an assigned switch identification at an input/output device
CN103067295A (zh) * 2013-01-04 2013-04-24 华为技术有限公司 业务传输的方法、装置与系统
CN111274197A (zh) * 2018-12-05 2020-06-12 锐迪科(重庆)微电子科技有限公司 数据处理装置及方法
CN112532714A (zh) * 2020-11-25 2021-03-19 北京金山云网络技术有限公司 一种数据处理方法、处理装置、服务器及存储介质
CN112667557A (zh) * 2021-03-16 2021-04-16 南京蓝洋智能科技有限公司 一种适用于chiplet架构的数据传输方法
CN114328623A (zh) * 2021-12-28 2022-04-12 深圳云天励飞技术股份有限公司 芯片系统中的数据传输处理方法及相关装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117041186A (zh) * 2023-10-07 2023-11-10 苏州仰思坪半导体有限公司 数据传输方法、芯片系统、计算设备及存储介质
CN117041186B (zh) * 2023-10-07 2024-01-30 苏州仰思坪半导体有限公司 数据传输方法、芯片系统、计算设备及存储介质

Also Published As

Publication number Publication date
CN114328623A (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2023123902A1 (zh) 芯片系统中的数据传输处理方法及相关装置
US7996569B2 (en) Method and system for zero copy in a virtualized network environment
US10635631B2 (en) Hybrid programmable many-core device with on-chip interconnect
EP3798835A1 (en) Method, device, and system for implementing hardware acceleration processing
US7788334B2 (en) Multiple node remote messaging
WO2018133035A1 (zh) 用于转发数据包的方法、网卡、主机设备和计算机系统
US8312197B2 (en) Method of routing an interrupt signal directly to a virtual processing unit in a system with one or more physical processing units
EP2486715B1 (en) Smart memory
CN108270676B (zh) 一种基于Intel DPDK的网络数据处理方法及装置
WO2023123905A1 (zh) 芯片系统中的数据传输处理方法及相关装置
CN104821887A (zh) 通过使用具有不同延迟的存储器来进行分组处理的设备和方法
WO2022094771A1 (zh) 网络芯片和网络设备
US11675633B2 (en) Virtualised gateways
US11829309B2 (en) Data forwarding chip and server
US11403250B2 (en) Operation accelerator, switch, task scheduling method, and processing system
CN110958189B (zh) 一种多核fpga网络处理器
EP3461086B1 (en) Communication apparatus, communication method and computer-readable medium
CN110995598B (zh) 一种变长报文数据处理方法和调度装置
WO2018113622A1 (zh) 基于虚拟机的数据包发送和接收方法及装置
US11343176B2 (en) Interconnect address based QoS regulation
KR20240024188A (ko) 네트워크 인터페이스 디바이스
WO2014101502A1 (zh) 基于内存芯片互连的内存访问处理方法、内存芯片及系统
US11575620B2 (en) Queue-to-port allocation
KR20210006127A (ko) 다중 프로세서 인터럽트 신호 처리 장치
WO2024017285A1 (zh) Cpu核的分配方法、系统、设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22913191

Country of ref document: EP

Kind code of ref document: A1