CN118277069A - Load balancing method in chip system and related device - Google Patents

Load balancing method in chip system and related device Download PDF

Info

Publication number
CN118277069A
CN118277069A CN202211711004.1A CN202211711004A CN118277069A CN 118277069 A CN118277069 A CN 118277069A CN 202211711004 A CN202211711004 A CN 202211711004A CN 118277069 A CN118277069 A CN 118277069A
Authority
CN
China
Prior art keywords
task
chipset
data
processed
chip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211711004.1A
Other languages
Chinese (zh)
Inventor
黎立煌
王和国
曹庆新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Intellifusion Technologies Co Ltd
Original Assignee
Shenzhen Intellifusion Technologies Co Ltd
Filing date
Publication date
Application filed by Shenzhen Intellifusion Technologies Co Ltd filed Critical Shenzhen Intellifusion Technologies Co Ltd
Publication of CN118277069A publication Critical patent/CN118277069A/en
Pending legal-status Critical Current

Links

Abstract

The embodiment of the application provides a load balancing method and a related device in a chip system, wherein the chip system comprises n chip groups, n is an integer greater than 1, and each chip group comprises one or more chips; the chip system is used for processing n kinds of tasks; the ith task is preset to be executed by an ith chipset, wherein the value of i is an integer from 1 to n; the method comprises the following steps: a first chipset of the n chipsets performs the following operations based on the task scheduling information: acquiring first data to be processed from a second chipset in the n chipsets; the task scheduling information is used for indicating the type of the task executed by the first chipset after the task scheduling; executing a first kind of task based on the first data to be processed to obtain processed first data; the first type of task is a task preset to be executed by the second chipset. The application can fully utilize the computing resource of the chip to improve the processing performance of the chip system.

Description

Load balancing method in chip system and related device
Technical Field
The present invention relates to the field of load balancing technologies, and in particular, to a load balancing method in a chip system and a related device.
Background
A system-on-a-chip may include a plurality of chips, each having the capability of processing data individually, connected in a topology to enable communication with each other. And moreover, a plurality of chips in the chip system can cooperatively process a single large-scale calculation task in a parallel mode through the model so as to reduce the requirement on the chip memory. In the process of cooperatively processing tasks, how to fully utilize the computing resources of the chip to improve the processing performance of the chip system is a technical problem to be solved.
Disclosure of Invention
The embodiment of the application discloses a load balancing method and a related device in a chip system, which can fully utilize the computing resources of a chip to improve the processing performance of the chip system.
In a first aspect, the present application provides a method for load balancing in a chip system, the chip system comprising n chip sets, wherein n is an integer greater than 1, each of the chip sets comprising one or more chips; the chip system is used for processing n kinds of tasks; presetting and executing an ith type of task by an ith chipset, wherein the value of the i is an integer from 1 to n; the method comprises the following steps:
a first chipset of the n chipsets performs the following operations based on the task scheduling information:
Acquiring first data to be processed from a second chip set in the n chip sets; the task scheduling information is used for indicating the type of the task executed by the first chipset after task scheduling;
Executing a first kind of task based on the first data to be processed to obtain processed first data; the first kind of task is a task preset to be executed by the second chipset.
Optionally, the task scheduling information is stored in a memory of the first chipset.
In the scheme, the chip groups in the chip system acquire data from other chip groups based on preset task scheduling information to execute tasks of the other chip groups so as to help the chip groups to share loads, so that load balancing of all the chip groups in the chip system is realized, and the processing efficiency and performance of the whole chip system can be improved.
In a possible implementation manner, the task scheduling information is further used to indicate the execution times of the first chipset on each kind of task after the task is scheduled, where the data processed by different execution times of the same kind of task is different.
In the scheme, the execution times of each kind of task are indicated in the task scheduling information, so that more accurate load balancing can be further realized, and a better load balancing effect is obtained.
In a possible implementation manner, the first chipset further performs the following operations based on the task scheduling information:
Acquiring a first task parameter from the second chipset, wherein the first task parameter is a parameter for executing the first type of task in the second chipset;
the performing a first kind of task based on the first to-be-processed data includes:
And executing the first task based on the first task parameter and the first data to be processed.
In the scheme, the first chipset can acquire corresponding task parameters from the second chipset, and the second chipset is helped to execute processing tasks based on the acquired task parameters so as to help the second chipset to share loads, so that load balancing of each chipset in the chip system is realized, and the processing efficiency and performance of the whole chip system can be improved.
In a possible implementation manner, the performing a task of a first kind based on the first data to be processed, after obtaining the processed first data, further includes:
executing a second kind of task based on the processed first data to obtain processed second data; the second kind of task is a task preset to be executed by the first chipset;
and transmitting the processed second data to a third chipset in the n chipsets.
The data output after the first kind of task is processed is the input data for processing the second kind of task. In the scheme, after the first chip group helps the second chip group to process the first type of task, the second type of task can be processed based on the data output after the first type of task is processed, so that the computing resource of the second chip group is fully utilized, and the processing efficiency of the chip system is improved.
In a possible implementation manner, the performing a task of a first kind based on the first data to be processed, after obtaining the processed first data, further includes:
Executing a second kind of task based on the processed first data to obtain processed second data; the second kind of task is a task preset to be executed by the first chip;
And executing a third kind of task based on the processed second data, wherein the third kind of task is a task preset to be executed by a third chip set in the n chip sets.
Optionally, before the performing the third kind of task based on the processed second data, the method further includes: acquiring second task parameters from the third chipset, wherein the second task parameters are parameters for executing the third kind of task in the third chipset; the performing a third kind of task based on the processed second data includes: and executing the third kind of task based on the second task parameter and the processed second data.
The data output after the processing of the first type of task is the input data for processing the second type of task, and the data output after the processing of the second type of task is the input data for processing the third type of task. In the scheme, after the first chipset helps the second chipset to process the first kind of task, the second kind of task can be processed based on the data output after the first kind of task is processed, and the third kind of task which helps the third chipset to process can also be processed based on the data output after the second kind of task is processed, so that the computing resource of the second chipset is fully utilized, and the processing efficiency of a chip system is improved.
In a possible implementation manner, after performing the first kind of task based on the first to-be-processed data and obtaining the processed first data, the method further includes: and transmitting the processed first data to a fourth chip set in the n chip sets.
In this scheme, the data output after the first kind of task is processed is the input data of the kind of task preset to be executed by the fourth chipset. After the first chipset helps the second chipset to process the first kind of task to obtain the output first data, the first data may be sent to the fourth chipset to allow the fourth chipset to execute its own processing task. Similarly, the first chipset fully utilizes own computing resources to help other chipsets to share the load, so that the processing efficiency of the chip system is improved.
In a possible implementation manner, the first chipset further performs the following operations based on the task scheduling information:
Executing a second kind of task to obtain processed third data; the second kind of task is a task preset to be executed by the first chipset;
executing a third kind of task based on the processed third data to obtain processed fourth data; the third kind of task is a task preset to be executed by a third chipset of the n chipsets;
and transmitting the processed fourth data to a fourth chipset in the n chipsets.
In this scheme, the data output after the first kind of task is processed is the input data of the second kind of task, the data output after the second kind of task is processed is the input data of the third kind of task, and the data output after the third kind of task is processed is the input data of the kind of task preset to be executed by the fourth chipset. After the first chip group helps the second chip group to process the first kind of task to obtain output data, the second kind of task is executed, and the third chip group is helped to execute the third kind of task, so that the self computing resource is fully utilized to help other chip groups to share the load, and the processing efficiency of the chip system is improved.
In a possible implementation manner, the method further includes:
The first chipset obtains second data to be processed from a fifth chipset in the n chipsets;
Executing a fourth kind of task based on the second data to be processed to obtain processed fifth data; the fourth task is a task preset to be executed by the fifth chipset; the task scheduling information does not indicate the first chipset to execute the fourth class of tasks.
In this scheme, besides software scheduling (scheduling based on task scheduling information), the chipset may also implement task scheduling through hardware. I.e. the tasks performed in the chipset may also be implemented without software scheduling control, but with hardware discretion. The method can respond to the load balancing requirement among the chip sets more quickly by combining software scheduling and hardware scheduling, so that the load balancing processing is realized quickly, and the processing efficiency and the performance of the chip system are improved.
In a possible implementation manner, the chip system processes the n kinds of tasks through a deep learning neural network; the deep learning neural network comprises n sub-networks, and each sub-network comprises a plurality of network layers of the deep learning neural network; the ith chip set is preset for realizing the function of the ith sub-network; the function of the ith sub-network is a function realized by the ith task.
Optionally, the n kinds of tasks are data processing tasks in the forward propagation process of the deep learning neural network.
In the scheme, the chip system can process various tasks based on the deep learning neural network, and the functions of all sub-networks of the deep learning neural network can be rapidly realized through the method.
In a possible implementation manner, the ith data obtained after the ith chipset executes the ith task is stored in a memory of the ith chipset, where the ith data is used for parameter calculation in a back propagation process of the ith sub-network.
Optionally, the n kinds of tasks are data processing tasks in the reverse propagation process of the deep learning neural network.
In the scheme, the data obtained after the chip group executes the task in the forward propagation process is stored in the local memory, so that the data can be directly obtained locally for parameter correction calculation in the backward propagation process, the data is prevented from being read to other places, the bandwidth resource of a chip system is saved, and the backward propagation efficiency is improved.
In a second aspect, the present application provides a chip system comprising n chip sets, wherein n is an integer greater than 1, each of the chip sets comprising one or more chips; the chip system is used for processing n kinds of tasks; presetting and executing an ith type of task by an ith chipset, wherein the value of the i is an integer from 1 to n; the n chip sets comprise a first chip set for performing the method of any of the first aspects.
In a third aspect, the present application provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the method according to any one of the first aspects.
It will be appreciated that both the above second and third aspects apply to performing the method provided in any of the above first aspects. Therefore, the advantages achieved by the method can be referred to as the advantages of the corresponding method, and will not be described herein.
Drawings
The drawings that are required to be used in the embodiments of the present application will be described below.
Fig. 1 to fig. 3A are schematic diagrams of a chip system according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a chip structure according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of a method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a data processing task provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of batch processing data by a system-on-chip provided by an embodiment of the present application;
FIG. 8 is a schematic diagram of the internal structure of a chipset according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a data queue to be processed in a chipset according to an embodiment of the present application;
FIG. 10 is a schematic diagram of resource utilization of a chip system according to an embodiment of the present application;
FIG. 11 is a schematic diagram of forward and reverse propagation of a system-on-chip provided by an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings.
First, the chip system provided by the embodiment of the application is described. The chip system provided by the embodiment of the application can be composed of a plurality of chips. For example, reference may be made to the chip systems shown in fig. 1,2 and 3. The chip system shown in fig. 1, fig. 2 and fig. 3 takes 8 chips as an example, and in a specific implementation, the number of chips included in the chip system is not limited in the embodiment of the present application.
Fig. 1 shows a chip system 100, wherein chips in the chip system 100 are connected in a ring structure, that is, 8 chips are connected end to form a ring structure, and specific connection can be seen in fig. 1. For example, data to be processed may be input from the chip 0 to the chip system 100, and data obtained after the processing by the chip system 100 is completed may be output from the chip 7.
Fig. 2 shows a chip system 200 in which the chips within the chip system 200 are connected in a net-like structure, i.e. each chip is connected to an adjacent chip. For example, chip 1 may be connected to chips 0, 2 and 6, and the specific connection of the other chips may be seen in fig. 2. Illustratively, the data to be processed may be input from the chip 0 to the chip system 200, and the data obtained after the processing by the chip system 200 may be output from the chip 7.
Fig. 3 shows a chip system 300, wherein chips within the chip system 300 are connected in a cubic structure, see in particular fig. 3. It should be noted that, in a specific implementation, the 8 chips of the chip system 300 may be disposed on a planar substrate (e.g., a silicon substrate), but the connection manner between the chips maintains the connection manner of the cubic structure shown in fig. 3. Illustratively, the data to be processed may be input from the chip 0 to the chip system 300, and the data obtained after the processing by the chip system 300 may be output from the chip 7.
In one possible implementation, a plurality of chips in the chip system provided by the embodiment of the present application may be divided into n chip sets, where n is an integer greater than 1. Each chipset may include one or more chips. In a specific implementation, the chip system may process data in a model parallel manner, and then each chipset may be assigned a task for fixedly executing a certain portion of the entire data processing task, which will be described in detail later, and which will not be described in detail herein.
For example, referring to fig. 1, 8 chips of the intermediate chip system 100 of fig. 1 are divided into 4 chip sets, each including 2 chips. Specifically, chip 0 and chip 1 constitute chip set 1, chip 2 and chip 3 constitute chip set 2, chip 4 and chip 5 constitute chip set 3, and chip 6 and chip 7 constitute chip set 4.
For example, referring to fig. 2, 8 chips of the intermediate chip system 200 in fig. 2 are divided into 4 chip sets, wherein chip 0 is chip set 1, chip 2 and chip 3 constitute chip set 2, chip 4 and chip 5 constitute chip set 3, and chip 6 and chip 7 constitute chip set 4.
For example, referring to fig. 3, 8 chips of the intermediate chip system 300 in fig. 3 are divided into 3 chip sets, wherein chip 0 is chip set 1, chip 2 and chip 3 constitute chip set 2, and chip 4, chip 5, chip 6 and chip 7 constitute chip set 3.
In the embodiment of the application, the number of the chip groups obtained by dividing the chips in the chip system is not limited, and the number of the chips included in each chip group is not limited, and the chip groups can be particularly divided according to actual conditions.
In a possible implementation manner, the chip system provided by the embodiment of the present application may also be the chip system 3A00 as shown in fig. 3A. The chip system 3a00 illustratively includes 16 chips, and a chipset (e.g., a rectangular dashed frame in the figure is a chipset) may be formed for every two chips. Or in a specific implementation, any number of chips may be formed into a chipset, which is not limited by the embodiment of the present application.
To facilitate understanding of the chip architecture in the chip system described above, fig. 4 schematically illustrates a schematic structure of a chip 400. Chip 400 may be any of the chips described above in the system-on-chip. Chip 400 may include a processor 401, memory 402, static storage 403, and communication ports 404. Wherein the processor 401, the memory 402, the static storage 403 and the communication port 404 are interconnected by a bus 405.
Processor 401 may be responsible for the operation and management of various process flows in chip 400. By way of example, the processor 401 may be a central processing unit, a neural network processor, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. A processor may also be a combination that performs a computational function, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and so forth.
The memory 402 may be used to temporarily store data required for the operation of the processor 401. The memory 402 may be, for example, a double data rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM), which may be referred to simply as DDR.
Static memory 403 is used to store computer programs and data. By way of example, static memory 403 may include, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM), etc.
The communication ports 404 include a transmitting port and a receiving port, and the number of the communication ports 404 may be plural, so as to support the chip 400 to perform communication, for example, to receive or transmit data or messages.
It should be noted that, the chip system and the chip structure provided by the embodiments of the present application are not limited to the foregoing description, and the foregoing description is merely an example, and does not limit the embodiments of the present application.
In a specific implementation, the multiple chip sets in the chip system can cooperatively process a single large-scale data processing task in a model parallel mode, but how to realize efficient cooperation of the multiple chip sets so as to fully utilize the computing resources of each chip to improve the processing performance of the chip system is a technical problem to be solved urgently.
In particular, the system-on-chip may be used to perform a large data processing task that includes a plurality of processing steps. Because of the large data volume of the large data processing task, a single chipset cannot individually complete the processing of the task. The task can be performed in a model parallel fashion. The data processing task may be divided into a plurality of sub-tasks, and a single chipset need only perform a certain sub-task fixedly, thereby rendering the single chipset adequate in the context of existing computing resources. In addition, in the implementation that a single chipset is used for fixedly completing a certain subtask, under the condition that batch (batch) data needs to be processed, the task parameters read into the memory by the single chipset can be repeatedly used for processing the batch data, and the task parameters do not need to be reloaded into the memory once each time of data processing, so that the occupation of the memory bandwidth can be reduced, and the utilization rate of the memory bandwidth can be improved.
For convenience of description, each of the above-divided sub-tasks may be referred to as a kind of task. The above-described data processing tasks may then be divided into a plurality of categories of tasks. Illustratively, the data processing tasks may be divided into n categories of tasks. Each of the n chipsets of the above-described chip system is for performing one of the n kinds of tasks. Specifically, the task of the ith category is preset to be executed by the ith chipset, and the value of i is an integer from 1 to n. For ease of understanding, the following is illustrative.
See, for example, fig. 6. FIG. 6 illustrates a model of a data processing task. It will be seen that the data processing task may comprise a plurality of processing steps (denoted processing steps by L), which are illustrated by 32 processing steps (L1 to L32). Assuming that the above-described chip system includes four chip sets, the 32 processing steps can be divided into four kinds of tasks. Wherein, L1 to L8 are divided into task type 1, L9 to L17 are divided into task type 2, L18 to L27 are divided into task type 3, and L28 to L32 are divided into task type 4. Illustratively, the workload of each step may be divided based on the magnitude of the workload, so that the workload of the four kinds of tasks obtained by the division tends to be averaged as much as possible. The embodiment of the application does not limit the specific division mode. Note that, the numbers of the processing steps in fig. 6 do not constitute a limitation on the sequence of processing of specific steps. That is, the numbering of the processing steps in fig. 6 serves as a marker only, and the order of numbering is not the order in which the steps are performed. For example, L17 may be performed prior to L18, or L17 may be performed later than L18, as embodiments of the application are not limited in this respect.
For convenience of the following description, task type 1 may be represented by task 1, task type 2 may be represented by task 2, task type 3 may be represented by task 3, and task type 4 may be represented by task 4. The division of the data processing task into four kinds of tasks is only an example, and the kinds of tasks obtained by dividing the data processing task in a specific implementation can be any number, and the number of kinds obtained by dividing is not limited in the embodiment of the present application.
The four kinds of tasks obtained by the division can be correspondingly distributed to four chip sets in the chip system for execution. For example, take the chip system of fig. 1 as an example. The chip system comprises four chip sets, namely a chip set 1, a chip set 2, a chip set 3 and a chip set 4. Then task 1 described above may be assigned to chipset 1 for execution, task2 described above may be assigned to chipset 2 for execution, task 3 described above may be assigned to chipset 3 for execution, and task 4 described above may be assigned to chipset 4 for execution.
The task of a kind is allocated to a chipset for execution, and a computer program and parameters for executing the task of the kind may be configured in the chipset in advance, so that the chipset may call the computer program and parameters to process the received data after receiving the data to be processed.
When there is a lot of data to be processed, the data can be input into the chip system in batches for processing. Assuming that m data is to be input into the system-on-chip for processing, the m data may be divided into β batches (batches), each of the α batches, i.e., m=α×β. This α may be referred to as the Batch Size (Batch Size). The data is then input batch wise into the system-on-chip for processing. For ease of understanding, the following is illustrative.
See, for example, fig. 7. Taking the example of a chip system including four chipsets, chipset 1 is used to execute task 1, chipset 2 is used to execute task 2, chipset 3 is used to execute task 3, and chipset 4 is used to execute task 4. Referring first to fig. 7 (a), there is a lot of data waiting to be processed, which is denoted by B 1、B2、B3、…、Bβ. The beta batch data are sequentially input into a chip system, and are sequentially processed by a chip set 1, a chip set 2, a chip set 3 and a chip set 4 to obtain final processed data and output the final processed data. The data after the final processing of the beta lot may be represented by B' 1、B'2、B'3、…、B'β.
Referring to fig. 7 (b), a flow of processing data in parallel by each chipset in the chip system is exemplarily described. The first data input into the system-on-chip is the first batch of data B 1. The first batch of data B 1 is first input into the chipset 1, and after the chipset 1 completes the task 1 processing on the first batch of data B 1, the first batch of processed data B 11 is obtained. This B 11 is input to the chipset 2 for processing of task 2. In addition, after the chipset 1 transmits the B 11 to the chipset 2, the second batch of data B 2 is received for processing.
After the above-mentioned chipset 2 performs task 2 on B 11, the second processed first data B 111 is obtained, and this B 111 is sent to the chipset 3. In addition, when the chip set 2 processes the data B 11, the chip set 1 performs task 1 on the second batch data B 2 to obtain second batch data B 21 after one processing. After chipset 2 sends this B 111 to chipset 3, B 21 is received from chipset 1 for processing. After chipset 1 sends this B 21 to chipset 2, a third batch of data B 3 is received for processing.
After the above-mentioned chipset 3 performs task 3 on B 111, the first batch of data B 1111 after three times of processing is obtained, and this B 1111 is sent to the chipset 4. In addition, when the chipset 3 processes the B 111, the chipset 2 performs task 2 on the B 21 to obtain second batch data B 211 after the secondary processing. Meanwhile, the chipset 1 performs task 1 on B 3 to obtain a third batch of data B 31 after one processing. After chipset 3 sends this B 1111 to chipset 4, B 211 is received from chipset 2 for processing. After chipset 2 sends this B 211 to chipset 3, B 31 is received from chipset 1 for processing. After chipset 1 sends this B 31 to chipset 2, a fourth batch of data B 4 is received for processing.
After the chipset 4 performs task 4 on B 1111, the first data B '1 after the final processing is obtained, and the B' 1 is output. in addition, when the chipset 4 processes the B 1111, the chipset 3 performs task 3 on the B 211 to obtain the second batch data B 2111 after three processes. Meanwhile, the chipset 2 performs task 2 on B 31 to obtain third batch data B 311 after the secondary processing. The chipset 1 performs task 1 on B 4 to obtain a fourth batch of processed data B 41. After the chipset 4 outputs B' 1, B 2111 is received from the chipset 3 for processing. After chipset 3 sends this B 2111 to chipset 4, B 311 is received from chipset 2 for processing. After chipset 2 sends this B 311 to chipset 3, B 41 is received from chipset 1 for processing. After chipset 1 sends this B 41 to chipset 2, a fifth batch of data B 5 is received for processing.
The flow of processing data in parallel by the chipset is described in the above example in connection with fig. 7 (B), and the flow of obtaining B '2、B'3、…、B'β by the chipset may refer to the flow of obtaining B' 1, which is not described herein.
The flow of processing data in parallel by the chipset described above in connection with fig. 7 (b) is only an example and is not meant to limit the embodiments of the present application.
In a possible embodiment, in a specific implementation, a chipset in the above-mentioned chip system may not be limited to only send the processed data to one chipset, and may send the processed data to one or more chipsets. For example, after the above-mentioned chipset 2 performs task 2 on B 11 to obtain the second processed first batch of data B 111, the B 111 may be sent to chipset 3, and the B 111 may also be sent to chipset 4. Then, for example, the chipset 4 may perform the task 4 processing on this B 111 together with the later received B 1111 to obtain the above-mentioned B' 1. The description herein is illustrative only and is not to be construed as limiting the embodiments of the application.
In the above description, although the workload of the n kinds of tasks obtained by dividing the data processing task into the plurality of kinds of tasks is made to be as average as possible in the process of dividing the data processing task into the plurality of kinds of tasks. However, absolute averaging cannot be achieved, which causes uneven task amounts among the chipsets, and situations can occur in which the existing chipsets are busy and the existing chipsets are idle. In this case, the computing resources of the relatively idle chipset are wasted, and the processing efficiency of the entire chip system is reduced. In addition, since the model parallel data processing mode is adopted, the chipset processing the previous step is busy at the beginning, and the chipset processing the subsequent step can perform further processing based on the data processed by the previous step, so that the chipset is in a waiting state, and the computing resource of the waiting chipset is wasted, thereby reducing the processing efficiency of the chip system.
In order to fully utilize computing resources in a chip set and improve the processing efficiency of a chip system, the embodiment of the application provides a load balancing method in the chip system. Referring to fig. 5, the load balancing method in the chip system provided by the embodiment of the application includes, but is not limited to, the following steps:
S501, the first chip set obtains first data to be processed from the second chip set based on task scheduling information; the task scheduling information is used to indicate a kind of task to be executed by the first chipset after task scheduling.
The first chipset and the second chipset may be, for example, chipsets in the chip system described above.
In one possible implementation, the first chipset and the second chipset may be adjacently connected chipsets. An adjacently connected chipset refers to two chipsets that can communicate directly without the need for other chipsets to relay the communicated data. For example, in the chip system shown in fig. 1, the chip set 1 is connected adjacent to the chip set 2 and the chip set 4, respectively, the chip set 2 is connected adjacent to the chip set 1 and the chip set 3, respectively, and the chip set 3 is connected adjacent to the chip set 2 and the chip set 4, respectively. For example, in the chip system shown in fig. 2 described above, the chip set 2 is connected adjacent to the chip set 1, the chip set 3, and the chip set 4, respectively, and the other is the same. For example, in the chip system shown in fig. 3 described above, the chip set 1, the chip set 2, and the chip set 3 are connected adjacent to each other.
In another possible embodiment, the first chipset and the second chipset may be non-adjacently connected chipsets. For example, in the chip system shown in fig. 1 described above, chip set 1 and chip set 3 are non-adjacently connected chip sets, and chip set 2 and chip set 4 are non-adjacently connected chip sets. For example, in the chip system shown in fig. 2 described above, chip set 1 and chip set 3 are non-adjacently connected chip sets.
Based on the foregoing, it can be seen that, among n chipsets in the chip system, the i-th chipset is preset to perform the i-th kind of task. In order to further achieve load balancing among the n chip sets, task scheduling information may be configured in advance for each chip set, so that when a data processing task is specifically executed, each chip set may dynamically execute tasks of itself or other chip sets according to the task scheduling information configured in advance.
The task scheduling information previously configured for each chipset may include information of the kind of task performed by the chipset after task scheduling. In a possible implementation, the task scheduling information may further include the number of times each task category is executed by the chip after task scheduling. The data processed by different execution times of the same kind of task is different.
Illustratively, taking the above-mentioned chipset 1 and chipset 2 as an example, there are multiple batches (btach) of data to be processed in the chipset 1, and each batch of data needs to be processed by task 1. If the task scheduling information in the chipset 2 indicates that the chipset 2 performs tasks of task 1 of the chipset 1 twice, the chipset 2 may obtain two batches of data to be processed from the chipset 1 based on its task scheduling information, and then perform task 1 on the two batches of data to be processed. In this process, the chipset 2 performs tasks of task 1 twice, i.e., performing a task of one kind on a batch of data indicates that the task of the one kind is performed once. In this example, the granularity at which one chipset obtains data from another chipset is batch. In another possible implementation, the granularity of acquiring data from one chip to another may be part of one batch of data, i.e., processing a part of the acquired one batch of data for one type of task may also indicate that the one type of task was performed. The embodiment of the application does not limit the granularity of the data processed by executing a certain kind of task once. For convenience of description, the embodiment of the application is presented by taking the granularity as a batch as an example.
To facilitate understanding of task scheduling information, the above-described chipset 1 is exemplified by the following manner of listing task scheduling information of chipset 1: task 1:16,task2:5,task3:0,task4: 0]. The task scheduling information indicates that the chipset 1 executes 16 tasks 1 times, executes 5 tasks 2, and executes 0 tasks 3 and 4. Or, for example, the task scheduling information only indicates the kind and number of tasks that need to be performed, such as task scheduling information [ task 1:16,task2:5,task3:0,task4: 0 may be expressed as [ task 1:16,task2: 5]. The chipset 1 may also be instructed to execute 16 tasks 1 times and execute 5 tasks 2. It should be noted that, the representation of the task scheduling information is only an example, and does not limit the embodiments of the present application. The task type may also be represented by other identifiers, which are not limiting in this regard according to embodiments of the present application.
In one possible implementation, the task scheduling information of the chipset may only instruct the chipset to perform the task type that is preset to be performed by the chipset, i.e. without helping other chipsets to perform the task.
In a possible implementation, when a task of a certain kind is only required to be executed once, the task scheduling information may not need to indicate the number of times of execution, and the chipset defaults to execute the task of the kind only once. For example, if the task scheduling information of the chipset 1 is represented as [ task 1:16,task2 ], it indicates that the task scheduling information indicates that the chipset 1 executes 16 tasks 1 and 1 task 2.
In one possible implementation, the task scheduling information may be calculated by a chip or a chipset in the chip system. In another possible implementation, the task scheduling information may be calculated by another computing device (e.g., a computing device such as a server) independent of the chip system, and then the calculated task scheduling information is sent to the chipset of the chip system. Or transferred to a chipset in a chip system by copying and migration. The embodiment of the application does not limit the calculation main body of the task scheduling information and does not limit the mode of transferring the task scheduling information into the chip set. The manner in which the task scheduling information is calculated will be described later and will not be described in detail here.
In a possible implementation manner, the task scheduling information may be stored in advance in a memory of the first chipset. In another possible implementation, the task scheduling information may be downloaded locally from another device, such as a server, after the first chipset is powered up. The embodiments of the present application are not limited in this regard.
In a specific implementation, the task scheduling information of the first chipset indicates the first chipset to execute the first kind of task. Since the task of the first kind is preset to be executed by the second chipset, the first chipset may acquire the first data to be processed from the second chipset. The first data to be processed may be data of a task queue cached in the second chipset. For example, the first data to be processed may include data of one or more latches.
In a possible implementation manner, if the number of times the first chipset performs the first kind of task is indicated in the task scheduling information of the first chipset, and assuming that the number of times is q, q is an integer greater than 0, the first chipset may obtain q pieces of batch data from the second chipset. I.e. the first data to be processed may comprise q batch data.
S502, the first chipset executes a first kind of task based on the first data to be processed to obtain processed first data; the first type of task is a task preset to be executed by the second chipset.
The first chipset obtains the first data to be processed from the second chipset, and may perform the first kind of task processing on the first data to be processed.
In a possible implementation manner, the first chipset has no task parameter for executing the task of the first kind, and then the first chipset may further obtain the task parameter of the task of the first kind from the second chipset. Then, processing of a first kind of task is performed on the first data to be processed based on the acquired task parameters.
In a possible implementation manner, the task parameters of the first type of task may be stored in the first chipset in advance, so after the first chipset acquires the first data to be processed, the task parameters do not need to be acquired from the second chipset, and the first data to be processed may be processed by the first type of task based on the task parameters stored in advance.
In order to facilitate understanding of the flow of the first chipset to obtain the first data to be processed and the task parameters from the second chipset, the following description will be given.
See, for example, fig. 8. A control unit 801 and a memory 802 comprised by a chipset are exemplarily shown. The control unit 801 may be a control unit of a certain chip in the chipset, or may be a combination of control units of a plurality of chips included in the chipset, which is not limited by the embodiment of the present application. The control unit 801 may directly or indirectly invoke computing and memory resources of the entire chipset. The control unit 801 may be, for example, the processor 401 in fig. 4 described above.
The memory 802 may be a memory in a certain chip included in the chipset, or may be a combination of memories of a plurality of chips included in the chipset, i.e., may be regarded as one memory system. The memory 802 may include one or more of memory, static memory, or registers, for example. The memory 802 may be used to store indication information, task parameters, task description queues, and pending data queues.
The indication information may include pointer information and a queue length S of the task description queue and the data queue to be processed. The length of the task description queue is the same as that of the data queue to be processed, and S can be an integer greater than 0 on the assumption that S is S. Queues operate according to first-in-first-out (FIFO) rules. The pointers include a head pointer and a tail pointer, the head pointer pointing to the head of the task description queue and the data queue to be processed, the tail pointer pointing to the tail of the task description queue and the data queue to be processed, as can be seen by way of example in fig. 8. The head pointer describes the addresses of the head of the task description queue and the pending data queue. The tail pointer describes the addresses of the tail of the task description queue and the pending data queue.
The indication information further includes task scheduling information of the chipset, and the task scheduling information may be referred to in the foregoing description, which is not repeated herein.
The task parameters refer to task parameters of the chipset for executing tasks of the kind preset to be executed by the chipset. For example, if the chipset 1 presets the task of the task 1, the task parameters stored in the chipset 1 are the task parameters of the task 1.
The task description queue includes descriptions of pieces of information about tasks that may instruct a chipset to complete execution of a corresponding kind of task. For example, description information of batch data to be processed, description information of a computer program executing tasks, and the like may be included. The description information may include, for example, a storage address corresponding to the batch data, a storage address of a task parameter, a storage address of data obtained after processing, and the like, and the description information about the task is not limited in the embodiment of the present application. The task description queue can be a description queue correspondingly created after the chip receives the batch data to be processed, and then the description information of one batch data is inserted into the queue after each batch data is received.
The pending data queue may include pending batch data. The queue of data to be processed may be a queue created correspondingly after the chip receives the batch data to be processed, and then one batch data is inserted into the queue every time one batch data is received.
In one possible implementation, the indication information and the task parameters may be stored in a static memory, and the task description queue and the data queue to be processed may be stored in a memory. In another possible implementation, the indication information may be stored in a memory specifically configured for use by the control unit 801, the task parameters may be stored in a memory shared by a plurality of chips in the chipset, and the task description queue and the data queue to be processed may be stored in a memory. It should be noted that, the embodiment of the present application does not limit the specific storage location of the information in the chipset.
In one possible implementation, a distributed lock mechanism is provided in the chipset. By the distributed lock mechanism, resources in the chip sets can only be accessed and operated by one chip set (the chip set itself or another chip set) at a time, and the consistency of data in the second chip set is ensured. The first chipset may access and operate resources in the second chipset by acquiring a distributed lock (lock) of the second chipset.
Based on the description of fig. 8, in the process of the first chipset acquiring the first data to be processed and the task parameters from the second chipset, the first chipset may acquire the distributed lock of the second chipset first. The first chipset may then access a head pointer in the indication information stored in the second chipset memory based on the distributed lock. The task description of the corresponding batch data and batch data may be obtained based on the address pointed to by the head pointer, and the task parameters may be read from the memory of the second chipset based on the memory address of the task parameters indicated in the task description. After the first chipset has acquired the data, the head pointer in the second chipset may be modified such that the head pointer points to the current task description queue and the queue head of the data queue to be processed in the second chipset. The first chipset may then release the distributed lock of the second chipset.
Illustratively, the first chipset described above may modify the head pointer based on the number of latches obtained from the second chipset, the number of latches obtained being determined based on the task scheduling information of the first chipset, see in particular the description above. For example, assume that the control unit 801 and the memory 802 shown in fig. 8 described above are the control unit and the memory in the second chipset, and that the first chipset acquires data of two latches from the second chipset. Then, the first chipset may obtain the task description of the batch_1 and the task description of the batch_2 from the task description queue of the second chipset, and obtain the data of the batch_1 and the data of the batch_2 from the data queue to be processed. After the first chipset obtains the task description of the batch_1 and the task description of the batch_2, and the data of the batch_1 and the data of the batch_2, the data may be deleted from the queue of the second chipset. The first chipset may then modify the head pointer in the indication of the second chipset memory to a memory address pointing to the task description of batch_3 and a memory address of the data pointing to batch_2.
After the first chipset obtains the batch data to be processed, the task parameters and the task description from the second chipset, the first type of task processing can be performed on the batch data to be processed based on the information, and the processed first data is obtained.
The above describes the process of the first chipset obtaining data from the second chipset, in one possible embodiment, the first chipset may send the data to the second chipset and store the data in the memory of the second chipset. For ease of understanding, the following description is exemplary in connection with the description of FIG. 8 above. The control unit 801 and the memory 802 shown in fig. 8 are assumed to be the control unit and the memory in the second chipset. First, the first chipset obtains the distributed lock of the second chipset. Based on the distributed lock, the first chipset may modify a tail pointer in a second chipset memory.
Illustratively, the first chipset may modify the tail pointer based on the number of latches of the data stored to the second chipset. For ease of understanding, reference may be made to fig. 9, which is described by way of example as a pending data queue. For example, assuming that the first chipset stores one batch of data to the second chipset, the tail pointer is shifted toward the tail of the queue by one length, and the storage address of the data originally pointed to batch_s is changed to the address pointed to one non-stored data. The address of the non-stored data is an address applied for storing data to be written. The modification of the tail pointer of the task description queue is the same, and the tail pointer of the task description queue and the tail pointer of the data queue to be processed may be modified simultaneously, i.e. moved synchronously, for example.
After the first chipset modifies the tail pointer in the second chipset, a task description of a batch is newly added in the address of the task description queue pointed by the tail pointer, and the batch data is sent to the storage address of the data queue to be processed pointed by the tail pointer for storage. One batch is taken as an example for description, and a plurality of batches can be implemented in a specific implementation, so that the processing flow is the same and is not repeated here. After the first chipset stores the data into the second chipset, the distributed lock of the second chipset is released, so that the access and operation of the second chipset are finished.
In a possible implementation manner, the task scheduling information of the first chipset indicates that the first chipset performs the first kind of task but does not perform the second kind of task. The second kind of task is a task preset to be executed by the first chipset. The second kind of task is used for further processing the data processed by the first kind of task.
Then, after the first chipset obtains the data from the second chipset to execute the processing of the first kind of task and obtain the processed first data, the first chipset may add the processed first data to its own data queue to be processed, and add the task description of the processed first data to its own task description queue. Then, the processed first data may be acquired from the data queue to be processed by other chipsets in the chipset, a task description of the processed first data is acquired from the task description queue, task parameters of a second type of task are acquired, and processing of the second type of task is performed on the processed first data based on the acquired information.
In a possible implementation manner, the task scheduling information of the first chipset instructs the first chipset to perform the first kind of task and to perform the second kind of task.
Then, after the first chipset obtains the data from the second chipset to execute the processing of the first kind of task and obtain the processed first data, the first chipset may add the processed first data to its own data queue to be processed, and add the task description of the processed first data to its own task description queue. Then, the first chipset may obtain the processed first data from the to-be-processed data queue, obtain a task description of the processed first data from the task description queue, obtain task parameters of a second type of task, and perform processing of the second type of task on the processed first data based on the obtained information.
The first chipset may perform the second kind of task processing on the processed first data, and then may obtain processed second data. The first chipset may then send the processed second data to a third chipset in the chip system. The third chipset is pre-configured to perform a third class of tasks. The third kind of task may be for further processing the data processed by the second kind of task.
In a possible embodiment, the task scheduling information of the first chipset instructs the first chipset to perform the first kind of task, the second kind of task and the third kind of task.
Then, after the first chipset obtains the data from the second chipset to execute the processing of the first kind of task and obtain the processed first data, the first chipset may add the processed first data to its own data queue to be processed, and add the task description of the processed first data to its own task description queue. Then, the first chipset may obtain the processed first data from the to-be-processed data queue, obtain a task description of the processed first data from the task description queue, obtain task parameters of a second type of task, and perform processing of the second type of task on the processed first data based on the obtained information.
The first chipset may perform the second kind of task processing on the processed first data, and then may obtain processed second data. Then, optionally, the first chipset may obtain the task description and the task parameters for performing the task processing of the third kind from the third chipset. And then, processing the processed second data into a third kind of task based on the task description and the task parameters. Optionally, the first chipset sends the data obtained after the third kind of task processing to a fourth chipset in the chip system. The fourth chipset is preset to execute a fourth kind of task. The fourth kind of task may be for further processing the data processed by the third kind of task.
In a possible implementation manner, the task scheduling information of the first chipset instructs the first chipset to perform the third kind of task.
Then, the first chipset may obtain the data to be processed from the data to be processed queue in the third chipset, obtain the task description of the data to be processed from the task description queue in the third chipset, and obtain the task parameters of the third kind of task. Then, the first chipset performs a third kind of task processing on the data to be processed based on the acquired information, and obtains the processed data. The first chipset may then send the processed data to a fourth chipset.
In a possible implementation manner, the task scheduling information of the first chipset instructs the first chipset to execute the second kind of task preset to be executed by itself and execute the third kind of task.
Then, the first chipset may obtain the data to be processed from its own data queue to be processed, obtain the task description of the data to be processed from its own task description queue, and obtain the task parameters of the second kind of task. And processing the data to be processed by a second type of task based on the acquired information to obtain the processed data. Then, optionally, the first chipset may obtain the task description and the task parameters for performing the task processing of the third kind from the third chipset. And then, processing the processed data for a third kind of task based on the acquired task description and task parameters. Optionally, the first chipset sends the data obtained after the third kind of task processing to the fourth chipset.
In a possible implementation manner, the task scheduling information of the first chipset instructs the first chipset to execute a task of a second kind preset to be executed by the first chipset.
Then, the first chipset may obtain the data to be processed from its own data queue to be processed, obtain the task description of the data to be processed from its own task description queue, and obtain the task parameters of the second kind of task. And processing the data to be processed by a second type of task based on the acquired information to obtain the processed data. The first chipset may then optionally send the processed data to the third chipset.
To facilitate understanding of the relationship between the first chipset, the second chipset, the third chipset, and the fourth chipset described above, an illustration is made. Based on the foregoing, the chip system includes n chip sets. Then, the first chipset may be an i-th chipset of the n chipsets. The second chipset may be an i-1 th chipset of the n chipsets. The third chipset may be an i+1th chipset of the n chipsets. The fourth chipset may be an i+2th chipset of the n chipsets. The i+2 is less than or equal to n.
Illustratively, taking the chip system shown in fig. 1 or fig. 2 as an example, the first chip set may be, for example, the chip set 2 in the chip system. The second chipset may be, for example, chipset 1 in the chip system. The third chip set may be, for example, chip set 3 in the chip system. The fourth chipset may be, for example, chipset 4 in the chip system. The present disclosure is to be considered as merely an example given here for the convenience of understanding the present disclosure and is not to be construed as limiting the embodiments of the present disclosure.
The above examples list several ways in which the task scheduling information of the first chipset instructs the first chipset to perform tasks. The several processing modes are as follows:
The task scheduling information of the first chipset instructs the first chipset to perform the first kind of task but not the second kind of task;
The task scheduling information of the first chipset indicates the first chipset to execute the first kind of task and execute the second kind of task;
the task scheduling information of the first chipset instructs the first chipset to execute the first kind of task, the second kind of task and the third kind of task;
The task scheduling information of the first chipset indicates the first chipset to execute the third kind of task;
The task scheduling information of the first chip set indicates the first chip set to execute a second type of task preset to be executed and execute the third type of task;
the task scheduling information of the first chipset indicates the first chipset to execute a task of a second kind preset to be executed by the first chipset.
In a possible implementation manner, the task scheduling information of the first chipset may instruct the first chip to perform one or more of the several processing manners, which is not limited in the embodiment of the present application.
In one possible implementation, the above several processing manners are only examples, and do not limit the embodiments of the present application. The task scheduling information of the first chipset may also instruct the first chipset to perform processing of other kinds of tasks, for example. For example, the task scheduling information of the first chipset may also instruct the first chipset to perform processing of a fifth kind of task that is preset to be performed by a fifth chipset in the chip system. The fifth kind of task may be, for example, for further processing of the data processed by the fourth kind of task described above. In a specific implementation, the task scheduling information of the first chipset may indicate that the first chipset executes a task preset to be executed by any chipset in the chipset system, which is not limited in the embodiment of the present application.
In summary, in the present solution, the chipset in the chip system acquires data from other chipsets based on the preset task scheduling information to execute the tasks of the other chipsets, so as to help the chipsets share the load, thereby implementing load balancing of each chipset in the chip system, and improving the processing efficiency and performance of the whole chip system.
In one possible implementation manner, the first chipset may perform tasks based on its own hardware logic in addition to performing tasks based on the task scheduling information. In particular, the hardware logic is not controlled by task scheduling information, and the hardware logic may instruct the first chipset to perform tasks independently of the task scheduling information.
In one possible implementation, the first chipset performs tasks based on the indication of task scheduling information with a higher priority than performing tasks based on hardware logic. That is, if the first chipset can obtain an instruction to perform a certain kind of task based on the task scheduling information during the same period of time, the first chipset may preferentially perform the task based on the instruction of the task scheduling information. And if the task scheduling information does not instruct the first chipset to perform the task during the period of time, the first chipset may perform the task based on the instruction of its hardware logic.
In one possible implementation, the hardware logic of the first chipset includes monitoring a pending data queue of the first chipset. If there is data to be processed in the data to be processed queue of the first chipset, and the task scheduling information does not instruct the first chipset to perform tasks at this time, the hardware logic may instruct the first chipset to perform processing of a second kind of task on the data in the data to be processed list.
In a possible implementation manner, if there is no data to be processed in the data queue to be processed of the first chipset, and the task scheduling information does not indicate that the first chipset performs a task at this time, the hardware logic may instruct the first chipset to acquire the data to be processed from other chipsets in the chipset system, and perform processing on the acquired data to be processed for tasks of a type corresponding to the other chipsets. I.e. to share the processing load of the other chip sets.
For example, if there is no data to be processed in the data queue to be processed of the first chipset, and the task scheduling information does not instruct the first chipset to perform the task at this time, the hardware logic may instruct the first chipset to acquire the data to be processed from the second chipset, and perform the processing of the first kind of task on the acquired data to be processed. I.e. sharing the processing load of the second chipset.
For example, if there is no data to be processed in the data queue to be processed of the first chipset, and the task scheduling information does not instruct the first chipset to perform the task at this time, the hardware logic may instruct the first chipset to acquire the data to be processed from the third chipset, and perform processing of a third kind of task on the acquired data to be processed. I.e. to share the processing load of the third chip set.
In a specific implementation, the hardware logic may be implemented by a hardware module. The hardware modules may be, for example, integrated circuits, field programmable gate arrays or other programmable logic devices, transistor logic devices, hardware components or any combination thereof, and the like. The embodiment of the application does not limit the hardware module, so long as the hardware logic described above can be realized, which is within the protection scope of the embodiment of the application.
The first chipset performs tasks in combination with the task scheduling information and the hardware logic, so that the resource utilization rate of the chipset can be further improved. In the chip system, each chip group can combine task scheduling information and hardware logic to execute tasks, so that the resource utilization rate of the chip system can be further improved. Referring to fig. 10 for exemplary purposes, fig. 10 shows a resource utilization schematic of a system-on-chip to perform tasks under various conditions. In fig. 10, the horizontal axis represents the batch number of data processed by the chip system, and the vertical axis represents the resource utilization. In the utilization curve shown in fig. 10, a curve ① represents the resource utilization in the case where the chipset in the chip system processes its own preset task only according to the hardware logic. Curve ② shows the resource utilization in the case where the chipset in the chip system processes its own preset task and the task of the second chipset according to the hardware logic only. Curve ③ represents the resource utilization in the case where the chipset in the chip system processes tasks only according to the indication of task scheduling information. The curve ④ shows the resource utilization of the chipset in the chip system when the chipset processes the task according to the instruction of the task scheduling information and processes the task preset by the chipset and the task of the second chipset according to the hardware logic. The curve ⑤ shows the resource utilization of the chip set in the chip system when the chip set processes the task according to the instruction of the task scheduling information and processes the task preset by the chip set, the task of the second chip set and the task of the third chip set according to the hardware logic. It can be seen that from curves ① to ⑤, the resource utilization of the chip system is higher and higher, so it can be seen that executing tasks in combination with task scheduling information and hardware logic can further improve the resource utilization of the chip system.
The way in which the task scheduling information is calculated is described below.
In one possible implementation manner, the task scheduling information of each of the n chip sets may be calculated based on the execution time of each kind of task and the number of latches that need to be processed. Illustratively, for ease of understanding, the following is described by way of example in connection with tables 1 and 2.
TABLE 1
Table 1 shows information such as the data amount preset by each chipset in the chip system before task scheduling. In table 1 above, the chip set of the preset execution, the execution time and the number of batches to be processed corresponding to each kind of task can be seen. For example, task 1 is preset to be executed by chipset 1, and the time required for executing task 1 (for example, performing task1 processing on data of one batch) once is t1, and the amount of data required to be processed by task 1 is data of β batches. Then, in the whole data processing process (the process of processing β batch data by the chip system), the time required for the chip set 1 is β×t 1. Similarly, the time spent by the chipset i is β×t i, and i has a value from 1 to n during the whole data processing.
T 1、t2、t3、…、tn above is known. Illustratively, the test may be performed through a t 1、t2、t3、…、tn test, or may be configured at the time of shipment, etc., which is not limited in this regard by the embodiments of the present application.
Because of the differences between t 1、t2、t3、…、tn, the time spent by each chipset is not equal throughout the process. The chipset spending a long time drags the progress of the whole data processing process, and reduces the processing efficiency of the chip system. For this reason, the time spent by each chipset in the whole data processing process can be made to be equal by task scheduling, thereby improving the processing efficiency of the chip system. Based on this idea, a chipset that takes less time can be scheduled to help a chipset that takes more time to perform certain tasks. For ease of understanding, see table 2 for an example.
TABLE 2
Table 2 above shows the number of times each kind of task is executed by each chipset in the chip system after task scheduling. Illustratively, the number of times the chipset i performs task j after task scheduling is C i.j. Wherein, the values of i and j are integers from 1 to n. The C i.j is an unknown variable that needs to be determined by calculation. Based on the above Table 1, it can be seen that the data amount to be processed of task j is βIn addition, the time Ti spent by the chipset i in the whole data processing process after task scheduling can be calculated by combining table 1 and table 2, and the specific calculation formula is as follows: The following two equations can be derived from this:
then, C i.j can be calculated by the above equation (2) in the case where the above equation (1) is satisfied and where T 1、T2、T3、…、Tn tends to be equal as much as possible is satisfied. Since the number of the above unknown variables is more than the number of equations, there may be a plurality of optimal solutions. After a plurality of solutions are obtained through calculation, one solution can be selected for use, and the selected solution is not limited in the embodiment of the application. After the calculation of C i.j, the type task executed by each chipset and the times thereof after the task scheduling can be obtained, thereby obtaining the task scheduling information of each chipset.
In a possible implementation manner, the above C i.j may also be calculated in another manner, which is also calculated based on the execution time of each kind of task and the number of latches that need to be processed. Specifically, the time taken to execute one batch of data can be calculated based on the execution time in Table 1 asThen, the proportion of the time taken to execute task i once to the time taken to execute the data of one batch is P i=ti/Tbatch. The amount of data processed by chipset i during the entire data processing can be obtained in combination with Table 2Theoretically, it is an ideal state that this data amount D i is equal to β/n, which means that the data amount processed by each chipset is equal during the whole data processing, so that the time spent by each chipset is also equal, thereby fully utilizing the computing resources of each chipset. However, this ideal is generally not achieved, and therefore, D i is only required to be as close to β/n as possible at the time of calculation. In addition, in this way of calculation, the equation is still satisfiedThe following two equations can be derived from this:
Then, C i.j can be calculated by the above equation (4) in the case where the above equation (3) is satisfied, and in the case where D 1、D2、D3、…、Dn is satisfied to tend to β/n as much as possible. Since the number of the above unknown variables is more than the number of equations, there may be a plurality of optimal solutions. After a plurality of solutions are obtained through calculation, one solution can be selected for use, and the selected solution is not limited in the embodiment of the application. After the calculation of C i.j, the type task executed by each chipset and the times thereof after the task scheduling can be obtained, thereby obtaining the task scheduling information of each chipset.
In one possible implementation, the task scheduling information of the first chipset just as listed above indicates several processing manners of the first chipset to perform tasks, and the first chipset may only help the adjacent connected chipsets to share the processing load, besides the task preset by itself. In this case, the calculation manner of the task scheduling information is the same as the calculation concept of the calculation manner described above, that is, load balancing of each chipset is to be implemented, but specific calculation processes are different. The following is an exemplary illustration.
For ease of understanding, the chip system 100 shown in FIG. 1 described above is presented as an example. The chip system 100 includes a chip set 1, a chip set 2, a chip set 3, and a chip set 4. Adjacent to chipset 1 are connected chipset 2 and chipset 4. Adjacent to chipset 2 are connected chipset 1 and chipset 3. Adjacent to the chip set 3 are connected the chip set 2 and the chip set 4. Adjacent to the chipset 4 are connected the chipset 3 and the chipset 1. Table 3 may be obtained based on the description that the first chipset only helps the neighboring connected chipsets to share the processing load, in addition to its own preset tasks.
TABLE 3 Table 3
Similarly, the amount of data to be processed for each type of task is β, then the following equation set can be obtained:
In addition, the time spent by each chipset during the whole data processing after task scheduling can be calculated by combining tables 1 and 3, thereby obtaining the following equation set:
Then, the unknown variables in the above table 3 can be calculated by the above equation set (6) in the case where the above equation set (5) is satisfied, and where T1, T2, T3, and T4 are satisfied to be as equal as possible. Since the number of the above unknown variables is more than the number of equations, there may be a plurality of optimal solutions. After a plurality of solutions are obtained through calculation, one solution can be selected for use, and the selected solution is not limited in the embodiment of the application. After the solution is obtained through calculation, the type task executed by each chipset and the times of the type task can be obtained after the task is scheduled, and therefore the task scheduling information of each chipset is obtained.
In one possible implementation, the unknown variables in table 3 may be calculated in another way, which is also calculated based on the execution time of each task and the number of latches that need to be processed. Referring to equation set (4) above, the following equation set may be obtained:
Reference may be made to the foregoing description for the physical meaning of D1, D2, D3, D4, P1, P2, P3 and P4 in equation set (7), and details are not repeated here. Then, the unknown variables in the above table 3 can be calculated by the above equation set (7) in the case where the above equation set (5) is satisfied, and in the case where D 1、D2、D3 and D 4 are satisfied to tend to β/4 as much as possible. Since the number of the above unknown variables is more than the number of equations, there may be a plurality of optimal solutions. After a plurality of solutions are obtained through calculation, one solution can be selected for use, and the selected solution is not limited in the embodiment of the application. After the solution is obtained through calculation, the type task executed by each chipset and the times of the type task can be obtained after the task is scheduled, and therefore the task scheduling information of each chipset is obtained.
Illustratively, the embodiment of the present application performs a simulation experiment based on the above equation set (5) and equation set (6). In one possible implementation, β=64,α=16,t1=0.1446cycles,t2=0.32629cycles,t3=0.2611cycles,t4=0.268cycles. may be taken, where cycles represent clock cycles, in a simulation experiment that includes 100 million clock cycles per second. The results of the simulation calculations can be seen in table 4.
TABLE 4 Table 4
The types of tasks performed by the respective chipsets obtained by the simulation experiments and the number of times thereof are shown in table 4. It can be seen that each kind of task is 64 and can be assigned to one or more chipsets for execution. Also shown in table 4 are the execution times (in cycles) of each two chipsets calculated based on the kind of task executed by each chipset obtained by the simulation and the number of times thereof. It can be seen that the execution time of each chipset tends to average, with the greatest execution time overhead being chipset 2, spending 16.119cycles. The 16.119cycles are the execution time of the entire system-on-chip. In addition, the ideal execution time of each chipset is 16cycles, i.e. the ideal execution time of the chipset is 16cycles. The resource utilization of the chip system can be expressed as a ratio of ideal execution time to the time spent by the chipset with the greatest time overhead, and then the utilization of the chip system is 16/16.119 = 0.9934. It can be seen that the utilization rate of the chip system obtained by simulation is very high, which means that the computing performance of the chip system is very good. It should be noted that the contents shown in table 4 are only data obtained by one possible simulation experiment, and do not limit the embodiments of the present application.
The above description is merely exemplary of the chip system 100 shown in fig. 1 and is not to be construed as limiting the embodiments of the present application. The calculation manner of other chip systems (such as the chip systems shown in fig. 2,3 or 3A) may refer to the above description, and is not repeated herein.
In a possible implementation manner, the data processing task may be a data processing task of a deep learning neural network. Based on the above description, the data processing task includes a plurality of processing steps (which may be exemplarily described with reference to fig. 6 above), and each of the plurality of processing steps may be a processing step of one network layer of the deep learning neural network. The data processing tasks are divided into n kinds of tasks, each kind of task comprising a plurality of processing steps, and each kind of task comprises processing tasks of a plurality of network layers of the deep neural network. The multiple network layers implementing each kind of task may be regarded as one sub-network of the deep neural network, which may then include n sub-networks. Then, the ith chipset is preset to execute the ith task, that is, the ith chipset is preset to implement the function of the ith sub-network.
In a possible implementation manner, the n kinds of tasks may be data processing tasks in the forward propagation process of the deep learning neural network.
In a possible implementation manner, the i-th data obtained after the i-th chipset executes the i-th task is stored in the memory of the i-th chipset. The i-th data may be used for parameter calculation during the back propagation of the i-th sub-network. For ease of understanding, reference may be made to fig. 11 by way of example. In fig. 11, the chip sets during forward propagation and backward propagation are the same chip set, and are drawn twice for better illustration. Memory 1, memory 2, memory 3, and memory 4 correspond to memories in chipset 1, chipset 2, chipset 3, and chipset 4, respectively. Wherein W represents a task parameter, and O represents data obtained after processing. Taking W1 and O1 in the memory 1 as examples, W1 is a task parameter of the task 1, and O1 is data obtained after performing task 1 processing on data of one or more batches, and the other is the same and will not be described again. In a specific implementation, the processed data obtained by each chipset may be stored in its own memory. After the chipset 4 completes the processing of the last kind of task in the whole data processing task, back propagation may be performed based on the processed data.
Counter-propagating is a process of deriving gradients of all parameters of the network layer through a reverse chained derivation process based on the loss function and the gradients, and then optimizing the parameters through an optimization algorithm. The process requires the use of processed data generated by each chipset during forward propagation. Specifically, the optimization parameter W1 needs to use the data O1, the optimization parameter W2 needs to use the data O2, the optimization parameter W3 needs to use the data O3, and the optimization parameter W4 needs to use the data O4. Therefore, compared with the prior implementation scheme of storing the processed data in other places, the embodiment of the application can avoid the need of reading the data in other places, thereby saving the bandwidth resources of a chip system and improving the back propagation efficiency.
In a possible implementation manner, the n kinds of tasks may be data processing tasks in the deep learning neural network back propagation process. In particular, it is known based on the above description that the back propagation process is actually a parameter optimization process, and then the parameter optimization task of each chipset can be regarded as a kind of task described above. For example, in the foregoing fig. 11, the task optimization task of the task parameter W1 of the chipset 1 is a kind of task, which may be denoted by task' 1. Similarly, during the back propagation process, the task of optimizing the task parameter W2 of the chipset 2 is also a kind of task, which may be denoted by task' 2. The other similar matters are not repeated.
In summary, in the present solution, the chipset in the chip system acquires data from other chipsets based on the preset task scheduling information to execute the tasks of the other chipsets, so as to help the chipsets share the load, thereby implementing load balancing of each chipset in the chip system, and improving the processing efficiency and performance of the whole chip system.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims (16)

1. A method of load balancing in a chip system, wherein the chip system comprises n chip sets, n being an integer greater than 1, each chip set comprising one or more chips; the chip system is used for processing n kinds of tasks; presetting an ith type of task to be executed by an ith chipset, wherein the value of i is an integer from 1 to n; the method comprises the following steps:
a first chipset of the n chipsets performs the following operations based on task scheduling information:
Acquiring first data to be processed from a second chipset in the n chipsets; the task scheduling information is used for indicating the type of the task executed by the first chipset after task scheduling;
Executing a first kind of task based on the first data to be processed to obtain processed first data; the first kind of task is a task preset to be executed by the second chipset.
2. The method of claim 1, wherein the task scheduling information is further used to indicate a number of times the first chipset performs the task for each type of task after the task is scheduled, and the data processed by different numbers of times the task of the same type is performed is different.
3. The method of claim 1, wherein the task scheduling information is stored in a memory of the first chipset.
4. The method of claim 1, wherein the first chipset further performs the following operations based on the task scheduling information:
Acquiring a first task parameter from the second chipset, wherein the first task parameter is a parameter used for executing the first kind of task in the second chipset;
The performing a first kind of task based on the first data to be processed includes:
And executing the first kind of task based on the first task parameter and the first data to be processed.
5. The method according to any one of claims 1-4, wherein the performing a task of a first kind based on the first data to be processed, after obtaining the processed first data, further comprises:
Executing a second kind of task based on the processed first data to obtain processed second data; the second kind of task is a task preset to be executed by the first chipset;
and sending the processed second data to a third chipset in the n chipsets.
6. The method according to any one of claims 1-4, wherein the performing a task of a first kind based on the first data to be processed, after obtaining the processed first data, further comprises:
Executing a second kind of task based on the processed first data to obtain processed second data; the second kind of task is a task preset to be executed by the first chip;
and executing a third kind of task based on the processed second data, wherein the third kind of task is a task preset to be executed by a third chip set in the n chip sets.
7. The method of claim 6, wherein prior to performing a third type of task based on the processed second data, further comprising:
acquiring second task parameters from the third chipset, wherein the second task parameters are parameters used for executing the third kind of task in the third chipset;
The performing a third kind of task based on the processed second data includes:
And executing the third kind of task based on the second task parameters and the processed second data.
8. The method according to any one of claims 1-4, wherein performing a task of a first kind based on the first data to be processed, after obtaining the processed first data, further comprises:
and sending the processed first data to a fourth chipset in the n chipsets.
9. The method of any of claims 1-4, wherein the first chipset further performs the following operations based on the task scheduling information:
Executing a second kind of task to obtain processed third data; the second kind of task is a task preset to be executed by the first chipset;
Executing a third kind of task based on the processed third data to obtain processed fourth data; the third kind of task is a task preset to be executed by a third chipset of the n chipsets;
and sending the processed fourth data to a fourth chipset in the n chipsets.
10. The method according to any one of claims 1-4, further comprising:
The first chipset obtains second data to be processed from a fifth chipset in the n chipsets;
Executing a fourth kind of task based on the second data to be processed to obtain processed fifth data; the fourth kind of task is a task preset to be executed by the fifth chipset; the task scheduling information does not indicate the first chipset to perform the fourth class of tasks.
11. The method of any one of claims 1-4, wherein the chip system processes the n categories of tasks through a deep learning neural network; the deep learning neural network comprises n sub-networks, each sub-network comprising a plurality of network layers of the deep learning neural network; the ith chip set is preset for realizing the function of the ith sub-network; the function of the ith sub-network is the function realized by the ith type of task.
12. The method of claim 11, wherein the n categories of tasks are data processing tasks during forward propagation of the deep learning neural network.
13. The method according to claim 12, characterized in that the i-th data obtained after the i-th chipset performs the i-th kind of task is stored in a memory of the i-th chipset, the i-th data being used for parameter calculation during back propagation of the i-th sub-network.
14. The method of claim 11, wherein the n categories of tasks are data processing tasks in the deep learning neural network back propagation process.
15. A chip system comprising n chip sets, n being an integer greater than 1, each chip set comprising one or more chips; the chip system is used for processing n kinds of tasks; presetting an ith type of task to be executed by an ith chipset, wherein the value of i is an integer from 1 to n; the n chip sets comprising a first chip set for performing the method of any of claims 1-14.
16. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, implements the method of any of claims 1-14.
CN202211711004.1A 2022-12-29 Load balancing method in chip system and related device Pending CN118277069A (en)

Publications (1)

Publication Number Publication Date
CN118277069A true CN118277069A (en) 2024-07-02

Family

ID=

Similar Documents

Publication Publication Date Title
US7523157B2 (en) Managing a plurality of processors as devices
US7516456B2 (en) Asymmetric heterogeneous multi-threaded operating system
US7669035B2 (en) Systems and methods for reconfigurable computing
US6003060A (en) Method and apparatus to share resources while processing multiple priority data flows
CN104820657A (en) Inter-core communication method and parallel programming model based on embedded heterogeneous multi-core processor
US20210049146A1 (en) Reconfigurable distributed processing
JPH08502612A (en) Data processing system and operating system
US20100049892A1 (en) Method of routing an interrupt signal directly to a virtual processing unit in a system with one or more physical processing units
CN101414270A (en) Method for implementing assist nuclear task dynamic PRI scheduling with hardware assistant
GB2460735A (en) Bus Fabric for Embedded System Comprising Peer-to-Peer Communication Matrix
JP2007257257A (en) Task execution environment switching method in multi-task system
US20230367593A1 (en) RISC-V-based Artificial Intelligence Inference Method and System
CN110187970A (en) A kind of distributed big data parallel calculating method based on Hadoop MapReduce
CN112527514A (en) Multi-core security chip processor based on logic expansion and processing method thereof
WO2023020010A1 (en) Process running method, and related device
CN118277069A (en) Load balancing method in chip system and related device
CN112867998A (en) Operation accelerator, exchanger, task scheduling method and processing system
CN111357016B (en) On-chip communication system for neural network processor
Merigot et al. A pyramidal system for image processing
Winter et al. A network-on-chip channel allocator for run-time task scheduling in multi-processor system-on-chips
JPH11110362A (en) Method for communicating data between computers
WO2021195104A1 (en) Digital-imc hybrid system architecture for neural network acceleration
JP2018120307A (en) Accelerator processing management apparatus, host apparatus, accelerator processing execution system, method and program
Benchehida et al. Memory-processor co-scheduling for real-time tasks on network-on-chip manycore architectures
CN110096374B (en) System and method for controlling communication middleware between internal multi-class computing units

Legal Events

Date Code Title Description
PB01 Publication