CN113495791B - Task processing system, method and chip - Google Patents

Task processing system, method and chip Download PDF

Info

Publication number
CN113495791B
CN113495791B CN202111041363.6A CN202111041363A CN113495791B CN 113495791 B CN113495791 B CN 113495791B CN 202111041363 A CN202111041363 A CN 202111041363A CN 113495791 B CN113495791 B CN 113495791B
Authority
CN
China
Prior art keywords
task
computing cluster
chip
main
auxiliary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111041363.6A
Other languages
Chinese (zh)
Other versions
CN113495791A (en
Inventor
刘伟
刘彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Suiyuan Technology Co ltd
Original Assignee
Shanghai Enflame Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Enflame Technology Co ltd filed Critical Shanghai Enflame Technology Co ltd
Priority to CN202111041363.6A priority Critical patent/CN113495791B/en
Publication of CN113495791A publication Critical patent/CN113495791A/en
Application granted granted Critical
Publication of CN113495791B publication Critical patent/CN113495791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)

Abstract

The embodiment of the invention discloses a task processing system, a method and a chip. Wherein, the system includes: the system comprises a main computing cluster and an auxiliary computing cluster, wherein the main computing cluster and the auxiliary computing cluster are in communication connection; the main computing cluster is used for processing a main chip task; the main chip task comprises a task independently processed by the main computing cluster; the auxiliary computing cluster is used for processing auxiliary chip tasks; the secondary chip tasks include tasks of the primary computing cluster simultaneous processing. The technical scheme of the embodiment of the invention can realize the high-efficiency utilization of the computing cluster resources and improve the performance of the chip system.

Description

Task processing system, method and chip
Technical Field
The embodiment of the invention relates to the technical field of chips, in particular to a task processing system, a task processing method and a chip.
Background
A chip, also called a microcircuit (microcircuit), a microchip (microchip), and an Integrated Circuit (IC), refers to a silicon chip containing an integrated circuit, which is small in size and is often a part of a computer or other electronic devices. The artificial intelligence chip can effectively process an artificial intelligence model and is a key development direction in the chip field.
An artificial intelligence chip usually includes a plurality of computing clusters, each of which works independently, but there are also scenarios that require mutual cooperation. For example, in forward inference calculations, multiple computing clusters share weight data. If a plurality of clusters access a part of weight data at the same time, the chip needs to copy a part of weight data for each computing cluster according to the number of the computing clusters acquiring the weight data so as to be used and calculated by each computing cluster. In the inverse gradient calculation, if one piece of gradient data is calculated for each calculation cluster, all the gradient data needs to be reduced to one piece in the whole chip range.
In the process of implementing the invention, the inventor finds that the prior art has the following defects: if the computing cluster participates in processing the computation scenes of mutual cooperation such as weight copying, gradient combination and the like, subsequent chip processing tasks such as data transportation, computation and the like need to be delayed, so that the computation time of the whole chip is prolonged, and the performance is poor.
Disclosure of Invention
The embodiment of the invention provides a task processing system, a task processing method and a chip, which are used for realizing the efficient utilization of computing cluster resources and improving the performance of a chip system.
In a first aspect, an embodiment of the present invention provides a task processing system, configured on a chip, and including a main computing cluster and an auxiliary computing cluster, where the main computing cluster and the auxiliary computing cluster maintain a communication connection; wherein:
the main computing cluster is used for processing a main chip task; the main chip task comprises a task independently processed by a main computing cluster;
the auxiliary computing cluster is used for processing auxiliary chip tasks; the secondary chip tasks include tasks for simultaneous processing by the primary computing cluster.
In a second aspect, a task processing method applied to a chip includes:
processing a main chip task through a main computing cluster in the chip;
and processing auxiliary chip tasks by the auxiliary computing cluster in the chip.
In a third aspect, an embodiment of the present invention further provides a chip, where the chip includes the task processing system provided in any embodiment of the present invention.
The embodiment of the invention configures the main computing cluster and the auxiliary computing cluster which are in communication connection in the chip, processes the main chip task comprising the main computing cluster independent processing task through the main computing cluster, processes the auxiliary chip task comprising the main computing cluster simultaneous processing task through the auxiliary computing cluster, solves the technical problems of long computing time and poor performance caused by the fact that the main chip task is delayed in the scene of mutual cooperation of the computing clusters of the traditional chip, realizes the high-efficiency utilization of computing cluster resources, and improves the performance of a chip system.
Drawings
Fig. 1 is a schematic diagram of a task processing system according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a task processing system in the prior art.
Fig. 3 is a schematic workflow diagram of a task processing system in the prior art.
Fig. 4 is a schematic structural diagram of a task processing system according to an embodiment of the present invention.
Fig. 5 is a schematic workflow diagram of a task processing system according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of an auxiliary computing cluster according to an embodiment of the present invention.
Fig. 7 is a flowchart of a task processing method according to a second embodiment of the present invention.
Fig. 8 is a schematic diagram of a chip according to a third embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention.
It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
The terms "first" and "second," and the like in the description and claims of embodiments of the invention and in the drawings, are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not set forth for a listed step or element but may include steps or elements not listed.
Example one
Fig. 1 is a schematic diagram of a task processing system according to an embodiment of the present invention, where the task processing system may be configured in a chip, and optionally, the chip may be an artificial intelligence chip, and its structure includes: the primary computing cluster 110 and the secondary computing cluster 120, the primary computing cluster 110 and the secondary computing cluster 120 being communicatively coupled.
Wherein, the master computing cluster 110 is used to process master chip tasks; the master chip tasks include tasks that the master computing cluster 110 processes independently; the auxiliary computing cluster 120 is used to process auxiliary chip tasks; the secondary chip tasks include tasks that are concurrently processed by the primary computing cluster 110.
Specifically, the number of the main computing clusters 110 configured in the chip may be multiple, and the specific number may be determined according to the task amount of the main chip to be processed by the chip and the performance requirement of the chip. The main computing cluster 110 is disposed in the chip, and may be connected to an on-chip bus of the chip to communicate with the outside.
Each host computing cluster 110 may include one or more computing cores to have the functionality to process computing tasks and thus may be used to process host chip tasks. The master chip tasks include tasks that the master computing cluster 110 processes independently, including tasks that can be completed independently of each other by several master computing clusters 110. In the process of processing the main chip task by the main computing clusters 110, the main computing clusters 110 do not need to communicate with each other, and the process of processing the main chip task by any main computing cluster 110 is not influenced by other main computing clusters 110 and can be performed in parallel.
Accordingly, the number of the auxiliary computing clusters 120 configured in any chip may be one or more, and may be determined according to the task amount of the auxiliary chip to be processed by the chip and the performance requirement of the chip. The auxiliary computing cluster 120 is disposed in the chip, and may also be connected to an on-chip bus of the chip to perform communication connection with the outside, so that the communication connection between the main computing cluster 110 and the auxiliary computing cluster 120 may be realized through the on-chip bus.
The helper compute cluster 120 also has the capability to process compute tasks and may be used to process helper chip tasks. The tasks concurrently processed by the primary computing clusters 110, which comprise the helper chip tasks, may comprise tasks associated with data in at least two of the primary computing clusters 110. Optionally, the task concurrently processed by the master computing cluster 110 may be a task concurrently processed by a plurality of master computing clusters 110 in a single chip, or may be a task concurrently processed by a plurality of master computing clusters 110 between different chips, which is not limited in this embodiment of the present invention. The auxiliary computing cluster 120 may complete a portion of the auxiliary chip task that needs to perform data interaction with the main computing cluster 110 through communication with the main computing cluster 110, and independently process the remaining computing tasks, so that the portion of the auxiliary chip task does not occupy resources of the main computing cluster 110, and the processing process between the portion of the auxiliary chip task and the main chip task is not affected by each other, and can perform parallel processing.
Accordingly, fig. 2 is a schematic diagram of an architecture of a task processing system in the prior art, wherein the task processing system is also an intra-chip system. In the prior art, as shown in fig. 2, each chip is configured with a plurality of computing clusters 1 to N, and may also be configured with a plurality of storage modules 1 to M, and chip interconnection modules, which are connected to form a system through on-chip buses. Accordingly, fig. 3 is a schematic workflow diagram of a task processing system in the prior art. As shown in fig. 3, in the system architecture inside a chip provided in the prior art, each computing cluster 1 to N works relatively independently; at some point in time, computing clusters 1-N need to handle auxiliary tasks, such as forward weight replication and inverse gradient data reduction in artificial intelligence training chips. Since the auxiliary task occupies computing cluster resources, the auxiliary task and computing cluster task 0, task 1, and task 2 can only be serialized. For data synchronization auxiliary tasks among a plurality of clusters, even if only one computing cluster is called and other computing clusters are in an idle state, the auxiliary tasks are required to be completed before subsequent tasks can be processed. In the above embodiment, the auxiliary computing cluster 120 is configured to deliver the auxiliary chip task to the auxiliary computing cluster 120 for processing, so as to implement synchronous processing of the auxiliary chip task and the main chip task processed by the main computing cluster 110.
Fig. 4 is a schematic structural diagram of a task processing system according to an embodiment of the present invention. As shown in FIG. 4, an auxiliary computing cluster is introduced into the system, also connected to the bus, to access the data in the storage module. Correspondingly, fig. 5 is a schematic workflow diagram of a task processing system according to an embodiment of the present invention. As shown in fig. 5, the assistant computing cluster can process assistant chip tasks including assistant computing with a small amount of computation or transferring data between different storage modules.
In an optional embodiment of the invention, the helper chip task may comprise a target data replication task; the auxiliary computing cluster 120 may be specifically configured to: determining a target storage module for storing target data; reading the target storage module to obtain the target data; and sequentially writing the target data into a target main computing cluster.
The target data replication task may be a task for replicating data required for synchronization of the plurality of master computing clusters 110 to the plurality of master computing clusters 110. The target data may be data that is needed for synchronization of multiple host computing clusters 110. The target memory module may be a memory space configured in the chip for storing target data. The target master computing cluster may be a plurality of master computing clusters 110 that synchronize the need for target data.
Accordingly, in a scenario where multiple master computing clusters 110 in a chip cooperate with each other, a situation may occur that the same portion of data is required for synchronization, and if the multiple master computing clusters 110 independently access the storage space of the portion of target data to obtain the portion of data at the same time, bus and storage contention may be caused, which may cause system performance degradation. To avoid this, data may be copied to the plurality of host computing clusters 110 by accessing the storage space a single time and reading the portion of data. For example, an artificial intelligence chip may share weight data with multiple master computing clusters 110 in a forward inference, and the chip may need to copy the weight data for each master computing cluster 110.
To implement the above process, a target data replication task may be generated, and the target data replication task needs to be processed concurrently by the primary computing cluster 110, and may be processed by the secondary computing cluster 120 as a secondary chip task. The auxiliary computing cluster 120 may determine a target storage module storing the target data, access the target storage module through the on-chip bus, and read the target storage module to obtain the target data. Further, the auxiliary computing cluster 120 may perform data replication calculation, and replicate one copy of target data for each target primary computing cluster according to the number of target primary computing clusters that require the target data, and sequentially write the replicated target data into each target primary computing cluster. The order in which the auxiliary computing clusters 120 write the target data into each target main computing cluster may be determined according to a preset rule, and is not limited herein.
If the target data replication task is processed by the master computing cluster 110, the task of processing the master chip in the master computing cluster 110 or any subsequent task is delayed, which results in a longer chip computation time and poor performance. In the above embodiment, the target data replication task is processed by the auxiliary computing clusters 120, each target primary computing cluster stores the target data only when the auxiliary computing cluster 120 writes the target data, and each primary computing cluster 110 can synchronously process its own primary chip task in the process of performing the data replication computation on the target data by the auxiliary computing clusters 120.
In an optional embodiment of the present invention, the helper chip task may comprise a target data specification task; the auxiliary computing cluster 120 may be specifically configured to: determining a target master computing cluster storing target data; reading each target main computing cluster to obtain target data stored by each target main computing cluster; and performing data specification processing on each target data according to a data specification processing strategy.
The target data reduction task may be a task of reducing a specific partial data reduction stored in the plurality of main computing clusters 110 into one piece of data. The target data may be data that is stored separately in multiple host computing clusters 110 and needs to be reduced to the same piece of data. The target master computing cluster may be a plurality of master computing clusters 110 storing target data. The data reduction processing strategy can be a preset strategy for specifying a specific operation process for reducing the target data into the same piece of data, and can comprise a reading sequence of the target main computing clusters. The data reduction processing may be an operation of reducing the target data into the same piece of data.
Correspondingly, under the scene of mutual cooperation of the plurality of main computing clusters 110 in the chip, a situation that the data specifications stored in the plurality of main computing clusters 110 need to be changed into one data may occur, and then the multi-part data that needs to be changed into the same data by the specifications obtained by the plurality of main computing clusters 110 may be changed into one data by the corresponding data specification calculation. For example, in the inverse gradient calculation of the artificial intelligence chip, each of the main computing clusters 110 calculates a piece of gradient data, and each portion of gradient data needs to be reduced to a piece of data in the whole chip range.
In order to implement the above process, a target data specification task may be generated, and the target data specification task needs to be processed by the main computing cluster 110 in a simultaneous manner, and may be processed by the auxiliary computing cluster 120 as an auxiliary chip task. The auxiliary computing cluster 120 may determine a target main computing cluster storing the target data, access the target main computing cluster through the on-chip bus, and read the target main computing cluster to obtain the target data. Further, the assistant computing cluster 120 may perform data specification computation to perform data specification processing on each target data according to a data specification processing policy, so as to specify a plurality of portions of target data into one piece of data.
In an optional embodiment of the present invention, the helper chip task may comprise a target data specification task; the master computing cluster 110 may be specifically configured to: determining an associated master computing cluster storing target data; associating each associated main computing cluster, and sequentially sending each target data to the auxiliary computing cluster 120 according to an agreed data sending sequence; the auxiliary computing cluster 120 may be specifically configured to: receiving each of the target data; and performing data specification processing on each target data according to a data specification processing strategy.
Wherein the associated master computing cluster may be a plurality of master computing clusters 110 storing the target data. The agreed data transmission order may be a transmission operation execution order between each associated primary computing cluster when the target data is transmitted to the secondary computing cluster 120.
Correspondingly, when the auxiliary computing cluster 120 processes the target data specification task, one of the main computing clusters 110 may determine other associated main computing clusters storing the target data, so that the main computing cluster 110 and each associated main computing cluster respectively and sequentially send the respective stored target data to the auxiliary computing cluster 120 according to the agreed data sending sequence by associating each associated main computing cluster. Further, the assistant computing cluster 120 may receive each target data, perform data specification computation, and perform data specification processing on each received target data according to a data specification processing policy, so as to convert a multi-part target data specification into one piece of data.
If the target data reduction task is processed by the main computing cluster 110, the task of processing the main chip in the main computing cluster 110 or any subsequent task is delayed, which results in longer computing time of the chip and poor performance. In the above embodiment, the auxiliary computing cluster 120 processes the target data reduction task, and the main computing cluster 110 only responds when the auxiliary computing cluster 120 reads the target data, or sends the respective stored target data to the auxiliary computing cluster 120, and each main computing cluster 110 can synchronously process its respective main chip task during the data reduction computation of the target data by the auxiliary computing cluster 120.
In an optional embodiment of the present invention, the auxiliary computing cluster 120 may be further configured to: in the event that it is determined that the secondary chip task does not exist, the primary chip task is processed in parallel with the primary computing cluster 110.
Accordingly, if it is determined that there is no secondary chip task, it is indicated that the chip function can be realized only by the main computing cluster 110 processing the main chip tasks independently from each other. At this time, the auxiliary computing cluster 120 does not have an auxiliary chip task that needs to be processed, and can share and process part of the main computing cluster 110 that needs to process the main chip task through the bus to communicate with the outside, so as to avoid the auxiliary computing cluster 120 being idle, and further improve the utilization rate of the chip computing power and the chip performance.
In an optional embodiment of the present invention, the auxiliary computing cluster 120 may be specifically configured to: under the condition that the main computing cluster 110 is determined to meet the task parallel processing condition, processing the main chip task in parallel with the main computing cluster 110 according to the cluster priority; in the event that it is determined that the master computing cluster 110 does not satisfy the task parallel processing condition, the process schedules the master chip task ahead of time in series with the master computing cluster 110.
Wherein the task parallel processing condition comprises a first task parallel processing condition and/or a second task parallel processing condition; the first task parallel processing condition includes the presence of a free cluster by the master computing cluster 110; the second task parallel processing condition comprises that the memory bandwidth of the chip does not reach a full load threshold value.
In particular, an idle cluster may be a master computing cluster 110 that does not have any tasks to process at the present time. The chip storage bandwidth may be the bandwidth occupied by the master computing cluster 110 when storing data to the chip. The full load threshold may be the maximum bandwidth that the master computing cluster 110 is allowed to occupy when storing data to the chip. The cluster priority may be a priority occupied by the auxiliary computing cluster 120 on chip resources when the auxiliary computing cluster 120 processes the main chip task in parallel with each main computing cluster 110, and the auxiliary computing cluster 120 or the main computing cluster 110 with a higher cluster priority may preferentially occupy chip resources to process its main chip task. Scheduling the main chip task in advance may be a task that has no data dependency with any other tasks.
Accordingly, the first task parallel processing condition may indicate that, at the present moment, the main computing cluster 110 has an idle cluster and does not need to process any task, and then the main computing cluster 110 generates less external communication, and the auxiliary computing cluster 120 is not prone to resource contention with the main computing cluster 110. The second task parallel processing condition may indicate that the available storage bandwidth resources in the chip are still unoccupied at the current time, and the auxiliary computing cluster 120 does not compete with the main computing cluster 110 for the storage bandwidth resources. Thus, if it is determined that the primary computing cluster 110 satisfies the first task parallel processing condition and/or the second task parallel processing condition, the auxiliary computing cluster 120 may process the primary chip task in parallel with the primary computing cluster 110. The auxiliary computing cluster 120 and the main computing cluster 110 need to process the main chip tasks in parallel according to the cluster priorities, and if competition for storage bandwidth or any resource occurs, the computing cluster with the higher cluster priority preferentially processes the main chip tasks. Alternatively, the primary computing cluster 110 may be set to a high priority and the secondary computing cluster 120 may be set to a low priority.
If it is determined that the main computing cluster 110 does not satisfy the task parallel processing condition, contention for a storage bandwidth or any resource is likely to occur if the auxiliary computing cluster 120 processes respective tasks in parallel with the main computing cluster 110. At this time, since there is no data dependency relationship between the pre-scheduled main chip task and any other task, the auxiliary computing cluster 120 may process the pre-scheduled main chip task in series with the main computing cluster 110.
In the above embodiment, under the condition that the auxiliary computing cluster 120 does not need to process the auxiliary chip task, the auxiliary computing cluster 120 shares the main chip task of the main computing cluster 110, and provides a reasonable task processing rule, and meanwhile, avoids the long-time idle of any computing cluster and the resource or bandwidth competition among the computing clusters, preferentially ensures the task processing efficiency and reliability of the main computing cluster 110, and improves the system performance.
In an optional embodiment of the present invention, the auxiliary computing cluster 120 further maintains a communication connection with an associated auxiliary computing cluster of an associated chip; the helper computing cluster 120 is also for: and performing data interaction with the associated auxiliary computing cluster so as to establish that the associated auxiliary computing cluster processes the auxiliary chip task together.
Wherein the associated chip may be any of a plurality of chips that maintain a communication connection. The associated helper computing cluster may be a helper computing cluster configured in the associated chip.
Accordingly, if multiple chips are configured in the system and communication connection is maintained between the multiple chips, the assistant computing cluster 120 may maintain communication connection with the associated assistant computing cluster of the associated chip, so as to perform data interaction with the associated assistant computing cluster, so that the associated assistant computing cluster processes tasks of the assistant chip together, for example, data of the associated chip may be handled or computed by the assistant computing cluster 120 in a distributed system. The above-described embodiments may further improve system performance.
Fig. 6 is a schematic diagram of an auxiliary computing cluster 120 according to an embodiment of the present invention. In an alternative embodiment of the present invention, as shown in fig. 6, the auxiliary computing cluster 120 may include a computing core module 1201, a cache module 1202, and a DMA (Direct Memory Access) module 1203.
The calculation core module 1201 is in communication connection with the DMA module 1203 and is configured to receive cache data sent by the DMA module 1203 and perform task processing according to the cache data; the DMA module 1203 is communicatively connected to the cache module 1202, and is configured to read the cache data from the cache module 1202; the cache module 1202 is configured to store the cache data.
Accordingly, the computational core module 1201 performs task processing according to the cache data by receiving the cache data, so that the auxiliary computing cluster 120 has the capability of processing the computational task. The DMA module 1203 further enables the secondary computing cluster 120 to process the secondary chip tasks and/or the primary chip tasks by reading the cached data and the cache module 1202 by storing the cached data.
Optionally, the workflow of the auxiliary computing cluster 120 may specifically include a workflow in a task scenario of processing target data replication, and a workflow in a task scenario of processing target data specification.
Specifically, in a scenario of processing a target data copy task, the auxiliary computing cluster 120 may read in target data from a designated system storage module or a designated main computing cluster through the DMA module 1203, copy multiple copies of the target data, and write the copied target data into a plurality of different system storage modules or main computing clusters. In a task scenario of processing a target data specification, the auxiliary computing cluster 120 may read in target data from a plurality of system storage modules or the main computing cluster through the DMA module 1203 and write the target data into the storage module 1202; reading target data from the storage module 1202 through the calculation core module 1201 for protocol calculation, and writing back a result to the storage module 1202 after calculation is completed; finally, the result data is read from the storage module and written to the designated storage module or the main computing cluster by the DMA module 1203.
According to the technical scheme of the embodiment of the invention, the main computing cluster and the auxiliary computing cluster which are kept in communication connection are configured in the chip, the main chip task comprising the main computing cluster independent processing task is processed by the main computing cluster, and the auxiliary chip task comprising the main computing cluster simultaneous processing task is processed by the auxiliary computing cluster, so that the technical problems of long computing time and poor performance caused by the fact that the main chip task is delayed in a scene that computing clusters are mutually cooperated in the conventional chip are solved, the efficient utilization of computing cluster resources is realized, and the performance of a chip system is improved.
Example two
Fig. 7 is a flowchart of a task processing method provided in the second embodiment of the present invention, where this embodiment is applicable, and this method may be executed by a chip configured with a task processing system provided in the second embodiment of the present invention. Accordingly, as shown in fig. 7, the method includes the following operations:
s210, processing a main chip task through a main computing cluster in the chip.
S220, processing auxiliary chip tasks through the auxiliary computing cluster in the chip.
Wherein the main chip task comprises a task independently processed by a main computing cluster; the secondary chip tasks include tasks for simultaneous processing by the primary computing cluster.
In an optional embodiment of the invention, the helper chip task comprises a target data replication task; processing the helper chip tasks by the helper compute cluster within the chip may include: determining a target storage module for storing target data through an auxiliary computing cluster in the chip; reading the target storage module through an auxiliary computing cluster in the chip to acquire the target data; and sequentially writing the target data into a target main computing cluster through the auxiliary computing cluster in the chip.
In an optional embodiment of the present invention, the helper chip task comprises a target data specification task; processing the helper chip tasks by the helper compute cluster within the chip may include: determining, by an auxiliary computing cluster within the chip, a target primary computing cluster storing target data; reading each target main computing cluster through an auxiliary computing cluster in the chip to obtain target data stored by each target main computing cluster; and performing data protocol processing on each target data according to a data protocol processing strategy through the auxiliary computing cluster in the chip.
In an optional embodiment of the present invention, the helper chip task comprises a target data specification task; processing the helper chip tasks by the helper compute cluster within the chip may include: determining, by a master computing cluster within a chip, an associated master computing cluster storing target data; the main computing clusters in the chip are used for simultaneously establishing the associated main computing clusters, and the target data are sequentially sent to the auxiliary computing clusters according to the appointed data sending sequence; receiving, by an auxiliary computing cluster within the chip, each of the target data; and performing data protocol processing on each target data according to a data protocol processing strategy through the auxiliary computing cluster in the chip.
In an optional embodiment of the invention, the method may further comprise: processing the main chip task in parallel with the main computing cluster by an auxiliary computing cluster within the chip if it is determined that the auxiliary chip task does not exist.
In an optional embodiment of the present invention, processing, by an auxiliary computing cluster within the chip, the main chip task in parallel with the main computing cluster in case it is determined that the auxiliary chip task does not exist may include: under the condition that the main computing cluster is determined to meet the task parallel processing condition through the auxiliary computing cluster in the chip, the main chip task is processed in parallel with the main computing cluster according to the cluster priority; serially processing and scheduling a main chip task in advance with a main computing cluster through an auxiliary computing cluster in the chip under the condition that the main computing cluster is determined not to meet a task parallel processing condition; wherein the task parallel processing condition comprises a first task parallel processing condition and/or a second task parallel processing condition; the first task parallel processing condition comprises that a main computing cluster has an idle cluster; the second task parallel processing condition comprises that the memory bandwidth of the chip does not reach a full load threshold value.
In an optional embodiment of the invention, the auxiliary computing cluster further maintains a communication connection with an associated auxiliary computing cluster of an associated chip; the method may further comprise: and performing data interaction between the auxiliary computing cluster in the chip and the associated auxiliary computing cluster so as to simultaneously process the auxiliary chip task by the associated auxiliary computing cluster.
In an optional embodiment of the present invention, the auxiliary computing cluster includes a computing core module, a cache module, and a DMA module; the computing core module is in communication connection with the DMA module, and the DMA module is in communication connection with the cache module; processing the helper chip tasks by the helper compute cluster within the chip may include: receiving cache data sent by the DMA module through a computing core module of an auxiliary computing cluster in the chip, and performing task processing according to the cache data; reading the cache data from the cache module by a DMA module of an auxiliary computing cluster within the chip; and storing the cache data through a cache module of the auxiliary computing cluster in the chip.
According to the technical scheme of the embodiment of the invention, the main computing cluster and the auxiliary computing cluster which are kept in communication connection are configured in the chip, the main chip task comprising the main computing cluster independent processing task is processed by the main computing cluster, and the auxiliary chip task comprising the main computing cluster simultaneous processing task is processed by the auxiliary computing cluster, so that the technical problems of long computing time and poor performance caused by the fact that the main chip task is delayed in a scene that computing clusters are mutually cooperated in the conventional chip are solved, the efficient utilization of computing cluster resources is realized, and the performance of a chip system is improved.
EXAMPLE III
Fig. 8 is a schematic diagram of a chip provided with a task processing system according to a third embodiment of the present invention. Accordingly, as shown in fig. 8, the chip 31 is configured with a task processing system including: the primary computing cluster 311 and the secondary computing cluster 312, the primary computing cluster 311 and the secondary computing cluster 312 maintaining a communicative connection.
Wherein, the main computing cluster 311 is used for processing the main chip task; the main chip tasks include tasks independently processed by the main computing cluster 311; the auxiliary computing cluster 312 is used to process auxiliary chip tasks; the secondary chip tasks include tasks concurrently processed by the primary computing cluster 311.
In an optional embodiment of the invention, the helper chip task comprises a target data replication task; the auxiliary computing cluster 312 may be specifically configured to: determining a target storage module for storing target data; reading the target storage module to obtain the target data; and sequentially writing the target data into a target main computing cluster.
In an optional embodiment of the present invention, the helper chip task comprises a target data specification task; the auxiliary computing cluster 312 may be specifically configured to: determining a target master computing cluster storing target data; reading each target main computing cluster to obtain target data stored by each target main computing cluster; and performing data specification processing on each target data according to a data specification processing strategy.
In an optional embodiment of the present invention, the helper chip task comprises a target data specification task; the master computing cluster 311 may be specifically configured to: determining an associated master computing cluster storing target data; simultaneously associating each associated main computing cluster, and sequentially sending each target data to the auxiliary computing clusters according to an agreed data sending sequence; the auxiliary computing cluster 312 may be specifically configured to: receiving each of the target data; and performing data specification processing on each target data according to a data specification processing strategy.
In an optional embodiment of the present invention, the auxiliary computing cluster 312 may be further configured to: processing the primary chip task in parallel with the primary computing cluster if it is determined that the secondary chip task is not present.
In an optional embodiment of the present invention, the auxiliary computing cluster 312 may be specifically configured to: under the condition that the main computing cluster 311 is determined to meet the task parallel processing condition, processing the main chip task in parallel with the main computing cluster 311 according to the cluster priority; under the condition that the main computing cluster 311 is determined not to meet the task parallel processing condition, serially processing with the main computing cluster 311 to schedule a main chip task in advance; wherein the task parallel processing condition comprises a first task parallel processing condition and/or a second task parallel processing condition; the first task parallel processing condition includes the presence of a free cluster by the master computing cluster 311; the second task parallel processing condition comprises that the memory bandwidth of the chip does not reach a full load threshold value.
In an optional embodiment of the present invention, the auxiliary computing cluster 312 also maintains a communication connection with an associated auxiliary computing cluster of an associated chip; the auxiliary computing cluster 312 may also be used to: and performing data interaction with the associated auxiliary computing cluster so as to establish that the associated auxiliary computing cluster processes the auxiliary chip task together.
In an optional embodiment of the present invention, the auxiliary computing cluster 312 comprises a computing core module, a cache module, and a Direct Memory Access (DMA) module; wherein: the calculation core module is in communication connection with the DMA module and is used for receiving cache data sent by the DMA module and performing task processing according to the cache data; the DMA module is in communication connection with the cache module and is used for reading the cache data from the cache module; the cache module is used for storing the cache data.
According to the technical scheme of the embodiment of the invention, the main computing cluster and the auxiliary computing cluster which are kept in communication connection are configured in the chip, the main chip task comprising the main computing cluster independent processing task is processed by the main computing cluster, and the auxiliary chip task comprising the main computing cluster simultaneous processing task is processed by the auxiliary computing cluster, so that the technical problems of long computing time and poor performance caused by the fact that the main chip task is delayed in a scene that computing clusters are mutually cooperated in the conventional chip are solved, the efficient utilization of computing cluster resources is realized, and the performance of a chip system is improved.

Claims (9)

1. A task processing system, configured on a chip, comprising a primary computing cluster and a secondary computing cluster, the primary computing cluster and the secondary computing cluster being communicatively coupled; wherein:
the main computing cluster is used for processing a main chip task; the main chip task comprises a task independently processed by a main computing cluster;
the auxiliary computing cluster is used for processing auxiliary chip tasks; the auxiliary chip tasks comprise tasks of simultaneous processing of the main computing cluster; wherein the tasks concurrently processed by the primary computing clusters are tasks associated with data in at least two of the primary computing clusters;
the helper computing cluster further to: under the condition that the main computing cluster is determined to meet the task parallel processing condition, processing the main chip task in parallel with the main computing cluster according to the cluster priority; under the condition that the main computing cluster is determined not to meet the task parallel processing condition, serially processing with the main computing cluster to schedule a main chip task in advance; wherein the task parallel processing condition comprises: the first task parallel processing condition and/or the second task parallel processing condition; the first task parallel processing condition comprises that a main computing cluster has an idle cluster; the second task parallel processing condition comprises that the memory bandwidth of the chip does not reach a full load threshold value.
2. The system of claim 1, wherein the helper chip tasks include a target data replication task; the auxiliary computing cluster is specifically configured to:
determining a target storage module for storing target data;
reading the target storage module to obtain the target data;
and sequentially writing the target data into a target main computing cluster.
3. The system of claim 1, wherein the helper chip task comprises a target data specification task; the target data specification task is a task of converting specific partial data specifications stored in a plurality of main computing clusters into one piece of data; the auxiliary computing cluster is specifically configured to:
determining a target master computing cluster storing target data;
reading each target main computing cluster to obtain target data stored by each target main computing cluster;
and performing data specification processing on each target data according to a data specification processing strategy.
4. The system of claim 1, wherein the helper chip task comprises a target data specification task; the target data specification task is a task of converting specific partial data specifications stored in a plurality of main computing clusters into one piece of data; the master computing cluster is specifically configured to:
determining an associated master computing cluster storing target data;
simultaneously associating each associated main computing cluster, and sequentially sending each target data to the auxiliary computing clusters according to an agreed data sending sequence;
the auxiliary computing cluster is specifically configured to:
receiving each of the target data;
and performing data specification processing on each target data according to a data specification processing strategy.
5. The system of claim 1, wherein the helper computing cluster is further configured to:
processing the primary chip task in parallel with the primary computing cluster if it is determined that the secondary chip task is not present.
6. The system of any of claims 1-5, wherein the helper computing clusters are further communicatively coupled to associated helper computing clusters of the associated chip; the helper computing cluster is further to:
and performing data interaction with the associated auxiliary computing cluster so as to establish that the associated auxiliary computing cluster processes the auxiliary chip task together.
7. The system of any of claims 1-5, wherein the helper computing cluster comprises a compute core module, a cache module, and a Direct Memory Access (DMA) module; wherein:
the calculation core module is in communication connection with the DMA module and is used for receiving cache data sent by the DMA module and performing task processing according to the cache data;
the DMA module is in communication connection with the cache module and is used for reading the cache data from the cache module;
the cache module is used for storing the cache data.
8. A task processing method is applied to a chip and comprises the following steps:
processing a main chip task through a main computing cluster in the chip;
processing an auxiliary chip task by an auxiliary computing cluster in the chip;
wherein the main chip task comprises a task independently processed by a main computing cluster; the auxiliary chip tasks comprise tasks of simultaneous processing of the main computing cluster; the tasks processed by the master computing clusters concurrently are tasks associated with data in at least two of the master computing clusters;
the method further comprises the following steps:
through the auxiliary computing cluster, under the condition that the main computing cluster is determined to meet the task parallel processing condition, the main chip task is processed in parallel with the main computing cluster according to the cluster priority;
through the auxiliary computing cluster, under the condition that the main computing cluster is determined not to meet the task parallel processing condition, serially processing with the main computing cluster to schedule a main chip task in advance;
wherein the task parallel processing condition comprises: the first task parallel processing condition and/or the second task parallel processing condition; the first task parallel processing condition comprises that a main computing cluster has an idle cluster; the second task parallel processing condition comprises that the memory bandwidth of the chip does not reach a full load threshold value.
9. A chip, characterized in that it comprises a task processing system according to any one of claims 1 to 7.
CN202111041363.6A 2021-09-07 2021-09-07 Task processing system, method and chip Active CN113495791B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111041363.6A CN113495791B (en) 2021-09-07 2021-09-07 Task processing system, method and chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111041363.6A CN113495791B (en) 2021-09-07 2021-09-07 Task processing system, method and chip

Publications (2)

Publication Number Publication Date
CN113495791A CN113495791A (en) 2021-10-12
CN113495791B true CN113495791B (en) 2021-12-14

Family

ID=77996050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111041363.6A Active CN113495791B (en) 2021-09-07 2021-09-07 Task processing system, method and chip

Country Status (1)

Country Link
CN (1) CN113495791B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115712499A (en) * 2022-11-09 2023-02-24 北京城建设计发展集团股份有限公司 Rail transit service AI chip driving task processing method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101442513A (en) * 2007-11-20 2009-05-27 杭州华三通信技术有限公司 Method for implementing various service treatment function and multi-nuclear processor equipment
CN101916209A (en) * 2010-08-06 2010-12-15 华东交通大学 Cluster task resource allocation method for multi-core processor
CN102360309A (en) * 2011-09-29 2012-02-22 中国科学技术大学苏州研究院 Scheduling system and scheduling execution method of multi-core heterogeneous system on chip
CN102831012A (en) * 2011-06-16 2012-12-19 日立(中国)研究开发有限公司 Task scheduling device and task scheduling method in multimode distributive system
CN102929929A (en) * 2012-09-24 2013-02-13 深圳市网信联动技术有限公司 Method and device for data summarization
CN105260237A (en) * 2015-09-29 2016-01-20 中南大学 Task scheduling system of heterogeneous multi-core platform and scheduling method for task scheduling system
CN107391245A (en) * 2017-07-18 2017-11-24 致象尔微电子科技(上海)有限公司 A kind of software systems of multi core chip
GB2554392A (en) * 2016-09-23 2018-04-04 Imagination Tech Ltd Task scheduling in a GPU
CN111352712A (en) * 2020-02-25 2020-06-30 程瑞萍 Cloud computing task tracking processing method and device, cloud computing system and server

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066319B (en) * 2017-01-17 2020-11-10 北京中电普华信息技术有限公司 Multi-dimensional scheduling system for heterogeneous resources
CN109408220A (en) * 2017-08-17 2019-03-01 北京国双科技有限公司 A kind of task processing method and device
CN112463315A (en) * 2020-11-13 2021-03-09 苏州浪潮智能科技有限公司 Cluster task scheduling method and device and related components

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101442513A (en) * 2007-11-20 2009-05-27 杭州华三通信技术有限公司 Method for implementing various service treatment function and multi-nuclear processor equipment
CN101916209A (en) * 2010-08-06 2010-12-15 华东交通大学 Cluster task resource allocation method for multi-core processor
CN102831012A (en) * 2011-06-16 2012-12-19 日立(中国)研究开发有限公司 Task scheduling device and task scheduling method in multimode distributive system
CN102360309A (en) * 2011-09-29 2012-02-22 中国科学技术大学苏州研究院 Scheduling system and scheduling execution method of multi-core heterogeneous system on chip
CN102929929A (en) * 2012-09-24 2013-02-13 深圳市网信联动技术有限公司 Method and device for data summarization
CN105260237A (en) * 2015-09-29 2016-01-20 中南大学 Task scheduling system of heterogeneous multi-core platform and scheduling method for task scheduling system
GB2554392A (en) * 2016-09-23 2018-04-04 Imagination Tech Ltd Task scheduling in a GPU
CN107391245A (en) * 2017-07-18 2017-11-24 致象尔微电子科技(上海)有限公司 A kind of software systems of multi core chip
CN111352712A (en) * 2020-02-25 2020-06-30 程瑞萍 Cloud computing task tracking processing method and device, cloud computing system and server

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Task Scheduling of Real-time Systems on Multi-Core Architectures;Pengliu Tan;《2009 Second International Symposium on Electronic Commerce and Security》;20091023;全文 *
面向异构多核处理器的统一编程及分开编译设计与实现;刘丹丹;《万方学位论文》;20151102;第2.1-2.2节 *

Also Published As

Publication number Publication date
CN113495791A (en) 2021-10-12

Similar Documents

Publication Publication Date Title
US7971029B2 (en) Barrier synchronization method, device, and multi-core processor
CN109819037B (en) Method and system for self-adaptive calculation and communication
CN100595744C (en) System integrated circuit and electronic system on chip, and method for transmitting data therein
CN110309088B (en) ZYNQ FPGA chip, data processing method thereof and storage medium
CN102112973A (en) Mediation device, mediation system, mediation method, semiconductor integrated circuit, and image processing device
CN113495791B (en) Task processing system, method and chip
US20090189686A1 (en) Semiconductor integrated circuit and power control method
US20060236001A1 (en) Direct memory access controller
CN103970714A (en) Apparatus and method for sharing function logic and reconfigurable processor thereof
CN111860821B (en) Control method and system for data transmission of data flow architecture neural network chip
US20220147097A1 (en) Synchronization signal generating circuit, chip and synchronization method and device, based on multi-core architecture
JP2005216147A (en) Information processing apparatus
CN112134814A (en) Board-level internet structure and communication method
US8601238B2 (en) Arithmetic processing apparatus, arithmetic processing system, and arithmetic processing method which utilize limitation information to perform enhanced arithmetic processing
JP3772578B2 (en) Parallel logic simulation method
CN112232498B (en) Data processing device, integrated circuit chip, electronic equipment, board card and method
CN111506518B (en) Data storage control method and device
US11900503B2 (en) Multi-core draw splitting
CN113568665B (en) Data processing device
US9647976B2 (en) Method and device for implementing end-to-end hardware message passing
CN212873459U (en) System for data compression storage
CN116137448A (en) Charging control method, chip, storage medium and electronic equipment
CN115964982A (en) Topological structure of accelerator
CN118113496A (en) Inter-process communication method, system and chip based on multi-core heterogeneous SOC
CN117908959A (en) Method for performing atomic operations and related products

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: Room a-522, 188 Yesheng Road, Lingang New District, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201306

Patentee after: Shanghai Suiyuan Technology Co.,Ltd.

Country or region after: China

Address before: Room a-522, 188 Yesheng Road, Lingang New District, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201306

Patentee before: SHANGHAI ENFLAME TECHNOLOGY Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address