CN114691313A - Data processing method and device of system on chip - Google Patents

Data processing method and device of system on chip Download PDF

Info

Publication number
CN114691313A
CN114691313A CN202011631083.6A CN202011631083A CN114691313A CN 114691313 A CN114691313 A CN 114691313A CN 202011631083 A CN202011631083 A CN 202011631083A CN 114691313 A CN114691313 A CN 114691313A
Authority
CN
China
Prior art keywords
task
data
computing device
binding
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011631083.6A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Cambricon Information Technology Co Ltd
Original Assignee
Anhui Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Cambricon Information Technology Co Ltd filed Critical Anhui Cambricon Information Technology Co Ltd
Priority to CN202011631083.6A priority Critical patent/CN114691313A/en
Publication of CN114691313A publication Critical patent/CN114691313A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Advance Control (AREA)

Abstract

The invention provides a data processing method and a data processing device of a system on chip. The multi-core computing device is designed in a layered structure, and the multi-core computing device is used as a system on chip and comprises at least one cluster, each cluster comprises a plurality of processor cores, in other words, the multi-core computing device is formed by a system on chip-cluster-processor core hierarchy. The data processing method and the data processing device of the system on chip can improve the computing capacity of the system on chip and improve the processing efficiency.

Description

Data processing method and device of system on chip
Technical Field
The present invention relates to the field of chips, and in particular, to a method and an apparatus for processing data of a system on a chip.
Background
With the development of deep learning and big data, especially the remarkable characteristics of intelligent application such as deep learning technology and the like are that the input data volume is large, and the requirement on the computing capability of a platform is high. Conventional general purpose processors (e.g., CPUs) have difficulty meeting computing requirements. In order to meet the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining and the like under complex scenes, a heterogeneous computing system formed by a general-purpose processor and other special processors (coprocessors) is generally adopted to improve the computing power of a computer. For a heterogeneous system, the reading and writing overhead of the coprocessor has a great influence on the computing capacity and the computing efficiency of the system, so how to reduce the reading and writing overhead of the coprocessor is a problem of concern.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a data processing method and apparatus for a system on chip.
According to an aspect of the present disclosure, there is provided a data processing method of a system on chip, the method including:
receiving tasks and related task grouping information;
issuing tasks in the related task groups to a cluster designated in the computing device; wherein the related task grouping is determined according to the related task grouping information;
and executing the tasks in the related task groups by using the designated cluster, and temporarily storing output data of at least one task in the related task groups in an on-chip storage resource of the computing device.
According to another aspect of the present disclosure, there is provided a data processing apparatus, characterized in that the apparatus comprises:
the task scheduling module is used for receiving tasks and related task grouping information and issuing the tasks in the related task grouping to a specified cluster of the computing device; wherein the related task grouping is determined according to the related task grouping information;
and the computing device comprises at least one cluster and is used for executing the tasks in the related task groups by using the specified cluster and temporarily storing the output data of at least one task in the related task groups in the on-chip storage resources of the computing device.
According to another aspect of the present disclosure, a system on chip is provided, wherein the system on chip comprises the data processing apparatus of the present disclosure.
According to another aspect of the present disclosure, a board card is provided, where the board card includes the system on chip of the present disclosure.
The data processing method and the data processing device can improve the processing efficiency and the computing capacity of the system on chip.
Drawings
Fig. 1 is a schematic structural diagram of a board card according to an embodiment;
FIG. 2 is a block diagram of a combinatorial processing device in a chip of an embodiment;
FIG. 3 is a schematic diagram illustrating an internal structure of a single core according to an embodiment;
FIG. 4 is a schematic diagram of an internal structure of a multi-core computing device according to an embodiment;
FIG. 5 is a diagram of a software architecture of a system on a chip, according to an embodiment;
FIG. 6 is a flowchart of a data processing method of a system on a chip according to an embodiment;
FIG. 7 is a flowchart of a method for implementing step S603 according to an embodiment;
fig. 8 is a flowchart of a method for implementing step S603 according to another embodiment.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the present disclosure. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System on Chip (SoC), or System on Chip, integrated with one or more combined processing devices, which is an artificial intelligence arithmetic unit for supporting various deep learning and machine learning algorithms and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.
The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.
The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).
Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a storage device 204.
The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.
The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.
The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.
The storage device 204 is used to store data to be processed, which may be a DRAM, a DDR memory, and is typically 16G or larger in size, and is used to store data of the computing device 201 and/or the processing device 203.
Fig. 3 shows an internal structure diagram of the computing apparatus 201 as a single core. The single-core computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining, and the like, and the single-core computing device 301 includes three modules: a control module 31, an operation module 32 and a storage module 33.
The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decode unit 312 decodes the obtained instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.
The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, nonlinear transformation, and the like; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, i.e., matrix multiplication and convolution.
The storage module 33 is used for storing or transporting related data, and includes a neuron storage unit (neuron RAM, NRAM)331, a parameter storage unit (weight RAM, WRAM)332, and a Direct Memory Access (DMA) 333. NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; WRAM 332 is used to store the convolution kernel of the deep learning network, i.e. the weight; the DMA 333 is connected to the DRAM204 via the bus 34, and is responsible for data transfer between the single-core computing device 301 and the DRAM 204.
Fig. 4 shows a schematic diagram of the internal structure of the computing apparatus 201 with multiple cores. The multi-core computing device 41 is designed in a hierarchical structure, and the multi-core computing device 41 is a system on a chip and includes at least one cluster (cluster), each cluster including a plurality of processor cores, in other words, the multi-core computing device 41 is formed in a system on a chip-cluster-processor core hierarchy.
In a system-on-chip hierarchy, as shown in FIG. 4, the multi-core computing device 41 includes an external storage controller 401, a peripheral communication module 402, an on-chip interconnect module 403, a synchronization module 404, and a plurality of clusters 405.
There may be multiple external memory controllers 401, 2 shown by way of example in the figure, for accessing an external memory device, such as DRAM204 in figure 2, to read data from or write data to off-chip in response to an access request issued by a processor core. The peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 and starting the computing device 201 to execute a task. The on-chip interconnect module 403 connects the external memory controller 401, the peripheral communication module 402 and the plurality of clusters 405 for transmitting data and control signals between the respective modules. The synchronization module 404 is a global synchronization barrier controller (GBC) for coordinating the operation progress of the clusters and ensuring the synchronization of the information. The plurality of clusters 405 are computing cores of the multi-core computing device 41, 4 are exemplarily shown in the figure, and as hardware advances, the multi-core computing device 41 of the present disclosure may further include 8, 16, 64, or even more clusters 405. The cluster 405 is used to efficiently execute deep learning algorithms.
Viewed at the cluster level, as shown in FIG. 4, each cluster 405 includes a plurality of processor cores (IPU core)406 and a memory core (MEM core) 407.
The processor cores 406 are exemplarily shown in 4 in the figure, and the present disclosure does not limit the number of the processor cores 406. The internal architecture is shown in fig. 5. Each processor core 406 is similar to the single-core computing device 301 of fig. 3, again including three major modules: a control module 51, an arithmetic module 52 and a storage module 53. The functions and structures of the control module 51, the operation module 52 and the storage module 53 are substantially the same as those of the control module 31, the operation module 32 and the storage module 33, and are not described again. It should be particularly noted that the storage module 53 includes an input/output direct memory access (IODMA) module 533 and a move direct memory access (MVDMA) module 534. IODMA 533 controls access of NRAM 531/WRAM 532 and DRAM204 through broadcast bus 409; the MVDMA 534 is used to control access to the NRAM 531/WRAM 532 and the memory cell (SRAM) 408.
Returning to FIG. 4, the storage core 407 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 406, as well as perform communications between the cluster 405 and the DRAM204, communications among each other cluster 405, communications among each other processor cores 406, and the like. In other embodiments, memory core 407 has the capability of scalar operations to perform scalar operations.
The memory core 407 includes an SRAM408, a broadcast bus 409, a Cluster Direct Memory Access (CDMA) module 410, and a Global Direct Memory Access (GDMA) module 411. The SRAM408 plays a role of a high-performance data transfer station, data multiplexed between different processor cores 406 in the same cluster 405 does not need to be acquired to the DRAM204 through the processor cores 406 respectively, but is transferred among the processor cores 406 through the SRAM408, and the memory core 407 only needs to rapidly distribute the multiplexed data from the SRAM408 to a plurality of processor cores 406, so that the inter-core communication efficiency is improved, and on-chip and off-chip input/output access is greatly reduced.
Broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication among processor cores 406, communication among cluster 405, and data transfer between cluster 405 and DRAM204, respectively. As will be described separately below.
The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405, and the broadcast bus 409 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (e.g., from a single processor core to a single processor core) data transfer, multicast is a communication that transfers a copy of data from SRAM408 to a particular number of processor cores 406, and broadcast is a communication that transfers a copy of data from SRAM408 to all processor cores 406, which is a special case of multicast.
CDMA 410 is used to control access to SRAM408 between different clusters 405 within the same computing device 201.
The GDMA 411 cooperates with the external memory controller 401 to control the access of the SRAM408 of the cluster 405 to the DRAM204 or to read data from the DRAM204 into the SRAM 408. As can be seen from the foregoing, communication between DRAM204 and NRAM 431 or WRAM432 may be achieved via 2 channels. The first channel is to directly contact DRAM204 with NRAM 431 or WRAM432 through IODAM 433; the second channel is that data is transferred between the DRAM204 and the SRAM408 through the GDMA 411, and then transferred between the SRAM408 and the NRAM 431 or WRAM432 through the MVDMA 534. Although seemingly the second channel requires more components and the data flow is longer, in some embodiments, the bandwidth of the second channel is substantially greater than the first channel, and thus communication between DRAM204 and NRAM 431 or WRAM432 may be more efficient over the second channel. Embodiments of the present disclosure may select a data transmission channel according to its own hardware condition.
In other embodiments, the functionality of GDMA 411 and the functionality of IODMA 533 may be integrated in the same component. For convenience of description, the GDMA 411 and the IODMA 533 are considered as different components, and it is within the scope of the disclosure for those skilled in the art to achieve the same functions and achieve the same technical effects as the present disclosure. Further, the functions of GDMA 411, IODMA 533, CDMA 410 and MVDMA 534 may be implemented by the same component.
Fig. 5 shows a schematic diagram of a software architecture 50 of the chip 101 in an embodiment of the disclosure. The software architecture 50 includes, from top to bottom, an artificial intelligence and machine learning application layer 501, a machine learning framework (frames) layer 502, a machine learning programming library layer 503, a machine learning runtime (runtime) library and system tool (toolkit) layer 504, and a driver layer 505.
The application layer 501 is used for implementing various artificial intelligence and machine learning, especially for deep learning, such as face recognition, unmanned driving, speech recognition, natural language processing, and processing of images, speech and text by robots.
The machine learning framework layer 502 refers to a tool for helping a developer understand and design a machine learning model. It can be developed with little code, helping to create powerful artificial intelligence software without having an in-depth knowledge of technically complex algorithms. Common frameworks include TensorFlow, Caffe, MXNet, PyTorch, and the like. The machine learning framework layer 502 can process the entire computational graph of the neural network. The computation graph may include a plurality of operators (operators) or kernel functions (kernel), such as convolution, pooling, relu functions, etc., and the operators are connected according to a certain rule to form the computation graph of the entire neural network.
The machine learning programming library layer 503 includes basic operators required for machine learning application development, and the machine learning framework layer 502 can conveniently call the operators to implement a deep neural network model and other machine learning algorithms.
The machine learning runtime (runtime) library and system tools (toolkit) layer 504, wherein the runtime library provides a set of upper programming interfaces for the system-on-chip hardware for interaction and hardware resource scheduling between the system-on-chip hardware. The system tools comprise various tools for realizing the development of system-on-chip software, for example, a compiler can realize assembly language compilation and output an executable binary file capable of running on the system-on-chip; the debugger is used for debugging software codes in the system-on-chip hardware; the performance optimization tool is used for optimizing the hardware performance; the automated hardware diagnostic tool may perform power consumption testing, PCIe link state diagnostics, hardware stress testing, multi-card interconnect state testing, and the like.
The driver layer 505, which contains information about the hardware devices of the system-on-chip, enables the software to communicate with the corresponding hardware devices.
Fig. 6 shows a flowchart of a data processing method of a system on chip according to an embodiment of the present disclosure. This embodiment is described with reference to fig. 2 to 5. As shown in fig. 6, the method for processing system-on-chip data provided in this embodiment may include the following steps:
step S601, receiving task and related task grouping information;
in an embodiment of the present disclosure, referring to fig. 4, task scheduling module 412 of chip 101 receives tasks and related task grouping information. The related task of the chip 101 may be to perform calculation of an operator (operator) of a neural network or a kernel function (kernel), wherein the operator of the neural network may also be represented in the form of the kernel function. Since the amount of computation of tasks executed by the system on chip is very large, and related tasks are related to each other, for example, the input of a current task needs to utilize the output of a previous task, at this time, there may be a large overhead in reading and writing data between related tasks. In the embodiment of the disclosure, data read-write overhead between related tasks can be reduced by binding the related tasks into related task groups. Wherein, the related task may refer to a task having a data dependency relationship.
In the embodiment of the present disclosure, the dependency relationship of the tasks may be determined by upper-layer software of the chip 101, for example, a machine learning framework (frames) layer 502, a machine learning runtime (runtime) library, a system tool (toolkit) layer 504 (such as a compiler in a system tool), and the like, to determine whether to bind the related tasks into a group, where the dependency relationship indicates that there is data dependency between the related tasks. When the task is a neural network, the dependency relationship may be determined according to a computational graph structure of the neural network, and a connection relationship between tasks in the computational graph of the neural network may represent the dependency relationship between related tasks.
When the upper layer software determines that there is a dependency relationship between tasks, the software may mark the relevant task as a relevant task group and send the relevant task group information to the task scheduling module 412 of the chip 101. When the upper layer software judges that the task does not have the dependency relationship, the software can mark the task as a single task without binding other tasks into related task groups.
For example, the related task group may include a start marking task, a related task (a task having a data dependency relationship, the related task including at least two tasks) and an end marking task, and the start marking task and the end marking task are used for indicating the start and the end of the related task. Accordingly, the related task grouping information may include information such as start information of the related task, related task information (e.g., identification of the related task, scale of the related task, etc.), and end information of the related task.
For example, the related task grouping information may be in various forms, for example, the related task grouping information may be bound to the tasks of the related task grouping. For another example, the related task grouping information and the related task grouping may be separately issued to the multi-core computing device 41.
For example, the related task group information may be bound into the related task group by the driver layer 505 of the software. For example, the software may bind the start information of the related task to the first task in the related task group, and mark the first task of the related task as a start binding task, which is a start marking task; the software can mark the intermediate task in the related tasks as a binding task; the software may bind the end information to the last task of the related tasks to mark the last task of the related tasks as an end bound task, i.e., an end-marker task.
As another example, the software may bind the start information to the first task of the associated task group, i.e., the start marker task, and the end information to the last task of the associated task group, i.e., the end marker task. The execution of the related tasks in the related task group is between the start marking task and the end marking task. The start marking task and the end marking task may be performed without performing a data read-write process.
Step S602, tasks in related task groups are issued to a cluster designated in a computing device; wherein the related task grouping is determined according to the related task grouping information;
in the embodiment of the present disclosure, the task scheduling module 412 may issue at least one task in the relevant task group to a specified cluster according to the relevant task group information, so that the at least one task in the relevant task group may be processed by the same cluster. The designated cluster may be determined based on the size of the tasks in the associated task group.
As an alternative embodiment, referring to fig. 4, the task scheduling module 412 issues related tasks to the multi-core computing device 41, locks at least one cluster 405 in the multi-core computing device 41 according to the start information in the related task grouping information, and takes the locked at least one cluster as a designated cluster. After the at least one cluster 405 is locked, the at least one cluster 405 can only be used for processing the task in the bound related task group, so that the processing result of the previous task in the related task group can reside in the cluster 405 for the current task to use, thereby improving the processing efficiency.
For example, after receiving the task of the upper layer software, the task scheduling module 412 may store the bound related task packet in the same task queue (tune) in the task scheduling module 412, where the task queue is a queue that the task scheduling module 412 maintains internally for scheduling the task. The task scheduling module 412 may send the tasks in the same task queue to the locked cluster 405 when issuing the related tasks to the multi-core computing device 41, so that the locked cluster 405 is only used for processing the tasks in the bound related task group.
The number of locked clusters 405 is not limited in this embodiment, and the locked clusters 405 may be a single cluster or a plurality of clusters 405. Each cluster may include multiple processor cores (IPU core) or may include a single processor core. Optionally, the number of clusters that need to be locked may be determined according to the size of the tasks in the relevant task group. For example, when the size of a task requires a single cluster to execute, then the single cluster may be locked. When the size of the task requires two clusters to execute, then both clusters may be locked. Further, the tasks stored in the same task queue of the task scheduling module 412 may be tasks with the same task size, that is, the number of clusters required for task execution in the related task group is the same, so that the data transmission overhead between clusters may be reduced.
Step S603, executing the tasks in the related task groups by using the designated cluster, and temporarily storing output data of at least one task in the related task groups in an on-chip storage resource of the computing device.
In an embodiment of the present disclosure, referring to FIG. 4, the multi-core computing device 41 uses the locked cluster 405 to execute tasks in the relevant task group without executing other tasks.
In an alternative embodiment, the task scheduling module may allocate the tasks in the related task group to the locked cluster 405 for execution, and data that needs to be reused in the related tasks, for example, the execution result (output data) of the related tasks with data dependency may reside in an on-chip storage resource inside the cluster, for example, in the SRAM408, so that the current task may directly obtain the execution result of its previous related task from the SRAM408 without reading data from a storage resource outside the cluster, for example, a cache or a DRAM, thereby reducing read-write overhead between related tasks in the same task group and improving processing efficiency.
Conventionally, after the execution of the previous associated task is finished, the execution result of the previous associated task is generally written back to the storage resource DRAM outside the cluster. If data access is frequently performed between the on-chip storage resource SRAM408 of the cluster and the storage resource DRAM outside the cluster, the computing efficiency of the on-chip system is greatly reduced, and the computing performance of the on-chip system is affected. In the embodiment of the disclosure, at least one cluster in the computing device is locked, and data (such as input data or execution result data) in the execution process of the related task is temporarily stored in the on-chip storage resource inside the cluster, so that a frequent data read-write process is avoided, and data read-write overhead and computation delay are reduced.
The present embodiment does not limit the type of on-chip storage resources inside the cluster, and for example, the storage may be SRAM, WRAM (Weight RAM), nram (neural RAM), or the like, and herein, the SRAM is taken as an example to illustrate the embodiments of the present disclosure.
In the embodiment of the present disclosure, because the on-chip storage resource is very limited, sometimes the on-chip storage resource cannot satisfy the data storage space required by the related task. Therefore, in order to implement that the data of the related tasks in the execution process resides in the on-chip storage resources, in this embodiment, the upper layer software may further segment the tasks, that is, the upper layer software not only binds the tasks related to each other, but also may further decompose the oversized tasks into smaller tasks, so that the cluster of the computing device may execute the tasks.
Optionally, the method may further include:
and releasing the locking of the cluster according to the end information in the task grouping information.
In embodiments of the present disclosure, the multi-core computing device 41 may unlock the locked cluster 405 according to the end information in the task group information it receives. Referring to fig. 4, after receiving the end information in the task grouping information, for example, when the locked cluster 405 performs the end marking task, the locked cluster 405 releases the occupied computing resources and resumes the normal data processing flow of the cluster 405, for example, data can be accessed from a storage resource outside the cluster.
It should be clear that each cluster comprises at least one processor core, and when only one processor core is included in each cluster, then the above-described method can perform scheduling of tasks at the granularity of processor cores. For example, the task scheduler may lock a single processor core of the computing device, send at least one task of the related task packet to the same processor core of the computing device, and temporarily store data (e.g., output data of at least one task) to be reused in the related task packet in an on-chip storage resource of the processor core, such as WRAM or NRAM.
Fig. 7 shows a flowchart of a method for implementing step S603 according to an embodiment of the present disclosure. In one embodiment of the disclosure, the related task grouping information may be bound to each related task, and the task scheduling module 412 may send the related task to which the related task grouping information is bound to at least one cluster of the multi-core computing device 41, and the cluster may execute the related task it receives. At this time, the related task packet only includes the related task, and also carries the related task packet information. For example, the first task in the related tasks is a start marking task, i.e., a start binding task. The last task in the related tasks is an end-marker task, i.e., an end-binding task. As shown in fig. 7, the step S603 may include:
step S6031, receiving the binding start task, locking the cluster executing the relevant task packet, reading and writing data according to the specification of the data read-write tag, and executing the binding start task.
In an embodiment of the present disclosure, referring to fig. 4, the multi-core computing device 41 receives the start binding task sent by the task scheduling module 412, and locks the cluster 405 for executing the relevant task group. The start binding task represents a first related task in the bound related task group. After the multi-core computing device 41 locks the cluster 405 for executing the group of related tasks, the cluster 405 can only be used for executing the tasks in the group of related tasks and cannot be used for executing other tasks.
For example, referring to fig. 4, after receiving the binding start task, the cluster 405 reads data read/write tags from the parameter table, where the data read/write tags may include a content read tag and a content write tag. In the embodiment of the present disclosure, each task in the related task group may be provided with a corresponding data read-write tag, and in the process of executing the task, the computing device may obtain input data required by the task according to the content read tag, and write output data of the task into a specified storage resource according to the content write tag. The parameter table may be stored in an external storage resource DRAM or a Cache (Cache), and the cluster 405 may obtain a data read-write tag in the parameter table from the storage resource, and perform a corresponding data read-write operation according to the data read-write tag.
The content read tag indicates whether input data is read from an on-chip storage resource (e.g., SRAM408) of the multicore computing device 41. According to the system-on-chip data processing method of the embodiment, the data held in the SRAM408 of the multi-core computing device 41 is the resident data stored on the SRAM408 of the last task in the bound related task packet.
The content write tag indicates whether to write the output data back to an external storage resource (e.g., DRAM) of the multi-core computing device 41 and whether to reside in an on-chip storage resource (e.g., SRAM408) of the multi-core computing device 41. Illustratively, the content write tag may be defined according to the following table:
Figure BDA0002876508120000161
illustratively, the cluster 405 may read and write data through an Application Binary Interface (ABI) according to the content read tags and content write tags read from the parameter table. For example, the output of the last task of the related task packet may be used as the input of the current task, in this case, the content write tag of the last task may be 00 or 01, and the content read tag of the current task is used to read the input data from the on-chip storage resource (e.g., SRAM408) of the multi-core computing device 41.
After the previous task is executed, the cluster 405 writes the output data of the previous task into an on-chip storage resource (e.g., SRAM) according to the content write tag of the previous task. The cluster 405, while executing the current task, reads the tag according to the content of the current task, and reads the output data of the last task from the on-chip storage resource (e.g., SRAM), without reading the data from the DRAM.
In an embodiment of the disclosure, the binding start task is the first task in the relevant task packet, the data read tag corresponding to the binding start task may be to read data from an external storage resource of the computing device, and the data write tag corresponding to the binding start task may be to reside output data of the binding start task in an on-chip storage resource of the computing device. Therefore, in the process of executing the binding start task, the computing device can read and write the tag according to the data, read the input data of the binding start task from the external storage resource of the computing device, and temporarily store the output data of the binding start task in the on-chip storage resource of the computing device.
Step S6032, receiving the binding task, reading and writing data according to the specification of the data read-write label, and executing the binding task;
in an embodiment of the present disclosure, referring to FIG. 4, the task scheduling module 412 continues to send the bind task after sending the start bind task. The bound task is an intermediate task in the bound related task group.
Referring to fig. 4, after the cluster 405 receives the binding task, it reads the content read tag and the content write tag from the parameter table, and starts to execute the task. Referring to the above description, each binding task may correspond to a data read-write tag, and the content read tag and the content write tag may define whether to read input data from an on-chip storage resource of the computing device and whether to write output data back to an external storage resource of the computing device. At least one intermediate task in the group of related tasks may be an associated task that needs to be executed on the cluster, and the content read tag of the binding task may be a read of input data from an on-chip storage resource (e.g., SRAM408) of the multicore computing device 41. The content write tag of the bind task may be the write back of the output data into an on-chip storage resource (e.g., SRAM 408). The cluster 405 reads and writes data through the application binary interface according to the content read tag and the content write tag of the binding task.
For example, the number of the binding tasks may be multiple, and all the binding tasks are performed by repeatedly performing the step S6032. For example, the related task packet may only contain the start binding task and the end binding task, and at this time, since there is no binding task, the above step S6032 may not be performed.
Step S6033, receiving the binding end task, reading and writing data according to the specification of the data read-write tag, executing the binding end task, and then releasing the locked cluster.
In an embodiment of the present disclosure, referring to fig. 4, after the cluster 405 finishes executing the bound task, the task scheduling module 412 sends a completion bound task. The end binding task represents the last task in the bound related task group. After the cluster 405 receives the binding ending task, the data is read and written through the application binary interface according to the content read tag and the content write tag read from the parameter table. Referring to the above description, the content read tag and the content write tag may define whether to read input data from an on-chip storage resource of the computing device and whether to write output data back to an external storage resource of the computing device. The content read tag of the end binding task may be to read input data from an on-chip storage resource (e.g., SRAM408) of the multicore computing device 41. The content write tag of the end binding task may be the write back of the output data to a storage resource (e.g., DRAM) external to the cluster.
After the bound task is finished, the locked cluster 405 is released, that is, the cluster 405 can also be used to execute other tasks, and the normal data processing flow is recovered, for example, data can be accessed in a storage resource (e.g., DRAM) outside the cluster.
Fig. 8 shows a flowchart of a method of implementing step S603 according to another embodiment of the present disclosure. The present embodiment is different from the embodiment shown in fig. 7 in that, in an embodiment of the present disclosure, the related task group may include the related task, and a start marking task for marking the start of the related task and an end marking task for marking the end of the related task, and the start marking task and the end marking task may not involve reading and writing any data. As shown in fig. 8, the step S603 may include:
step S7031, receiving and executing the start marking task, and locking the cluster executing the related task group.
In an embodiment of the present disclosure, referring to fig. 4, the multi-core computing device 41 receives the start marking task sent by the task scheduling module 412, and locks the cluster 405 executing the related task packet. The start marker task is used to identify the start of a bound related task packet. Wherein the multi-core computing device 41 locks the cluster 405 for executing the group of related tasks, the cluster 405 is only available for executing the tasks in the group of related tasks and is not available for executing other tasks.
And step S7032, receiving the related tasks, reading and writing the data according to the provisions of the data reading and writing labels, and executing the related tasks.
In an embodiment of the present disclosure, referring to fig. 4, the data read/write tags of the related tasks may include a content write tag and a content read tag. The content read tag of the first of the related tasks may indicate that data is read from an external storage resource (e.g., DRAM) of the locked cluster, and the content write tag of the first of the related tasks may indicate that output data is resident in an on-chip storage resource (e.g., SRAM408) of the multicore computing device 41. The content read tag of the last task of the related task may indicate that the input data is read from an on-chip storage resource (e.g., SRAM408) of the multi-core computing device 41, and the content write tag of the last task of the related task may indicate that the output data is written back to an external storage resource (e.g., DRAM) of the multi-core computing device 41. The content read tag of an intermediate task between the first task and the last task in the related task may be to read input data from an on-chip storage resource (e.g., SRAM408) of the multicore computing device 41, and the content write tag of the intermediate task may be to reside output data in an on-chip storage resource (e.g., SRAM408) of the multicore computing device 41.
The locked cluster can read and write the label according to the data of the related task, acquire the data related to the execution of each related task, and respectively execute each related task.
Step S7033, receiving and executing an end-marker task, and releasing the locked cluster.
In an embodiment of the present disclosure, referring to fig. 4, after the bound cluster 405 finishes executing the bound related task, an end-marker task sent by the task scheduling module 412 is received. The start marker task is used to identify the start of a bound related task packet. After the cluster 405 completes the end-marking task, the locked cluster 405 is released, i.e., the cluster 405 may also be used to perform other tasks, and resume normal data processing flow, for example, data may be accessed in a storage resource (e.g., DRAM) outside the cluster.
Referring to fig. 2 and 4, for explaining a data processing apparatus for executing the data processing method of the system on chip according to an embodiment of the present disclosure, as shown in fig. 2 and 4, the data processing apparatus may be implemented by using a chip 101, and the apparatus may include:
the task scheduling module 412 is configured to receive tasks and related task grouping information, and issue tasks in the related task grouping to a cluster designated in a computing device, where the related task grouping is determined according to the task grouping information;
and the computing device 41 comprises at least one cluster and is used for executing the tasks in the related task groups by using the specified cluster and temporarily storing the output data of at least one task in the related task groups in the on-chip storage resources of the computing device.
Optionally, the computing device 41 is configured to lock at least one cluster in the computing devices according to the start information in the task grouping information, and use the locked at least one cluster as the designated cluster. The computing device is further configured to unlock the cluster based on end information in the task group.
In an embodiment of the present disclosure, referring to fig. 4, task scheduling module 412 of chip 101 receives tasks and related task grouping information. The related task of the chip 101 may be to perform calculation of an operator (operator) of a neural network or a kernel function (kernel), wherein the operator of the neural network may also be represented in the form of the kernel function. Since the amount of computation of tasks executed by the system on chip is very large, and related tasks are related to each other, for example, the input of a current task needs to utilize the output of a previous task, at this time, there may be a large overhead in reading and writing data between related tasks. In the embodiment of the disclosure, data read-write overhead between related tasks can be reduced by binding the related tasks into related task groups. Wherein, the related task may refer to a task having a data dependency relationship.
In the embodiment of the present disclosure, the dependency relationship of the tasks may be determined by upper-layer software of the chip 101, for example, a machine learning framework (frames) layer 502, a machine learning runtime (runtime) library, a system tool (toolkit) layer 504 (such as a compiler in a system tool), and the like, to determine whether to bind the related tasks into a group, where the dependency relationship indicates that there is data dependency between the related tasks. When the task is a neural network, the dependency relationship may be determined according to a computation graph structure of the neural network, and a connection relationship between tasks in the computation graph of the neural network may represent a dependency relationship between related tasks.
When the upper layer software determines that there is a dependency relationship between tasks, the software may mark the relevant task as a relevant task group and send the relevant task group information to the task scheduling module 412 of the chip 101. When the upper layer software judges that the task does not have the dependency relationship, the software can mark the task as a single task without being bound with other tasks into related task groups.
For example, the related task group may include a start marker task, a related task (a task having a data dependency, the related task including at least two tasks) and an end marker task, which are used to identify the start and end of the binding task. Accordingly, the related task grouping information may include information such as start information of the related task, related task information (e.g., identification of the related task, scale of the related task, etc.), and end information of the related task.
For example, the related task grouping information may be in various forms, for example, the related task grouping information may be bound to tasks in the related task grouping. For another example, the related task grouping information and the related task grouping may be separately issued to the multi-core computing device 41.
For example, the related task group information may be bound into the related task group by the driver layer 505 of the software. For example, the software may bind the start information of the related task to the first task in the related task group, and mark the first task of the related task as a start-binding task, which is a start-marking task; the software can mark the intermediate task in the related tasks as a binding task; the software may bind the end information to the last task of the related tasks to mark the last task of the related tasks as an end bound task, i.e., an end-marker task.
As another example, the software may bind the start information to the first task of the associated task group, i.e., the start marker task, and the end information to the last task of the associated task group, i.e., the end marker task. The execution of the related tasks in the related task group is between the start marking task and the end marking task. The start marking task and the end marking task may be performed without performing a data read-write process.
In an embodiment of the present disclosure, referring to fig. 4, the task scheduling module 412 may issue the related tasks to the multi-core computing device 41, and lock at least one cluster 405 in the multi-core computing device 41 according to the start information in the related task grouping information. After the at least one cluster 405 is locked, the at least one cluster 405 can only be used for processing the task in the bound related task group, so that the processing result of the previous task in the related task group can reside in the cluster 405 for the current task to use, thereby improving the processing efficiency.
For example, after receiving the task of the upper layer software, the task scheduling module 412 may store the bound related task packet in the same task queue (tune) in the task scheduling module 412, where the task queue is a queue that the task scheduling module 412 maintains internally for scheduling the task. The task scheduling module 412 may send the tasks in the same task queue to the locked cluster 405 when issuing the related tasks to the multi-core computing device 41, so that the locked cluster 405 is only used for processing the tasks in the bound related task group.
The number of locked clusters 405 is not limited in this embodiment, and the locked clusters 405 may be a single cluster or a plurality of clusters 405. Each cluster may include multiple processor cores (IPU core) or may include a single processor core. Optionally, the number of clusters that need to be locked may be determined according to the size of the tasks in the relevant task group. For example, when the size of a task requires a single cluster to execute, then the single cluster may be locked. When the size of the task requires two clusters to execute, then both clusters may be locked. Further, the tasks stored in the same task queue of the task scheduling module 412 may be tasks with the same task size, that is, the number of clusters required for task execution in the related task group is the same, so that the data transmission overhead between clusters may be reduced.
In an embodiment of the present disclosure, referring to FIG. 4, the multi-core computing device 41 uses the locked cluster 405 to execute tasks in the relevant task group without executing other tasks.
In an alternative embodiment, the tasks in the related task group may be allocated to the locked cluster 405 for execution, and the execution results of the related tasks may reside in an on-chip storage resource inside the cluster, for example, in the SRAM408, so that the current task may directly obtain the execution results of its previous related task from the SRAM408 without reading data from a storage resource outside the cluster, for example, a cache or a DRAM, thereby reducing read-write overhead between related tasks in the same task group and improving processing efficiency.
Conventionally, after the execution of the previous associated task is finished, the execution result of the previous associated task is generally written back to the storage resource DRAM outside the cluster. If data access is frequently performed between the on-chip storage resource SRAM408 of the cluster and the storage resource DRAM outside the cluster, the computational efficiency of the on-chip system is greatly reduced, and the computational performance of the on-chip system is affected. In the embodiment of the present disclosure, at least one cluster in the computing device is locked, and data (such as input data or execution result data) in the execution process of the relevant task is temporarily stored in the on-chip storage resource inside the cluster, so that a frequent data read-write process is avoided, and data read-write overhead and computation delay are reduced.
The present embodiment does not limit the type of on-chip storage resources inside the cluster, and for example, the storage may be SRAM, WRAM (Weight RAM), nram (neural RAM), or the like, and herein, the SRAM is taken as an example to illustrate the embodiments of the present disclosure.
In the embodiment of the disclosure, because the on-chip storage resources are very limited, sometimes the on-chip storage resources cannot meet the data storage space required by the related task. Therefore, in order to implement that the data of the related tasks in the execution process resides in the on-chip storage resources, in this embodiment, the upper layer software may further segment the tasks, that is, the upper layer software not only binds the tasks related to each other, but also may further decompose the oversized tasks into smaller tasks, so that the cluster of the computing device may execute the tasks.
In embodiments of the present disclosure, the multi-core computing device 41 may unlock the locked cluster 405 according to the end information in the task group information it receives. Referring to fig. 4, after receiving the end information in the task grouping information, for example, when the locked cluster 405 performs the end marking task, the locked cluster 405 releases the occupied computing resources and resumes the normal data processing flow of the cluster 405, for example, data can be accessed from a storage resource outside the cluster.
Referring to fig. 4, a computing device of a data processing device according to an embodiment of the disclosure is illustrated, and may be implemented by the multi-core computing device 41 of the chip 101, as shown in fig. 4. In one embodiment of the disclosure, the related task grouping information may be bound to each related task, and the task scheduling module 412 may send the related task to which the related task grouping information is bound to at least one cluster of the multi-core computing device 41, and the cluster may execute the related task it receives. At this time, the related task packet only includes the related task, and also carries the related task packet information. For example, the first task in the related tasks is a start marking task, i.e., a start binding task. The last task in the related tasks is an end-marker task, i.e., an end-binding task.
In this embodiment, as shown in fig. 4, the multi-core computing device 41 is configured to receive a binding start task, lock a cluster 405 that executes a related task packet, read and write data according to the specification of a data read-write tag, and execute the binding start task.
In this embodiment, referring to fig. 4, the multi-core and multi-core computing device 41 receives the start binding task sent by the task scheduling module 412, and locks the cluster 405 for executing the related task packet. The start binding task represents a first related task in the bound related task group. After the multi-core computing device 41 locks the cluster 405 for executing the group of related tasks, the cluster 405 can only be used for executing the tasks in the group of related tasks and cannot be used for executing other tasks.
For example, referring to fig. 4, after receiving the binding start task, the cluster 405 reads data read/write tags from the parameter table, where the data read/write tags may include a content read tag and a content write tag. In the embodiment of the present disclosure, each task in the related task group may be provided with a corresponding data read-write tag, and in the process of executing the task, the computing device may obtain input data required by the task according to the content read tag, and write output data of the task into a specified storage resource according to the content write tag. The parameter table may be stored in an external storage resource DRAM or a Cache (Cache), and the cluster 405 may obtain a data read-write tag in the parameter table from the storage resource, and perform a corresponding data read-write operation according to the data read-write tag.
The content read tag indicates whether to read input data from an on-chip storage resource (e.g., SRAM408) of the multicore computing device 41. According to the system-on-chip data processing method of the embodiment, the data held in the SRAM408 of the multi-core computing device 41 is the resident data stored on the SRAM408 of the last task in the bound related task packet.
The content write tag indicates whether to write the output data back to an external storage resource (e.g., DRAM) of the multi-core computing device 41 and whether to reside in an on-chip storage resource (e.g., SRAM408) of the multi-core computing device 41. Illustratively, the content write tag may be defined according to the following table:
Figure BDA0002876508120000251
illustratively, the cluster 405 may read and write data through an Application Binary Interface (ABI) according to the content read tags and content write tags read from the parameter table. For example, the output of the last task of the related task packet may be used as the input of the current task, in this case, the content write tag of the last task may be 00 or 01, and the content read tag of the current task is used to read the input data from the on-chip storage resource (e.g., SRAM408) of the multi-core computing device 41.
After the previous task is executed, the cluster 405 writes the output data of the previous task into an on-chip storage resource (e.g., SRAM) according to the content write tag of the previous task. The cluster 405, while executing the current task, reads the tag according to the content of the current task, and reads the output data of the last task from the on-chip storage resource (e.g., SRAM), without reading the data from the DRAM. In an embodiment of the disclosure, the binding start task is the first task in the relevant task packet, the data read tag corresponding to the binding start task may be to read data from an external storage resource of the computing device, and the data write tag corresponding to the binding start task may be to reside output data of the binding start task in an on-chip storage resource of the computing device. Therefore, in the process of executing the binding start task, the computing device can read and write the tag according to the data, read the input data of the binding start task from the external storage resource of the computing device, and temporarily store the output data of the binding start task in the on-chip storage resource of the computing device.
In this embodiment, the multi-core computing device 41 is further configured to receive a binding task, read and write data according to the specification of the data read-write tag, and execute the binding task;
in an embodiment of the present disclosure, referring to FIG. 4, the task scheduling module 412 continues to send the bind task after sending the start bind task. The bound task is an intermediate task in the bound related task group.
Referring to FIG. 4, after the cluster 405 of multi-core computing devices 41 receives the bind task, it reads the content read tag and the content write tag from the parameter table and starts to execute the task. Referring to the above description, each binding task may correspond to a data read-write tag, and the content read tag and the content write tag may define whether to read input data from an on-chip storage resource of the computing device and whether to write output data back to an external storage resource of the computing device. At least one intermediate task in the group of related tasks may be an associated task that needs to be executed on the cluster, and the content read tag of the bound task may be a read of input data from an on-chip storage resource (e.g., SRAM408) of the multicore computing device 41. The content write tag of the bind task may be the write back of the output data into an on-chip storage resource (e.g., SRAM 408). The cluster 405 reads and writes data through the application binary interface according to the content read tag and the content write tag of the binding task.
For example, the number of the binding tasks may be multiple, and all the binding tasks are performed by repeatedly performing the step S6032. For example, the related task packet may only contain the start binding task and the end binding task, and at this time, since there is no binding task, the above step S6032 may not be performed.
In this embodiment, the multi-core computing device 41 is further configured to receive an end binding task, read and write data according to the specification of the data read-write tag, execute the end binding task, and then release the locked cluster.
In an embodiment of the present disclosure, referring to fig. 4, after the cluster 405 finishes executing the binding task, the end binding task sent by the task scheduling module 412 is received. The end binding task represents the last task in the bound related task group. After the cluster 405 receives the binding ending task, the data is read and written through the application binary interface according to the content read tag and the content write tag read from the parameter table. Referring to the above description, the content read tag and the content write tag may define whether to read input data from an on-chip storage resource of the computing device and whether to write output data back to an external storage resource of the computing device. The content read tag of the end binding task may be to read input data from an on-chip storage resource (e.g., SRAM408) of the multicore computing device 41. The content write tag of the end binding task may be the write back of the output data to a storage resource (e.g., DRAM) external to the cluster.
After the bound task is finished, the multi-core computing device 41 releases the locked cluster 405, that is, the cluster 405 may also be used to execute other tasks, and resume normal data processing flow, for example, data may be accessed in a storage resource (e.g., DRAM) outside the cluster.
Referring to fig. 4, a computing device of a system-on-chip data processing device according to another embodiment of the present disclosure is illustrated, and may be implemented by a multi-core computing device 41, as shown in fig. 4. The present embodiment is different from the foregoing embodiments in that, in one embodiment of the present disclosure, the relevant task packet may include a relevant task, and a start marking task for marking a start of the relevant task and an end marking task for marking an end of the relevant task, and the start marking task and the end marking task may not involve reading and writing any data.
In this embodiment, the multi-core computing device 41 is configured to receive and execute the start marking task, and lock the cluster executing the related task packet.
In an embodiment of the present disclosure, referring to fig. 4, the multi-core computing device 41 receives the start marking task sent by the task scheduling module 412, and locks the cluster 405 executing the related task packet. The start marker task is used to identify the start of a bound related task packet. Wherein the multi-core computing device 41 locks the cluster 405 for executing the group of related tasks, the cluster 405 is only available for executing the tasks in the group of related tasks and is not available for executing other tasks.
In this embodiment, the multi-core computing device 41 is further configured to receive related tasks, read and write data according to the specification of the data read-write tag, and execute the related tasks.
In an embodiment of the present disclosure, referring to fig. 4, the data read/write tags of the related tasks may include a content write tag and a content read tag. The content read tag of the first one of the related tasks may indicate that data is read from an external storage resource (e.g., DRAM) of the locked cluster, and the content write tag of the first one of the related tasks may indicate that output data resides in an on-chip storage resource (e.g., SRAM408) of the multicore computing device 41. The content read tag of the last task of the related task may indicate that the input data is read from an on-chip storage resource (e.g., SRAM408) of the multi-core computing device 41, and the content write tag of the last task of the related task may indicate that the output data is written back to an external storage resource (e.g., DRAM) of the multi-core computing device 41. The content read tag of an intermediate task between the first task and the last task in the related task may be to read input data from an on-chip storage resource (e.g., SRAM408) of the multicore computing device 41, and the content write tag of the intermediate task may be to reside output data in an on-chip storage resource (e.g., SRAM408) of the multicore computing device 41.
The locked cluster can read and write the label according to the data of the related task, acquire the data related to the execution of each related task, and respectively execute each related task.
In this embodiment, the multi-core computing device 41 is further configured to receive and execute an end-marking task, and release the locked cluster.
In an embodiment of the present disclosure, referring to fig. 4, after the bound cluster 405 finishes executing the bound related task, an end-marker task sent by the task scheduling module 412 is received. The start marker task is used to identify the start of a bound related task packet. After the cluster 405 completes the end-marking task, the locked cluster 405 is released, i.e., the cluster 405 may also be used to perform other tasks, and resume normal data processing flow, for example, data may be accessed in a storage resource (e.g., DRAM) outside the cluster.
It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
It is further noted that, although the various steps in the flowcharts of fig. 6-8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Also, at least some of the steps in fig. 6-8 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.
It should be understood that the above-described apparatus embodiments are merely illustrative and that the apparatus of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is only one logical function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented.
In addition, unless otherwise specified, each functional unit/module in each embodiment of the present disclosure may be integrated into one unit/module, each unit/module may exist alone physically, or two or more units/modules may be integrated together. The integrated units/modules may be implemented in the form of hardware or software program modules.
If the integrated unit/module is implemented in hardware, the hardware may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. The artificial intelligence processor may be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, etc., unless otherwise specified. Unless otherwise specified, the Memory unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as resistive Random Access Memory rram (resistive Random Access Memory), Dynamic Random Access Memory dram (Dynamic Random Access Memory), Static Random Access Memory SRAM (Static Random-Access Memory), enhanced Dynamic Random Access Memory edram (enhanced Dynamic Random Access Memory), High-Bandwidth Memory HBM (High-Bandwidth Memory), hybrid Memory cubic hmc (hybrid Memory cube), and so on.
The integrated units/modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and various media capable of storing program codes.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be construed as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.
The foregoing may be better understood in light of the following clauses:
clause a1, a method for data processing in a system on a chip, the method comprising:
receiving tasks and related task grouping information;
issuing tasks in the related task groups to a cluster designated in the computing device; wherein the related task grouping is determined according to the related task grouping information;
and executing the tasks in the related task groups by using the designated cluster, and temporarily storing output data of at least one task in the related task groups in an on-chip storage resource of the computing device.
Clause A2, the method according to clause A1, wherein the related task grouping information includes start information, the method comprising:
and locking at least one cluster in the computing devices as the designated cluster according to the starting information in the task grouping information.
Clause A3, the method according to clause A1, wherein the task grouping information includes end information; the method further comprises the following steps:
and according to the end information in the task grouping information, unlocking the designated cluster.
Clause A4, the method according to any of clauses A1-3, wherein the related task groups include a start bound task, a bound task, and an end bound task; the executing tasks in the relevant task group using the designated cluster further comprises:
receiving the binding starting task, locking a cluster for executing related task groups, reading and writing data according to a data read-write label, and executing the binding starting task; wherein the start binding task includes start information in the task grouping information;
receiving the binding task, reading and writing data according to the data reading and writing label, and executing the binding task;
and receiving the binding ending task, reading and writing data according to the specification of the data reading and writing label, and executing the binding ending task.
Clause A5, the method of clause A4, wherein the data read-write tags include a content write tag, the content write tag indicating whether to write the output data back to an external storage resource of the computing device and whether to reside the output data in an on-chip storage resource of the computing device;
the reading and writing of data according to the data reading and writing label and the execution of the binding starting task comprise:
and according to the data write label, residing the output data of the binding task in the on-chip storage resource of the computing device.
Clause A6, the method of clause A5, wherein the data read-write tag comprises a content read tag, the content read tag indicating whether input data is read from an on-chip storage resource of the computing device;
the reading and writing of data according to the data read-write tag and the execution of the binding task comprise:
reading input data from an on-chip storage resource of the computing device according to the data read tag;
and according to the data write label, residing the output data of the binding task in an on-chip storage resource of the computing device.
Clause A7 and the method according to clause A6, wherein the reading and writing of data according to the provision of the data read-write tag, and the execution of the binding end task, include:
and writing the output data of the binding task back to the external storage resource of the computing device according to the data write-in tag.
Clause A8, the method of any of clauses A1-3, wherein the performing tasks in the group of related tasks using the designated cluster further comprises:
receiving and executing a start marking task, and locking a cluster for executing related task groups;
receiving related tasks, reading and writing data according to the provisions of the data reading and writing labels, executing the related tasks, and residing output data of other tasks except the last task in the related tasks in on-chip storage resources of the computing device;
and receiving and executing an end marking task, and releasing the locked cluster.
Clause A9, a data processing apparatus, characterized in that the apparatus comprises:
the task scheduling module is used for receiving tasks and related task grouping information and issuing the tasks in the related task grouping to a specified cluster of the computing device; wherein the related task grouping is determined according to the related task grouping information;
and the computing device comprises at least one cluster and is used for executing the tasks in the related task group by using the specified cluster and temporarily storing the output data of at least one task in the related task group in an on-chip storage resource of the computing device.
Clause a 10, the apparatus according to clause A9, wherein the task grouping information includes start information and end information;
the computing device is used for locking at least one cluster in the computing device as the designated cluster according to the starting information in the task grouping information.
Clause a 11, the apparatus according to clause A9, characterized in that,
the computing device is further used for unlocking the cluster according to the end information in the task grouping information.
Clause a 12, the apparatus according to any of clauses A9-11, wherein the related task group comprises a start bound task, a bound task, and an end bound task; the computing device is configured to execute the task in the related task group using the designated cluster, and specifically includes:
the system comprises a task processing module, a task binding module and a task binding module, wherein the task binding module is used for receiving a binding starting task, locking a cluster for executing related task groups, reading and writing data according to the specification of a data reading and writing label and executing the binding starting task; wherein the start binding task includes start information in the task grouping information;
the data reading and writing device is used for receiving the binding task, reading and writing data according to the specification of the data reading and writing label and executing the binding task;
and the binding end task is used for receiving the binding end task, reading and writing data according to the specification of the data reading and writing label, and executing the binding end task.
Clause a 13, the apparatus of clause a 12, wherein the data read-write tag comprises a content write tag, the content write tag indicating whether to write the output data back to an external storage resource of the computing device and whether to reside the output data in an on-chip storage resource of the computing device;
the computing device is specifically configured to reside the output data of the start binding task in an on-chip storage resource of the computing device according to the data write tag.
Clause a 14, the apparatus of clause a 13, wherein the data read-write tag comprises a content read tag, the content read tag indicating whether to read input data from an on-chip storage resource of the computing device;
the computing device is specifically configured to:
reading input data from an on-chip storage resource of the computing device according to the data read tag;
and according to the data write label, residing the output data of the binding task in an on-chip storage resource of the computing device.
Clause a 15, the method according to clause a 14, characterized in that the computing means are in particular adapted to:
and writing the output data of the binding task back to the external storage resource of the computing device according to the data write-in tag.
Clause a 16, the apparatus according to any of clauses A9-11, wherein the computing apparatus is configured to execute the tasks in the related task group using the locked cluster, specifically:
the cluster is used for receiving and executing the starting marking task and locking and executing the related task group;
the system comprises a data reading and writing tag, a chip storage resource and a task execution module, wherein the data reading and writing tag is used for reading and writing data according to the specification of the data reading and writing tag, executing the related tasks and storing output data of other tasks except the last task in the related tasks in the chip storage resource of the computing device;
and the system is used for receiving and executing an end marking task and releasing the locked cluster.
Clause a 17, a system-on-chip, comprising the data processing apparatus of any of clauses A9-16.
Clause a 18, a board, wherein the board comprises the system-on-chip of clause a 17.
The embodiments of the present disclosure have been described in detail, and the principles and embodiments of the present disclosure are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present disclosure. Meanwhile, a person skilled in the art should, based on the idea of the present disclosure, change or modify the specific embodiments and application scope of the present disclosure. In view of the above, the description is not intended to limit the present disclosure.

Claims (18)

1. A method of data processing for a system on a chip, the method comprising:
receiving tasks and related task grouping information;
issuing tasks in the related task groups to a cluster designated in the computing device; wherein the related task grouping is determined according to the related task grouping information;
and executing the tasks in the related task groups by using the designated cluster, and temporarily storing output data of at least one task in the related task groups in an on-chip storage resource of the computing device.
2. The method of claim 1, wherein the relevant task grouping information comprises start information, the method comprising:
and locking at least one cluster in the computing devices as the designated cluster according to the starting information in the task grouping information.
3. The method of claim 1, wherein the task grouping information includes end information; the method further comprises the following steps:
and releasing the lock of the designated cluster according to the end information in the task grouping information.
4. The method of any of claims 1-3, wherein the related task groups include a start binding task, a binding task, and an end binding task; the executing tasks in the relevant task group using the designated cluster further comprises:
receiving the binding starting task, locking a cluster for executing related task groups, reading and writing data according to a data read-write label, and executing the binding starting task; wherein the start binding task includes start information in the task grouping information;
receiving the binding task, reading and writing data according to the data read-write label, and executing the binding task;
and receiving the binding ending task, reading and writing data according to the specification of the data reading and writing label, and executing the binding ending task.
5. The method of claim 4, wherein the data read-write tag comprises a content write tag that indicates whether to write the output data back to an external storage resource of the computing device and whether to reside the output data in an on-chip storage resource of the computing device;
the reading and writing of data according to the data reading and writing label and the execution of the binding starting task comprise:
and according to the data write label, residing the output data of the starting binding task in an on-chip storage resource of the computing device.
6. The method of claim 5, wherein the data read-write tag comprises a content read tag indicating whether to read input data from an on-chip storage resource of the computing device;
reading and writing data according to the data reading and writing label, and executing the binding task, wherein the binding task comprises the following steps:
reading input data from an on-chip storage resource of the computing device according to the data read tag;
and according to the data write label, residing the output data of the binding task in an on-chip storage resource of the computing device.
7. The method of claim 6, wherein reading and writing data as specified by the data read/write tag, performing the end binding task comprises:
and writing the output data of the binding task back to the external storage resource of the computing device according to the data writing label.
8. The method of any of claims 1-3, wherein the performing tasks in the relevant task group using the specified cluster, further comprises:
receiving and executing a start marking task, and locking a cluster for executing related task groups;
receiving related tasks, reading and writing data according to the provisions of the data reading and writing labels, executing the related tasks, and residing output data of other tasks except the last task in the related tasks in on-chip storage resources of the computing device;
and receiving and executing an end marking task, and releasing the locked cluster.
9. A data processing apparatus, characterized in that the apparatus comprises:
the task scheduling module is used for receiving tasks and related task grouping information and issuing the tasks in the related task grouping to a specified cluster of the computing device; wherein the related task grouping is determined according to the related task grouping information;
and the computing device comprises at least one cluster and is used for executing the tasks in the related task groups by using the specified cluster and temporarily storing the output data of at least one task in the related task groups in the on-chip storage resources of the computing device.
10. The apparatus of claim 9, wherein the task grouping information comprises start information and end information;
the computing device is used for locking at least one cluster in the computing device as the designated cluster according to the starting information in the task grouping information.
11. The apparatus of claim 9,
the computing device is further used for unlocking the cluster according to the end information in the task grouping information.
12. The apparatus according to any of claims 9-11, wherein the related task groups comprise a start binding task, a binding task, and an end binding task; the computing device is configured to execute the task in the related task group using the designated cluster, and specifically includes:
the system comprises a task processing module, a task binding module and a task binding module, wherein the task binding module is used for receiving a binding starting task, locking a cluster for executing related task groups, reading and writing data according to the specification of a data reading and writing label and executing the binding starting task; wherein the start binding task includes start information in the task grouping information;
the data reading and writing device is used for receiving the binding task, reading and writing data according to the specification of the data reading and writing label and executing the binding task;
and the binding end task is used for receiving the binding end task, reading and writing data according to the specification of the data read-write label, and executing the binding end task.
13. The apparatus of claim 12, wherein the data read/write tags comprise a content write tag indicating whether to write the output data back to an external storage resource of the computing device and whether to reside in an on-chip storage resource of the computing device;
the computing device is specifically configured to reside the output data of the start binding task in an on-chip storage resource of the computing device according to the data write tag.
14. The apparatus of claim 13, wherein the data read-write tag comprises a content read tag indicating whether to read input data from an on-chip storage resource of the computing device;
the computing device is specifically configured to:
reading input data from an on-chip storage resource of the computing device according to the data reading tag;
and according to the data write label, residing the output data of the binding task in an on-chip storage resource of the computing device.
15. The method of claim 14, wherein the computing device is specifically configured to:
and writing the output data of the binding task back to the external storage resource of the computing device according to the data write-in tag.
16. The apparatus according to any of claims 9-11, wherein the computing apparatus is configured to execute the task in the related task group using the locked cluster, specifically:
the cluster is used for receiving and executing the start marking task and locking and executing the related task group;
the system comprises a data reading and writing tag, a chip storage resource and a task execution module, wherein the data reading and writing tag is used for reading and writing data according to the specification of the data reading and writing tag, executing the related tasks and storing output data of other tasks except the last task in the related tasks in the chip storage resource of the computing device;
and the system is used for receiving and executing an end marking task and releasing the locked cluster.
17. A system-on-chip, characterized in that it comprises a data processing device according to any one of claims 9 to 16.
18. A board comprising the system-on-chip of claim 17.
CN202011631083.6A 2020-12-30 2020-12-30 Data processing method and device of system on chip Pending CN114691313A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011631083.6A CN114691313A (en) 2020-12-30 2020-12-30 Data processing method and device of system on chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011631083.6A CN114691313A (en) 2020-12-30 2020-12-30 Data processing method and device of system on chip

Publications (1)

Publication Number Publication Date
CN114691313A true CN114691313A (en) 2022-07-01

Family

ID=82133413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011631083.6A Pending CN114691313A (en) 2020-12-30 2020-12-30 Data processing method and device of system on chip

Country Status (1)

Country Link
CN (1) CN114691313A (en)

Similar Documents

Publication Publication Date Title
Kaeli et al. Heterogeneous computing with OpenCL 2.0
EP1582980B1 (en) Context switching method, device, program, recording medium, and central processing unit
US8745631B2 (en) Intelligent memory device with ASCII registers
CN112381220B (en) Neural network tensor processor
WO2021000281A1 (en) Instructions for operating accelerator circuit
Finkbeiner et al. In-memory intelligence
CN112580792B (en) Neural network multi-core tensor processor
CN112799726A (en) Data processing device, method and related product
TWI754310B (en) System and circuit of pure functional neural network accelerator
CN114035916A (en) Method for compiling and scheduling calculation graph and related product
US7882504B2 (en) Intelligent memory device with wakeup feature
Stepchenkov et al. Recurrent data-flow architecture: features and realization problems
CN112948136A (en) Method for implementing asynchronous log record of embedded operating system
CN114691313A (en) Data processing method and device of system on chip
WO2022078400A1 (en) Device and method for processing multi-dimensional data, and computer program product
US20210200584A1 (en) Multi-processor system, multi-core processing device, and method of operating the same
CN115904681A (en) Task scheduling method and device and related products
CN114281561A (en) Processing unit, synchronization method for a processing unit and corresponding product
CN115840894A (en) Method for processing multidimensional tensor data and related product thereof
CN115794604B (en) Data generation method, device, equipment, medium and program product
CN117667211A (en) Instruction synchronous control method, synchronous controller, processor, chip and board card
CN115878272A (en) Graph task scheduling method, execution end device, storage medium and program product
CN116302459A (en) Method for memory management of an artificial intelligence computing system and related products
EP4206999A1 (en) Artificial intelligence core, artificial intelligence core system, and loading/storing method of artificial intelligence core system
CN117742715A (en) Access boundary crossing detection method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination