CN117667209A - Data processing method and related product - Google Patents

Data processing method and related product Download PDF

Info

Publication number
CN117667209A
CN117667209A CN202311482407.8A CN202311482407A CN117667209A CN 117667209 A CN117667209 A CN 117667209A CN 202311482407 A CN202311482407 A CN 202311482407A CN 117667209 A CN117667209 A CN 117667209A
Authority
CN
China
Prior art keywords
result
processor core
computing
processor
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311482407.8A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to CN202311482407.8A priority Critical patent/CN117667209A/en
Publication of CN117667209A publication Critical patent/CN117667209A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Advance Control (AREA)

Abstract

The present disclosure discloses a data processing method, comprising: a processor core of a computing device acquires input data corresponding to a current task, and a computing instruction is utilized to determine non-zero numerical values in the input data of the processor core; meanwhile, the processor core stores the non-zero value dimension of the processor core in the previous task to the corresponding position of the global memory according to the offset of the non-zero value storage position of the processor core in the previous task by using an IO instruction; the processor core obtains the first protocol result and the second protocol result from the shared memory by using a move instruction, and determines a non-zero value storage position of the processor core in the current task by using the first protocol result and the second protocol result.

Description

Data processing method and related product
Technical Field
The present disclosure relates generally to the field of intelligent computing, and more particularly to the field of neural networks. More particularly, the present disclosure relates to data processing methods, computing devices, computer readable storage media, and computer program products.
Background
In intelligent computing systems, common operations such as in neural network model algorithms are packaged into operators through a programming framework for direct invocation by programmers, such as convolution, pooling, and the like. TensorFlow, pyTorch, etc. are currently popular deep learning frameworks. In these programming frameworks, computational graphs are typically used to describe the computation of machine learning algorithms, with tensors representing all data in the computational graph and operators representing various operations.
In the Cuda implementation process, the implementation of the white operator in the pytorch invokes the cub:: deviceSelect:: flag, and a synchronization mechanism between thread blocks is internally implemented, that is, respective states are initialized on a global memory (similar to not ready) first, then processing of own data blocks is started, at a certain moment in running, after the current thread block needs to wait for the previous thread block to update to a ready state, the current thread block can continue to read the data of the previous thread block for subsequent operation, and at an appropriate moment, the state (ready) of the own thread block is updated, so that the subsequent thread block can read the required data. The problem to be solved is the storage location of the data. This way, the global memory is directly accessed, wasting bandwidth.
In addition, the artificial intelligent processor has a self inter-core synchronization mechanism, but a general method for acquiring inter-core data can cause serious LLC pending problem, and the operator performance is greatly reduced.
Disclosure of Invention
To at least partially solve one or more of the technical problems mentioned in the background, the present disclosure provides solutions from a number of aspects.
In a first aspect, the present disclosure discloses a data processing method comprising:
A processor core of a computing device acquires input data corresponding to a current task, and a computing instruction is utilized to determine non-zero numerical values in the input data of the processor core; meanwhile, the processor core stores the dimension of the non-zero value of the processor core in the previous task to the corresponding position of the global memory according to the storage position of the non-zero value of the processor core in the previous task by using the IO instruction;
the processor core determines the dimension of the non-zero value of the input data corresponding to the processor core in the current task by utilizing a calculation instruction; simultaneously, the processor core stores non-zero numerical values corresponding to the processor core in the current task to corresponding positions in a shared memory by utilizing a move instruction, and performs a first reduction on the non-zero numerical values corresponding to the processor core in the current computing cluster to obtain a first reduction result; the first protocol result comprises non-zero values corresponding to the current calculation cluster;
the processor core performs a second reduction on non-zero values corresponding to the current calculation cluster in the first reduction result by using an IO instruction to obtain a second reduction result; the second protocol result is stored in a global memory;
The memory control circuit of the computing cluster utilizes an IO instruction to move the second protocol result from the global memory to a corresponding position of the shared memory of the computing cluster for storage;
the processor core obtains the first protocol result and the second protocol result from the shared memory by using a move instruction, and determines a non-zero value storage position of the processor core in the current task by using the first protocol result and the second protocol result.
Preferably, in this embodiment, the first protocol result is stored in a corresponding location in the shared memory.
Preferably, in the present embodiment, the calculation clusters are alli+1) the result of the non-zero value addition corresponding to the processor cores as the%i+1) the reduction results of the processor cores; the reduction results for each processor core in the compute cluster constitute a first reduction result.
Preferably, in the present embodiment, the second reduction result is the first [ ] of the computing devicei+1) the result of non-zero value addition corresponding to the calculation clusters as the first%i+1) reduction results for the calculation clusters; the reduction results of each of the computing clusters of the computing device constitute a second reduction result.
Preferably, in this embodiment, the step of determining the storage location of the non-zero value of the processor core in the current task using the first and second protocol results includes:
acquiring a specification result of the target processor core from the first specification result;
acquiring a protocol result of the target computing cluster from the second protocol result;
and adding the reduction result of the target processor core and the reduction result of the target computing cluster, wherein the obtained result is the initial storage position of the non-zero value in the input data of the current processor core.
Preferably, in the present embodiment, the step of determining, by the processor core using the calculation instruction, a dimension of a non-zero value of the input data corresponding to the processor core in the current task includes:
and determining the dimension of the processor check to input data according to the type of the task, the number of the computing clusters in the computing device, the number of the processor cores in each computing cluster, the input data scale of the task and the maximum data scale processed by the computing device each time.
Preferably, in this embodiment, the data processing method further includes:
and the memory control circuit of the computing cluster performs a 0 setting operation on the position of the second protocol result stored in the global memory.
In a second aspect, the present disclosure provides a computing device for performing data processing, comprising:
a processor configured to execute program instructions; and
a memory configured to store the program instructions, which when loaded and executed by the processor, cause the processor to perform the method according to the first aspect of the present disclosure.
In a third aspect, the present disclosure provides a computer readable storage medium having stored therein program instructions which, when loaded and executed by a processor, cause the processor to perform the method according to the first aspect of the present disclosure.
In a fourth aspect, the present disclosure provides a computer program product comprising a computer program or instructions which, when executed by a processor, implement the method of the first aspect of the present disclosure
By the scheme provided by the scheme, the flow function of different cores can be synchronized by fully utilizing hardware according to the inter-core communication resource on the artificial intelligent processor chip, and the purpose of parallel inter-core protocol and intra-core calculation is achieved.
Drawings
The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
FIG. 1 illustrates a block diagram of a board of an embodiment of the present disclosure;
FIG. 2 illustrates a block diagram of a combination processing device according to an embodiment of the present disclosure;
FIG. 3a illustrates a schematic internal architecture of a single processor core of a single core computing device of an embodiment of the present disclosure;
FIG. 3b shows a simplified schematic diagram of the internal structure of a multi-core computing device of an embodiment of the present disclosure;
FIG. 4 shows a simplified schematic of the internal architecture of a computing device when it is multi-core;
FIG. 5 illustrates a data processing method flow diagram of an embodiment of the present disclosure;
FIG. 6 illustrates a schematic diagram of communication between computing clusters, between processor cores of the computing clusters in a computing device of an embodiment of the disclosure;
FIG. 7a illustrates one of the input data schematics of a computing device;
FIG. 7b shows a second schematic diagram of input data of the computing device;
FIG. 7c illustrates a third schematic diagram of input data to the computing device;
FIG. 8a illustrates one of the memory schematics of the shared memory corresponding to the processor cores of the computing device;
FIG. 8b illustrates a second storage schematic of a shared memory corresponding to a processor core of a computing device;
FIG. 8c illustrates a first protocol result stored in a shared memory corresponding to a processor core of a computing device;
FIG. 9a is a diagram showing one of the second protocol results stored in the global memory corresponding to the computing clusters of the computing device;
FIG. 9b illustrates a schematic diagram of a shared memory of a computing device storing a first protocol result and a second protocol result;
FIG. 10 illustrates a block diagram of a hardware configuration of a computing device in which various aspects of embodiments of the disclosure may be implemented.
Detailed Description
The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," and the like, as may appear in the claims, specification and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context.
Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Exemplary hardware architecture
Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is applied to the cloud intelligent field in a large quantity, and the cloud intelligent application has the remarkable characteristics of large input data quantity and high requirements on the storage capacity and the computing capacity of a platform, and the board card 10 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.
The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.
The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).
Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combined processing means 20 comprises computing means 201, interface means 202, processing means 203 and storage means 204.
The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.
The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.
The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.
The storage 204 is used to store data to be processed, which may be DRAM, is DDR memory, and is typically 16G or larger in size, for storing data for the computing device 201 and/or the processing device 203.
When the computing device 201 runs the neural network, the processing device 203 is generally required to compile the neural network to obtain an executable file, where the executable file includes device information, that is, which device in the heterogeneous computer system the executable file needs to execute. The executable files are assembled and linked to obtain an executable program of the neural network, and the executable program is stored in the storage device 204.
The processing device 203 may read an executable program from a storage location of the executable program and obtain a plurality of tasks of the program according to the executable program. These tasks are distributed via the interface means 202 to the computing means 201 for execution, ultimately obtaining the result of the operation.
Fig. 3a shows a schematic diagram of the internal structure of a processing core when the computing device 201 in fig. 2 is a single-core device. The computing device 301 is configured to process input data such as computer vision, voice, natural language, data mining, etc., and the computing device 301 includes three modules: a control module 31 (also referred to as a controller), an arithmetic module 32 (also referred to as an operator), and a storage module 33 (also referred to as a memory).
The control module 31 is used for coordinating and controlling the operation of the operation module 32 and the storage module 33 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 311 and an instruction decode unit (instruction decode unit, IDU) 312. The instruction fetching unit 311 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 312 decodes the fetched instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.
The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 322 is responsible for the core computation of the deep learning algorithm, i.e., matrix multiplication and convolution.
The storage module 33 is used for storing or carrying related data, and includes a neuron storage unit (NRAM) 331, a weight storage unit (weight RAM) 332, and a direct memory access module (direct memory access, DMA) 333.NRAM 331 is to store input neurons, output neurons, and calculated intermediate results; WRAM 332 is configured to store a convolution kernel, i.e., a weight, of the deep learning network; DMA 333 is coupled to DRAM 204 via bus 34 and is responsible for data handling between computing device 301 and DRAM 204. It should be noted that the NRAM and WRAM herein may be two memory areas formed by dividing the same memory in a logic memory space, or may be two independent memories, which are not limited herein specifically.
Fig. 3b shows a simplified schematic diagram of the internal architecture of computing device 201 as a multi-core. The multi-core computing device may be abstracted using a hierarchical hardware model. As shown, the multi-Core computing device may be abstracted into three levels, namely a Chip level (Chip) 360, a processor Cluster level (Cluster) 370, and a processor Core level (Core) 380. The embodiments of the present disclosure mainly relate to data transmission and calculation unit portions of a storage unit, and thus the drawings and description briefly show and introduce related calculation structures, and other portions are omitted.
At the chip level, local DDR memory is included on each chip, each processor chip acts as a compute and control unit, and each processor chip includes multiple processors as compute units.
At the processor cluster level, each multiprocessor includes a plurality of accelerator cores as control and computation units, and further has a shared memory SRAM as a memory unit.
At the processor core level, each accelerator core includes an array of local memory and local processing units. NFU refers to a neural arithmetic unit (Neuron Function Unit) for performing convolution calculations. The structure of a single processor core may be similar to the block diagram of a single core computing device shown in FIG. 3a and will not be described in detail herein.
In the multi-Core computing device, the storage model includes a board global memory, SRAM (shared memory) on a Cluster, NRAM on Core, WRAM, registers, and the like. For better performance, the data movement and memory/computation balancing between memory levels below the Card may be explicitly controlled. The SRAM is included in a memory processing unit MPU (Memory Process Unit Core, abbreviated as MPU, or Mem Core). Core refers to an intelligent processing Core (Intelligent Process Unit Core, IPU Core or Core for short) in a multi-Core computing device. 1 IPU Core contains NRAM, WRAM, NFU, etc. Cluster refers to a Cluster of processors or computing clusters, typically a multi-Core computing device comprising a number of Cluster, one Cluster comprising 1 Mem core+N IPU cores.
Fig. 4 shows a simplified schematic diagram of the internal architecture of the computing device 201 of fig. 2 when it is multi-core. The multi-core computing device may be abstracted using a hierarchical hardware model. As shown, the multi-core computing device 400 is a system-on-chip that includes at least one compute cluster (cluster), each of which in turn includes a plurality of processor cores, in other words, the multi-core computing device 400 is formed in a system-on-chip-compute cluster-processor core hierarchy.
At the system-on-chip level, as shown, the multi-core computing device 400 includes an external memory controller 41, a peripheral communication module 42, an on-chip interconnect module 43, a global synchronization module 44, and a plurality of computing clusters 45.
There may be a plurality of external memory controllers 41, 2 being shown by way of example, for accessing external memory devices (e.g., DRAM 204 in FIG. 2) to read data from or write data to off-chip in response to access requests issued by the processor cores. The peripheral communication module 42 is configured to receive a control signal from the processing device (203 of fig. 2) via the interface device (202 of fig. 2) and to initiate the computing device (201 of fig. 2) to perform a task. The on-chip interconnect module 43 connects the external memory controller 41, the peripheral communication module 42, and the plurality of computing clusters 45 for transmitting data and control signals between the respective modules. The global synchronization module 44 is, for example, a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each computing cluster to ensure synchronization of information. The plurality of computing clusters 45 are the computing cores of the multi-core computing device 400, 4 on each die being illustratively shown, the multi-core computing device 400 of the present disclosure may also include 8, 16, 64, or even more computing clusters 45 as hardware evolves. The computing clusters 45 are used to efficiently execute the deep learning algorithm.
At the level of the compute clusters, as shown in fig. 4, each compute cluster 45 includes a plurality of processor cores 406 as control and compute units, and a shared memory core 407 as a memory unit. Further, each computing cluster may further include a local synchronization module 412, configured to coordinate the working progress of each processor core in the computing cluster, so as to ensure synchronization of information. The processor cores 406 are illustratively shown as 4, and the present disclosure does not limit the number of processor cores 406.
The storage cores 407 are mainly used for storing and communicating, i.e., storing shared data or intermediate results between the processor cores 406, and executing communication between the compute clusters 45 and the DRAM 204, communication between the compute clusters 45, communication between the processor cores 406, and the like. In other embodiments, the memory core 407 has scalar operation capabilities to perform scalar operations.
The memory core 407 includes a shared memory unit (SRAM) 408, a broadcast bus 409, a compute cluster direct memory access module (cluster direct memory access, CDMA) 410, and a global direct memory access module (global direct memory access, GDMA) 411. The SRAM 408 plays a role of a high-performance data transfer station, and data multiplexed between different processor cores 406 in the same computing cluster 45 is not required to be obtained from the processor cores 406 to the DRAM 204 respectively, but is transferred between the processor cores 406 through the SRAM 408, and the memory core 407 only needs to rapidly distribute the multiplexed data from the SMEM 408 to a plurality of processor cores 406, so as to improve the inter-core communication efficiency and greatly reduce on-chip off-chip input/output access. Broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication between processor cores 406, communication between compute clusters 45, and data transfer between compute clusters 45 and DRAM 204, respectively.
At the level of the processor cores, the structure of a single processor core may be similar to the block diagram of a single core computing device shown in FIG. 3a, and will not be described in detail herein.
Data processing scheme
The following describes a data processing scheme provided by an embodiment of the present disclosure, as shown in fig. 5. Comprising the following steps:
step 501), a processor core of a computing device acquires input data corresponding to a current task, and a computing instruction is utilized to determine non-zero numerical values in the input data of the processor core; and simultaneously, the processor core stores the non-zero value dimension of the processor core in the previous task to the corresponding position of the global memory according to the non-zero value storage position of the processor core in the previous task by using the IO instruction.
As shown in fig. 6, a schematic diagram of communication between computing clusters and between processor cores of the computing clusters in a computing device according to an embodiment of the disclosure is shown. The processor cores IPU0 and IPU1 in the computing Cluster Cluster0 determine non-zero numerical values in the corresponding input data through computing instructions. If a computing device initiates a U8 task, the computing device has a total of 8 computing clusters of each MPU (Memory Processing Unit), each Cluster having 4 processor cores. Then the input data is tensor shape [24,832,832], the data type is bool, which contains 1661076 non-zero numbers; the output data is tensor shape 3,1661076, and the data type is int32. Thus, the scale needs to be cycled 74 times, with a maximum of 7040 inputs processed each time. As shown in fig. 7 a. The range of data offsets for the 3 rd round of processing of Cluster 0_IPU1 is:
The terms [7040 x 32 x 2+7040 x 5, 7040 x 32 x 2+7040 x 6-1], i.e., [485760, 485760+7040-1], do not need to pay attention to the position of termination, but only to the start offset 485760 and the number of data 7040. From the starting offset 485760, its dimension can be calculated as [0, 583, 704]. The period of the lowest dimension is 832 (less than 7040), the period of the second dimension is 832×832 (greater than 7040), and the period of the highest dimension is 24×832×832 (greater than 7040). For the dimension with the period larger than 7040, only partial period data is needed to be generated, and for the dimension with the period smaller than 7040, data with one period is needed to be generated first, and then a method of multiple copies is used for generating the data.
It will be appreciated from the illustration of FIG. 7a that a 704-831 sequence and a 0-703 sequence need be generated. For the sequence 704-831, a sequence 704-735 is written on the on-chip memory of the computing device using scalar, then the number generated is added 32 using vector instructions to obtain a new sequence, then the first two sequences are added 64 to obtain another new sequence, and so on, and so forth, until the desired sequence is generated, as shown in FIG. 7 b.
For the second dimension, as shown in fig. 7 c. 128 583, 832 584,832 585, …,832 591,256 592 need to be generated using the vectors; for the lowest dimension, 7040 0 s need to be generated using vector instructions. Since only the period of the lowest dimension is less than 7040, only periodic copying of the data of the lowest dimension is required. Each copy copies all the previous data to one copy, if the number exceeds the required number, directly intercepting and copying according to the requirement.
Step 502) the processor core determines the dimension of the non-zero value of the corresponding input data of the processor core in the current task by using the calculation instruction; simultaneously, the processor core stores non-zero numerical values corresponding to the processor core in the current task to corresponding positions in a shared memory by utilizing a move instruction, and performs a first reduction on the non-zero numerical values corresponding to the processor core in the current computing cluster to obtain a first reduction result; the first protocol result comprises non-zero values corresponding to the current calculation cluster.
In this embodiment, assuming that the number of non-zeros of the processor cores IPU1 in the compute Cluster1 is 7039 and the number of non-zeros of the remaining 31 processor cores is 7040, they need to exchange the number of non-zeros to determine the storage location of the input data. In order to be parallel to the computational flow, a finer way of controlling inter-core synchronization is required, as shown in fig. 8 a. The non-zero values of the input data of IPU 0-IPU 3 of Cluster0, IPU 0-IPU 3 of Cluster2, IPU 0-IPU 3 of Cluster3, IPU 0-IPU 3 of Cluster4, IPU 0-IPU 3 of Cluster5, IPU 0-IPU 3 of Cluster6 and IPU 0-IPU 3 of Cluster7 are 7040, and are stored in the corresponding positions in the shared memories of the corresponding calculation clusters through move instructions. As shown in fig. 8b, the value of the non-zero number of the input data of IPU0 of Cluster1 is 7040, the value of the non-zero number of the input data of IPU1 of Cluster1 is 7039, the value of the non-zero number of the input data of IPU2 of Cluster1 is 7040, and the value of the non-zero number of the input data of IPU3 of Cluster1 is 7040. And stored by move instructions in corresponding locations in the shared memory of the corresponding compute cluster. As shown in FIG. 6, a barrer instruction is inserted, inputs to all processor cores of the device to be calculated iAfter the non-zero numbers of the data are all stored in the corresponding positions, the non-zero numbers in the computing device are subjected to reduction, and the reduction rules are as follows:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein,irepresenting the sequence number of a processor core in a compute cluster, andithe value is a natural number; />Representing the first degree of calculation clusteri+1) non-zero values in the input data to the processor cores. In short, calculate the front [ ] in the clusteri+1) the result of the non-zero value addition corresponding to the processor cores as the%i+1) the reduction results of the processor cores; the reduction results for each processor core in the compute cluster constitute a first reduction result. As can be seen from fig. 8a, 8b and 8c, the reduction results are stored in the corresponding locations of the shared memory.
The rule of the specification is described by way of example with Cluster1, as shown in FIG. 8 c. For IPU0 of Cluster1, the original non-zero value of IPU0 is stored in the non-zero value storage location of IPU0, that is, 7040. For IPU1 of Cluster1, the sum of the non-zero value of the input data corresponding to IPU0 and the non-zero value of the input data corresponding to IPU1 is stored in the original non-zero value storage location of IPU1, namely: 7040+7039=14079. For IPU2 of Cluster1, the original non-zero value storage location of IPU2 stores the sum of the non-zero value of the input data corresponding to IPU0, the non-zero value of the input data corresponding to IPU1, and the non-zero value of the input data corresponding to IPU2, namely: 7040+7039+7040=21119. For IPU2 of Cluster1, the original non-zero value storage location of IPU2 stores the sum of the non-zero value of the input data corresponding to IPU0, the non-zero value of the input data corresponding to IPU1, and the non-zero value of the input data corresponding to IPU2, namely: 7040+7039+7040+7040= 28159.
Step 503), the processor core performs a second reduction on non-zero values corresponding to the current calculation cluster in the first reduction result by using an IO instruction to obtain a second reduction result; and storing the second protocol result in a global memory.
As shown in fig. 6, a barrer instruction is inserted, and after the first protocol results corresponding to all the computing clusters of the computing device are stored in the corresponding positions, any processor core is selected to perform a second protocol on the first protocol results based on the IO instruction. In this embodiment, the IPU0 in each computing cluster in the computing device performs a second reduction on the first reduction result based on the IO instruction, as shown in fig. 9 a. The non-zero values in the input data corresponding to Cluster0 (abbreviated C0), cluster2 (abbreviated C2), cluster3 (abbreviated C3), cluster4 (abbreviated C4), cluster5 (abbreviated C5), cluster6 (abbreviated C6) and Cluster7 (abbreviated C7) are 28160, and the non-zero values in the input data corresponding to Cluster1 (abbreviated C1) are 28159. The second rule of the protocol is:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein,irepresenting a sequence number of a computing cluster in a computing device, andithe value is a natural number; />Representing the first degree of the computing devicei+1) non-zero values in the input data of the computational clusters. In short, the front of the computing device i+1) the result of non-zero value addition corresponding to the calculation clusters as the first%i+1) reduction results for the calculation clusters; the reduction results of each of the computing clusters of the computing device constitute a second reduction result.
As can be seen from fig. 9a, the reduction result for the 1 st calculation cluster is a non-zero value of the input data of the 1 st calculation cluster. I.e., 28160. The reduction result for the 2 nd computing cluster is the addition of the non-zero value of the input data of the 1 st computing cluster and the non-zero value of the input data of the 2 nd computing cluster, i.e., 28160+28159= 56319. The result of the reduction for the 3 rd computing cluster is the non-zero value of the input data of the 1 st computing cluster, the non-zero value of the input data of the 2 nd computing cluster, and the non-zero value of the input data of the 3 rd computing cluster are added, i.e., 28160+28159+28160= 84479. The result of the reduction for the 4 th computing cluster is the non-zero value of the input data of the 1 st computing cluster, the non-zero value of the input data of the 2 nd computing cluster, the non-zero value of the input data of the 3 rd computing cluster, and the non-zero value of the input data of the 4 th computing cluster are added, namely 28160+28159+28160+28160= 112639. The result of the reduction for the 5 th computing cluster is the sum of the non-zero value of the input data of the 1 st computing cluster, the non-zero value of the input data of the 2 nd computing cluster, the non-zero value of the input data of the 3 rd computing cluster, the non-zero value of the input data of the 4 th computing cluster and the non-zero value of the input data of the 5 th computing cluster, namely 28160+28159+28160+28160+28160= 140799. The reduction result for the 6 th calculation cluster is obtained by adding the non-zero value of the input data of the 1 st calculation cluster, the non-zero value of the input data of the 2 nd calculation cluster, the non-zero value of the input data of the 3 rd calculation cluster, the non-zero value of the input data of the 4 th calculation cluster, the non-zero value of the input data of the 5 th calculation cluster and the non-zero value of the input data of the 6 th calculation cluster, namely 28160+28159+28160+28160+28160+28160= 168959. The reduction result for the 7 th calculation cluster is the non-zero value of the input data of the 1 st calculation cluster, the non-zero value of the input data of the 2 nd calculation cluster, the non-zero value of the input data of the 3 rd calculation cluster, the non-zero value of the input data of the 4 th calculation cluster, the non-zero value of the input data of the 5 th calculation cluster, the non-zero value of the input data of the 6 th calculation cluster, and the non-zero value of the input data of the 7 th calculation cluster are added, that is, 28160+28159+28160+28160+28160+28160= 168959. The result of the reduction for the 8 th computing cluster is the non-zero value of the input data of the 1 st computing cluster, the non-zero value of the input data of the 2 nd computing cluster, the non-zero value of the input data of the 3 rd computing cluster, the non-zero value of the input data of the 4 th computing cluster, the non-zero value of the input data of the 5 th computing cluster, the non-zero value of the input data of the 6 th computing cluster, the non-zero value of the input data of the 7 th computing cluster, the non-zero value of the input data of the 8 th computing cluster, namely: 28160+28159+28160+28160+28160+28160+28160+28160= 225279. The result of the above-mentioned protocol is stored in the corresponding position of the global memory.
Step 504), the memory control circuit of the computing cluster uses IO instruction to move the second protocol result from the global memory to the corresponding position of the shared memory of the computing cluster for storage.
As shown in fig. 6, a barrer instruction is inserted, and after the second protocol results corresponding to all the computing clusters of the computing device are stored in the corresponding positions, a memory control circuit is selected to perform copy (copy) operation. The computing device is provided with at least one computing cluster, and each computing cluster is provided with a corresponding memory control circuit. The memory control circuit uses the IO instruction to move the second reduction result from the global memory to the corresponding position of the shared memory of the computing cluster for storage, as shown in fig. 9 b. Taking Cluster1 as an example, in the shared memory, a first protocol result corresponding to each processor core in the computing Cluster is stored. Such as: 7040. 14079, 21119, 28159. The latter area stores the second reduction result. Such as: 28160. 56319, 84479, 112639, 140799, 168959, 197119, 225279.
Step 505) the processor core obtains the first protocol result and the second protocol result from the shared memory by using a move instruction, and determines a non-zero value storage position of the processor core in the current task by using the first protocol result and the second protocol result.
As shown in fig. 6, after inserting the barrer instruction and storing the second protocol results corresponding to all the computing clusters of the computing device to the corresponding positions, the processor core uses the move instruction to take the second protocol results and the first protocol results needed from the shared storage, and uses the first protocol results and the second protocol results to determine the starting position of the storage of the processor core. Taking IPU1 of Cluster1 as an example, it needs to take the second protocol result 28160 corresponding to the computing Cluster and the first protocol result 7040 between the processor cores in the computing Cluster, where the IPU1 uses 28160 and 7040 to determine that the storage location is 28160+7040= 35200. Assuming that all non-zero values for the first 2 cycles are 7040 x 32 x 2= 450560, the final storage location is 450560+35200. For the next round to determine the final storage location, the total non-zero number after the round needs to be updated, namely: 450560+225279.
By the scheme provided by the scheme, the flow function of different cores can be synchronized by fully utilizing hardware according to the inter-core communication resource on the artificial intelligent processor chip, and the purpose of parallel inter-core protocol and intra-core calculation is achieved.
As shown in fig. 6, when determining the storage location, the memory control circuit of the computing cluster performs a 0-setting operation on the location in the global memory where the second specification result is stored.
The present disclosure also provides a computing device that may be used to perform the data processing method described previously.
Fig. 10 illustrates a block diagram of a hardware configuration of a computing device 1200 in which various aspects of embodiments of the disclosure may be implemented. As shown, computing device 1200 may include a processor 1210 and a memory 1220. In the computing apparatus 1200 of fig. 12, only constituent elements related to the present embodiment are shown. Thus, it will be apparent to those of ordinary skill in the art that: computing device 1200 may also include common constituent elements that are different from those shown in fig. 10, such as: a display.
The computing apparatus 1200 may correspond to a computing device having various processing functions, e.g., functions for programming, compiling source code. For example, the computing apparatus 1200 may be implemented as various types of devices, such as a Personal Computer (PC), a server device, a mobile device, and so forth.
A processor 1210 configured to execute program instructions to control all functions of the computing device 1200. For example, the processor 1210 controls all functions of the computing device 1200 by executing programs stored in the memory 1220 on the computing device 1200. Processor 1210 may be implemented by a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an Application Processor (AP), an artificial intelligence processor chip (IPU), etc. provided in computing device 1200. However, the present disclosure is not limited thereto.
Memory 1220 is hardware for storing various data processed in computing device 1200. For example, the memory 1220 may store processed data and data to be processed in the computing device 1200. The memory 1220 may store data that has been processed or is to be processed by the processor 1210, such as source code before compilation, assembly instructions after compilation, and the like. Further, the memory 1220 may store program instructions of applications, drivers, etc. to be driven by the computing device 1200. For example: the memory 1220 may store various programs related to a data processing method or the like to be executed by the processor 1210. The memory 1220 may be a DRAM, but the present disclosure is not limited thereto. The memory 1220 may include at least one of volatile memory or nonvolatile memory. The nonvolatile memory may include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, phase change RAM (PRAM), magnetic RAM (MRAM), resistive RAM (RRAM), ferroelectric RAM (FRAM), and the like. Volatile memory can include Dynamic RAM (DRAM), static RAM (SRAM), synchronous DRAM (SDRAM), PRAM, MRAM, RRAM, ferroelectric RAM (FeRAM), and the like. In an embodiment, the memory 920 may include at least one of a Hard Disk Drive (HDD), a Solid State Drive (SSD), a high density flash memory (CF), a Secure Digital (SD) card, a Micro-secure digital (Micro-SD) card, a Mini-secure digital (Mini-SD) card, an extreme digital (xD) card, a cache (caches), or a memory stick.
In summary, the specific functions implemented by the memory 1220 and the processor 1210 of the computing device 1200 provided in the embodiments of the present disclosure may be explained in comparison with the foregoing embodiments in the present disclosure, and the technical effects of the foregoing embodiments may be achieved, which will not be repeated herein.
In an embodiment of the present disclosure, there is also provided a computer-readable storage medium in which program instructions are stored, which when loaded and executed by a processor, cause the processor to perform the method of processing data in a computational graph described in the embodiments of the present disclosure. In an embodiment of the present disclosure, there is also provided a computer program product comprising a computer program or instructions which, when executed by a processor, implement a method for processing data in a computational graph according to the embodiments described in the present disclosure.
According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.
It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.
In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are divided herein by taking into account the logic function, and there may be other manners of dividing the units when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.
In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.
In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP and ASICs, etc. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.
While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. The appended claims are intended to define the scope of the disclosure and are therefore to cover all equivalents or alternatives falling within the scope of these claims.

Claims (10)

1. A data processing method, comprising:
a processor core of a computing device acquires input data corresponding to a current task, and a computing instruction is utilized to determine non-zero numerical values in the input data of the processor core; meanwhile, the processor core stores the non-zero value dimension of the processor core in the previous task to the corresponding position of the global memory according to the offset of the non-zero value storage position of the processor core in the previous task by using an IO instruction;
the processor core determines the dimension of the non-zero value of the input data corresponding to the processor core in the current task by utilizing a calculation instruction; simultaneously, the processor core stores non-zero numerical values corresponding to the processor core in the current task to corresponding positions in a shared memory by utilizing a move instruction, and performs a first reduction on the non-zero numerical values corresponding to the processor core in the current computing cluster to obtain a first reduction result; the first protocol result comprises non-zero values corresponding to the current calculation cluster;
The processor core performs a second reduction on non-zero values corresponding to the current calculation cluster in the first reduction result by using an IO instruction to obtain a second reduction result; the second protocol result is stored in a global memory;
the memory control circuit of the computing cluster utilizes an IO instruction to move the second protocol result from the global memory to a corresponding position of the shared memory of the computing cluster for storage;
the processor core obtains the first protocol result and the second protocol result from the shared memory by using a move instruction, and determines a non-zero value storage position of the processor core in the current task by using the first protocol result and the second protocol result.
2. The method of claim 1, wherein the first specification result is stored in a corresponding location in the shared memory.
3. The method of claim 1, wherein the computing a front # -in a clusteri+1) the result of the non-zero value addition corresponding to the processor cores as the%i+1) the reduction results of the processor cores; the reduction results for each processor core in the compute cluster constitute a first reduction result.
4. The method of claim 1, wherein the second reduction result is a first [ ] of the computing device i+1) the result of non-zero value addition corresponding to the calculation clusters as the first%i+1) reduction results for the calculation clusters; the reduction results of each of the computing clusters of the computing device constitute a second reduction result.
5. The method of claim 1, wherein determining a storage location of a non-zero value of the processor core in a current task using the first protocol result and the second protocol result comprises:
acquiring a specification result of the target processor core from the first specification result;
acquiring a protocol result of the target computing cluster from the second protocol result;
and adding the reduction result of the target processor core and the reduction result of the target computing cluster, wherein the obtained result is the initial storage position of the non-zero value in the input data of the current processor core.
6. The method of claim 1, wherein the step of the processor core determining, with the computing instructions, a non-zero value dimension of the input data corresponding to the processor core in the current task comprises:
and determining the dimension of the processor check to input data according to the type of the task, the number of the computing clusters in the computing device, the number of the processor cores in each computing cluster, the input data scale of the task and the maximum data scale processed by the computing device each time.
7. The method of claim 1, wherein the data processing method further comprises:
and the memory control circuit of the computing cluster performs a 0 setting operation on the position of the second protocol result stored in the global memory.
8. A computing device for performing data processing, comprising:
a processor configured to execute program instructions; and
a memory configured to store the program instructions, which when loaded and executed by the processor, cause the processor to perform the method according to any one of claims 1-7.
9. A computer readable storage medium having stored therein program instructions which, when loaded and executed by a processor, cause the processor to perform the method according to any of claims 1-7.
10. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the method of any of claims 1-7.
CN202311482407.8A 2023-11-08 2023-11-08 Data processing method and related product Pending CN117667209A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311482407.8A CN117667209A (en) 2023-11-08 2023-11-08 Data processing method and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311482407.8A CN117667209A (en) 2023-11-08 2023-11-08 Data processing method and related product

Publications (1)

Publication Number Publication Date
CN117667209A true CN117667209A (en) 2024-03-08

Family

ID=90069070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311482407.8A Pending CN117667209A (en) 2023-11-08 2023-11-08 Data processing method and related product

Country Status (1)

Country Link
CN (1) CN117667209A (en)

Similar Documents

Publication Publication Date Title
WO2022161318A1 (en) Data processing device and method, and related products
WO2024093292A1 (en) Automatic operator fusion method for computational graph and related product
WO2023071238A1 (en) Computational graph compiling and scheduling methods and related products
CN113469336A (en) Compiling method and execution method for optimizing neural network model and related products
CN112084023A (en) Data parallel processing method, electronic equipment and computer readable storage medium
CN113469337B (en) Compiling method for optimizing neural network model and related products thereof
CN117667209A (en) Data processing method and related product
CN114281561A (en) Processing unit, synchronization method for a processing unit and corresponding product
CN112948001A (en) Method for setting tensor hardware configuration, readable storage medium and device
CN116185378A (en) Optimization method of calculation graph, data processing method and related products
CN117931430A (en) Method for realizing DFT performance optimization by processing device and data processing system
CN115373646A (en) Information expansion method, device and related product
CN113792867B (en) Arithmetic circuit, chip and board card
CN113742266B (en) Integrated circuit device, electronic apparatus, board and computing method
CN113791996B (en) Integrated circuit device, electronic apparatus, board and computing method
CN113469365B (en) Reasoning and compiling method based on neural network model and related products thereof
WO2023016382A1 (en) Method for system on chip, and related product thereof
WO2022111013A1 (en) Device supporting multiple access modes, method and readable storage medium
CN117648091A (en) Compiling method of calculation graph and related product
CN117093263A (en) Processor, chip, board card and method
CN117667212A (en) Instruction control device, method, processor, chip and board card
CN117742566A (en) Memory access processing device, processor, chip, board card and instruction execution method
CN117520254A (en) Processor, chip, board card and method
CN114625370A (en) Method, device and heterogeneous system for data layout between host and device
CN115437693A (en) Computing device operating according to multi-operation instruction and single-operation instruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination