CN117389731A - Data processing method and device, chip, device and storage medium - Google Patents

Data processing method and device, chip, device and storage medium Download PDF

Info

Publication number
CN117389731A
CN117389731A CN202311368621.0A CN202311368621A CN117389731A CN 117389731 A CN117389731 A CN 117389731A CN 202311368621 A CN202311368621 A CN 202311368621A CN 117389731 A CN117389731 A CN 117389731A
Authority
CN
China
Prior art keywords
data
thread
computing
module
data processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311368621.0A
Other languages
Chinese (zh)
Other versions
CN117389731B (en
Inventor
周华民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xinfeng Microelectronics Co ltd
Original Assignee
Shanghai Xinfeng Microelectronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xinfeng Microelectronics Co ltd filed Critical Shanghai Xinfeng Microelectronics Co ltd
Priority to CN202311368621.0A priority Critical patent/CN117389731B/en
Publication of CN117389731A publication Critical patent/CN117389731A/en
Application granted granted Critical
Publication of CN117389731B publication Critical patent/CN117389731B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Multi Processors (AREA)

Abstract

The application discloses a data processing method and device, a chip, equipment and a storage medium, wherein the data processing device comprises an NPU (non-point processing unit), the NPU comprises a calculation module and a data operation module, and the data processing method comprises the following steps: when executing the data processing task, synchronizing the resource matching information corresponding to the data processing task to the computing module through the data operation module; the data processing task at least comprises a first thread and a second thread; the resource matching information characterizes whether the data carrying operation corresponding to the first thread and the second thread is completed or not; executing computing operation corresponding to the first thread or the second thread based on the resource matching information through a computing module; therefore, the universality and the hardware utilization rate can be considered, and the efficiency and the performance of data processing are effectively improved.

Description

Data processing method and device, chip, device and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, a chip, a device, and a storage medium.
Background
The neural network processor (Neural Processing Unit, NPU) is a chip dedicated to performing deep learning calculations. It is one of the popular technologies in the field of artificial intelligence in recent years, and is widely applied to various artificial intelligence applications, such as the fields of automatic driving, face recognition, intelligent voice and the like.
To achieve this goal of high hardware utilization, the common practice mainly includes two methods, one is compiler optimization combined with hardware loop level synchronization, and the other is realized through multithreading. The method of optimizing and combining hardware loop level synchronization by the compiler has no generality, but the method of realizing the method through multithreading cannot solve the problem of mismatching of computing resources and data resources.
Therefore, the common data processing method cannot be compatible with the universality and the hardware utilization rate, so that the efficiency and the performance of data processing are limited.
Disclosure of Invention
The embodiment of the application provides a data processing method and device, a chip, equipment and a storage medium, which can consider the universality and the hardware utilization rate and effectively improve the efficiency and the performance of data processing.
The technical scheme of the embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a data processing method, where the data processing apparatus includes an NPU, and the NPU includes a computing module and a data operation module, and the method includes:
when executing the data processing task, synchronizing the resource matching information corresponding to the data processing task to the computing module through the data operation module; the data processing task at least comprises a first thread and a second thread; the resource matching information characterizes whether the data carrying operation corresponding to the first thread and the second thread is completed or not;
And executing the computing operation corresponding to the first thread or the second thread based on the resource matching information through the computing module.
In a second aspect, an embodiment of the present application provides a data processing apparatus, including: a synchronization unit, an execution unit,
the synchronization unit is used for synchronizing the resource matching information corresponding to the data processing task to the calculation module through the data operation module when the data processing task is executed; the data processing task at least comprises a first thread and a second thread; the resource matching information characterizes whether the data carrying operation corresponding to the first thread and the second thread is completed or not;
the execution unit is used for executing the computing operation corresponding to the first thread or the second thread based on the resource matching information through the computing module.
In a third aspect, an embodiment of the present application provides a data processing chip, where the data processing chip includes an NPU, and the NPU includes a computing module and a data operation module; the data processing chip is configured to implement the method according to the first aspect.
In a fourth aspect, an embodiment of the present application provides a data processing apparatus, where the data processing apparatus includes an NPU, where the NPU includes a computing module and a data operation module; the data processing device is arranged to implement the method as described in the first aspect.
In a fifth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a program which, when executed by a processor, implements a method as described in the first aspect.
The embodiment of the application provides a data processing method and device, a chip, equipment and a storage medium, wherein the data processing device comprises an NPU (network platform unit), the NPU comprises a calculation module and a data operation module, and when a data processing task is executed, resource matching information corresponding to the data processing task is synchronized to the calculation module through the data operation module; the data processing task at least comprises a first thread and a second thread; the resource matching information characterizes whether the data carrying operation corresponding to the first thread and the second thread is completed or not; and executing the computing operation corresponding to the first thread or the second thread based on the resource matching information through the computing module. It can be seen that, in the embodiment of the present application, by synchronizing the resource matching information, the computing module may determine the completion of the data handling operation, so that the computing operation may be further performed according to the completion of the data handling operation. That is, in the embodiment of the present application, for a data processing task, on the basis of improving universality by using multithreading, an appropriate computing operation can be flexibly and dynamically executed according to the completion condition of a synchronous data handling operation, so that computing resources and data resources are separated from a hardware layer, further the problem of mismatching of the computing resources and the data resources is solved, and the efficiency and performance of data processing can be effectively improved in consideration of both versatility and hardware utilization.
Drawings
FIG. 1 is a schematic diagram of computing and data handling;
FIG. 2 is a schematic diagram II of calculation and data handling;
FIG. 3 is a schematic diagram III of computing and data handling;
FIG. 4 is a schematic diagram IV of computing and data handling;
FIG. 5 is a schematic diagram five of computing and data handling;
FIG. 6 is a schematic diagram of an implementation framework of a data processing method according to an embodiment of the present application;
fig. 7 is a schematic implementation flow chart of a data processing method according to an embodiment of the present application;
FIG. 8 is a schematic diagram of an instruction sequence according to an embodiment of the present application;
FIG. 9 is a schematic diagram illustrating an execution sequence of a data handling operation according to an embodiment of the present application;
fig. 10 is a schematic diagram of synchronization of resource matching information according to an embodiment of the present application;
FIG. 11 is a schematic diagram illustrating an execution sequence of a computing operation according to an embodiment of the present application;
FIG. 12 is a diagram of a multi-threaded implementation framework according to an embodiment of the present application;
FIG. 13 is a schematic diagram of an execution sequence of multiple threads according to an embodiment of the present application;
FIG. 14 is a second frame diagram of a multithreading implementation according to an embodiment of the present application;
FIG. 15 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;
FIG. 16 is a schematic diagram illustrating a structure of a data processing chip according to an embodiment of the present disclosure;
Fig. 17 is a schematic diagram of the composition structure of a data processing apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to be limiting. It should be noted that, for convenience of description, only a portion related to the related application is shown in the drawings.
The neural network processor (Neural network Processing Unit, NPU) can accelerate the operation of the neural network, and solves the problem that the traditional chip has low efficiency in the operation of the neural network. In general, most NPUs need to be used with other chips, such as a central processing unit (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), etc., to complete the entire computing task.
The principle of NPU mainly includes two aspects: a computing unit and a data store.
The computing unit is the core component of the NPU, which is specifically designed for neural network computation. The computation unit of the NPU generally adopts a mode of matrix computation, vector computation and the like, and can rapidly perform matrix multiplication, convolution and the like. Compared with the traditional CPU and GPU, the NPU computing unit has higher computing efficiency and lower energy consumption, and can more efficiently complete the neural network computing task.
Data storage is another key component of the NPU. Because neural network models are typically very bulky, the NPU needs to have sufficient memory capacity to store model parameters and intermediate calculations. The data storage of the NPU typically employs a combination of cache and memory to more quickly access and read data.
The NPU has the characteristics of high efficiency, low delay, stability and programmability. The NPU is designed to perform deep learning calculation, so that the NPU has very high calculation efficiency and energy consumption efficiency, and can complete large-scale neural network calculation tasks in a short time. The low latency of the NPU means that it can respond to a computation request in a short time to perform a computation task. This is because the NPU design separates the computationally and memory intensive tasks, avoiding the problem of computation and memory contention in the CPU and GPU. In addition, the calculation unit of the NPU is optimized, a more efficient matrix calculation and vector calculation mode is adopted, and operations such as large-scale matrix multiplication, convolution and the like can be rapidly executed. The stability of NPU is also a major feature. NPUs typically have good fault tolerance and reliability, and maintain stable computing performance even under high-load, complex computing tasks. This is because the NPU's computing unit and data store are well designed and tested to maintain efficient and stable performance over long periods of operation. NPUs typically have some programmability and can be adjusted in parameters and configurations by software to accommodate different computing tasks. This means that the NPU is not only applicable to fixed deep learning models, but also to different algorithms, frameworks and datasets.
FIG. 1 is a schematic diagram of calculation and data handling, as shown in FIG. 1, after the data handling 11 process is completed, a calculation operation 21, such as a process multiply-accumulate operation (Multiply and accumulate operations, mac), is performed, then the data handling 12 process is started, and finally a calculation operation 22, such as a Mac operation, is performed. As can be seen, there is a gap (gap) in the existence of a gap, since there is a dependency between the computation and the data handling, which cannot be performed in parallel.
NPU is an architecture of a special-purpose domain architecture processor (Domain Special Architecture, DSA), and high hardware utilization is one of the most important advantages. To achieve this goal of high hardware utilization, the common practice mainly includes two types, one is compiler optimization in combination with hardware loop (cycle) level synchronization, and the other is realized through multithreading.
In the process of realizing compiler optimization and hardware loop level synchronization, reasonable instruction flow and data flow sequences are distributed through the compiler, and the hardware realizes the accurate loop number for ensuring the completion of each instruction execution and data handling so as to achieve the synchronization of a calculation execution hardware unit and a data handling hardware unit and realize higher hardware utilization rate.
Fig. 2 is a schematic diagram of calculation and data handling, as shown in fig. 2, after the calculation is completed, the upper-stage hardware unit outputs data D1 to the lower-stage hardware unit a, after the calculation is completed by the hardware unit a, data D2 and data D3 are respectively output to the lower-stage hardware unit B1 and the hardware unit B2, and data D4 output by the hardware unit B1 and data D5 output by the hardware unit B2 are further output to the lower-stage hardware unit C, after the calculation is completed by the hardware unit C, data D6, data D7 and data D8 can be continuously output to the lower-stage hardware unit D1, the lower-stage hardware unit D2 and the lower-stage hardware unit D3, respectively. That is, based on each circulation hardware unit, a certain calculation is completed and data is output to the next level hardware unit in the next circulation, so that the calculation and the data are well matched, and the high utilization rate of hardware is realized. In order to achieve such synchronization, software should split the computing tasks well, and implement loop-level computing operations and data operation ordering through a compiler.
However, the manner in which the compiler optimizes synchronization in conjunction with hardware loop level requires that a fixed instruction stream and data stream be generated by compilation and that the execution time of each step be tightly controlled by hardware to achieve the desired result. This results in a de-optimization that must be adaptive for each model, is not universally speaking, and is accomplished by a compiler implementing a fixed data and instruction stream, with very low software programmability and unfriendly to software applications.
In the process of realizing high hardware utilization rate through multithreading, the same hardware resource can be shared by a plurality of threads, so that the gap generated by the execution of the single-thread hardware resource is filled by different threads, and the hardware utilization rate is improved.
FIG. 3 is a third schematic diagram of computation and data handling, as shown in FIG. 3, the two threads, thread t1 (thread 1) and thread t2 (thread 2), may execute sequentially through an Issue Queue (IssueQuue) and store the final data in an internal store (CUBE). The thread t1 and the thread t2 respectively correspond to a plurality of computing operations and a plurality of data handling operations, for example, the thread t1 may include 2 computing operations, respectively, the computing operation 21 and the computing operation 22, and correspondingly, the thread t1 may further include 2 data handling operations, respectively, the data handling 11 and the data handling 12. Thread t2 may include 2 computing operations, computing operation 21 and computing operation 22, respectively, and correspondingly, thread t2 may also include 2 data handling operations, data handling 11 and data handling 12, respectively. The thread t1 and the thread t2 may use the same hardware resource to sequentially perform the computing operation and the data handling operation, so that a gap generated by the execution of the single-threaded hardware resource is filled by other threads.
Fig. 4 is a schematic diagram of calculation and data handling, and as shown in fig. 4, the thread t1 and the thread t2 respectively correspond to a plurality of calculation operations and a plurality of data handling operations, for example, the thread t1 may include 2 calculation operations, respectively, the calculation operation 21 and the calculation operation 22, and correspondingly, the thread t1 may further include 2 data handling operations, respectively, the data handling 11 and the data handling 12. Thread t2 may include 2 computing operations, computing operation 21 and computing operation 22, respectively, and correspondingly, thread t2 may also include 2 data handling operations, data handling 11 and data handling 12, respectively. The thread t1 and the thread t2 may use the same hardware resource to sequentially perform the computing operation and the data handling operation, so that a gap generated by the execution of the single-threaded hardware resource is filled by other threads.
Therefore, the mode of high hardware utilization rate is realized by adopting multithreading, the compiler is not relied on again, the software programmability is strong, and the universality and the usability are better.
The multi-thread system can provide more software programmability to improve the hardware utilization rate, does not depend on a compiler any more, has better universality, and makes up for the gap generated by calculation and data dependence to a certain extent due to resource sharing of a plurality of threads, and the calculation data dependence of different threads is different.
However, since the computing operations of different threads are sequentially performed within the computing resource, there is a problem in that the computing resource and the data resource are not matched.
Fig. 5 is a schematic diagram of calculation and Data handling, and as shown in fig. 5, assuming that Data corresponding to the Data handling 11 of the thread t2 is handled from an external Double Data Rate (DDR), when the external access bus is not available, the Data handling 11 operation of the thread t2 needs to be delayed (delay), which causes the calculation operation 21 of the thread t2 to be delayed, and thus an execution gap is generated.
It can be seen that the multithreading data processing method only improves the single thread (single thread) execution efficiency, and the problem of mismatch between computing resources and data resources encountered by the multithreading (multi-thread) is not solved.
Thus, the manner of optimizing and combining hardware loop level synchronization by the compiler is not universal, but the problem of mismatching of computing resources and data resources cannot be solved by the multithreading implementation. That is, the common data processing method cannot achieve both the universality and the hardware utilization rate, so that the efficiency and the performance of data processing are limited.
In order to solve the above problems, in an embodiment of the present application, a data processing apparatus includes an NPU, where the NPU includes a computing module and a data operation module, and when executing a data processing task, the data operation module synchronizes resource matching information corresponding to the data processing task to the computing module; the data processing task at least comprises a first thread and a second thread; the resource matching information characterizes whether the data carrying operation corresponding to the first thread and the second thread is completed or not; and executing the computing operation corresponding to the first thread or the second thread based on the resource matching information through the computing module. It can be seen that, in the embodiment of the present application, by synchronizing the resource matching information, the computing module may determine the completion of the data handling operation, so that the computing operation may be further performed according to the completion of the data handling operation. That is, in the embodiment of the present application, for a data processing task, on the basis of improving universality by using multithreading, an appropriate computing operation can be flexibly and dynamically executed according to the completion condition of a synchronous data handling operation, so that computing resources and data resources are separated from a hardware layer, further the problem of mismatching of the computing resources and the data resources is solved, and the efficiency and performance of data processing can be effectively improved in consideration of both versatility and hardware utilization.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
An embodiment of the present application provides a data processing method, which may be applied to a data processing apparatus, where the data processing apparatus may be configured with an NPU, and the NPU includes a computing module and a data operating module.
It will be appreciated that in the embodiments of the present application, the data processing method may also be applied to a data processing chip configured with an NPU, where the NPU includes a calculation module and a data operation module.
It is understood that in the embodiments of the present application, the data processing method may also be applied to a data processing apparatus integrated with a data processing device or a data processing chip.
In the implementation of the present application, the data processing apparatus may be any form of electronic device, chip, integrated circuit (integrated circuit, IC), application specific integrated circuit (ApplicationSpecific Integrated Circuit, ASIC), or the like. For example, the data processing method provided in the embodiment of the present application may be implemented by a chip, where the chip may be integrated with MCU, NPU, DDR or the like.
Illustratively, in an embodiment of the present application, fig. 6 is a schematic diagram of an implementation framework of the data processing method set forth in the embodiment of the present application, and as shown in fig. 6, the data processing apparatus 60 may be configured with a neural Network Processing Unit (NPU) 61 including a calculation module 611 and a data manipulation module 612. The computing module 611 is configured to perform a computing operation, and the data operating module 612 is configured to perform a data handling operation.
Further, in an embodiment of the present application, fig. 7 is a schematic flow chart of an implementation of a data processing method according to an embodiment of the present application, as shown in fig. 7, where in an embodiment of the present application, a method for performing data processing by a data processing device may include the following steps:
step 101, synchronizing resource matching information corresponding to a data processing task to a computing module through a data operation module when the data processing task is executed; the data processing task at least comprises a first thread and a second thread; the resource matching information characterizes whether the data handling operation corresponding to the first thread and the second thread is completed.
In the embodiment of the application, when the data processing device executes the data processing task, the data operation module can synchronize the resource matching information corresponding to the data processing task to the calculation module.
It should be noted that, in the embodiments of the present application, the data processing apparatus may perform the data processing task in response to the received data processing instruction. The data processing task may be any type of task, for example, the data processing task may be image processing, music playing, or face recognition, which is not specifically limited in this application.
Further, in the embodiment of the present application, during the process of executing the data processing task, the data processing device may first determine N threads corresponding to the data processing task, where N is an integer greater than or equal to 2.
That is, in the embodiment of the present application, when executing a data processing task, the same hardware resource may be selectively shared by multiple threads, so that the hardware utilization may be further improved.
Illustratively, in some embodiments, the data processing tasks include at least a first thread and a second thread. For example, for data processing task 1, there may be 2 threads, thread t1 and thread t 2; for data processing task 2, there may be 3 threads, thread t3, thread t4, and thread t 5.
It should be noted that, in an embodiment of the present application, the first thread may include at least one first computing operation, and at least one first data handling operation corresponding to the at least one first computing operation; the second thread may also include at least one second computing operation, and at least one second data handling operation corresponding to the at least one second computing operation.
That is, in embodiments of the present application, for any one thread corresponding to a data processing task, the thread may include at least one computing operation, and at least one data handling operation corresponding to the at least one computing operation. Wherein the computing operations may be performed by computing modules in the data processing apparatus and the data handling operations may be performed by data manipulation modules in the data processing apparatus.
Illustratively, in some embodiments, thread t1 may include 2 computing operations, computing operation 21 and computing operation 22, respectively, and correspondingly, thread t1 may also include 2 data handling operations, data handling 11 and data handling 12, respectively. Wherein the computing operation 21 corresponds to the data handling 11 and the computing operation 22 corresponds to the data handling 12.
Illustratively, in some embodiments, thread t2 may include 3 computing operations, respectively computing operation 21, computing operation 22, and computing operation 23, and correspondingly, thread t2 may also include 2 data handling operations, respectively data handling 11, data handling 12, and data handling 13. Wherein computing operation 21 corresponds to data handling 11, computing operation 22 corresponds to data handling 12, and computing operation 23 corresponds to data handling 13.
It should be noted that, in the embodiment of the present application, the resource matching information may indicate whether the data handling operations corresponding to the first thread and the second thread are completed. The data operation module can synchronize the completion condition of any one data carrying operation included in any one thread to the calculation module through the resource matching information.
That is, in the embodiment of the present application, by synchronizing the resource matching information, the computing module may learn whether the data handling operation corresponding to the computing operation has been completed, that is, whether the handling process has been completed by the data resource corresponding to the computing operation.
Further, in the embodiment of the present application, after determining the plurality of threads included in the data processing task, the data processing apparatus may further determine an instruction sequence corresponding to the data processing task. The instruction sequence may be used to determine an execution sequence of a plurality of threads, and may also be used to determine an execution sequence of at least one computing operation corresponding to any one thread.
In some embodiments, fig. 8 is a schematic diagram of an instruction sequence set forth in an embodiment of the present application, and as shown in fig. 8, it is assumed that a data processing task includes 2 threads, namely a first thread and a second thread, where the first thread includes 2 computing operations, namely a computing operation 21 of the first thread and a computing operation 22 of the first thread, and the second thread also includes 2 computing operations, namely a computing operation 21 of the second thread and a computing operation 22 of the second thread. The instruction sequence corresponding to the determined data processing task may be a computing operation 21 of the first thread, a computing operation 22 of the first thread, a computing operation 21 of the second thread, and a computing operation 22 of the second thread in sequence according to time sequence.
It is to be appreciated that in the embodiments of the present application, since at least one computing operation included in any one thread corresponds to at least one data handling operation, it is considered that the instruction sequence can also determine the execution sequence of at least one computing operation corresponding to any one thread.
Illustratively, in some embodiments, while the determined instruction sequence corresponding to the data processing task is the first thread of computing operations 21, the first thread of computing operations 22, the second thread of computing operations 21, the second thread of computing operations 22 in order of time, the corresponding data processing operations may be determined to be the first thread of data handling 11, the first thread of data handling 12, the second thread of data handling 11, the second thread of data handling 12 in order of time.
Further, in the embodiment of the present application, after the kth data handling operation indicated by the instruction sequence corresponding to the data processing task is completed by the data operation module, in the case that the data resource corresponding to the kth+1 data handling operation indicated by the instruction sequence is not executable, the kth+2 data handling operation indicated by the instruction sequence may be selected to be executed by the data operation module.
Further, in the embodiment of the present application, after the kth data handling operation indicated by the instruction sequence is completed by the data operation module, in the case where the data resource corresponding to the kth+1 data handling operation indicated by the instruction sequence is executable, then the kth+1 data handling operation may be executed by the data operation module.
In the embodiment of the present application, k is an integer greater than 0.
That is, in the embodiment of the present application, when the data operation module in the data processing apparatus performs data processing based on the order of the data handling operations of each thread indicated by the instruction sequence corresponding to the data processing task, after one of the data handling operations indicated by the instruction sequence is completed, the data operation module may determine whether the data resource corresponding to the next data handling operation indicated by the instruction sequence is executable, that is, may determine whether the data resource can support the next data handling operation, and if not, the data operation module may skip the data handling operation first and continue to perform the next data handling operation indicated by the instruction sequence, that is, continue to determine whether the data resource corresponding to the next data handling operation indicated by the instruction sequence is executable.
Accordingly, in the embodiment of the present application, when the data operation module in the data processing apparatus performs data processing based on the order of the data handling operations of each thread indicated by the instruction sequence corresponding to the data processing task, after one of the data handling operations indicated by the instruction sequence is completed, the data operation module may determine whether the data resource corresponding to the next data handling operation indicated by the instruction sequence is executable, that is, may determine whether the data resource can support the next data handling operation, and if so, the data operation module may acquire the data resource corresponding to the data handling operation, thereby completing the data handling operation.
Illustratively, in some embodiments, it is assumed that the data processing task includes 2 threads, such as thread t1 and thread t2, where thread t1 corresponds to 2 data handling operations, such as data handling 11 of thread t1 and data handling 12 of thread t1, and thread t2 corresponds to 3 data handling operations, such as data handling 11 of thread t2, data handling 12 of thread t2, and data handling 13 of thread t 2. After completing the data handling 11 of the thread t2, the data operation module may determine that the next data handling operation of the data handling 11 of the thread t2 is the data handling 12 of the thread t1 based on the instruction sequence, and if it is determined that the data resource cannot support the data handling 12 of the thread t1, the data operation module may not execute the data handling 12 of the thread t1, but further determine whether the corresponding data resource is executable for the next data handling operation of the data handling 12 of the thread t1 indicated by the instruction sequence.
Illustratively, in some embodiments, it is assumed that the data processing task includes 2 threads, such as thread t1 and thread t2, where thread t1 corresponds to 2 data handling operations, such as data handling 11 of thread t1 and data handling 12 of thread t1, and thread t2 corresponds to 3 data handling operations, such as data handling 11 of thread t2, data handling 12 of thread t2, and data handling 13 of thread t 2. After completing the data handling 11 of the thread t2, the data operation module may determine, based on the instruction sequence, that the next data handling operation of the data handling 11 of the thread t2 is the data handling 12 of the thread t1, and if it is determined that the data resource may support the data handling 12 of the thread t1, the data operation module may acquire the data resource corresponding to the data handling 12 of the thread t1, thereby completing the data handling 12 of the thread t 1.
Therefore, in the embodiment of the application, before executing the data handling operation, the data operation module needs to determine whether the data resource is executable, so that the corresponding data handling operation can be completed under the condition that the data resource is executable; accordingly, in the event that the data resource is not executable, execution of the corresponding data handling operation is skipped.
For example, in some embodiments, fig. 9 is a schematic diagram illustrating an execution sequence of the data handling operations according to the embodiments of the present application, as shown in fig. 9, and it is assumed that the data processing task includes 2 threads, namely, thread t1 and thread t2, where thread t1 corresponds to 2 data handling operations, such as data handling 11 of thread t1 and data handling 12 of thread t 1; thread t2 corresponds to 2 data handling operations, such as data handling 11 for thread t2, data handling 12 for thread t 2. The instruction sequence corresponding to the data processing task is data carrying 11 of thread t1, data carrying 11 of thread t2, data carrying 12 of thread t1 and data carrying 12 of thread t 2. In the process of executing the data handling operation, after the data handling 11 of the thread t2 is completed, the data handling module determines that the data resource corresponding to the data handling 12 of the thread t1 is not executable, so that the data handling module skips the data handling 12 of the thread t1, but completes the data handling 12 of the thread t2 executable by the corresponding data resource, and then executes the data handling 12 of the thread t1, that is, the execution sequence of the data handling operation performed by the data handling module is the data handling 11 of the thread t1, the data handling 11 of the thread t2, the data handling 12 of the thread t2, and the data handling 12 of the thread t 1.
That is, in the embodiment of the present application, the actual execution order of the data handling operations performed by the data operation module is not necessarily the same as the instruction order corresponding to the data processing task, because when the data operation module performs the data handling operation processing, it is necessary to determine whether the data resource corresponding to the data handling operation to be performed can be performed in addition to the reference to the corresponding instruction order, and in the case where the data resource cannot support the corresponding data handling operation, the data operation module does not delay the data handling operation any more, but selects to skip the data handling operation, and perform the next data handling operation indicated by the instruction order.
In the embodiment of the present application, although the actual execution order of the data handling operations may be different from the instruction order corresponding to the data processing tasks, the data handling module needs to execute, in the process of performing the data handling operations, the plurality of data handling operations for the same thread in the order indicated by the instruction order. For example, for the 3 threads of thread t1, thread t2, and thread t3, the data operation module may execute in an order other than the order indicated by the instruction sequence, but for the 3 data handling operations corresponding to thread t1, such as data handling 11, data handling 12, and data handling 13, the data operation module needs to execute in the order indicated by the instruction sequence, that is, if the instruction sequence is data handling 11, data handling 12, and data handling 13, the actual execution sequence is also data handling 11, data handling 12, and data handling 13.
Further, in the embodiment of the present application, after completing one data handling operation, the resource matching information may be updated by the data operation module.
It should be noted that, in the embodiment of the present application, the data operation module may update the resource matching information in real time. The data operation module may select to update the resource matching information after completing one data handling operation.
Illustratively, in some embodiments, after completing the kth data handling operation indicated by the instruction sequence, the data handling module may update the resource matching information such that it may be determined by the resource matching information that the kth data handling operation has been completed. Then, after completing the (k+1) th data handling operation indicated by the instruction sequence, the data operation module may update the resource matching information again, so that it may be determined that the (k+1) th data handling operation has been completed through the resource matching information.
Further, in the embodiment of the present application, when the resource matching information corresponding to the data processing task is synchronized to the computing module through the data operation module, the resource matching information may be selectively synchronized to the computing module through the data operation module based on the synchronization event.
It should be noted that, in the embodiment of the present application, the synchronization Event may include an Event (Event object), which may be a multithreaded synchronization means. In addition to using the synchronization event to perform the synchronization processing of the resource matching information, the data operation module may also synchronize the resource matching information to the computing module using any other synchronization method, which is not specifically limited in this application.
Illustratively, in some embodiments, fig. 10 is a schematic diagram illustrating synchronization of resource matching information according to an embodiment of the present application, and as shown in fig. 10, based on a synchronization Event (Event sync) 30, the data operation module 612 may synchronize the resource matching information updated in real time to the calculation module 611.
It will be appreciated that in embodiments of the present application, by synchronizing the resource matching information, the computing module may further perform subsequent computing operations based on the resource matching information.
Step 102, executing, by a computing module, computing operations corresponding to the first thread or the second thread based on the resource matching information.
In an embodiment of the present application, when executing the data processing task, the data processing apparatus may further execute, by the computing module, a computing operation corresponding to the first thread or the second thread based on the resource matching information after synchronizing, by the data operating module, the resource matching information corresponding to the data processing task to the computing module.
It should be noted that, in the embodiment of the present application, the resource matching information may determine the completion condition of any one data handling operation included in any one thread, so after obtaining the resource matching information synchronized by the data operation module, the calculation module may perform the corresponding calculation operation in combination with the resource matching information.
Further, in the embodiment of the present application, in the process of executing the computing operation corresponding to the first thread or the second thread by the computing module based on the resource matching information, after completing the kth computing operation indicated by the instruction sequence by the computing module, in the case that it is determined that the kth+1th data handling operation indicated by the instruction sequence is not completed based on the resource matching information, the kth+2th computing operation indicated by the instruction sequence by the computing module may be selected to be executed.
Further, in the embodiment of the present application, in the process of executing the computing operation corresponding to the first thread or the second thread based on the resource matching information by the computing module, after completing the kth computing operation indicated by the instruction sequence by the computing module, in the case where it is determined that the kth+1th data handling operation indicated by the instruction sequence has been completed based on the resource matching information, the kth+1th computing operation may be executed by the computing module.
That is, in the embodiment of the present application, when the computing module in the data processing apparatus performs data processing based on the order of the computing operations of each thread indicated by the instruction sequence corresponding to the data processing task, after completing one of the computing operations indicated by the instruction sequence, the computing module may determine, according to the resource matching information obtained in synchronization, whether the data handling operation corresponding to the next computing operation indicated by the instruction sequence is completed, that is, whether the data resource corresponding to the data handling operation corresponding to the next computing operation is executable, and if not, the computing module may skip the computing operation first, and if not, continue to perform the next computing operation indicated by the instruction sequence, that is, continue to determine whether the data handling operation corresponding to the next computing operation indicated by the instruction sequence is completed.
Accordingly, in the embodiment of the present application, when the computing module in the data processing apparatus performs data processing based on the order of the computing operations of each thread indicated by the instruction sequence corresponding to the data processing task, after completing one of the computing operations indicated by the instruction sequence, the computing module may determine, according to the resource matching information obtained in synchronization, whether the data handling operation corresponding to the next computing operation indicated by the instruction sequence is completed, that is, whether the data resource corresponding to the data handling operation corresponding to the next computing operation is executable, and if so, the computing module may complete the corresponding computing operation according to the data resource corresponding to the data handling operation.
Illustratively, in some embodiments, it is assumed that the data processing task includes 2 threads, such as thread t1 and thread t2, where thread t1 corresponds to 2 computing operations and 2 data handling operations, such as computing operation 21 of thread t1 and computing operation 22 of thread t1, and corresponding data handling 11 of thread t1 and data handling 12 of thread t1, thread t2 corresponds to 3 computing operations and 3 data handling operations, such as computing operation 21 of thread t2, computing operation 22 of thread t2, computing operation 23 of thread t2, and corresponding data handling 11 of thread t2, data handling 12 of thread t2, and data handling 13 of thread t 2. After completing the computing operation 21 of the thread t2, the computing module may determine, based on the instruction sequence, that a next computing operation of the computing operation 21 of the thread t2 is the computing operation 22 of the thread t1, and if it is determined that the data handling 12 of the thread t1 is not completed according to the resource matching information synchronized by the data operating module, the computing module may not execute the computing operation 22 of the thread t1, but further perform a determination as to whether the corresponding data handling operation is completed for the next computing operation of the computing operation 22 of the thread t1 indicated by the instruction sequence.
Illustratively, in some embodiments, it is assumed that the data processing task includes 2 threads, such as thread t1 and thread t2, where thread t1 corresponds to 2 computing operations and 2 data handling operations, such as computing operation 21 of thread t1 and computing operation 22 of thread t1, and corresponding data handling 11 of thread t1 and data handling 12 of thread t1, thread t2 corresponds to 3 computing operations and 3 data handling operations, such as computing operation 21 of thread t2, computing operation 22 of thread t2, computing operation 23 of thread t2, and corresponding data handling 11 of thread t2, data handling 12 of thread t2, and data handling 13 of thread t 2. After completing the computing operation 21 of the thread t2, the computing module may determine, based on the instruction sequence, that a next computing operation of the computing operation 21 of the thread t2 is the computing operation 22 of the thread t1, and if it is determined that the data handling 12 of the thread t1 is completed according to the resource matching information synchronized by the data handling module, the computing module may complete the computing operation 22 of the thread t1 according to the data resource corresponding to the data handling 12 of the thread t 1.
Therefore, in the embodiment of the present application, before executing the computing operation, the computing module needs to determine whether the corresponding data handling operation is completed according to the resource matching information, so that the computing module can execute the corresponding computing operation by using the corresponding data resource when the data handling operation is completed; accordingly, in the event that the corresponding data handling operation is not completed, execution of the corresponding computing operation is skipped.
For example, in some embodiments, fig. 11 is a schematic diagram illustrating an execution sequence of the computing operations set forth in the embodiments of the present application, as shown in fig. 11, and it is assumed that the data processing task includes 2 threads, namely, a thread t1 and a thread t2, where the thread t1 corresponds to 2 computing operations, such as a computing operation 21 of the thread t1 and a computing operation 22 of the thread t 1; thread t2 corresponds to 2 computing operations, such as computing operation 21 of thread t2, computing operation 22 of thread t 2. The instruction sequence corresponding to the data processing task is the calculation operation 21 of the thread t1, the calculation operation 21 of the thread t2, the calculation operation 22 of the thread t1 and the calculation operation 22 of the thread t 2. In the process of executing the computing operation, after completing the computing operation 21 of the thread t2, the computing module determines that the data handling operation corresponding to the computing operation 22 of the thread t1 is not completed, that is, the corresponding data resource is not executable, so that the computing module skips the computing operation 22 of the thread t1, and completes the computing operation 22 of the thread t2 after the corresponding data handling operation is completed, and then executes the computing operation 22 of the thread t1, that is, the computing operation 21 of the thread t1, the computing operation 21 of the thread t2, the computing operation 22 of the thread t2, and the computing operation 22 of the thread t1 are executed by the computing module.
That is, in the embodiment of the present application, the actual execution order of the computing operations performed by the computing module is not necessarily the same as the instruction order corresponding to the data processing task, because when the computing module performs the computing operation processing, it is required to determine whether the corresponding data handling operation is completed or not, that is, whether the corresponding data resource can be executed according to the resource matching information, in addition to referring to the corresponding instruction order, in the case that the data resource cannot support the corresponding computing operation, the computing module does not delay the computing operation any more, and selects to skip the computing operation, and perform the next computing operation indicated by the instruction order.
It should be noted that, in the embodiment of the present application, although the actual execution order of the computing operations may be different from the instruction order corresponding to the data processing task, the computing module needs to execute, in the process of performing the computing operations, the computing operations of the same thread according to the order indicated by the instruction order. For example, for the 3 threads of thread t1, thread t2, and thread t3, the calculation module may execute in an order other than that indicated by the instruction sequence, but for the 3 calculation operations corresponding to thread t2, such as calculation operation 21, calculation operation 22, and calculation operation 23, the calculation module needs to execute in an order indicated by the instruction sequence, that is, if the instruction sequence is calculation operation 21, calculation operation 22, and calculation operation 23, the actual execution sequence is calculation operation 21, calculation operation 22, and calculation operation 23.
That is, in the embodiment of the present application, by synchronization of the resource matching information, the computing module may learn whether the data handling operation corresponding to the computing operation has been completed, that is, determine whether the data resource corresponding to the computing operation to be performed is executable, and may choose to perform the computing operation if the corresponding data handling operation has been completed, and not perform the computing operation if the corresponding data handling operation has not been completed.
It may be appreciated that, in the embodiments of the present application, for a multithreaded data processing task, in the process of executing a computing operation by a computing module, the completion situation of a data handling operation corresponding to the data handling module may be referred to based on resource matching information, so that an executable computing operation may be selected according to the completion situation of the data handling operation, and such a processing manner may separate computing resources from data resources from a hardware layer, thereby solving the problem of mismatching between computing resources and data resources.
In summary, by the data processing methods proposed in steps 101 to 102, the multithreaded instructions corresponding to the data processing tasks are not sequentially executed, but executable instructions are selected for execution according to the completion of the data handling operation. The execution of the computing operation refers to the completion condition of the data handling operation based on the synchronization of the resource matching information between the computing module and the data operation module, so that the computing resource and the data resource are separated from the hardware level, the software is executed without sense, and the proper thread activating operation is selected according to the execution condition.
For example, in some embodiments, fig. 12 is a diagram of a multithreading implementation framework according to an embodiment of the present application, as shown in fig. 12, the computing module 611 may be configured to perform computing operations corresponding to the thread t1 and the thread t2, where the computing module includes a corresponding first command arbitration portion 611a, and the data operation module 612 may be configured to perform data handling operations corresponding to the thread t1 and the thread t2, and also includes a corresponding second command arbitration portion 612a. The data operation module 612 may determine whether to perform the processing of the data handling operation based on whether the data resource corresponding to the data handling operation is executable in the course of performing the data handling operation; the computing module 611 and the data operation module 612 may synchronize respective execution states, that is, synchronize the resource matching information, through the intermediate synchronization event 30, so that the computing module 611 may determine the completion of the data handling operation according to the resource matching information, and the corresponding first command arbitration section 611a may dynamically select an appropriate instruction for execution according to the completion of the data handling operation.
For example, in some embodiments, fig. 13 is a schematic diagram of the execution sequence of the multithreading proposed in the embodiments of the present application, as shown in fig. 13, assuming that the data processing task includes 2 threads, namely, a thread t1 and a thread t2, where the thread t1 corresponds to 2 computing operations, the thread t1 corresponds to 2 computing operations and 2 data handling operations, such as the computing operation 21 of the thread t1 and the computing operation 22 of the thread t1, and the data handling 11 of the thread t1 and the data handling 12 of the thread t1, and the thread t2 corresponds to 2 computing operations and 2 data handling operations, such as the computing operation 21 of the thread t2 and the computing operation 22 of the thread t2, and the data handling 11 of the thread t2 and the data handling 12 of the thread t 2. The instruction sequence corresponding to the data processing task is the calculation operation 21 of the thread t1, the calculation operation 21 of the thread t2, the calculation operation 22 of the thread t1 and the calculation operation 22 of the thread t 2. After completing the data handling 11 of the thread t2, the data operation module may determine that the next data handling operation of the data handling 11 of the thread t2 is the data handling 12 of the thread t1 based on the instruction sequence, and if it is determined that the data resource cannot support the data handling 12 of the thread t1, the data operation module may not execute the data handling 12 of the thread t1, but further execute the next data handling operation of the data handling 12 of the thread t1 indicated by the instruction sequence, that is, execute the data handling 12 of the thread t 2. Meanwhile, the data operation module can update the resource matching information and synchronize the resource matching information updated in real time to the calculation module. Accordingly, in the process of executing the computing operation, after completing the computing operation 21 of the thread t2, the computing module determines that the data handling operation corresponding to the computing operation 22 of the thread t1 is not completed based on the resource matching information, that is, the corresponding data resource is not executable, at this time, the computing module skips the computing operation 22 of the thread t1, but completes the computing operation 22 of the thread t2 in which the corresponding data handling operation has been completed, and then executes the computing operation 22 of the thread t 1. It can be seen that, referring to the execution sequence of the data handling operation performed by the data operation module, the execution sequence of the calculation operation performed by the calculation module is the calculation operation 21 of the thread t1, the calculation operation 21 of the thread t2, the calculation operation 22 of the thread t2, and the calculation operation 22 of the thread t 1.
Therefore, the data processing method provided by the embodiment of the application is a more reasonable and efficient instruction and data synchronization mode suitable for the NPU. Wherein, realized multi-wire Cheng Gaoxiao and carried out on the NPU, 3. Improved the NPU commonality to a great extent and not good, the problem that the programming level is low. Meanwhile, through synchronization of the resource matching information, the computing operation is not executed according to the entering sequence of the instructions, threads are flexibly activated according to the resource matching condition, the efficiency is higher, and the multithreading effect is truly exerted.
The embodiment of the application provides a data processing method which is applied to a data processing device, wherein the data processing device comprises an NPU (network platform unit), the NPU comprises a calculation module and a data operation module, and when a data processing task is executed, resource matching information corresponding to the data processing task is synchronized to the calculation module through the data operation module; the data processing task at least comprises a first thread and a second thread; the resource matching information characterizes whether the data carrying operation corresponding to the first thread and the second thread is completed or not; and executing the computing operation corresponding to the first thread or the second thread based on the resource matching information through the computing module. It can be seen that, in the embodiment of the present application, by synchronizing the resource matching information, the computing module may determine the completion of the data handling operation, so that the computing operation may be further performed according to the completion of the data handling operation. That is, in the embodiment of the present application, for a data processing task, on the basis of improving universality by using multithreading, an appropriate computing operation can be flexibly and dynamically executed according to the completion condition of a synchronous data handling operation, so that computing resources and data resources are separated from a hardware layer, further the problem of mismatching of the computing resources and the data resources is solved, and the efficiency and performance of data processing can be effectively improved in consideration of both versatility and hardware utilization.
Based on the above embodiments, a further embodiment of the present application provides a data processing method, which can utilize multithreading to improve hardware utilization rate, achieve good versatility, and solve the problem of mismatching between computing resources and data resources existing in multithreading.
It should be noted that, in the embodiment of the present application, the data processing method may be applied to a data processing apparatus, where the data processing apparatus may be configured with an NPU. The NPU can also be applied to a data processing chip provided with the NPU, and the NPU comprises a calculation module and a data operation module. But also in data processing devices integrated with data processing means or chips. The NPU comprises a calculation module and a data operation module.
It should be noted that in the embodiment of the present application, an intelligent instruction synchronization design may be added in a multithreaded data processing task, that is, resource matching information synchronized in real time between a computing module and a data operation module, based on the resource synchronization information, computing operations are not executed according to an order in which instructions enter, but threads are flexibly activated according to a resource matching condition, so that when computing resources and data resources are not matched is solved.
Further, in the embodiment of the present application, when executing the data processing task, the resource matching information corresponding to the data processing task may be synchronized to the computing module by the data operation module. Wherein the data processing task comprises at least a first thread and a second thread.
Further, in the embodiment of the present application, during the process of executing the data processing task, the data processing device may first determine a plurality of threads corresponding to the data processing task, where the plurality of threads are selected to share the same hardware resource, so that the hardware utilization rate may be further improved.
Further, in an embodiment of the present application, for any one thread corresponding to a data processing task, the thread may include at least one computing operation, and at least one data handling operation corresponding to the at least one computing operation. Wherein the computing operations may be performed by computing modules in the data processing apparatus and the data handling operations may be performed by data manipulation modules in the data processing apparatus.
It should be noted that, in the embodiment of the present application, the resource matching information may indicate whether the data handling operations corresponding to the first thread and the second thread are completed. The data operation module can synchronize the completion condition of any one data carrying operation included in any one thread to the calculation module through the resource matching information.
That is, in the embodiment of the present application, by synchronizing the resource matching information, the computing module may learn whether the data handling operation corresponding to the computing operation has been completed, that is, whether the handling process has been completed by the data resource corresponding to the computing operation.
Further, in the embodiment of the present application, after determining the plurality of threads included in the data processing task, the data processing apparatus may further determine an instruction sequence corresponding to the data processing task. The instruction sequence may be used to determine the execution sequence of a plurality of threads, and may also be used to determine the execution sequence of at least one computing operation corresponding to any one thread, or it may be considered that the instruction sequence may also be capable of determining the execution sequence of at least one computing operation corresponding to any one thread.
Further, in the embodiment of the present application, after the kth data handling operation indicated by the instruction sequence corresponding to the data processing task is completed by the data operation module, in the case that the data resource corresponding to the kth+1 data handling operation indicated by the instruction sequence is not executable, the kth+2 data handling operation indicated by the instruction sequence may be selected to be executed by the data operation module. In the case where the data resource corresponding to the (k+1) th data handling operation indicated by the instruction sequence is executable, then the (k+1) th data handling operation may be executed by the data handling module.
Therefore, in the embodiment of the application, before executing the data handling operation, the data operation module needs to determine whether the data resource is executable, so that the corresponding data handling operation can be completed under the condition that the data resource is executable; accordingly, in the event that the data resource is not executable, execution of the corresponding data handling operation is skipped.
Further, in an embodiment of the present application, after completing one data handling operation, the resource matching information may be updated by the data handling module, and then the resource matching information may be optionally synchronized by the data handling module to the computing module based on the synchronization event.
It should be noted that, in the embodiment of the present application, the resource matching information may determine the completion condition of any one data handling operation included in any one thread, so after obtaining the resource matching information synchronized by the data operation module, the calculation module may perform the corresponding calculation operation in combination with the resource matching information.
Further, in the embodiment of the present application, in the process of executing the computing operation corresponding to the first thread or the second thread by the computing module based on the resource matching information, after completing the kth computing operation indicated by the instruction sequence by the computing module, in the case that it is determined that the kth+1th data handling operation indicated by the instruction sequence is not completed based on the resource matching information, the kth+2th computing operation indicated by the instruction sequence by the computing module may be selected to be executed. In the case where it is determined that the (k+1) th data handling operation indicated by the instruction sequence has been completed based on the resource matching information, the (k+1) th computing operation may be performed by the computing module.
Therefore, in the embodiment of the present application, before executing the computing operation, the computing module needs to determine whether the corresponding data handling operation is completed according to the resource matching information, so that the computing module can execute the corresponding computing operation by using the corresponding data resource when the data handling operation is completed; accordingly, in the event that the corresponding data handling operation is not completed, execution of the corresponding computing operation is skipped.
That is, in the embodiment of the present application, by synchronization of the resource matching information, the computing module may learn whether the data handling operation corresponding to the computing operation has been completed, that is, determine whether the data resource corresponding to the computing operation to be performed is executable, and may choose to perform the computing operation if the corresponding data handling operation has been completed, and not perform the computing operation if the corresponding data handling operation has not been completed.
In summary, for the multithreaded data processing task, the data processing method provided in the embodiments of the present application may refer to the completion condition of the data handling operation corresponding to the data operation module based on the resource matching information in the process of executing the computing operation by the computing module, so that the executable computing operation may be selected according to the completion condition of the data handling operation, and such a processing manner may separate the computing resource and the data resource from the hardware layer, thereby solving the problem of mismatching between the computing resource and the data resource.
For example, in some embodiments, fig. 14 is a second frame diagram of a multithreading implementation proposed in the embodiments of the present application, as shown in fig. 14, the computing module 611 may be configured to perform computing operations corresponding to a plurality of threads t1, t2, t3, and t4, where the computing operations include a corresponding first command arbitration portion 611a, and the data operation module 612 may be configured to perform data handling operations corresponding to a plurality of threads t1, t2, t3, and t4, and also include a corresponding second command arbitration portion 612a. The data operation module 612 may determine whether to perform processing of the data handling operation based on whether a data resource corresponding to the data handling operation is executable during execution of the data handling operation, and after the completion of the data handling operation, may complete writing of data through direct memory access (Direct Memory Access, DMA); the computing module 611 and the data operation module 612 may synchronize respective execution states, that is, synchronize the resource matching information, through the intermediate synchronization event 30, so that the computing module 611 may determine the completion of the data handling operation according to the resource matching information, and the corresponding first command arbitration section 611a may dynamically select an appropriate instruction to execute, that is, execute the computing operation, according to the completion of the data handling operation.
Therefore, the data processing method provided by the embodiment of the application is a more reasonable and efficient instruction and data synchronization mode suitable for the NPU. Wherein, realized multi-wire Cheng Gaoxiao and carried out on the NPU, 3. Improved the NPU commonality to a great extent and not good, the problem that the programming level is low. Meanwhile, through synchronization of the resource matching information, the computing operation is not executed according to the entering sequence of the instructions, threads are flexibly activated according to the resource matching condition, the efficiency is higher, and the multithreading effect is truly exerted.
The embodiment of the application provides a data processing method which is applied to a data processing device, wherein the data processing device comprises an NPU (network platform unit), the NPU comprises a calculation module and a data operation module, and when a data processing task is executed, resource matching information corresponding to the data processing task is synchronized to the calculation module through the data operation module; the data processing task at least comprises a first thread and a second thread; the resource matching information characterizes whether the data carrying operation corresponding to the first thread and the second thread is completed or not; and executing the computing operation corresponding to the first thread or the second thread based on the resource matching information through the computing module. It can be seen that, in the embodiment of the present application, by synchronizing the resource matching information, the computing module may determine the completion of the data handling operation, so that the computing operation may be further performed according to the completion of the data handling operation. That is, in the embodiment of the present application, for a data processing task, on the basis of improving universality by using multithreading, an appropriate computing operation can be flexibly and dynamically executed according to the completion condition of a synchronous data handling operation, so that computing resources and data resources are separated from a hardware layer, further the problem of mismatching of the computing resources and the data resources is solved, and the efficiency and performance of data processing can be effectively improved in consideration of both versatility and hardware utilization.
Based on the above embodiments, in another embodiment of the present application, fig. 15 is a schematic diagram illustrating a composition structure of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 15, a data processing apparatus 60 according to an embodiment of the present application may include: a synchronization unit 62 and an execution unit 63, wherein:
the synchronization unit 62 is configured to synchronize, when executing a data processing task, resource matching information corresponding to the data processing task to the calculation module through the data operation module; the data processing task at least comprises a first thread and a second thread; the resource matching information characterizes whether the data carrying operation corresponding to the first thread and the second thread is completed or not;
the executing unit 63 is configured to execute, by using the computing module, a computing operation corresponding to the first thread or the second thread based on the resource matching information.
In an embodiment of the present application, further, fig. 16 is a schematic diagram illustrating a composition structure of a data processing chip according to an embodiment of the present application, and as shown in fig. 16, a data processing chip 160 according to an embodiment of the present application may include: the neural network processing unit 61, the neural network processing unit 61 includes a calculation module 611 and a data manipulation module 612. The data processing chip 160 may be used to implement the data processing method set forth in the above embodiment.
In an embodiment of the present application, further, fig. 17 is a schematic diagram of a composition structure of a data processing device according to an embodiment of the present application, as shown in fig. 17, a data processing device 170 (which may be a terminal device when implemented) according to an embodiment of the present application may include: the neural network processing unit 61, the neural network processing unit 61 includes a calculation module 611 and a data manipulation module 612. Wherein the data processing device 170 may be adapted to implement the data processing method as proposed by the above embodiments.
The present embodiment provides a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the data processing method as described above.
Specifically, the program instructions corresponding to one data processing method in the present embodiment may be stored on a storage medium such as an optical disc, a hard disk, or a usb disk, and when the program instructions corresponding to one data processing method in the storage medium are read or executed by an electronic device, the method includes the following steps:
when executing the data processing task, synchronizing the resource matching information corresponding to the data processing task to the computing module through the data operation module; the data processing task at least comprises a first thread and a second thread; the resource matching information characterizes whether the data carrying operation corresponding to the first thread and the second thread is completed or not;
And executing the computing operation corresponding to the first thread or the second thread based on the resource matching information through the computing module.
In embodiments of the present application, the processor may be at least one of an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a digital signal processor (Digital Signal Processor, DSP), a digital signal processing device (DigitalSignal Processing Device, DSPD), a programmable logic device (ProgRAMmable Logic Device, PLD), a field programmable gate array (Field ProgRAMmable Gate Array, FPGA), a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor. It will be appreciated that the electronic device for implementing the above-mentioned processor function may be other for different apparatuses, and embodiments of the present application are not specifically limited.
In addition, each functional module in the present embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional modules.
The integrated units, if implemented in the form of software functional modules, may be stored in a computer-readable storage medium, if not sold or used as separate products, and based on this understanding, the technical solution of the present embodiment may be embodied essentially or partly in the form of a software product, or all or part of the technical solution may be embodied in a storage medium, which includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or processor (processor) to perform all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access Memory (RandomAccess Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The embodiment of the application provides a data processing device, a chip, equipment and a storage medium, wherein when a data processing task is executed, resource matching information corresponding to the data processing task is synchronized to a computing module through a data operation module; the data processing task at least comprises a first thread and a second thread; the resource matching information characterizes whether the data carrying operation corresponding to the first thread and the second thread is completed or not; and executing the computing operation corresponding to the first thread or the second thread based on the resource matching information through the computing module. It can be seen that, in the embodiment of the present application, by synchronizing the resource matching information, the computing module may determine the completion of the data handling operation, so that the computing operation may be further performed according to the completion of the data handling operation. That is, in the embodiment of the present application, for a data processing task, on the basis of improving universality by using multithreading, an appropriate computing operation can be flexibly and dynamically executed according to the completion condition of a synchronous data handling operation, so that computing resources and data resources are separated from a hardware layer, further the problem of mismatching of the computing resources and the data resources is solved, and the efficiency and performance of data processing can be effectively improved in consideration of both versatility and hardware utilization.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of implementations of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block and/or flow of the flowchart illustrations and/or block diagrams, and combinations of blocks and/or flow diagrams in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks and/or block diagram block or blocks.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application.

Claims (12)

1. A data processing method applied to a data processing apparatus, wherein the data processing apparatus includes a neural network processing unit NPU, the NPU including a calculation module and a data operation module, the method comprising:
When executing the data processing task, synchronizing the resource matching information corresponding to the data processing task to the computing module through the data operation module; the data processing task at least comprises a first thread and a second thread; the resource matching information characterizes whether the data carrying operation corresponding to the first thread and the second thread is completed or not;
and executing the computing operation corresponding to the first thread or the second thread based on the resource matching information through the computing module.
2. The method according to claim 1, wherein the method further comprises:
after finishing the kth data handling operation indicated by the instruction sequence corresponding to the data processing task through the data operation module, executing the kth+2 data handling operation indicated by the instruction sequence through the data operation module under the condition that the data resource corresponding to the kth+1 data handling operation indicated by the instruction sequence is not executable; where k is an integer greater than 0.
3. The method according to claim 2, wherein the method further comprises:
after finishing the kth data handling operation indicated by the instruction sequence through the data operation module, executing the kth+1 data handling operation through the data operation module under the condition that the data resource corresponding to the kth+1 data handling operation indicated by the instruction sequence is executable.
4. The method of claim 2, wherein the performing, by the computing module, a computing operation corresponding to the first thread or the second thread based on the resource matching information comprises:
after completion of the kth computing operation indicated by the instruction sequence by the computing module, in a case where it is determined that the kth+1th data handling operation indicated by the instruction sequence is not completed based on the resource matching information, executing the kth+2th computing operation indicated by the instruction sequence by the computing module.
5. The method of claim 3, wherein the performing, by the computing module, a computing operation corresponding to the first thread or the second thread based on the resource matching information comprises:
after completion of the kth computing operation indicated by the instruction sequence by the computing module, in a case where it is determined that the kth+1 data handling operation indicated by the instruction sequence has been completed based on the resource matching information, the kth+1 computing operation is performed by the computing module.
6. The method according to any one of claim 2 to 5, wherein,
the first thread comprises at least one first computing operation and at least one first data handling operation corresponding to the at least one first computing operation;
The second thread includes at least one second computing operation, and at least one second data handling operation corresponding to the at least one second computing operation.
7. The method according to any one of claims 2-5, further comprising:
after completing one data handling operation, the resource matching information is updated by the data handling module.
8. The method of claim 7, wherein synchronizing, by the data manipulation module, resource matching information corresponding to a data processing task to the computing module comprises:
and synchronizing, by the data manipulation module, the resource matching information to the computing module based on a synchronization event.
9. A data processing apparatus, characterized in that the data processing apparatus comprises: a synchronization unit, an execution unit,
the synchronization unit is used for synchronizing the resource matching information corresponding to the data processing task to the calculation module through the data operation module when the data processing task is executed; the data processing task at least comprises a first thread and a second thread; the resource matching information characterizes whether the data carrying operation corresponding to the first thread and the second thread is completed or not;
The execution unit is used for executing the computing operation corresponding to the first thread or the second thread based on the resource matching information through the computing module.
10. The data processing chip is characterized by comprising an NPU, wherein the NPU comprises a calculation module and a data operation module; the data processing chip is configured to implement the method of any of claims 1-8.
11. A data processing device, wherein the data processing device comprises an NPU, the NPU comprising a computing module and a data operating module; the data processing device being adapted to implement the method of any of claims 1-8.
12. A computer readable storage medium, on which a program is stored, which program, when being executed by a processor, implements the method according to any of claims 1-8.
CN202311368621.0A 2023-10-20 2023-10-20 Data processing method and device, chip, device and storage medium Active CN117389731B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311368621.0A CN117389731B (en) 2023-10-20 2023-10-20 Data processing method and device, chip, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311368621.0A CN117389731B (en) 2023-10-20 2023-10-20 Data processing method and device, chip, device and storage medium

Publications (2)

Publication Number Publication Date
CN117389731A true CN117389731A (en) 2024-01-12
CN117389731B CN117389731B (en) 2024-04-02

Family

ID=89467915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311368621.0A Active CN117389731B (en) 2023-10-20 2023-10-20 Data processing method and device, chip, device and storage medium

Country Status (1)

Country Link
CN (1) CN117389731B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120110585A1 (en) * 2010-10-29 2012-05-03 International Business Machines Corporation Energy consumption optimization in a data-processing system
US20150082324A1 (en) * 2013-09-18 2015-03-19 International Business Machines Corporation Efficient Interrupt Handling
WO2016014263A2 (en) * 2014-07-24 2016-01-28 Iniguez Alfonso System and method for parallel processing using dynamically configurable proactive co-processing cells
US20170364361A1 (en) * 2016-06-17 2017-12-21 Via Alliance Semiconductor Co., Ltd. Multi-threading processor and a scheduling method thereof
US20210012185A1 (en) * 2019-07-11 2021-01-14 Arm Limited Managing control data
CN114328316A (en) * 2021-11-22 2022-04-12 北京智芯微电子科技有限公司 DMA controller, SOC system and data carrying method based on DMA controller
CN114661474A (en) * 2022-03-30 2022-06-24 阿里巴巴(中国)有限公司 Information processing method, apparatus, device, storage medium, and program product
CN114661353A (en) * 2022-03-31 2022-06-24 成都登临科技有限公司 Data handling device and processor supporting multithreading
WO2022151970A1 (en) * 2021-01-14 2022-07-21 华为技术有限公司 Data transmission method, system, and computing node
CN115454644A (en) * 2022-09-26 2022-12-09 上海乐普云智科技股份有限公司 Task thread processing method and device for real-time monitoring data
WO2022266842A1 (en) * 2021-06-22 2022-12-29 华为技术有限公司 Multi-thread data processing method and apparatus
CN115686625A (en) * 2021-07-31 2023-02-03 华为技术有限公司 Integrated chip and instruction processing method
CN116719764A (en) * 2023-08-07 2023-09-08 苏州仰思坪半导体有限公司 Data synchronization method, system and related device

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120110585A1 (en) * 2010-10-29 2012-05-03 International Business Machines Corporation Energy consumption optimization in a data-processing system
US20150082324A1 (en) * 2013-09-18 2015-03-19 International Business Machines Corporation Efficient Interrupt Handling
WO2016014263A2 (en) * 2014-07-24 2016-01-28 Iniguez Alfonso System and method for parallel processing using dynamically configurable proactive co-processing cells
US20170364361A1 (en) * 2016-06-17 2017-12-21 Via Alliance Semiconductor Co., Ltd. Multi-threading processor and a scheduling method thereof
US20210012185A1 (en) * 2019-07-11 2021-01-14 Arm Limited Managing control data
WO2022151970A1 (en) * 2021-01-14 2022-07-21 华为技术有限公司 Data transmission method, system, and computing node
WO2022266842A1 (en) * 2021-06-22 2022-12-29 华为技术有限公司 Multi-thread data processing method and apparatus
CN115686625A (en) * 2021-07-31 2023-02-03 华为技术有限公司 Integrated chip and instruction processing method
CN114328316A (en) * 2021-11-22 2022-04-12 北京智芯微电子科技有限公司 DMA controller, SOC system and data carrying method based on DMA controller
CN114661474A (en) * 2022-03-30 2022-06-24 阿里巴巴(中国)有限公司 Information processing method, apparatus, device, storage medium, and program product
CN114661353A (en) * 2022-03-31 2022-06-24 成都登临科技有限公司 Data handling device and processor supporting multithreading
CN115454644A (en) * 2022-09-26 2022-12-09 上海乐普云智科技股份有限公司 Task thread processing method and device for real-time monitoring data
CN116719764A (en) * 2023-08-07 2023-09-08 苏州仰思坪半导体有限公司 Data synchronization method, system and related device

Also Published As

Publication number Publication date
CN117389731B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
US10002402B2 (en) Learning convolution neural networks on heterogeneous CPU-GPU platform
US8615646B2 (en) Unanimous branch instructions in a parallel thread processor
US5949996A (en) Processor having a variable number of stages in a pipeline
US10346212B2 (en) Approach for a configurable phase-based priority scheduler
CN100562892C (en) Image processing engine and comprise the image processing system of image processing engine
US10402223B1 (en) Scheduling hardware resources for offloading functions in a heterogeneous computing system
CN112799726B (en) Data processing device, method and related product
US20110078418A1 (en) Support for Non-Local Returns in Parallel Thread SIMD Engine
TW202024922A (en) Method and apparatus for accessing tensor data
JPH01177127A (en) Information processor
RU2450329C2 (en) Efficient interrupt return address save mechanism
CN117389731B (en) Data processing method and device, chip, device and storage medium
US11823303B2 (en) Data processing method and apparatus
CN114706813B (en) Multi-core heterogeneous system-on-chip, asymmetric synchronization method, computing device and medium
CN112433773B (en) Configuration information recording method and device for reconfigurable processor
JP5630798B1 (en) Processor and method
JP2008102599A (en) Processor
CN110689475A (en) Image data processing method, system, electronic equipment and storage medium
Kang et al. Tensor virtualization technique to support efficient data reorganization for CNN accelerators
CN117667198A (en) Instruction synchronous control method, synchronous controller, processor, chip and board card
CN116301874A (en) Code compiling method, electronic device and storage medium
CN113568665B (en) Data processing device
CN111915014B (en) Processing method and device of artificial intelligent instruction, board card, main board and electronic equipment
EP4206999A1 (en) Artificial intelligence core, artificial intelligence core system, and loading/storing method of artificial intelligence core system
JPS6259829B2 (en)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant