WO2020253383A1 - 一种基于众核处理器的流式数据处理方法及计算设备 - Google Patents

一种基于众核处理器的流式数据处理方法及计算设备 Download PDF

Info

Publication number
WO2020253383A1
WO2020253383A1 PCT/CN2020/087013 CN2020087013W WO2020253383A1 WO 2020253383 A1 WO2020253383 A1 WO 2020253383A1 CN 2020087013 W CN2020087013 W CN 2020087013W WO 2020253383 A1 WO2020253383 A1 WO 2020253383A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
computing
core
many
groups
Prior art date
Application number
PCT/CN2020/087013
Other languages
English (en)
French (fr)
Inventor
何伟
Original Assignee
北京灵汐科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京灵汐科技有限公司 filed Critical 北京灵汐科技有限公司
Publication of WO2020253383A1 publication Critical patent/WO2020253383A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the technical field of processors, and in particular to a stream data processing method and computing equipment based on many-core processors.
  • the present invention provides a many-core processor-based streaming data processing method and computing device that overcomes the above problems or at least partially solves the above problems.
  • a streaming data processing method based on a many-core processor includes a plurality of computing cores, and the method includes:
  • Receiving data including images, video, audio or data from sensors
  • inputting the data into a corresponding computing core group among the N computing core groups to perform a data processing task includes:
  • the data is input to one computing core group; or the data is input to multiple computing core groups in the N computing core groups to perform different data processing tasks.
  • the inputting the data into a corresponding computing core group among the N computing core groups to perform a data processing task includes:
  • the data is input into the first computing core group to execute the first data processing task.
  • the inputting the data into the first computing core group to perform the first data processing task includes:
  • the data is divided into multiple data blocks, and the multiple data blocks are divided into M groups.
  • Each group in the M groups includes the same number of K data blocks, where M is greater than or equal to 2, and K Greater than or equal to 1;
  • the K data blocks in the first group of the M groups are input in parallel to the operation core corresponding to each data block of the K data blocks, and the operation check performs calculation processing on the corresponding data block.
  • the processing of image data by the many-core processor can be improved, and the processing accuracy can be improved.
  • the K data blocks in the first group of the M groups are input in parallel to the operation core corresponding to each data block of the K data blocks, and the corresponding data block is checked by the operation After the calculation process, it also includes:
  • K data blocks in the second group in the M groups are input in parallel to the K computing cores, and the K computing cores perform calculation processing on the K data blocks in the second group.
  • a pipelined sequential working mode can be formed between the many-core processors to ensure the orderly processing of data operations and improve processing efficiency.
  • any arithmetic core while performing operations within the arithmetic core, at least part of the data completed by the calculation is transmitted to the next arithmetic core in the first arithmetic core group for execution The calculation of the next layer in the first data processing task.
  • each of the computing cores in the many-core processor includes:
  • the weight storage unit is used to store at least one of data, data weights, and operation instructions required when performing neural network calculations;
  • a calculation unit used to access the data stored in the weight storage unit, and perform arithmetic processing on the data
  • the control unit generates operation instructions and controls the calculation unit to perform data operations based on the operation instructions
  • Routing is used to send and receive data, and perform data communication within the computing core or between multiple computing cores.
  • the communication mode between the multiple computing cores includes at least one of a 2D network mode, a 3D ring mode, and a pair of continuous mode.
  • a computing device including a many-core processor for running a computer program, wherein:
  • any one of the many-core processor-based streaming data processing methods described above is adopted.
  • the computing device further includes:
  • the storage device is used to store a computer program, which is loaded and executed by the processor when the computer program is running in the computing device.
  • the present invention provides a more efficient flow data processing method based on many-core processors.
  • the many-core processors are divided into N computing core groups, and then the data received by the many-core processors is input into N computing cores.
  • the corresponding computing core group in the group executes data processing tasks in parallel and outputs data processing results.
  • data can be arranged in a pipeline mode to be input by time period to ensure the accuracy of data processing.
  • the solution provided by the present invention can improve the many-core processor's data processing speed while reducing the energy consumption and data delay of the many-core processor, thereby improving the overall computing efficiency of the many-core processor.
  • Fig. 1 shows a schematic structural diagram of a computing core according to an embodiment of the present invention
  • Figure 2 shows a schematic diagram of a many-core processor architecture according to an embodiment of the present invention
  • FIG. 3 shows a schematic diagram of the architecture of a many-core processor in 2D mode according to an embodiment of the present invention
  • FIG. 4 shows a schematic flow chart of a stream data processing method based on a many-core processor according to an embodiment of the present invention
  • FIG. 5 shows a schematic diagram of a many-core processor divided into multiple computing core groups according to an embodiment of the present invention
  • Fig. 6 shows a schematic diagram of dividing input data into blocks and executing processing according to an embodiment of the present invention
  • Figure 7 shows a schematic diagram of a data division process according to an embodiment of the present invention.
  • Fig. 8 shows a schematic diagram of neural network mapping according to an embodiment of the present invention.
  • Fig. 9 shows a schematic diagram of input and output of a many-core processor according to an embodiment of the present invention.
  • FIG. 10 shows a schematic diagram of a parallel operation transmission mode of a many-core processor according to an embodiment of the present invention
  • FIG. 11 shows a schematic diagram of a serial mode of arithmetic transmission of a many-core processor according to an embodiment of the present invention
  • Fig. 12 shows a schematic diagram of a timing operation in a parallel working mode in a many-core processor according to an embodiment of the present invention.
  • Storage and processing integrated many-core architecture can effectively improve the computing power and efficiency of the chip.
  • Storage and processing are integrated, which means that storage is localized and integrated with the processor. On the one hand, it can greatly save the energy consumption of data transportation; on the other hand, in deep learning, when the static graph is calculated through the neural network, it can make all The calculated picture remains unchanged during training or inference applications.
  • each computing core may include: a weight storage unit, a computing unit, Modules such as routing and control unit.
  • the weight storage unit is used to store at least one of data, data weights, and operation instructions required when performing neural network calculations or other data processing algorithms, and may also include temporary data or other data.
  • the calculation unit is used to access the data stored in the weight storage unit and perform arithmetic processing on the data.
  • the calculation unit can include multipliers, adders, and other calculation processing modules with special functions, and is responsible for calculations such as matrix vectors or Logic operation.
  • the control unit generates operation instructions and controls the calculation unit to perform data operations based on the operation instructions.
  • the control unit can specifically control the flow of data, select the corresponding operation and so on. Routing is used to send and receive data, and carry out data communication within the computing core or between multiple computing cores. In the working process of the computing core, the data required by the computing unit is input from the routing to the computing unit, and the computing unit performs computing processing and then outputs through the routing.
  • each computing core may also include a cache unit or other units, which is not limited in the present invention.
  • the weight storage units of the multiple computing cores included in the many-core processor are independently combined into one weight storage core.
  • the weight storage cores can simultaneously transfer the computing core 1, computing core 2, and computing core 3. Provide weights. In this mode, partial weight sharing can be realized.
  • the computing core 1, the computing core 2, and the computing core 3 can be isomorphic computing cores.
  • the weight storage units of multiple computing cores are selected in the many-core processor to be independently combined into weight computing cores, they can be selected according to different requirements or the characteristics of each computing core, which is not limited in the present invention.
  • the communication mode between the computing cores in the many-core processor may include at least one of a 2D grid mode, a 3D ring mode, and a pair of continuous modes.
  • the 2D network mode is shown in Fig. 3, and each computing core is connected to the computing cores in the upper, lower, left, and right directions through routing. In this mode, data can be routed to any designated computing core.
  • the embodiment of the present invention provides a stream data processing method based on the many-core processor.
  • the streaming data processing method of the many-core processor may include:
  • Step S401 Divide a plurality of computing cores into N computing core groups, and the N computing core groups execute different data processing methods.
  • the data processing method can be a neural network algorithm or other data processing algorithms.
  • multiple computing cores in the many-core processor can be divided into N computing core groups, and each computing core group can execute different data processing methods, where N is an integer greater than zero.
  • the many-core processor is divided into three computing core groups, and three different neural network algorithms (neural network algorithm A, neural network algorithm B, and neural network algorithm C) can be mapped to different computing core groups.
  • the computing cores operate independently, the three neural networks can simultaneously receive data and perform corresponding data processing in parallel.
  • they can be divided according to computing power requirements or other requirements corresponding to different algorithms, which is not limited in the present invention.
  • Step S402 Receive data, the data including image, video, audio or data transmitted from the sensor.
  • the data that needs the many-core processor to perform calculations may be image data, audio data, video data, or other data to be processed that needs to be calculated by the many-core processor.
  • Step S403 Input the data into the corresponding computing core group among the N computing core groups to perform data processing tasks. Further, the data may be input to one computing core group; or the data may be input to multiple computing core groups of the N computing core groups to perform different data processing tasks.
  • Step S404 output the processed data.
  • the many-core processor After the many-core processor receives the data, it can be input to the computing core group for corresponding data processing.
  • the input data needs to execute the neural network A algorithm
  • the data is input to the computing core group corresponding to the neural network A algorithm for data processing, and the processing result is output.
  • any set of data may be preprocessed before being input to the computing core for calculation processing. That is, the foregoing step S403 inputs the data into the first computing core group to perform the first data processing task, and the specific method is as follows:
  • each of the M groups includes the same number of K data blocks, where M is greater than Equal to 2, and K is greater than or equal to 1.
  • the data may be divided into a plurality of K data blocks including the same number according to the calculation amount of each computing core in a specified time.
  • FIG. 6 is a schematic diagram of dividing input data into blocks and executing processing. Taking the data divided into 9 blocks and three groups as an example for description, the embodiment of the present invention does not limit how many blocks and groups the data is divided into, and FIG. 6 is only a schematic diagram. As shown in Figure 6, the data is divided into 9 data blocks, and the 9 data blocks are divided into three groups, and each group includes 3 data blocks.
  • the first data block, the second data block, and the third data block are allocated computing core 1, computing core 2, and computing core 3.
  • the first data block, the second data block, and the third data block are input in parallel to the computing core 1, the computing core 2, and the computing core 3.
  • Core 3 performs calculation processing on the first data block, the second data block, and the third data block.
  • Fig. 7 shows a schematic diagram of a data division process according to an embodiment of the present invention.
  • the data can be divided into equal or unequal tasks according to the processing capacity of each core within a specified time.
  • time T1 corresponds to Input1 (input 1)
  • time T2 corresponds to Input2 (input 2)
  • time T3 corresponds to Input1 (input 3)
  • the data block in each group of data may correspond to a calculation subtask in the entire calculation task of the data or a collection of multiple subtasks.
  • a neural network is composed of multiple layers of calculation, divided into an input layer, an output layer and a hidden layer in the middle, as shown in Figure 8.
  • the circles represent neurons
  • the arrows in front of the output layer and the hidden layer represent weights.
  • the weights are composed of matrix functions, and their dimensions are determined by the number of neurons in the two layers.
  • the weights are distributedly stored in the computing cores (or weight storage cores), that is, the weights of a certain layer in the neural network algorithm are allocated to one or several computing cores (or weight storage cores).
  • the pre-processed input data When the pre-processed input data is input to the many-core processor, it will first be sent to the corresponding computing core of the first layer for processing; when the processing is completed, the data is sent to the pre-planned computing core corresponding to the second layer Process, sequentially process the third layer, fourth layer,..., nth layer corresponding to the computing core, until the output layer of the neural network is completed.
  • the mapping distribution of computing cores corresponding to a simple neural network is illustrated. The 8-layer network is distributed on 8 different computing cores, and each computing core completes a layer of neural network operations; therefore, after 8 operations After the core, the neural network is output from the computing core 7.
  • the divided groups of data blocks can be sequentially input into the pre-configured computing core group in the many-core processor in a time period for the first-layer neural network operation.
  • the K data blocks in the first group of the M groups are input in parallel to the operation core corresponding to each data block of the K data blocks, and the operation checks the corresponding data blocks. After calculation and processing, it can also include:
  • Step S403-4 the K data blocks in the second group in the M group are input in parallel to the K computing cores, and the K data blocks in the second group are processed by the K computing cores. Calculation processing. Further, when performing parallel input in the above steps S403-3 and S403-4, for any one of the computing cores, while performing operations within the computing cores, at least part of the completed data is transmitted to the first computing core group. The next computing core is used to execute the next layer of calculation in the first data processing task.
  • the multiple computing cores may perform computing processing on the multiple data blocks in a parallel computing transmission mode or a computing transmission serial mode. That is to say, the data transmission of the operation in the operation core of the many-core processor can be in two ways: operation transmission parallel mode and operation transmission serial mode, whether it is operation transmission parallel mode or serial mode, between multiple operation cores Can form a streamlined sequential working mode.
  • FIG. 10 shows a schematic diagram of a parallel operation transmission mode of a many-core processor according to an embodiment of the present invention.
  • the The arithmetic core transmits at least part of the data completed by the calculation to the next target arithmetic core while performing calculations in the arithmetic core. That is, in this mode, while performing calculations in the arithmetic cores, part of the data completed by the calculations will be transferred to the corresponding next target arithmetic core. That is, the calculations in the arithmetic cores and data transmission between multiple arithmetic cores can be performed simultaneously.
  • each time period may be the same or not.
  • the specific duration of each time period is based on the time when the last data is transmitted to the target computing core, and triggers to the next time period at the same time.
  • FIG. 11 shows a schematic diagram of the operation transmission serial mode of a many-core processor according to an embodiment of the present invention.
  • multiple operation cores perform calculation processing on the multiple data blocks according to the operation transmission serial mode, for any operation core.
  • the calculation core executes the calculation in the calculation core, all the data completed by the calculation is transmitted to the next target calculation core.
  • the data can be transmitted to the corresponding next target calculation core. That is, at the same time, when the data of the computing core in a single computing core is released, there is only one situation in which it is working, that is, the computing in the computing core and the data transmission between multiple computing cores are executed serially.
  • the processing of a single set of data can be performed directly using the above process.
  • the calculation tasks of the corresponding data processing are respectively divided into multiple calculation subtasks; the calculation subtasks of the first set of data in the multiple sets of data are sequentially input to the multiple calculation cores in the many-core processor for calculation processing; Wherein, after the many-core processor completes the first computing subtask of the first group of data, multiple computing cores in the many-core processor work in parallel at the same time; the first computing in the many-core processor After the core processing completes the final calculation subtask of the first group of data, it continues to input the second group of data into the many-core processor for arithmetic processing, and so on for the subsequent third group to the Nth group.
  • the many-core pipeline architecture can also support multi-task parallelism and inter-task interaction. That is to say, when there are multiple groups of data to be processed, when the computing cores in the many-core processors are configured for each group of data, the many-core processors select multiple computing core groups, and each computing core group includes multiple computing cores.
  • a computing core; the multiple sets of data are respectively input to different computing core groups in the many-core processor, and each computing core group executes the calculation processing process of the multiple sets of data in parallel.
  • the data processing algorithms executed by each computing core group may be the same or different.
  • an embodiment of the present invention also provides a computing device including a many-core processor for running a computer program, where:
  • the stream data processing method based on the many-core processor described in any of the foregoing embodiments is adopted.
  • the computing device may further include: a storage device for storing a computer program, and the computer program is loaded and executed by the processor when the computer program runs in the computing device.
  • the embodiment of the present invention provides a more efficient many-core processor-based streaming data processing method and computing device.
  • the many-core processor is divided into N computing core groups, and then the data received by the many-core processor is analyzed. Input the corresponding operation core group among the N operation core groups to perform data processing tasks and output data processing results.
  • data can be arranged in a pipeline mode to be input by time period to ensure the accuracy of data processing.
  • the solution provided based on the embodiments of the present invention can increase the many-core processor's data processing speed while reducing the energy consumption and data delay of the many-core processor, thereby improving the overall computing efficiency of the many-core processor.
  • modules or units or components in the embodiments can be combined into one module or unit or component, and in addition, they can be divided into multiple sub-modules or sub-units or sub-components. Except that at least some of such features and/or processes or units are mutually exclusive, any combination can be used to compare all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any method or methods disclosed in this manner or All the processes or units of the equipment are combined. Unless expressly stated otherwise, each feature disclosed in this specification (including the accompanying claims, abstract and drawings) may be replaced by an alternative feature providing the same, equivalent or similar purpose.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Processing (AREA)

Abstract

本发明提供了一种基于众核处理器的流式数据处理方法及计算设备,所述众核处理器包括多个运算核,该方法包括:将所述多个运算核分为N个运算核组,所述N个运算核组执行不同的数据处理方法;接收数据,所述数据包括图像、视频、音频或由传感器传来的数据;将所述数据输入所述N个运算核组中对应的运算核组执行数据处理任务;输出所述处理后的数据。基于本发明提供的方案可以在提升众核处理器对数据处理速度的同时降低众核处理器的能耗以及数据延迟,进而提升众核处理器的整体运算效率。

Description

一种基于众核处理器的流式数据处理方法及计算设备 技术领域
本发明涉及处理器技术领域,特别是涉及一种基于众核处理器的流式数据处理方法及计算设备。
背景技术
当今时代,人工智能技术日新月异、澎湃发展,从各方面影响着人们的生产和生活,推动着世界的发展和进步。近几年,研究者们发现神经网络算法对处理非结构化数据非常有效,比如人脸识别、语音识别、图像分类等任务。随着这些非结构化数据的指数级增长,对处理器算力的要求越来越高。传统的中央处理器CPU和数字信号处理器DSP的算力已经不能满足需求,因此,如何提升处理器的算力和效率是亟待解决的问题。
发明内容
鉴于上述问题,本发明提供了一种克服上述问题或至少部分地解决了上述问题的一种基于众核处理器的流式数据处理方法及计算设备。
根据本发明的一个方面,提供了一种基于众核处理器的流式数据处理方法,所述众核处理器包括多个运算核,所述方法包括:
将所述多个运算核分为N个运算核组,所述N个运算核组执行不同的数据处理方法;
接收数据,所述数据包括图像、视频、音频或由传感器传来的数据;
将所述数据输入所述N个运算核组中对应的运算核组执行数据处理任务;
输出所述处理后的数据。
可选地,将所述数据输入所述N个运算核组中对应的运算核组执行数据处理任务,包括:
将所述数据输入一个运算核组;或将所述数据输入所述N个运算核组中的多个运算核组,以执行不同的数据处理任务。
可选地,所述将所述数据输入所述N个运算核组中对应的运算核组执行数据处理任务,包括:
将所述数据输入第一运算核组执行第一数据处理任务。
可选地,所述将所述数据输入第一运算核组执行第一数据处理任务,包括:
将所述数据分为多个数据块,并将所述多个数据块分为M组,所述M组中的每个组包括相同数量的K个数据块,其中,M大于等于2,K大于等于1;
为所述K个数据块分配对应的K个运算核;
将所述M组中第一组内的K个数据块并行输入至所述K个数据块的每个数据块对应的运算核中,由所述运算核对对应的数据块进行计算处理。
通过将数据根据各运算核的处理能力进行子任务的划分,可以提升众核处理器对图像数据的处理过程,并提升处理精度。
可选地,所述将所述M组中第一组内的K个数据块并行输入至所述K个数据块的每个数据块对应的运算核中,由所述运算核对对应的数据块进行计算处理之后,还包括:
将所述M组中第二组内的K个数据块并行输入至所述K个运算核中,由所述K个运算核对所述第二组内的K个数据块进行计算处理。众核处理器无论是采用运算传输并行模式,可以在众核之间可形成流水式的顺序工作模式,以保证数据运算处理有序进行的同时提升处理效率。
可选地,所述并行输入时,对于任意一个运算核,在进行运算核内运算的同时,将至少部分运算完成的数据传输至所述第一运算核组中的下一运算核,以执行所述第一数据处理任务中的下一层的计算。
可选地,所述众核处理器中的各所述运算核均包括:
权重存储单元,用于存储执行神经网络计算时所需的数据、数据权重和操作指令中的至少一种;
计算单元,用于访问所述权重存储单元中存储的数据,并对所述数据进行运算处理;
控制单元,用生成操作指令,并控制所述计算单元基于所述操作指令执行数据运算;
路由,用于发送和接收数据,进行所述运算核内或多个运算核之间的数据通信。
通过在众核处理器的各运算核中设置私有的权重存储单元,可以方便各运算核快速读取其执行运算时所需要的数据。
可选地,所述多个运算核之间的通信方式包括:2D网络模式、3D环形模式和一对一直连模式中的至少一种。
根据本发明的另一个方面,还提供了一种计算设备,包括众核处理器,用于运行计算机程序,其中,
所述众核处理器在进行数据处理时,采用上述任一项所述的基于众核处理器的流式数据处理方法。
可选地,所述计算设备还包括:
存储设备,用于存储计算机程序,所述计算机程序在所述计算设备中运行时由处理器加载并执行。
本发明提供了一种更加高效的基于众核处理器的流式数据处理方法,首先将众核处理器分为N个运算核组,进而对众核处理器接收到的数据输入N个运算核组中对应的运算核组并行执行数据处理任务并输出数据处理结果。在本发明中,对每个运算核而言,数据均可以排成管道模式按时间段输入,以保证数据处理的准确性。基于本发明提供的方案可以在提升众核处理器对数据处理速度的同时降低众核处理器的能耗以及数据延迟,进而提升众核处理器的整体运算效率。
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。
根据下文结合附图对本发明具体实施例的详细描述,本领域技术人员将会更加明了本发明的上述以及其他目的、优点和特征。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示出了根据本发明实施例的运算核结构示意图;
图2示出了根据本发明实施例的众核处理器架构示意图;
图3示出了根据本发明实施例的2D模式众核处理器架构示意图;
图4示出了根据本发明实施例的基于众核处理器的流式数据处理方法流程示意图;
图5示出了根据本发明实施例的众核处理器划分多个运算核组示意图;
图6示出了根据本发明实施例的将输入数据分块并执行处理的示意图;
图7示出了根据本发明实施例的数据划分过程示意图;
图8示出了根据本发明实施例的神经网络映射示意图;
图9示出了根据本发明实施例的众核处理器输入输出示意图;
图10示出了根据本发明实施例的众核处理器的运算传输并行模式示意图;
图11示出了根据本发明实施例的众核处理器的运算传输串行模式示意图;
图12示出了根据本发明实施例的众核处理器中并行工作模式下的时序工作示意图。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
采用存储处理一体众核架构可以有效提升芯片算力和效率。存储处理一体,即将存储本地化,与处理器融为一体,这样一方面可以极大地节省数据的搬运能耗;另一方面,在深度学习中,当通过神经网络计算静态图时,可以使得所计算的图片在训练或推理应用中保持不变。
本发明实施例提供了一种众核处理器,其包括多个运算核,图1示出了运算核的结构示意图,参见图1可知,各运算核均可以包括:权重存储单元、计算单元、路由、控制单元等模块。其中,权重存储单元,用于存储执行神经网络计算或其他数据处理算法时所需的数据、数据权重和操作指令中的至少一种,此外还可以包括暂存数据或者是其他数据。计算单元,用于访问所述权重存储单元中存储的数据,并对所述数据进行运算处理,计算单元可以包乘法器、加法器和其他特殊功能的计算处理模块,负责进行矩阵向量等计算或逻辑操作。控制单元,用生成操作指令,并控制计算单元基于所述操作指令执行数据运算。控制单元可具体控制数据的流向,选择相应的运算操作等。路由,用于发送和接收数据,进行该运算核内或多个运算核之间的数据通信。在运算核的工作过程中,所述计算单元所需数据从所述路由输入至计算单元,由计算单元进行运算处理后再通过路由输出。除上述介绍的之外,各运算核还可以包括缓存单元或是其他单元,本发明不做限定。
另外,众核处理器所包括的多个运算核的权重存储单元单独独立出来合并成一个权重存储核,如图2所示,权重存储核可以同时向运算核1、运算核2、 运算核3提供权重,在该模式下,可以实现权重的部分共享,其中,运算核1、运算核2、运算核3可以为同构运算核。实际应用中,在众核处理器中选取多个运算核的权重存储单元单独独立出来合并成权重运算核时,可以根据不同的需求或是各运算核的特征进行选取,本发明不做限定。
对于众核处理器来讲,众核处理器内的各运算核之间的通信方式可以包括2D网格模式,3D环形模式,一对一直连等模式中的至少一种。其中,2D网络模式参见图3所示,每个运算核通过路由,同上、下、左、右四个方向的运算核相连。在该模式下,数据可以通过路由,到达任意指定的运算核。
对于上述实施例所介绍的包括多个运算核的众核处理器,本发明实施例提供了一种基于众核处理器的流式数据处理方法,参见图4可知,本发明实施例提供的基于众核处理器的流式数据处理方法可以包括:
步骤S401,将多个运算核分为N个运算核组,所述N个运算核组执行不同的数据处理方法。其中,该数据处理方法可以为神经网络算法或是其他数据处理算法。
本实施例中,众核处理器中的多个运算核可以划分为N个运算核组,每个运算核组可执行不同的数据处理方法,其中N为大于零的整数。如图5所示,众核处理器划分有三个运算核组,三个不同的神经网络算法(神经网络算法A,神经网络算法B,神经网络算法C)可分别映射到不同的运算核组中。由于运算核各自独立运行,三种神经网络可同时接收数据并行进行对应的数据处理。实际应用中,对众核处理器中的多个运算核进行划分时,可根据不同算法所对应的算力要求或是其他要求进行划分,本发明不做限定。
步骤S402,接收数据,所述数据包括图像、视频、音频或由传感器传来的数据。实际应用中,需要众核处理器执行计算的数据可以是需要众核处理器进行计算的图像数据、音频数据、视频数据或是传感器传输的其他待处理数据。
步骤S403,将所述数据输入所述N个运算核组中对应的运算核组执行数据处理任务。进一步地,可以将所述数据输入一个运算核组;或将所述数据输入所述N个运算核组中的多个运算核组,以执行不同的数据处理任务。
步骤S404,输出所述处理后的数据。
众核处理器在接收到数据之后,就可以将其输入运算核组进行相应的数据处理。
例如输入的数据需要执行神经网络A算法,则将数据输入至神经网络A算法对应的运算核组中进行数据处理,并将处理结果输出。
可选地,对于任意一组数据在将其输入至运算核执行计算处理之前,可以先对其进行预处理。即上述步骤S403将所述数据输入第一运算核组执行第一数据处理任务,具体方法如下:
S403-1,将所述数据分为多个数据块,并将所述多个数据块分为M组,所述M组中的每个组包括相同数量的K个数据块,其中,M大于等于2,K大于等于1。在划分数据块时,可以根据各运算核在指定时间内的运算量将所述数据划分成多个包括相同数量的K个数据块。
图6为将输入数据分块并执行处理的示意图。以将数据分为9块三组为例进行描述,本发明实施例对数据分为多少块,多少组不做限定,图6仅为示意图。如图6所示,将数据分为9个数据块,将9个数据块分为三组,每组包括3个数据块。
S403-2,为所述K个数据块分配对应的K个运算核。
如图6所示,为第一个数据块、第二个数据块、第三个数据块分配运算核1、运算核2、运算核3。
S403-3,将所述M组中第一组内的K个数据块并行输入至所述K个数据块的每个数据块对应的运算核中,由所述运算核对对应的数据块进行计算处理。
如图6所示,将第一个数据块、第二个数据块、第三个数据块并行输入至运算核1、运算核2、运算核3,并由运算核1、运算核2、运算核3对第一个数据块、第二个数据块、第三个数据块进行计算处理。
图7示出了根据本发明实施例的数据划分过程示意图。参见图7可知,可以将数据按每个核在规定时间内的处理能力,切分成等份或不等份的任务量。图7中,T1时刻对应Input1(输入1),T2时刻对应Input2(输入2),T3时刻对应Input1(输入3),以此类推。其中,每组数据中的数据块可对应是一个该数据整个计算任务中的计算子任务或由多个子任务的集合而成。
因此,图6中的数据块大小可以不相等。
一般来说,神经网络由多层计算构成,分成输入层,输出层和中间的隐藏层,如图8所示。在图8中,圆圈代表神经元,输出层和隐藏层前的箭头代表权重,权重由矩阵函数组成,其维数由前后两层的神经元个数决定。在芯片处理器运行前,将权重分布式存储在运算核(或权重存储核)中,即神经网络算法中某一层的权重被分配在一个或几个运算核(或权重存储核)中。当经过预处理的输入数据被输入众核处理器,将首先被送到第一层对应运算核中进行处理;当处理完毕后,数据被送到事先规划的第二层所对应的运算核中进行处理, 依次处理第三层、第四层、……、第n层所对应的运算核中进行处理,直至完成神经网络的输出层。在图9中,示例了一个简单的神经网络对应的运算核映射分布,将8层网络分布部署在8个不同的运算核上,每个运算核完成一层神经网络运算;因此经过8个运算核后,神经网络由运算核7向外输出。
因此,将数据划分完成多个数据块之后,就可以将划分后的各组数据块按时间段的方式依次输入众核处理器中的事先配置好的运算核组进行首层神经网络运行。而在上述步骤S403-3将M组中第一组内的K个数据块并行输入至所述K个数据块的每个数据块对应的运算核中,由所述运算核对对应的数据块进行计算处理之后,还可以包括:
步骤S403-4,将所述M组中第二组内的K个数据块并行输入至所述K个运算核中,由所述K个运算核对所述第二组内的K个数据块进行计算处理。进一步地,上述步骤S403-3、S403-4进行并行输入时,对于任意一个运算核,在进行运算核内运算的同时,将至少部分运算完成的数据传输至所述第一运算核组中的下一运算核,以执行所述第一数据处理任务中的下一层的计算。
如图6所示,在执行第一个数据块、第二个数据块、第三个数据块的计算过程中,如果已经完成的部分运算结果足够运算核4开始执行计算,那么运算核4开始执行计算,可以理解,在同一时间段内,运算核4与运算核1、运算核2、运算核3都在执行计算工作。
可选地,多个运算核执行计算处理时,可以是多个所述运算核按照运算传输并行模式或运算传输串行模式对所述多个数据块进行计算处理。也就是说,众核处理器的运算核内运算的数据传输可以有两种方式:运算传输并行模式和运算传输串行模式,无论是运算传输并行模式还是串行模式,在多个运算核之间都可形成流水式的顺序工作模式。
图10示出了根据本发明实施例众核处理器的运算传输并行模式示意图,多个运算核按照运算传输并行模式对所述多个数据块进行计算处理时,对于任意一个运算核,所述运算核在进行运算核内运算的同时,将至少部分运算完成的数据传输至下一个目的运算核。即,在该模式下,在运算核内运算的同时,部分运算完成的数据将被传输给相应的下一个目的运算核,即运算核内运算和多个运算核之间数据传输可同时进行。
以图10的并行运算传输模式为例,首先,输入数据按时序输入到运算核1进行处理,同时,将运算核1处理完成后的输出数据经由路由送至运算核5并作为运算核5的输入数据,当前的所有运算操作结束后,需等待一段时间所有 数据才能全部传输完毕。当所有数据传输完毕之后,当前时间段T1结束。在T2时间段开始之后,在运算核5的运算核内进行运算处理,当所有的数据处理完成并传输到下一个目的运算核(运算核2)之后,当前时间段T2结束。在T3时间段内,运算核2进行数据处理和传输。等T3时间段结束后,在接下来的时间段内(T4-T6),依次在运算核3,运算核6和运算核9的顺序实现数据的处理并完成相应的传输。值得注意的是,各个时间段的长度可能一致,也可能不一致。每个时间段具体的时长由最后一个数据被传输到目的运算核的时间为依据,同时触发到下一个时间段。
图11示出了根据本发明实施例众核处理器的运算传输串行模式示意图,多个运算核按照运算传输串行模式对所述多个数据块进行计算处理时,对于任意一个运算核,所述运算核执行运算核内运算完成后,将所有运算完成的数据传输至下一个目的运算核。在该模式下,当所有运算核内运算完成时,即所有输入数据被处理完成后,数据才能被传输给相应的下一个目的运算核。也就是在同一个时刻,单个运算核内的运算核数据发放,只有一种情况在工作,即运算核内运算和多个运算核之间数据传输串行执行。
对于单独一组数据的处理可直接采用上述流程进行执行。当对多组数据进行处理时,即用相同的神经网络算法处理不同任务时(比如对不同的图片进行人脸识别),同样可基于上述步骤对多组数据进行排序,并依次将各组数据对应的数据处理的计算任务分别分割成多个计算子任务;将所述多组数据中的第一组数据的计算子任务依次输入所述众核处理器中的多个运算核进行运算处理;其中,在所述众核处理器完成第一组数据的首个计算子任务后,所述众核处理器中的多个运算核同时并行工作;在所述众核处理器中的首个运算核处理完成第一组数据的最后的计算子任务后,继续将第二组数据输入所述众核处理器进行运算处理,对于后续第三组至第N组以此类推。
如图12所示,假设需要对两组数据进行运算处理,各自对应的计算任务分别是任务1和任务2。在将其输入众核处理器之前,需要先分别对任务1和任务2进行子任务的划分,可以将单个任务(任务1、任务2)分割成不同的计算子任务,然后将这些子任务依次输入众核处理器进行处理。在完成首个子任务之后(图12中表示为T1到T7),所有的运算核都将同时并行工作,分别处理不同层的网络。而且,当任务1的最后的计算子任务完成首个运算核(运算核1)的处理后,任务2的首个计算子任务可以立刻输入运算核1进行处理,实现了任务间的无缝衔接。
在单个任务的流水式的顺序工作模式基础上,众核流水架构还可支持多任务并行和任务间的交互。也就是说,在需要处理的数据为多组时,为各组数据配置众核处理器中的运算核时,所述众核处理器选取多个运算核组,其中,各运算核组包括多个运算核;将所述多组数据分别输入所述众核处理器中不同的运算核组,由各运算核组并行执行所述多组数据的计算处理过程。其中,各运算核组所执行的数据处理算法可以相同,也可以不同。
基于同一发明构思,本发明实施例还提供了一种计算设备,包括众核处理器,用于运行计算机程序,其中,
所述众核处理器在进行数据处理时,采用上述任一实施例所述的基于众核处理器的流式数据处理方法。
可选地,所述计算设备还可以包括:存储设备,用于存储计算机程序,所述计算机程序在所述计算设备中运行时由处理器加载并执行。
本发明实施例提供了一种更加高效的基于众核处理器的流式数据处理方法及计算设备,首先将众核处理器分为N个运算核组,进而对众核处理器接收到的数据输入N个运算核组中对应的运算核组执行数据处理任务并输出数据处理结果。在本发明实施例中,对每个运算核而言,数据均可以排成管道模式按时间段输入,以保证数据处理的准确性。基于本发明实施例提供的方案可以在提升众核处理器对数据处理速度的同时降低众核处理器的能耗以及数据延迟,进而提升众核处理器的整体运算效率。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实 施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
至此,本领域技术人员应认识到,虽然本文已详尽示出和描述了本发明的多个示例性实施例,但是,在不脱离本发明精神和范围的情况下,仍可根据本发明公开的内容直接确定或推导出符合本发明原理的许多其他变型或修改。因此,本发明的范围应被理解和认定为覆盖了所有这些其他变型或修改。

Claims (10)

  1. 一种基于众核处理器的流式数据处理方法,其特征在于,所述众核处理器包括多个运算核,所述方法包括:
    将所述多个运算核分为N个运算核组,所述N个运算核组执行不同的数据处理方法;
    接收数据,所述数据包括图像、视频、音频或由传感器传来的数据;
    将所述数据输入所述N个运算核组中对应的运算核组执行数据处理任务;
    输出所述处理后的数据。
  2. 如权利要求1所述的方法,其中,将所述数据输入所述N个运算核组中对应的运算核组执行数据处理任务,包括:
    将所述数据输入一个运算核组;或将所述数据输入所述N个运算核组中的多个运算核组,以执行不同的数据处理任务。
  3. 如权利要求1或2所述的方法,其中,所述将所述数据输入所述N个运算核组中对应的运算核组执行数据处理任务,包括:
    将所述数据输入第一运算核组执行第一数据处理任务。
  4. 如权利要求3所述的方法,其中,所述将所述数据输入第一运算核组执行第一数据处理任务,包括:
    将所述数据分为多个数据块,并将所述多个数据块分为M组,所述M组中的每个组包括相同数量的K个数据块,其中,M大于等于2,K大于等于1;
    为所述K个数据块分配对应的K个运算核;
    将所述M组中第一组内的K个数据块并行输入至所述K个数据块的每个数据块对应的运算核中,由所述运算核对对应的数据块进行计算处理。
  5. 根据权利要求4所述的方法,其中,所述将所述M组中第一组内的K个数据块并行输入至所述K个数据块的每个数据块对应的运算核中,由所述运算核对对应的数据块进行计算处理之后,还包括:
    将所述M组中第二组内的K个数据块并行输入至所述K个运算核中,由所述K个运算核对所述第二组内的K个数据块进行计算处理。
  6. 根据权利要求4或5所述的方法,其中,所述并行输入时,对于任意一个运算核,在进行运算核内运算的同时,将至少部分运算完成的数据传输至所述第一运算核组中的下一运算核,以执行所述第一数据处理任务中的下一层的计算。
  7. 根据权利要求1-6任一项所述的方法,其中,所述众核处理器中的各所述运算核均包括:
    权重存储单元,用于存储执行神经网络计算时所需的数据、数据权重和操作指令中的至少一种;
    计算单元,用于访问所述权重存储单元中存储的数据,并对所述数据进行运算处理;
    控制单元,用生成操作指令,并控制所述计算单元基于所述操作指令执行数据运算;
    路由,用于发送和接收数据,进行所述运算核内或多个运算核之间的数据通信。
  8. 根据权利要求7所述的方法,其中,所述多个运算核之间的通信方式包括:2D网络模式、3D环形模式和一对一直连模式中的至少一种。
  9. 一种计算设备,包括众核处理器,用于运行计算机程序,其特征在于,
    所述众核处理器在进行数据处理时,采用权利要求1-8任一项所述的基于众核处理器的流式数据处理方法。
  10. 根据权利要求9所述的计算设备,其中,所述计算设备还包括:
    存储设备,用于存储计算机程序,所述计算机程序在所述计算设备中运行时由处理器加载并执行。
PCT/CN2020/087013 2019-06-21 2020-04-26 一种基于众核处理器的流式数据处理方法及计算设备 WO2020253383A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910540896.5A CN112114942A (zh) 2019-06-21 2019-06-21 一种基于众核处理器的流式数据处理方法及计算设备
CN201910540896.5 2019-06-21

Publications (1)

Publication Number Publication Date
WO2020253383A1 true WO2020253383A1 (zh) 2020-12-24

Family

ID=73796165

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/087013 WO2020253383A1 (zh) 2019-06-21 2020-04-26 一种基于众核处理器的流式数据处理方法及计算设备

Country Status (2)

Country Link
CN (1) CN112114942A (zh)
WO (1) WO2020253383A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112631982A (zh) * 2020-12-25 2021-04-09 清华大学 基于众核架构的数据交换方法及装置
CN112835719B (zh) * 2021-02-10 2023-10-31 北京灵汐科技有限公司 任务处理的方法和装置、众核系统、计算机可读介质
CN115098262B (zh) * 2022-06-27 2024-04-23 清华大学 一种多神经网络任务处理方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156271A (zh) * 2014-08-01 2014-11-19 浪潮(北京)电子信息产业有限公司 一种协同计算集群负载均衡的方法及系统
US20180275967A1 (en) * 2017-03-27 2018-09-27 Microsoft Technology Licensing, Llc Neural network for program synthesis
CN109359736A (zh) * 2017-04-06 2019-02-19 上海寒武纪信息科技有限公司 网络处理器和网络运算方法
CN110502330A (zh) * 2018-05-16 2019-11-26 上海寒武纪信息科技有限公司 处理器及处理方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156271A (zh) * 2014-08-01 2014-11-19 浪潮(北京)电子信息产业有限公司 一种协同计算集群负载均衡的方法及系统
US20180275967A1 (en) * 2017-03-27 2018-09-27 Microsoft Technology Licensing, Llc Neural network for program synthesis
CN109359736A (zh) * 2017-04-06 2019-02-19 上海寒武纪信息科技有限公司 网络处理器和网络运算方法
CN110502330A (zh) * 2018-05-16 2019-11-26 上海寒武纪信息科技有限公司 处理器及处理方法

Also Published As

Publication number Publication date
CN112114942A (zh) 2020-12-22

Similar Documents

Publication Publication Date Title
WO2020253383A1 (zh) 一种基于众核处理器的流式数据处理方法及计算设备
US11789895B2 (en) On-chip heterogeneous AI processor with distributed tasks queues allowing for parallel task execution
CN107301456B (zh) 基于向量处理器的深度神经网络多核加速实现方法
CN1272705C (zh) 包括纯量算术逻辑单元的单指令多数据处理机
US11544525B2 (en) Systems and methods for artificial intelligence with a flexible hardware processing framework
US10713059B2 (en) Heterogeneous graphics processing unit for scheduling thread groups for execution on variable width SIMD units
CN111105023B (zh) 数据流重构方法及可重构数据流处理器
CN111199275B (zh) 用于神经网络的片上系统
TWI634489B (zh) 多層人造神經網路
CN115136123A (zh) 用于集成电路架构内的自动化数据流和数据处理的瓦片子系统和方法
US20230385233A1 (en) Multiple accumulate busses in a systolic array
WO2021259098A1 (zh) 一种基于卷积神经网络的加速系统、方法及存储介质
TWI805820B (zh) 神經處理系統
US20210326683A1 (en) Hardware circuit for accelerating neural network computations
CN114661353A (zh) 支持多线程的数据搬运装置及处理器
CN112051981B (zh) 一种数据流水线计算路径结构及单线程数据流水线系统
US20190272460A1 (en) Configurable neural network processor for machine learning workloads
CN111078286B (zh) 数据通信方法、计算系统和存储介质
Aghapour et al. Integrated ARM big. Little-Mali pipeline for high-throughput CNN inference
WO2021218492A1 (zh) 任务分配方法、装置、电子设备及计算机可读存储介质
CN114595813A (zh) 异构加速处理器及数据计算方法
Aghapour Integrated ARM big
Cheng et al. Towards a deep-pipelined architecture for accelerating deep GCN on a multi-FPGA platform
WO2021036668A1 (zh) 神经网络的全局池化方法及众核系统
Gao et al. Revisiting thread configuration of SpMV kernels on GPU: A machine learning based approach

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20827580

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20827580

Country of ref document: EP

Kind code of ref document: A1