TWI805820B - Neural processing system - Google Patents

Neural processing system Download PDF

Info

Publication number
TWI805820B
TWI805820B TW108127870A TW108127870A TWI805820B TW I805820 B TWI805820 B TW I805820B TW 108127870 A TW108127870 A TW 108127870A TW 108127870 A TW108127870 A TW 108127870A TW I805820 B TWI805820 B TW I805820B
Authority
TW
Taiwan
Prior art keywords
end module
neural processing
operation result
processing unit
data
Prior art date
Application number
TW108127870A
Other languages
Chinese (zh)
Other versions
TW202011279A (en
Inventor
宋陳煜
朴峻奭
趙胤校
Original Assignee
南韓商三星電子股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南韓商三星電子股份有限公司 filed Critical 南韓商三星電子股份有限公司
Publication of TW202011279A publication Critical patent/TW202011279A/en
Application granted granted Critical
Publication of TWI805820B publication Critical patent/TWI805820B/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/04Generating or distributing clock signals or signals derived directly therefrom
    • G06F1/06Clock generators producing several clock signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Power Sources (AREA)
  • Image Analysis (AREA)
  • Electrotherapy Devices (AREA)
  • Feedback Control In General (AREA)
  • Hardware Redundancy (AREA)

Abstract

A neural processing system includes a first frontend module, a second frontend module, a first backend module, and a second backend module. The first frontend module executes a feature extraction operation using a first feature map and a first weight, and outputs a first operation result and a second operation result. The second frontend module executes the feature extraction operation using a second feature map and a second weight, and outputs a third operation result and a fourth operation result. The first backend module receives an input of the first operation result provided from the first frontend module and the fourth operation result provided from the second frontend module via a second bridge to sum up the first operation result and the fourth operation result. The second backend module receives an input of the third operation result provided from the second frontend module and the second operation result provided from the first frontend module via a first bridge to sum up the third operation result and the second operation result.

Description

神經處理系統neural processing system

本揭露涉及一種神經處理系統。The present disclosure relates to a neural processing system.

深度學習(deep learning)是指一種基於使用演算法集(algorithm set)的深度學習架構的運算類型,其試圖使用在層級結構(hierarchy)中具有多個處理級(processing level)的深度圖(deep graph)對輸入資料的高級抽象進行建模。一般來說,深度學習架構可包括多個神經元層級結構及參數。深度學習架構中的卷積神經網路(Convolutional Neural Network,CNN)被廣泛應用於許多人工智慧及機器學習應用中,例如影像分類、影像標題創建(image caption creation)、視覺問題回應以及自動化駕駛車輛中。Deep learning refers to a type of operation based on a deep learning architecture using an algorithm set that attempts to use a deep graph with multiple processing levels in a hierarchy. graph) to model a high-level abstraction of the input data. In general, a deep learning architecture may include multiple neuron hierarchies and parameters. The Convolutional Neural Network (CNN) in the deep learning architecture is widely used in many artificial intelligence and machine learning applications, such as image classification, image caption creation (image caption creation), visual question response, and autonomous vehicles middle.

由於CNN系統包括很多參數且需要進行很多例如用於影像分類的運算,因此CNN系統複雜度高。因此,為了實施CNN系統,硬體資源的成本成為問題,且硬體資源消耗的電量也成為問題。具體來說,在最近的移動系統(例如,移動通信裝置)中實施的CNN的情形中,需要能夠在具有低成本及低功耗的同時實施人工智慧的架構。Since the CNN system includes many parameters and needs to perform many operations, such as for image classification, the CNN system has high complexity. Therefore, in order to implement the CNN system, the cost of hardware resources becomes a problem, and the power consumed by the hardware resources also becomes a problem. Specifically, in the case of CNNs implemented in recent mobile systems (eg, mobile communication devices), an architecture capable of implementing artificial intelligence while having low cost and low power consumption is required.

本揭露的各個方面提供一種能夠在具有低成本及低功耗的同時實施人工智慧的神經網路系統。Various aspects of the present disclosure provide a neural network system capable of implementing artificial intelligence while having low cost and low power consumption.

然而,本揭露的各個方面並不受限於本文中所述的方面。通過參照以下給出的本揭露的詳細說明,本揭露的以上及其他方面將對本揭露所屬領域中的普通技術人員來說變得更顯而易見。However, aspects of the present disclosure are not limited to the aspects described herein. The above and other aspects of the present disclosure will become more apparent to those of ordinary skill in the art to which the present disclosure pertains by reference to the detailed description of the disclosure given below.

根據本揭露的一個方面,一種神經處理系統包括第一前端模組、第二前端模組、第一後端模組及第二後端模組。所述第一前端模組利用第一特徵圖及第一權重執行特徵提取運算,並輸出第一運算結果及第二運算結果。所述第二前端模組利用第二特徵圖及第二權重執行所述特徵提取運算,並輸出第三運算結果及第四運算結果。所述第一後端模組接收從所述第一前端模組提供的所述第一運算結果及通過第二橋接器從所述第二前端模組提供的所述第四運算結果的輸入,以總和所述第一運算結果與所述第四運算結果。所述第二後端模組接收從所述第二前端模組提供的所述第三運算結果及通過第一橋接器從所述第一前端模組提供的所述第二運算結果的輸入,以總和所述第三運算結果與所述第二運算結果。According to an aspect of the present disclosure, a neural processing system includes a first front-end module, a second front-end module, a first back-end module, and a second back-end module. The first front-end module performs a feature extraction operation by using the first feature map and the first weight, and outputs a first operation result and a second operation result. The second front-end module performs the feature extraction operation by using the second feature map and the second weight, and outputs a third operation result and a fourth operation result. The first back-end module receives input of the first calculation result provided from the first front-end module and the fourth calculation result provided from the second front-end module through a second bridge, and summing up the first operation result and the fourth operation result. The second back-end module receives the input of the third calculation result provided from the second front-end module and the second calculation result provided from the first front-end module through the first bridge, and summing up the third operation result and the second operation result.

根據本揭露的另一方面,一種神經處理系統包括第一神經處理單元、橋接器單元以及第二神經處理單元。所述第一神經處理單元包括第一前端模組及第一後端模組。所述橋接器單元電連接到所述第一神經處理單元。所述第二神經處理單元在與所述第一神經處理單元不同的時脈域中運作。所述第一前端模組將通過利用第一特徵圖及第一權重執行特徵提取運算而獲得的第一運算結果的一部分提供到所述第一後端模組。所述橋接器單元將在所述第二神經處理單元中執行的第二運算結果的一部分提供到所述第一後端模組。所述第一後端模組總和所述第一運算結果的所述一部分與所述第二運算結果的所述一部分。According to another aspect of the present disclosure, a neural processing system includes a first neural processing unit, a bridge unit, and a second neural processing unit. The first neural processing unit includes a first front-end module and a first back-end module. The bridge unit is electrically connected to the first neural processing unit. The second NPU operates in a different clock domain than the first NPU. The first front-end module provides a part of a first operation result obtained by performing a feature extraction operation using a first feature map and a first weight to the first back-end module. The bridge unit provides a part of the result of the second operation performed in the second neural processing unit to the first backend module. The first backend module sums the part of the first operation result and the part of the second operation result.

根據本揭露的另一方面,一種神經處理系統包括第一神經處理單元、第二神經處理單元以及工作負荷管理器。所述第一神經處理單元包括第一前端模組及第一後端模組。所述第二神經處理單元包括第二前端模組及第二後端模組。所述工作負荷管理器將用於執行特徵提取的資料中的第一資料分配到所述第一神經處理單元,並將所述資料中的第二資料分配到所述第二神經處理單元。所述第一前端模組利用第一特徵圖及第一權重對所述第一資料執行特徵提取運算,並輸出第一運算結果及第二運算結果。所述第二前端模組利用第二特徵圖及第二權重對所述第二資料執行所述特徵提取運算,並輸出第三運算結果及第四運算結果。所述第一後端模組總和所述第一運算結果與所述第四運算結果。所述第二後端模組總和所述第三運算結果與所述第二運算結果。According to another aspect of the present disclosure, a neural processing system includes a first neural processing unit, a second neural processing unit, and a workload manager. The first neural processing unit includes a first front-end module and a first back-end module. The second neural processing unit includes a second front-end module and a second back-end module. The workload manager distributes a first one of the data for performing feature extraction to the first neural processing unit, and distributes a second one of the data to the second neural processing unit. The first front-end module performs a feature extraction operation on the first data by using the first feature map and the first weight, and outputs a first operation result and a second operation result. The second front-end module performs the feature extraction operation on the second data by using the second feature map and the second weight, and outputs a third operation result and a fourth operation result. The first backend module sums the first operation result and the fourth operation result. The second backend module sums the third operation result and the second operation result.

圖1是示出根據本揭露的實施例的計算系統的示意圖。FIG. 1 is a schematic diagram illustrating a computing system according to an embodiment of the present disclosure.

參照圖1,根據本揭露的實施例的計算系統1包括神經處理系統10、時脈管理單元20(clock management unit,CMU)、處理器30以及記憶體40。神經處理系統10、處理器30以及記憶體40可通過匯流排90傳送及接收資料。神經處理系統10可為或可包括一個或多個神經網路處理器,所述神經網路處理器可例如通過執行指令並處理資料來實施卷積神經網路(CNN)。然而,本揭露並不僅限於此。也就是說,神經處理系統10可作為另外一種選擇由處理任意向量運算、矩陣運算等的處理器實施。神經處理系統10也可包括儲存在其中的指令,或可執行儲存在記憶體40中或動態地從外部源接收的指令。神經處理系統10也可包括在本文中所述的學習過程中動態更新的記憶體,以對學習內容進行更新從而動態更新新的學習。神經網路處理器的實例是圖形處理器(graphics processing unit,GPU),但可使用多於一個處理器(例如,多個圖形處理器)來實施神經處理系統10。因此,本文中使用的神經處理系統10至少包括神經網路處理器,但也可被視為包括功能上可分離但相互依賴的軟體模組、個別電路元件的功能上可分離但相互依賴的電路模組、每一模組和/或單元所特有的資料及記憶體、以及本文中所述的其他元件。同時,儘管在圖1中示出了神經處理系統10並參照圖1將神經處理系統10闡述為與時脈管理單元20、處理器30以及記憶體40分離,但由神經處理系統10實施的功能可部分地通過或使用時脈管理單元20、處理器30以及記憶體40的資源實施。Referring to FIG. 1 , a computing system 1 according to an embodiment of the present disclosure includes a neural processing system 10 , a clock management unit 20 (clock management unit, CMU), a processor 30 and a memory 40 . The neural processing system 10 , the processor 30 and the memory 40 can transmit and receive data through the bus 90 . Neural processing system 10 may be or include one or more neural network processors that may implement, for example, a convolutional neural network (CNN) by executing instructions and processing data. However, the present disclosure is not limited thereto. That is, neural processing system 10 may alternatively be implemented by a processor that handles arbitrary vector operations, matrix operations, and the like. Neural processing system 10 may also include instructions stored therein, or may execute instructions stored in memory 40 or dynamically received from an external source. The neural processing system 10 may also include a memory that is dynamically updated during the learning process described herein, so as to update the learning content so as to dynamically update new learning. An example of a neural network processor is a graphics processing unit (GPU), although more than one processor (eg, multiple graphics processors) may be used to implement neural processing system 10 . Therefore, the neural processing system 10 used herein includes at least a neural network processor, but can also be considered to include functionally separable but interdependent software modules, functionally separable but interdependent circuits of individual circuit elements modules, data and memory specific to each module and/or unit, and other elements described herein. Meanwhile, although the neural processing system 10 is shown in FIG. 1 and explained with reference to FIG. It can be partially implemented by or using the resources of the clock management unit 20 , the processor 30 and the memory 40 .

另外,圖1中的計算系統1可以是包括一個或多個計算裝置的電腦系統,所述一個或多個計算裝置各自包括一個或多個處理器。計算系統1的處理器是有形的及非暫時性的。用語“非暫時性的”明確否認短暫特性,例如在任意時間任意地點僅暫時性地存在的載波或信號或其他形式的特性。處理器是製品和/或機器組件。用於實施圖1中的神經處理系統10或本文中的其他實施例的電腦系統的處理器被配置成執行軟體指令以實行如在本文中的各種實施例中所述的功能。電腦系統的處理器可以是通用處理器、專用積體電路(application specific integrated circuit,ASIC)的一部分、微處理器、微電腦、處理器晶片、控制器、微控制器、數位訊號處理器(digital signal processor,DSP)、狀態機、或可程式化的邏輯裝置。電腦系統的處理器也可以是包括可程式化閘陣列(programmable gate array,PGA)(例如,現場可程式化閘陣列(field programmable gate array,FPGA))的邏輯電路或包括分立門和/或電晶體邏輯的另一類型的電路。處理器也可以是中央處理器(central processing unit,CPU)、圖形處理器(GPU)或可以是所述兩者。另外,本文中所述的任一處理器可包括多個處理器、並行處理器或可包括所述兩者。多個處理器可包括在單個裝置或多個裝置中或耦合到單個裝置或多個裝置。Additionally, computing system 1 in FIG. 1 may be a computer system including one or more computing devices each including one or more processors. The processors of computing system 1 are tangible and non-transitory. The term "non-transitory" expressly disavows ephemeral properties, such as carrier waves or signals or other forms of properties that exist only temporarily anywhere at any time. Processors are articles and/or machine components. A processor of a computer system for implementing neural processing system 10 in FIG. 1 or other embodiments herein is configured to execute software instructions to carry out functions as described in various embodiments herein. The processor of a computer system can be a general-purpose processor, part of an application specific integrated circuit (ASIC), a microprocessor, a microcomputer, a processor chip, a controller, a microcontroller, or a digital signal processor (digital signal processor). processor, DSP), state machine, or programmable logic device. The processor of a computer system may also be a logic circuit including a programmable gate array (PGA) (eg, a field programmable gate array (FPGA)) or include discrete gates and/or circuits. Crystal logic is another type of circuit. The processor may also be a central processing unit (central processing unit, CPU), a graphics processing unit (GPU), or both. Additionally, any processor described herein may include multiple processors, parallel processors, or both. Multiple processors may be included in or coupled to a single device or multiple devices.

實施圖1中的計算系統1的電腦系統可實施本文中所述的全部或部分方法。舉例來說,如本文中所述的例如特徵提取、求和及啟動等功能可由執行軟體指令的電腦系統通過本文中所述的一個或多個處理器來實施。A computer system implementing the computing system 1 in FIG. 1 can implement all or part of the methods described herein. For example, functions such as feature extraction, summation, and initiation as described herein may be implemented by a computer system executing software instructions through one or more processors described herein.

在本實施例中,神經處理系統10可實施和/或處理包括多個層(例如,特徵提取層及特徵分類層)的神經網路。此處,特徵提取層對應於神經網路的初始層,且可例如用於從輸入影像中提取例如邊緣及梯度等低級特徵。另一方面,特徵分類層對應於神經網路的第二級層(secondary layer),且可例如用於從輸入影像中提取例如人臉、眼睛、鼻子等更複雜及高級的特徵。解釋一下,特徵提取層可被視為在特徵分類層提取更複雜及高級的特徵之前提取低級特徵。特徵分類層對應于全連接層(fully-connected layer)。In this embodiment, the neural processing system 10 may implement and/or process a neural network including multiple layers (eg, a feature extraction layer and a feature classification layer). Here, the feature extraction layer corresponds to the initial layer of the neural network and can be used, for example, to extract low-level features such as edges and gradients from the input image. On the other hand, the feature classification layer corresponds to the secondary layer of the neural network, and can be used, for example, to extract more complex and advanced features such as face, eyes, nose, etc. from the input image. To explain, the feature extraction layer can be viewed as extracting low-level features before the feature classification layer extracts more complex and high-level features. The feature classification layer corresponds to a fully-connected layer.

為了從輸入影像中提取特徵,神經處理系統10可使用濾波器或內核(kernel)來計算輸入影像或特徵圖。舉例來說,神經處理系統10可使用卷積濾波器或卷積內核對輸入影像或特徵圖執行卷積運算。此外,神經處理系統10可利用可對應於特徵圖的權重來進行運算,所述權重是依具體實施方式的目的而確定的。To extract features from an input image, the neural processing system 10 may use filters or kernels to compute input images or feature maps. For example, the neural processing system 10 may use convolution filters or convolution kernels to perform convolution operations on input images or feature maps. In addition, the neural processing system 10 may utilize weights corresponding to the feature maps to perform calculations, and the weights are determined according to the purpose of the specific implementation.

在本實施例中,應特別注意的是,神經處理系統10包括多個神經處理單元,所述多個神經處理單元包括第一神經處理單元100a及第二神經處理單元100b。第一神經處理單元100a及第二神經處理單元100b可通過如上所述的物理上分離的神經網路處理器實施,並/或通過由相同或不同物理上分離的神經網路處理器執行的邏輯上和/或功能上分離的軟體模組實施。為便於闡釋,在本實施例中,神經處理系統10被示出為包括第一神經處理單元100a及第二神經處理單元100b,但本揭露的範圍並不僅限於此。根據具體實施方式的目的,神經處理系統10可包括n(此處,n是為2或大於2的自然數)個神經處理單元。In this embodiment, it should be particularly noted that the neural processing system 10 includes a plurality of neural processing units, and the plurality of neural processing units include a first neural processing unit 100a and a second neural processing unit 100b. The first neural processing unit 100a and the second neural processing unit 100b may be implemented by physically separate neural network processors as described above, and/or by logic executed by the same or different physically separate neural network processors and/or functionally separate software module implementations. For ease of illustration, in this embodiment, the neural processing system 10 is shown as including the first neural processing unit 100a and the second neural processing unit 100b, but the scope of the present disclosure is not limited thereto. According to the purpose of the specific implementation, the neural processing system 10 may include n (herein, n is a natural number of 2 or greater than 2) neural processing units.

使用例如本文中所述的第一神經處理單元100a及第二神經處理單元100b等多個神經處理單元提供了若干實際機會來降低成本和/或功耗。Using multiple neural processing units such as the first neural processing unit 100a and the second neural processing unit 100b described herein provides several real opportunities to reduce cost and/or power consumption.

時脈管理單元20產生用於驅動神經處理系統10的第一時脈信號CLK1及第二時脈信號CLK2。時脈管理單元20向第一神經處理單元100a及第二神經處理單元100b中的每一者提供第一時脈信號CLK1及第二時脈信號CLK2。因此,第一神經處理單元100a根據第一時脈信號CLK1被驅動。第二神經處理單元100b根據第二時脈信號CLK2被驅動。如本文中所闡釋,針對例如第一神經處理單元100a及第二神經處理單元100b等不同的神經處理單元,可以降低功耗、增加功耗、降低處理速度或提高處理速度的方式選擇性地控制不同的時脈。The clock management unit 20 generates a first clock signal CLK1 and a second clock signal CLK2 for driving the neural processing system 10 . The clock management unit 20 provides the first clock signal CLK1 and the second clock signal CLK2 to each of the first neural processing unit 100a and the second neural processing unit 100b. Therefore, the first neural processing unit 100a is driven according to the first clock signal CLK1. The second neural processing unit 100b is driven according to the second clock signal CLK2. As explained herein, different neural processing units, such as the first neural processing unit 100a and the second neural processing unit 100b, can be selectively controlled to reduce power consumption, increase power consumption, reduce processing speed, or increase processing speed. different clocks.

在本揭露的一些實施例中,第一時脈信號CLK1及第二時脈信號CLK2的頻率可彼此不同。換句話說,第一神經處理單元100a在其中運作的時脈域可不同於第二神經處理單元100b在其中運作的時脈域。In some embodiments of the present disclosure, the frequencies of the first clock signal CLK1 and the second clock signal CLK2 may be different from each other. In other words, the clock domain in which the first neural processing unit 100a operates may be different from the clock domain in which the second neural processing unit 100b operates.

時脈管理單元20可根據需要控制第一時脈信號CLK1及第二時脈信號CLK2的每個頻率。此外,時脈管理單元20還可根據需要對第一時脈信號CLK1及第二時脈信號CLK2執行時脈閘控(clock gating)。The clock management unit 20 can control each frequency of the first clock signal CLK1 and the second clock signal CLK2 as required. In addition, the clock management unit 20 can also perform clock gating on the first clock signal CLK1 and the second clock signal CLK2 as required.

處理器30是執行一般算數運算的處理器,所述一般算數運算與由神經處理系統10處理的人工智慧運算、向量運算、矩陣運算等運算不同。處理器30可包括例如中央處理器(CPU)、圖形處理器(GPU)等,但本揭露的範圍並不僅限於此。在本實施例中,處理器30通常可控制計算系統1。The processor 30 is a processor that performs general arithmetic operations, which are different from artificial intelligence operations, vector operations, matrix operations, and the like processed by the neural processing system 10 . The processor 30 may include, for example, a central processing unit (CPU), a graphics processing unit (GPU), etc., but the scope of the present disclosure is not limited thereto. In this embodiment, the processor 30 can generally control the computing system 1 .

記憶體40可儲存在處理器30執行應用程式或控制計算系統1時使用的資料。記憶體40也可用於儲存用於神經處理系統10的資料,但神經處理系統10可包括其自身的記憶體以儲存指令及資料。記憶體40可以是例如動態隨機存取記憶體(Dynamic Random-Access Memory,DRAM),但本揭露的範圍並不僅限於此。在本實施例中,將由神經處理系統10利用例如CNN處理的影像資料可儲存在記憶體40中。The memory 40 can store data used when the processor 30 executes application programs or controls the computing system 1 . Memory 40 may also be used to store data for neural processing system 10, but neural processing system 10 may include its own memory for storing instructions and data. The memory 40 may be, for example, a dynamic random-access memory (Dynamic Random-Access Memory, DRAM), but the scope of the present disclosure is not limited thereto. In this embodiment, the image data to be processed by the neural processing system 10 using, for example, CNN can be stored in the memory 40 .

圖2是示出根據本揭露的實施例的神經處理系統的方塊圖。FIG. 2 is a block diagram illustrating a neural processing system according to an embodiment of the present disclosure.

參照圖2,根據本揭露的實施例的神經處理系統10包括第一神經處理單元100a及第二神經處理單元100b。在第一神經處理單元100a與第二神經處理單元100b之間設置有橋接器單元110。如上所述,第一神經處理單元100a與第二神經處理單元100b可在物理上分離且在功能上分離。如在本文中所闡釋,例如在橋接器單元110中使用一個或多個橋接器會增強以降低功耗、增加功耗、降低處理速度或提高處理速度的方式選擇性地控制第一神經處理單元100a及第二神經處理單元100b的實際能力。Referring to FIG. 2 , the neural processing system 10 according to an embodiment of the present disclosure includes a first neural processing unit 100 a and a second neural processing unit 100 b. A bridge unit 110 is provided between the first neural processing unit 100a and the second neural processing unit 100b. As described above, the first neural processing unit 100a and the second neural processing unit 100b may be physically separated and functionally separated. As explained herein, use of one or more bridges, for example in bridge unit 110, enhances selective control of the first neural processing unit in a manner that reduces power consumption, increases power consumption, reduces processing speed, or increases processing speed 100a and the actual capabilities of the second neural processing unit 100b.

首先,橋接器單元110包括第一橋接器111及第二橋接器112。第一橋接器111用於將由第一神經處理單元100a的運算產生的中間結果傳送到第二神經處理單元100b。第二橋接器112用於將由第二神經處理單元100b的運算產生的中間結果傳送到第一神經處理單元100a。Firstly, the bridge unit 110 includes a first bridge 111 and a second bridge 112 . The first bridge 111 is used to transmit the intermediate result generated by the operation of the first neural processing unit 100a to the second neural processing unit 100b. The second bridge 112 is used to transmit the intermediate result generated by the operation of the second neural processing unit 100b to the first neural processing unit 100a.

為此,第一神經處理單元100a與第二神經處理單元100b可在相互不同的時脈域中運作。在此種情況下,橋接器單元110可電連接到第一神經處理單元100a及在不同於第一神經處理單元100a的時脈域中運作的第二神經處理單元100b。For this reason, the first neural processing unit 100 a and the second neural processing unit 100 b can operate in different clock domains. In this case, the bridge unit 110 may be electrically connected to the first neural processing unit 100a and the second neural processing unit 100b operating in a different clock domain than the first neural processing unit 100a.

因此,當第一神經處理單元100a與第二神經處理單元100b在相互不同的時脈域中運作時,橋接器單元110中所包括的第一橋接器111及第二橋接器112被實施為非同步橋接器以使得資料能夠在彼此不同的時脈域之間傳送。Therefore, when the first neural processing unit 100a and the second neural processing unit 100b operate in mutually different clock domains, the first bridge 111 and the second bridge 112 included in the bridge unit 110 are implemented as non- The bridges are synchronized so that data can be transferred between different clock domains.

在本實施例中,第一神經處理單元100a包括第一前端模組102a及第一後端模組104a。第二神經處理單元100b包括第二前端模組102b及第二後端模組104b。第一神經處理單元100a可對將由神經處理系統10處理的資料中的第一資料DATA1進行處理。第二神經處理單元100b可對將由神經處理系統10處理的資料中的第二資料DATA2進行處理。具體來說,第一前端模組102a利用第一特徵圖及第一權重對第一資料DATA1執行特徵提取運算,並輸出第一運算結果R11及第二運算結果R12。此外,第二前端模組102b利用第二特徵圖及第二權重對第二資料DATA2執行特徵提取運算,並輸出第三運算結果R21及第四運算結果R22。In this embodiment, the first neural processing unit 100a includes a first front-end module 102a and a first back-end module 104a. The second neural processing unit 100b includes a second front-end module 102b and a second back-end module 104b. The first neural processing unit 100 a can process the first data DATA1 among the data to be processed by the neural processing system 10 . The second neural processing unit 100 b can process the second data DATA2 among the data to be processed by the neural processing system 10 . Specifically, the first front-end module 102a uses the first feature map and the first weight to perform a feature extraction operation on the first data DATA1, and outputs a first operation result R11 and a second operation result R12. In addition, the second front-end module 102b uses the second feature map and the second weight to perform a feature extraction operation on the second data DATA2, and outputs a third operation result R21 and a fourth operation result R22.

第一後端模組104a接收從第一前端模組102a提供的第一運算結果R11以及通過第二橋接器112從第二前端模組102b提供的第四運算結果R22。第一後端模組104a總和第一運算結果R11與第四運算結果R22。另一方面,第二後端模組104b接收從第二前端模組102b提供的第三運算結果R21以及通過第一橋接器111從第一前端模組102a提供的第二運算結果R12。第二後端模組104b總和第三運算結果R21與第二運算結果R12。The first back-end module 104a receives the first calculation result R11 provided from the first front-end module 102a and the fourth calculation result R22 provided from the second front-end module 102b through the second bridge 112 . The first backend module 104a sums the first operation result R11 and the fourth operation result R22. On the other hand, the second back-end module 104b receives the third calculation result R21 provided from the second front-end module 102b and the second calculation result R12 provided from the first front-end module 102a through the first bridge 111 . The second backend module 104b sums the third operation result R21 and the second operation result R12.

在本揭露的一些實施例中,第一前端模組102a及第一後端模組104a根據第一時脈信號CLK1被驅動,且第二前端模組102b及第二後端模組104b可根據頻率與第一時脈信號CLK1不同的第二時脈信號CLK2被驅動。也就是說,第一前端模組102a及第一後端模組104a可在不同於第二前端模組102b及第二後端模組104b的時脈域中運作。In some embodiments of the present disclosure, the first front-end module 102a and the first back-end module 104a are driven according to the first clock signal CLK1, and the second front-end module 102b and the second back-end module 104b can be driven according to The second clock signal CLK2 having a frequency different from that of the first clock signal CLK1 is driven. That is, the first front-end module 102a and the first back-end module 104a can operate in different clock domains than the second front-end module 102b and the second back-end module 104b.

另一方面,在本實施例中,第一後端模組104a可向第一前端模組102a提供第一回寫資料WB DATA1,且第二後端模組104b可向第二前端模組102b提供第二回寫資料WB DATA2。第一回寫資料WB DATA1及第二回寫資料WB DATA2被輸入到第一前端模組102a及第二前端模組102b中的每一者,以允許重複進行特徵提取運算。On the other hand, in this embodiment, the first back-end module 104a can provide the first write-back data WB DATA1 to the first front-end module 102a, and the second back-end module 104b can provide the second front-end module 102b Provide the second write-back data WB DATA2. The first write-back data WB DATA1 and the second write-back data WB DATA2 are input to each of the first front-end module 102a and the second front-end module 102b to allow repeated feature extraction operations.

現在參照圖3,將闡述根據本揭露的實施例的神經處理系統10的更詳細的結構。Referring now to FIG. 3 , a more detailed structure of neural processing system 10 according to an embodiment of the present disclosure will be set forth.

圖3是示出根據本揭露的實施例的神經處理系統的方塊圖。FIG. 3 is a block diagram illustrating a neural processing system according to an embodiment of the present disclosure.

參照圖3,根據本揭露的實施例的神經處理系統10的第一神經處理單元100a中所包括的第一前端模組102a包括多個第一內部記憶體1021a及1022a、多個第一提取單元1023a及1024a、多個第一分派單元1025a及1026a以及第一MAC陣列1027a(乘法及累積陣列)。Referring to FIG. 3 , the first front-end module 102a included in the first neural processing unit 100a of the neural processing system 10 according to an embodiment of the present disclosure includes a plurality of first internal memories 1021a and 1022a, a plurality of first extraction units 1023a and 1024a, a plurality of first dispatch units 1025a and 1026a, and a first MAC array 1027a (multiply and accumulate array).

第一內部記憶體1021a及1022a可儲存由第一前端模組102a用於進行資料DATA11及DATA12的特徵提取運算的第一特徵圖及第一權重。在本實施例中,第一內部記憶體1021a及1022a可實施為靜態隨機存取記憶體(Static Random-Access Memory,SRAM),但本揭露的範圍並不僅限於此。The first internal memory 1021a and 1022a can store the first feature map and the first weight used by the first front-end module 102a to perform feature extraction operations on the data DATA11 and DATA12. In this embodiment, the first internal memories 1021a and 1022a may be implemented as Static Random-Access Memory (SRAM), but the scope of the disclosure is not limited thereto.

第一提取單元1023a及1024a從第一內部記憶體1021a及1022a中的每一者提取第一特徵圖及第一權重,並將所述第一特徵圖及第一權重傳送到第一分派單元1025a及1026a。The first extraction units 1023a and 1024a extract the first feature map and the first weight from each of the first internal memories 1021a and 1022a, and transmit the first feature map and the first weight to the first dispatching unit 1025a and 1026a.

第一分派單元1025a及1026a針對每一通道將所提取的第一特徵圖及第一權重傳送到第一MAC陣列1027a。舉例來說,第一分派單元1025a及1026a例如針對k(此處,k是自然數)個通道中的每一者選擇權重及對應的特徵圖,並可將所述權重及對應的特徵圖傳送到第一MAC陣列1027a。The first dispatch units 1025a and 1026a transmit the extracted first feature maps and first weights to the first MAC array 1027a for each channel. For example, the first assignment units 1025a and 1026a select weights and corresponding feature maps for each of k (here, k is a natural number) channels, and may transmit the weights and corresponding feature maps to the first MAC array 1027a.

第一MAC陣列1027a對從第一分派單元1025a及1026a傳送的資料執行乘法累積運算。舉例來說,第一MAC陣列1027a對用於k個通道中的每一者的資料執行乘法累積運算。此外,第一MAC陣列1027a輸出第一運算結果R11及第二運算結果R12。The first MAC array 1027a performs a multiply-accumulate operation on the data transmitted from the first dispatch unit 1025a and 1026a. For example, the first MAC array 1027a performs a multiply accumulate operation on the data for each of the k channels. In addition, the first MAC array 1027a outputs the first operation result R11 and the second operation result R12.

然後,如上所述,第一運算結果R11被提供到第一後端模組104a,且第二運算結果R12可通過第一橋接器111被提供到第二神經處理單元100b的第二後端模組104b。Then, as described above, the first operation result R11 is provided to the first back-end module 104a, and the second operation result R12 may be provided to the second back-end module of the second neural processing unit 100b through the first bridge 111. Group 104b.

另一方面,根據本揭露的實施例的神經處理系統10的第一神經處理單元100a中所包括的第一後端模組104a包括第一求和單元1041a、第一啟動單元1043a以及第一回寫單元1045a。On the other hand, the first backend module 104a included in the first neural processing unit 100a of the neural processing system 10 according to the embodiment of the present disclosure includes a first summing unit 1041a, a first starting unit 1043a and a first loop Write unit 1045a.

第一求和單元1041a對第一運算結果R11及第四運算結果R22執行求和運算以產生求和結果。此處,可通過第二橋接器112從第二神經處理單元100b的第二前端模組102b提供第四運算結果R22。The first summation unit 1041a performs a summation operation on the first operation result R11 and the fourth operation result R22 to generate a summation result. Here, the fourth operation result R22 can be provided from the second front-end module 102b of the second neural processing unit 100b through the second bridge 112 .

第一啟動單元1043a可對求和運算的執行結果執行啟動運算以產生啟動結果。在本揭露的一些實施例中,啟動運算可包括使用啟動函數(例如,修正線性單元(rectified linear unit,ReLU)、S形(Sigmoid)函數及雙曲正切函數(tanh))的運算,但本揭露的範圍並不僅限於此。The first activation unit 1043a may perform an activation operation on the execution result of the summation operation to generate an activation result. In some embodiments of the present disclosure, the activation operation may include an operation using an activation function (for example, a rectified linear unit (ReLU), a Sigmoid function, and a hyperbolic tangent function (tanh)), but this The scope of disclosure is not limited to this.

第一回寫單元1045a執行向第一前端模組102a提供啟動運算的執行結果的回寫運算。具體來說,第一回寫單元1045a可將啟動運算的執行結果儲存在第一內部記憶體1021a及1022a中。The first write-back unit 1045a executes a write-back operation to provide the execution result of the startup operation to the first front-end module 102a. Specifically, the first write-back unit 1045a can store the execution result of the startup operation in the first internal memories 1021a and 1022a.

另一方面,根據本揭露的實施例的神經處理系統10的第二神經處理單元100b中所包括的第二前端模組102b包括多個第二內部記憶體1021b及1022b、多個第二提取單元1023b及1024b、多個第二分派單元1025b及1026b以及第二MAC陣列1027b。On the other hand, the second front-end module 102b included in the second neural processing unit 100b of the neural processing system 10 according to an embodiment of the present disclosure includes a plurality of second internal memories 1021b and 1022b, a plurality of second extraction units 1023b and 1024b, a plurality of second dispatch units 1025b and 1026b, and a second MAC array 1027b.

所述多個第二內部記憶體1021b及1022b可儲存由第二前端模組102b用於進行資料DATA21及DATA22的特徵提取運算的第二特徵圖及第二權重。在本實施例中,第二內部記憶體1021b及1022b可實施為SRAM,但本揭露的範圍並不僅限於此。The plurality of second internal memories 1021b and 1022b can store a second feature map and a second weight used by the second front-end module 102b to perform feature extraction operations on the data DATA21 and DATA22. In this embodiment, the second internal memory 1021b and 1022b can be implemented as SRAM, but the scope of the present disclosure is not limited thereto.

第二提取單元1023b及1024b從第二內部記憶體1021b及1022b中的每一者提取第二特徵圖及第二權重,並將所述第二特徵圖及第二權重傳送到第二分派單元1025b及1026b。The second extraction units 1023b and 1024b extract the second feature map and the second weight from each of the second internal memories 1021b and 1022b, and transmit the second feature map and the second weight to the second dispatching unit 1025b and 1026b.

第二分派單元1025b及1026b針對每一通道將所提取的第二特徵圖及第二權重傳送到第二MAC陣列1027b。舉例來說,第二分派單元1025b及1026b例如針對k(此處,k是自然數)個通道中的每一者選擇權重及對應的特徵圖,並可將所述權重及對應的特徵圖傳送到第二MAC陣列1027b。The second dispatch units 1025b and 1026b transmit the extracted second feature maps and second weights to the second MAC array 1027b for each channel. For example, the second assignment units 1025b and 1026b select weights and corresponding feature maps for each of k (here, k is a natural number) channels, and may transmit the weights and corresponding feature maps to the second MAC array 1027b.

第二MAC陣列1027b對從第二分派單元1025b及1026b傳送的資料執行乘法累積運算。舉例來說,第二MAC陣列1027b對用於k個通道中的每一者的資料執行乘法累積運算。此外,第二MAC陣列1027b輸出第三運算結果R21及第四運算結果R22。The second MAC array 1027b performs a multiply-accumulate operation on the data transmitted from the second dispatch unit 1025b and 1026b. For example, the second MAC array 1027b performs a multiply-accumulate operation on the data for each of the k channels. In addition, the second MAC array 1027b outputs the third operation result R21 and the fourth operation result R22.

然後,如上所述,第三運算結果R21被提供到第二後端模組104b,且第四運算結果R22可通過第二橋接器112被提供到第一神經處理單元100a的第一後端模組104a。Then, as described above, the third operation result R21 is provided to the second back-end module 104b, and the fourth operation result R22 may be provided to the first back-end module of the first neural processing unit 100a through the second bridge 112. Group 104a.

另一方面,根據本揭露的實施例的神經處理系統10的第二神經處理單元100b中所包括的第二後端模組104b包括第二求和單元1041b、第二啟動單元1043b以及第二回寫單元1045b。On the other hand, the second backend module 104b included in the second neural processing unit 100b of the neural processing system 10 according to an embodiment of the present disclosure includes a second summing unit 1041b, a second starting unit 1043b and a second loop Write unit 1045b.

第二求和單元1041b對第三運算結果R21及第二運算結果R12執行求和運算以產生求和結果。此處,可通過第一橋接器111從第一神經處理單元100a的第一前端模組102a提供第二運算結果R12。The second summation unit 1041b performs a summation operation on the third operation result R21 and the second operation result R12 to generate a summation result. Here, the second calculation result R12 can be provided from the first front-end module 102 a of the first neural processing unit 100 a through the first bridge 111 .

第二啟動單元1043b可對求和運算的執行結果執行啟動運算以產生執行結果。在本揭露的一些實施例中,啟動運算可包括使用啟動函數(例如,修正線性單元(ReLU)、S形(Sigmoid)函數及雙曲正切函數(tanh))的運算,但本揭露的範圍並不僅限於此。The second activation unit 1043b may perform an activation operation on the execution result of the sum operation to generate an execution result. In some embodiments of the present disclosure, activation operations may include operations using activation functions such as Rectified Linear Units (ReLU), Sigmoid functions, and hyperbolic tangent functions (tanh), although the scope of the present disclosure is not It doesn't stop there.

第二回寫單元1045b執行用於向第二前端模組102b提供啟動運算的執行結果的回寫運算。具體來說,第二回寫單元1045b可將啟動運算的執行結果儲存在第二內部記憶體1021b及1022b中。The second write-back unit 1045b performs a write-back operation for providing the execution result of the start-up operation to the second front-end module 102b. Specifically, the second write-back unit 1045b can store the execution result of the startup operation in the second internal memory 1021b and 1022b.

圖4及圖5是示出根據本揭露的實施例的神經處理系統的前端模組的方塊圖。4 and 5 are block diagrams illustrating front-end modules of a neural processing system according to an embodiment of the disclosure.

參照圖4,第一內部記憶體1021a及1022a中的每一者儲存用於對資料DATA11及資料DATA12進行特徵提取運算的第一特徵圖及第一權重。第一提取單元1023a及1024a從第一內部記憶體1021a及1022a中的每一者提取第一特徵圖及第一權重,並將所述第一特徵圖及第一權重傳送到第一分派單元1025a及1026a。Referring to FIG. 4, each of the first internal memories 1021a and 1022a stores a first feature map and a first weight for performing feature extraction operations on the data DATA11 and the data DATA12. The first extraction units 1023a and 1024a extract the first feature map and the first weight from each of the first internal memories 1021a and 1022a, and transmit the first feature map and the first weight to the first dispatching unit 1025a and 1026a.

第一分派單元1025a針對資料DATA11的六個通道中的每一者選擇權重及對應的特徵圖,並將所述權重及對應的特徵圖傳送到第一MAC陣列1027a。第一分派單元1026a針對資料DATA12的六個通道中的每一者選擇權重及對應的特徵圖,並將所述權重及對應的特徵圖傳送到第一MAC陣列1027a。The first allocation unit 1025a selects weights and corresponding feature maps for each of the six channels of the data DATA11, and transmits the weights and corresponding feature maps to the first MAC array 1027a. The first allocation unit 1026a selects weights and corresponding feature maps for each of the six channels of the data DATA12, and transmits the weights and corresponding feature maps to the first MAC array 1027a.

第一MAC陣列1027a對從第一分派單元1025a及1026a針對六個通道中的每一者傳送的資料執行乘法累積運算。The first MAC array 1027a performs a multiply accumulate operation on the data transmitted from the first dispatch units 1025a and 1026a for each of the six lanes.

在本實施例中,從第一MAC陣列1027a輸出的運算結果中的第一運算結果R11對應於針對第一通道、第三通道及第六通道的乘法累積運算的結果。第二運算結果R12對應於針對第二通道、第四通道及第五通道的乘法累積運算的結果。In this embodiment, the first operation result R11 among the operation results output from the first MAC array 1027a corresponds to the result of the multiplication and accumulation operation for the first channel, the third channel and the sixth channel. The second operation result R12 corresponds to the result of the multiplication and accumulation operation for the second channel, the fourth channel, and the fifth channel.

第一運算結果R11被提供到第一後端模組104a的第一求和單元1041a,且第二運算結果R12被提供到第一橋接器111以傳送到在其他時脈域中運作的第二神經處理單元100b。另一方面,第一後端模組104a的第一求和單元1041a通過第二橋接器112接收在其他時脈域中運作的第二神經處理單元100b的運算結果,例如,第四運算結果R22。The first operation result R11 is provided to the first summation unit 1041a of the first back-end module 104a, and the second operation result R12 is provided to the first bridge 111 to be transmitted to the second Neural processing unit 100b. On the other hand, the first summation unit 1041a of the first back-end module 104a receives the operation result of the second neural processing unit 100b operating in another clock domain through the second bridge 112, for example, the fourth operation result R22 .

接下來,參照圖5,第二內部記憶體1021b及1022b中的每一者儲存用於對資料DATA21及資料DATA22進行特徵提取運算的第二特徵圖及第二權重。第二提取單元1023b及1024b從第二內部記憶體1021b及1022b中的每一者提取第二特徵圖及第二權重,並將所述第二特徵圖及第二權重傳送到第二分派單元1025b及1026b。Next, referring to FIG. 5 , each of the second internal memories 1021 b and 1022 b stores a second feature map and a second weight for performing feature extraction operations on the data DATA21 and the data DATA22 . The second extraction units 1023b and 1024b extract the second feature map and the second weight from each of the second internal memories 1021b and 1022b, and transmit the second feature map and the second weight to the second dispatching unit 1025b and 1026b.

第二分派單元1025b針對資料DATA21的六個通道中的每一者選擇權重及對應的特徵圖,並將所選擇的權重及對應的特徵圖傳送到第二MAC陣列1027b。第二分派單元1026b針對資料DATA22的六個通道中的每一者選擇權重及對應的特徵圖,並將所選擇的權重及對應的特徵圖傳送到第二MAC陣列1027b。The second allocation unit 1025b selects weights and corresponding feature maps for each of the six channels of the data DATA21, and transmits the selected weights and corresponding feature maps to the second MAC array 1027b. The second allocation unit 1026b selects weights and corresponding feature maps for each of the six channels of the data DATA22, and transmits the selected weights and corresponding feature maps to the second MAC array 1027b.

第二MAC陣列1027b對從第二分派單元1025b及1026b針對六個通道中的每一者傳送的資料執行乘法累積運算。The second MAC array 1027b performs a multiply-accumulate operation on data transmitted from the second dispatch units 1025b and 1026b for each of the six lanes.

在本實施例中,從第二MAC陣列1027b輸出的運算結果中的第三運算結果R21對應於對第二通道、第四通道及第五通道的乘法累積運算的結果。第四運算結果R22對應於對第一通道、第三通道及第六通道的乘法累積運算的結果。In this embodiment, the third operation result R21 among the operation results output from the second MAC array 1027b corresponds to the result of the multiplication and accumulation operation on the second channel, the fourth channel, and the fifth channel. The fourth operation result R22 corresponds to the result of the multiplication and accumulation operation on the first channel, the third channel and the sixth channel.

第三運算結果R21被提供到第二後端模組104b的第二求和單元1041b,且第四運算結果R22被提供到第二橋接器112以傳送到在其他時脈域中運作的第一神經處理單元100a。另一方面,第二後端模組104b的第二求和單元1041b通過第一橋接器111接收在其他時脈域中運作的第一神經處理單元100a的運算結果,例如,第二運算結果R12。The third operation result R21 is provided to the second summation unit 1041b of the second back-end module 104b, and the fourth operation result R22 is provided to the second bridge 112 to be transmitted to the first Neural processing unit 100a. On the other hand, the second summation unit 1041b of the second back-end module 104b receives the operation result of the first neural processing unit 100a operating in other clock domains through the first bridge 111, for example, the second operation result R12 .

圖6是示出根據本揭露的實施例的神經處理系統的後端模組的方塊圖。FIG. 6 is a block diagram illustrating backend modules of a neural processing system according to an embodiment of the disclosure.

參照圖6,第一求和單元1041a針對每一通道對第一運算結果R11及第四運算結果R22執行求和運算以產生求和結果。在圖4及圖5中,由於總第一運算結果R11包括六個通道中的三個通道的值,且第四運算結果R22也包括三個通道的值,因此對其中每一者的求和是針對三個通道執行的。Referring to FIG. 6 , the first summation unit 1041a performs a summation operation on the first operation result R11 and the fourth operation result R22 for each channel to generate a summation result. In FIGS. 4 and 5, since the total first operation result R11 includes the values of three channels among the six channels, and the fourth operation result R22 also includes the values of three channels, the sum of each of them is performed for three channels.

隨後,第一啟動單元1043a對每一通道的求和運算的執行結果執行啟動運算以產生啟動結果,且第一回寫單元1045a針對每一通道執行用於向第一前端模組102a提供啟動運算的執行結果的回寫運算。舉例來說,第一回寫單元1045a可將啟動運算的執行結果中對應於第一通道的資料回寫到第一內部記憶體1021a中,且可將對應於第二通道及第三通道的資料回寫到第一內部記憶體1022a中。Subsequently, the first start-up unit 1043a performs a start-up operation on the execution result of the sum operation of each channel to generate a start-up result, and the first write-back unit 1045a performs a start-up operation for each channel to provide the first front-end module 102a The write-back operation of the execution result. For example, the first write-back unit 1045a can write back the data corresponding to the first channel in the execution result of the startup operation to the first internal memory 1021a, and can write back the data corresponding to the second channel and the third channel Write back to the first internal memory 1022a.

另一方面,第二求和單元1041b也針對每一通道對第三運算結果R21及第二運算結果R12執行求和運算以產生求和結果。在圖4及圖5中,由於總第三運算結果R21包括六個通道中的三個通道的值,且第二運算結果R12也包括三個通道的值,因此對其中每一者的求和是針對三個通道執行的。On the other hand, the second summation unit 1041b also performs a summation operation on the third operation result R21 and the second operation result R12 for each channel to generate a summation result. In FIGS. 4 and 5, since the total third operation result R21 includes the values of three channels among the six channels, and the second operation result R12 also includes the values of three channels, the sum of each of them is performed for three channels.

隨後,第二啟動單元1043b對每一通道的求和運算的執行結果執行啟動運算以產生啟動結果。第二回寫單元1045b針對每一通道執行用於向第二前端模組102b提供啟動運算的執行結果的回寫運算。舉例來說,第二回寫單元1045b可將啟動運算的執行結果中對應於第一通道的資料回寫到第二內部記憶體1021b中,且可將對應於第二通道及第三通道的資料回寫到第二內部記憶體1022b中。Subsequently, the second activation unit 1043b performs an activation operation on the execution result of the sum operation of each channel to generate an activation result. The second write-back unit 1045b performs a write-back operation for providing the execution result of the startup operation to the second front-end module 102b for each channel. For example, the second write-back unit 1045b can write back the data corresponding to the first channel in the execution result of the startup operation to the second internal memory 1021b, and can write back the data corresponding to the second channel and the third channel Write back to the second internal memory 1022b.

圖7是示出根據本揭露的另一實施例的計算系統的示意圖,且圖8是示出根據本揭露的另一實施例的神經處理系統的方塊圖。FIG. 7 is a schematic diagram illustrating a computing system according to another embodiment of the present disclosure, and FIG. 8 is a block diagram illustrating a neural processing system according to another embodiment of the present disclosure.

參照圖7及圖8,不同於圖1所示的實施例,根據本實施例的計算系統2的神經處理系統10還包括工作負荷管理器120。如在本文中所闡釋,使用例如工作負荷管理器120等工作負荷管理器會增強以降低功耗、增加功耗、降低處理速度或提高處理速度的方式選擇性地控制多個神經處理單元中的個別神經處理單元的實際能力。Referring to FIG. 7 and FIG. 8 , different from the embodiment shown in FIG. 1 , the neural processing system 10 of the computing system 2 according to this embodiment further includes a workload manager 120 . As explained herein, use of a workload manager such as workload manager 120 enhances selective control of neural processing units in multiple neural processing units in a manner that reduces power consumption, increases power consumption, reduces processing speed, or increases processing speed. The actual capabilities of individual neural processing units.

工作負荷管理器120將用於執行特徵提取的資料DATA中的第一資料DATA1分配到第一神經處理單元100a。工作負荷管理器120將資料DATA中的第二資料DATA2分配到第二神經處理單元100b。具體來說,工作負荷管理器120將用於執行特徵提取的資料DATA中的第一資料DATA1分配到第一前端模組102a,並將資料DATA中的第二資料DATA2分配到第二前端模組102b。The workload manager 120 distributes the first data DATA1 among the data DATA for performing feature extraction to the first neural processing unit 100a. The workload manager 120 distributes the second data DATA2 in the data DATA to the second neural processing unit 100b. Specifically, the workload manager 120 distributes the first data DATA1 of the data DATA used for feature extraction to the first front-end module 102a, and distributes the second data DATA2 of the data DATA to the second front-end module 102b.

因此,第一前端模組102a利用第一特徵圖及第一權重對第一資料DATA1執行特徵提取運算。第二前端模組102b可利用第二特徵圖及第二權重對第二資料DATA2執行特徵提取運算。Therefore, the first front-end module 102a performs a feature extraction operation on the first data DATA1 by using the first feature map and the first weight. The second front-end module 102b can use the second feature map and the second weight to perform a feature extraction operation on the second data DATA2.

具體來說,在本揭露的一些實施例中,第一資料DATA1的量與第二資料DATA2的量可彼此不同。Specifically, in some embodiments of the present disclosure, the amount of the first data DATA1 and the amount of the second data DATA2 may be different from each other.

時脈管理單元20控制第一時脈信號CLK1及第二時脈信號CLK2中的至少一者的頻率,且可根據工作負荷管理器120的分配操作控制第一神經處理單元100a及第二神經處理單元100b的性能及功率。舉例來說,時脈管理單元20可根據工作負荷管理器120的分配操作對第一前端模組102a、第一後端模組104a、第二前端模組102b及第二後端模組104b中的至少一者執行時脈閘控。The clock management unit 20 controls the frequency of at least one of the first clock signal CLK1 and the second clock signal CLK2, and can control the first neural processing unit 100a and the second neural processing unit 100a according to the allocation operation of the workload manager 120 Performance and power of unit 100b. For example, the clock management unit 20 can perform operations on the first front-end module 102a, the first back-end module 104a, the second front-end module 102b, and the second back-end module 104b according to the allocation operation of the workload manager 120. At least one of the performs clock gating.

如此一來,根據本揭露的各種實施例的神經處理系統10可控制其中的多個第一神經處理單元100a及第二神經處理單元100b的時脈信號來控制性能或功耗。舉例來說,為了改善第一神經處理單元100a的性能並降低第二神經處理單元100b的功耗,時脈管理單元20可增大用於驅動第一神經處理單元100a的第一時脈信號CLK1的頻率且可減小用於驅動第二神經處理單元100b的第二時脈信號CLK2的頻率。作為另一實例,在其中僅使用第一神經處理單元100a且不使用第二神經處理單元100b的具體情況下,可通過控制用於驅動第二神經處理單元100b的第二時脈信號CLK2而執行時脈閘控。因此,根據包括根據本揭露的各種實施例的神經處理系統10的計算系統,可在降低成本及功耗的同時實現人工智慧。In this way, the neural processing system 10 according to various embodiments of the present disclosure can control the clock signals of the first neural processing units 100a and the second neural processing units 100b therein to control performance or power consumption. For example, in order to improve the performance of the first neural processing unit 100a and reduce the power consumption of the second neural processing unit 100b, the clock management unit 20 may increase the first clock signal CLK1 for driving the first neural processing unit 100a frequency and can reduce the frequency of the second clock signal CLK2 for driving the second neural processing unit 100b. As another example, in a specific case where only the first neural processing unit 100a is used and the second neural processing unit 100b is not used, it may be performed by controlling the second clock signal CLK2 for driving the second neural processing unit 100b Clock gating. Therefore, according to a computing system including the neural processing system 10 according to various embodiments of the present disclosure, artificial intelligence can be realized while reducing cost and power consumption.

圖9是示出根據本揭露的再一實施例的計算系統的示意圖,且圖10是示出根據本揭露的另一實施例的神經處理系統的方塊圖。FIG. 9 is a schematic diagram illustrating a computing system according to yet another embodiment of the present disclosure, and FIG. 10 is a block diagram illustrating a neural processing system according to another embodiment of the present disclosure.

參照圖9及圖10,不同於圖7及圖8所示的實施例,根據本實施例的計算系統3還包括電源管理單元50(power management unit,PMU)。如在本文中所闡釋,使用例如電源管理單元50等電源管理單元會增強以降低功耗、增加功耗、降低處理速度或提高處理速度的方式選擇性地控制多個神經處理單元中的個別神經處理單元的功率的實際能力。Referring to FIG. 9 and FIG. 10 , different from the embodiments shown in FIG. 7 and FIG. 8 , the computing system 3 according to this embodiment further includes a power management unit 50 (power management unit, PMU). As explained herein, the use of a power management unit such as power management unit 50 enhances selective control of individual neural processing units in multiple neural processing units in a manner that reduces power consumption, increases power consumption, reduces processing speed, or increases processing speed. The actual capability of the processing unit's power.

如上所述,工作負荷管理器120將資料DATA中用於執行特徵提取的第一資料DATA1分配到第一前端模組102a,並將資料DATA中的第二資料DATA2分配到第二前端模組102b。As described above, the workload manager 120 allocates the first data DATA1 of the data DATA for performing feature extraction to the first front-end module 102a, and distributes the second data DATA2 of the data DATA to the second front-end module 102b .

因此,第一前端模組102a可利用第一特徵圖及第一權重對第一資料DATA1執行特徵提取運算。第二前端模組102b可利用第二特徵圖及第二權重對第二資料DATA2執行特徵提取運算。Therefore, the first front-end module 102a can use the first feature map and the first weight to perform a feature extraction operation on the first data DATA1. The second front-end module 102b can use the second feature map and the second weight to perform a feature extraction operation on the second data DATA2.

電源管理單元50向第一神經處理單元100a提供第一電源閘控信號PG1,並向第二神經處理單元100b提供第二電源閘控信號PG2。具體來說,電源管理單元50可向第一前端模組102a及第一後端模組104a提供第一電源閘控信號PG1。電源管理單元50可向第二前端模組102b及第二後端模組104b提供第二電源閘控信號PG2。The power management unit 50 provides a first power gating signal PG1 to the first neural processing unit 100a, and provides a second power gating signal PG2 to the second neural processing unit 100b. Specifically, the power management unit 50 can provide the first power gating signal PG1 to the first front-end module 102a and the first back-end module 104a. The power management unit 50 can provide the second power gating signal PG2 to the second front-end module 102b and the second back-end module 104b.

電源管理單元50可控制第一電源閘控信號PG1及第二電源閘控信號PG2的至少一個值,藉此回應於工作負荷管理器120的分配操作執行對第一神經處理單元100a及第二神經處理單元100b的電源控制。舉例來說,電源管理單元50可對第一前端模組102a、第一後端模組104a、第二前端模組102b及第二後端模組104b中的至少一者執行電源閘控。The power management unit 50 may control at least one value of the first power gating signal PG1 and the second power gating signal PG2 , thereby executing the first neural processing unit 100 a and the second neural processing unit 100 a and the second neural processing unit 100 a in response to the allocation operation of the workload manager 120 . Power control of the processing unit 100b. For example, the power management unit 50 can perform power gating on at least one of the first front-end module 102a, the first back-end module 104a, the second front-end module 102b, and the second back-end module 104b.

如此一來,根據本揭露的各種實施例的神經處理系統10可根據需要對第一神經處理單元100a及第二神經處理單元100b的至少一部分執行電源閘控,藉此降低神經處理系統10的功耗。因此,根據包括根據本揭露的各種實施例的神經處理系統10的計算系統,可在降低成本及功耗的同時實現人工智慧。In this way, the neural processing system 10 according to various embodiments of the present disclosure may perform power gating on at least a part of the first neural processing unit 100a and the second neural processing unit 100b as required, thereby reducing the power of the neural processing system 10. consumption. Therefore, according to a computing system including the neural processing system 10 according to various embodiments of the present disclosure, artificial intelligence can be realized while reducing cost and power consumption.

圖11是示出根據本揭露的另一實施例的計算系統的示意圖。FIG. 11 is a schematic diagram illustrating a computing system according to another embodiment of the present disclosure.

參照圖11,根據本實施例的計算系統4包括第一神經處理單元100a、第二神經處理單元100b、第三神經處理單元100c以及第四神經處理單元100d。為便於闡釋,在本實施例中將神經處理系統10示出為包括第一神經處理單元100a、第二神經處理單元100b、第三神經處理單元100c以及第四神經處理單元100d,但本揭露的範圍並不僅限於此。Referring to FIG. 11 , the computing system 4 according to the present embodiment includes a first neural processing unit 100a, a second neural processing unit 100b, a third neural processing unit 100c, and a fourth neural processing unit 100d. For ease of explanation, in this embodiment, the neural processing system 10 is shown as including a first neural processing unit 100a, a second neural processing unit 100b, a third neural processing unit 100c, and a fourth neural processing unit 100d, but the present disclosure The scope is not limited to this.

時脈管理單元20產生用於驅動神經處理系統10的第一時脈信號CLK1、第二時脈信號CLK2、第三時脈信號CLK3及第四時脈信號CLK4。時脈管理單元20向第一神經處理單元100a、第二神經處理單元100b、第三神經處理單元100c以及第四神經處理單元100d中的每一者提供時脈信號。因此,第一神經處理單元100a根據第一時脈信號CLK1被驅動。第二神經處理單元100b根據第二時脈信號CLK2被驅動。第三神經處理單元100c根據第三時脈信號CLK3被驅動。第四神經處理單元100d根據第四時脈信號CLK4被驅動。The clock management unit 20 generates a first clock signal CLK1 , a second clock signal CLK2 , a third clock signal CLK3 and a fourth clock signal CLK4 for driving the neural processing system 10 . The clock management unit 20 provides a clock signal to each of the first neural processing unit 100a, the second neural processing unit 100b, the third neural processing unit 100c and the fourth neural processing unit 100d. Therefore, the first neural processing unit 100a is driven according to the first clock signal CLK1. The second neural processing unit 100b is driven according to the second clock signal CLK2. The third neural processing unit 100c is driven according to the third clock signal CLK3. The fourth neural processing unit 100d is driven according to the fourth clock signal CLK4.

在本揭露的一些實施例中,第一時脈信號CLK1、第二時脈信號CLK2、第三時脈信號CLK3及第四時脈信號CLK4的頻率可能並不是都相同的。換句話說,第一神經處理單元100a、第二神經處理單元100b、第三神經處理單元100c以及第四神經處理單元100d在其中運作的時脈域可能並不是都相同的。In some embodiments of the present disclosure, the frequencies of the first clock signal CLK1 , the second clock signal CLK2 , the third clock signal CLK3 and the fourth clock signal CLK4 may not all be the same. In other words, the clock domains in which the first neural processing unit 100a, the second neural processing unit 100b, the third neural processing unit 100c, and the fourth neural processing unit 100d operate may not be the same.

時脈管理單元20可根據需要控制第一時脈信號CLK1、第二時脈信號CLK2、第三時脈信號CLK3及第四時脈信號CLK4的每個頻率。此外,時脈管理單元20也可根據需要對第一時脈信號CLK1、第二時脈信號CLK2、第三時脈信號CLK3及第四時脈信號CLK4中的至少一者執行時脈閘控。The clock management unit 20 can control each frequency of the first clock signal CLK1 , the second clock signal CLK2 , the third clock signal CLK3 and the fourth clock signal CLK4 as required. In addition, the clock management unit 20 may also perform clock gating on at least one of the first clock signal CLK1 , the second clock signal CLK2 , the third clock signal CLK3 and the fourth clock signal CLK4 as required.

圖12及圖13是示出根據本揭露的再一實施例的神經處理系統的方塊圖。12 and 13 are block diagrams illustrating a neural processing system according to yet another embodiment of the present disclosure.

參照圖12,根據本實施例的神經處理系統10包括第一神經處理單元100a到第四神經處理單元100d。在第一神經處理單元100a到第四神經處理單元100d之間設置有一個或多個橋接器1112、1113及1114。Referring to FIG. 12, the neural processing system 10 according to the present embodiment includes first to fourth neural processing units 100a to 100d. One or more bridges 1112 , 1113 and 1114 are disposed between the first neural processing unit 100 a to the fourth neural processing unit 100 d.

在本實施例中,第三神經處理單元100c包括第三前端模組102c及第三後端模組104c。第四神經處理單元100d包括第四前端模組102d及第四後端模組104d。第三神經處理單元100c可對將由神經處理系統10處理的資料中的第三資料DATA3進行處理。第四神經處理單元100d可對將由神經處理系統10處理的資料中的第四資料DATA4進行處理。In this embodiment, the third neural processing unit 100c includes a third front-end module 102c and a third back-end module 104c. The fourth neural processing unit 100d includes a fourth front-end module 102d and a fourth back-end module 104d. The third neural processing unit 100c can process the third data DATA3 among the data to be processed by the neural processing system 10 . The fourth neural processing unit 100d can process the fourth data DATA4 among the data to be processed by the neural processing system 10 .

橋接器1112將由第一神經處理單元100a的運算產生的中間結果R12傳送到第二神經處理單元100b。橋接器1113將由第一神經處理單元100a的運算產生的中間結果R13傳送到第三神經處理單元100c。此外,橋接器1114將由第一神經處理單元100a的運算產生的中間結果R14傳送到第四神經處理單元100d。The bridge 1112 transmits the intermediate result R12 generated by the operation of the first neural processing unit 100a to the second neural processing unit 100b. The bridge 1113 transmits the intermediate result R13 generated by the operation of the first neural processing unit 100a to the third neural processing unit 100c. In addition, the bridge 1114 transmits the intermediate result R14 generated by the operation of the first neural processing unit 100a to the fourth neural processing unit 100d.

為此,第一神經處理單元100a與第二神經處理單元100b可在相互不同的時脈域中運作。在此種情況下,橋接器1112可電連接到第一神經處理單元100a及在不同於第一神經處理單元100a的時脈域中運作的第二神經處理單元100b。類似地,橋接器1113可電連接到第一神經處理單元100a及在不同於第一神經處理單元100a的時脈域中運作的第三神經處理單元100c。橋接器1114可電連接到第一神經處理單元100a及在不同於第一神經處理單元100a的時脈域中運作的第四神經處理單元100d。For this reason, the first neural processing unit 100 a and the second neural processing unit 100 b can operate in different clock domains. In this case, the bridge 1112 may be electrically connected to the first NPU 100a and the second NPU 100b operating in a different clock domain than the first NPU 100a. Similarly, the bridge 1113 may be electrically connected to the first NPU 100a and the third NPU 100c operating in a different clock domain than the first NPU 100a. The bridge 1114 may be electrically connected to the first NPU 100a and the fourth NPU 100d operating in a different clock domain than the first NPU 100a.

因此,橋接器1112、1113及1114被實施為非同步橋接器以使得能夠在不同的時脈域之間進行資料傳送。Therefore, bridges 1112, 1113, and 1114 are implemented as asynchronous bridges to enable data transfer between different clock domains.

隨後,參照圖13,在第一神經處理單元100a與第四神經處理單元100d之間設置有一個或多個橋接器1122、1123及1124。Subsequently, referring to FIG. 13 , one or more bridges 1122 , 1123 and 1124 are disposed between the first neural processing unit 100 a and the fourth neural processing unit 100 d.

橋接器1122將由第二神經處理單元100b的運算產生的中間結果R22傳送到第一神經處理單元100a。橋接器1123將由第三神經處理單元100c的運算產生的中間結果R33傳送到第一神經處理單元100a。此外,橋接器1124將由第四神經處理單元100d的運算產生的中間結果R44傳送到第一神經處理單元100a。The bridge 1122 transmits the intermediate result R22 generated by the operation of the second neural processing unit 100b to the first neural processing unit 100a. The bridge 1123 transmits the intermediate result R33 generated by the operation of the third neural processing unit 100c to the first neural processing unit 100a. In addition, the bridge 1124 transmits the intermediate result R44 generated by the operation of the fourth neural processing unit 100d to the first neural processing unit 100a.

為此,第一神經處理單元100a與第二神經處理單元100b可在相互不同的時脈域中運作。在此種情況下,橋接器1122可電連接到第一神經處理單元100a及在不同於第一神經處理單元100a的時脈域中運作的第二神經處理單元100b。類似地,橋接器1123可電連接到第一神經處理單元100a及在不同於第一神經處理單元100a的時脈域中運作的第三神經處理單元100c。橋接器1124可電連接到第一神經處理單元100a及在不同於第一神經處理單元100a的時脈域中運作的第四神經處理單元100d。For this reason, the first neural processing unit 100 a and the second neural processing unit 100 b can operate in different clock domains. In this case, the bridge 1122 may be electrically connected to the first NPU 100a and the second NPU 100b operating in a different clock domain than the first NPU 100a. Similarly, the bridge 1123 may be electrically connected to the first NPU 100a and the third NPU 100c operating in a different clock domain than the first NPU 100a. The bridge 1124 may be electrically connected to the first NPU 100a and the fourth NPU 100d operating in a different clock domain than the first NPU 100a.

因此,橋接器1122、1123及1124被實施為非同步橋接器以使得能夠在不同的時脈域之間進行資料傳送。Therefore, bridges 1122, 1123, and 1124 are implemented as asynchronous bridges to enable data transfer between different clock domains.

在圖12及圖13所示的實施例中,已闡述了在不同於第一神經處理單元100a的第二神經處理單元100b、第三神經處理單元100c及第四神經處理單元100d之間的橋接器,但本揭露的範圍並不僅限於此,且此內容也可類似地應用在不同於第二神經處理單元100b的第三神經處理單元100c與第四神經處理單元100d之間、以及第三神經處理單元100c與第四神經處理單元100d之間。In the embodiments shown in FIGS. 12 and 13 , bridging between the second NPU 100b , the third NPU 100c , and the fourth NPU 100d , which are different from the first NPU 100a , has been described. device, but the scope of the present disclosure is not limited thereto, and this content can also be similarly applied between the third neural processing unit 100c and the fourth neural processing unit 100d different from the second neural processing unit 100b, and the third neural processing unit 100b Between the processing unit 100c and the fourth neural processing unit 100d.

圖14是示出根據本揭露的再一實施例的計算系統的方塊圖。FIG. 14 is a block diagram illustrating a computing system according to yet another embodiment of the present disclosure.

參照圖14,根據本實施例的計算系統5的神經處理系統10還包括工作負荷管理器120。類似於對圖7及圖8的說明,工作負荷管理器120可將用於執行特徵提取的資料DATA分佈及分配到第一神經處理單元100a、第二神經處理單元100b、第三神經處理單元100c以及第四神經處理單元100d。此外,從第一神經處理單元100a到第四神經處理單元100d分佈的資料量可能並不是都相同的。Referring to FIG. 14 , the neural processing system 10 of the computing system 5 according to the present embodiment further includes a workload manager 120 . Similar to the descriptions of FIG. 7 and FIG. 8, the workload manager 120 can distribute and distribute the data DATA used for feature extraction to the first neural processing unit 100a, the second neural processing unit 100b, and the third neural processing unit 100c. and a fourth neural processing unit 100d. In addition, the amount of data distributed from the first neural processing unit 100a to the fourth neural processing unit 100d may not be the same.

時脈管理單元20可以與參照圖7及圖8所闡釋的方式相同的方式控制第一時脈信號CLK1到第四時脈信號CLK4中的至少一者的頻率,以回應於工作負荷管理器120的分配操作控制第一神經處理單元100a到第四神經處理單元100d的性能及功率。The clock management unit 20 may control the frequency of at least one of the first to fourth clock signals CLK1 to CLK4 in the same manner as explained with reference to FIGS. 7 and 8 in response to the workload manager 120 The allocation operation of the control the performance and power of the first neural processing unit 100a to the fourth neural processing unit 100d.

如此一來,根據本揭露的各種實施例的神經處理系統10可控制其中的第一神經處理單元100a、第二神經處理單元100b、第三神經處理單元100c及第四神經處理單元100d的時脈信號,藉此控制性能或功耗。舉例來說,為了改善第一神經處理單元100a、第二神經處理單元100b及第三神經處理單元100c的性能並降低第四神經處理單元100d的功耗,時脈管理單元20可增大用於驅動第一神經處理單元100a到第三神經處理單元100c的第一時脈信號CLK1、第二時脈信號CLK2及第三時脈信號CLK3的頻率且可減小用於驅動第四神經處理單元100d的第四時脈信號CLK4的頻率。作為再一實例,當僅使用第一神經處理單元100a及第二神經處理單元100b且不使用第三神經處理單元100c及第四神經處理單元100d時,可通過控制用於驅動第三神經處理單元100c及第四神經處理單元100d的第三時脈信號CLK3及第四時脈信號CLK4而執行時脈閘控。因此,根據包括根據本揭露的各種實施例的神經處理系統10的計算系統,可在降低成本及功耗的同時實現人工智慧。In this way, the neural processing system 10 according to various embodiments of the present disclosure can control the clocks of the first neural processing unit 100a, the second neural processing unit 100b, the third neural processing unit 100c, and the fourth neural processing unit 100d. signal to control performance or power consumption. For example, in order to improve the performance of the first neural processing unit 100a, the second neural processing unit 100b and the third neural processing unit 100c and reduce the power consumption of the fourth neural processing unit 100d, the clock management unit 20 can increase the The frequencies of the first clock signal CLK1, the second clock signal CLK2 and the third clock signal CLK3 for driving the first neural processing unit 100a to the third neural processing unit 100c can be reduced to drive the fourth neural processing unit 100d The frequency of the fourth clock signal CLK4. As another example, when only the first neural processing unit 100a and the second neural processing unit 100b are used and the third neural processing unit 100c and the fourth neural processing unit 100d are not used, the third neural processing unit can be driven by controlling The third clock signal CLK3 and the fourth clock signal CLK4 of the 100c and the fourth neural processing unit 100d perform clock gating. Therefore, according to a computing system including the neural processing system 10 according to various embodiments of the present disclosure, artificial intelligence can be realized while reducing cost and power consumption.

圖15是示出根據本揭露的再一實施例的計算系統的方塊圖。FIG. 15 is a block diagram illustrating a computing system according to yet another embodiment of the present disclosure.

參照圖15,不同於圖14所示的實施例,根據本實施例的計算系統6的神經處理系統10還包括電源管理單元50(PMU)。Referring to FIG. 15 , different from the embodiment shown in FIG. 14 , the neural processing system 10 of the computing system 6 according to this embodiment further includes a power management unit 50 (PMU).

如上所述,工作負荷管理器120將用於執行特徵提取的資料DATA分配及分佈到第一神經處理單元100a、第二神經處理單元100b、第三神經處理單元100c以及第四神經處理單元100d。As mentioned above, the workload manager 120 allocates and distributes the data DATA for feature extraction to the first neural processing unit 100a, the second neural processing unit 100b, the third neural processing unit 100c, and the fourth neural processing unit 100d.

電源管理單元50向第一神經處理單元100a、第二神經處理單元100b、第三神經處理單元100c以及第四神經處理單元100d提供第一電源閘控信號PG1、第二電源閘控信號PG2、第三電源閘控信號PG3以及第四電源閘控信號PG4。The power management unit 50 provides the first neural processing unit 100a, the second neural processing unit 100b, the third neural processing unit 100c and the fourth neural processing unit 100d with the first power gating signal PG1, the second power gating signal Three power gating control signals PG3 and a fourth power gating control signal PG4 .

電源管理單元50可以與參照圖9及圖10所述相同的方式控制第一電源閘控信號PG1、第二電源閘控信號PG2、第三電源閘控信號PG3以及第四電源閘控信號PG4的至少一個值,藉此回應於工作負荷管理器120的分配操作執行第一神經處理單元100a、第二神經處理單元100b、第三神經處理單元100c以及第四神經處理單元100d的電源控制。The power management unit 50 can control the first power gating signal PG1, the second power gating signal PG2, the third power gating signal PG3 and the fourth power gating signal PG4 in the same manner as described with reference to FIG. 9 and FIG. At least one value whereby power control of the first NPU 100 a , the second NPU 100 b , the third NPU 100 c , and the fourth NPU 100 d are performed in response to the allocation operation of the workload manager 120 .

如此一來,根據本揭露的各種實施例的神經處理系統10可通過視需要對第一神經處理單元100a、第二神經處理單元100b、第三神經處理單元100c以及第四神經處理單元100d中的一者或多者執行電源閘控而降低神經處理系統10的功耗。因此,根據包括根據本揭露的各種實施例的神經處理系統10的計算系統,可在降低成本及功耗的同時實現人工智慧。In this way, the neural processing system 10 according to various embodiments of the present disclosure can control the first neural processing unit 100a, the second neural processing unit 100b, the third neural processing unit 100c, and the fourth neural processing unit 100d as needed. One or more perform power gating to reduce power consumption of the neural processing system 10 . Therefore, according to a computing system including the neural processing system 10 according to various embodiments of the present disclosure, artificial intelligence can be realized while reducing cost and power consumption.

圖16是示出根據本揭露的再一實施例的計算系統的方塊圖。FIG. 16 is a block diagram illustrating a computing system according to yet another embodiment of the present disclosure.

參照圖16,根據本實施例的計算系統7可以是包括神經處理系統10、時脈管理單元20、處理器30、記憶體40、電源管理單元50、儲存器60、顯示器70及照相機80的計算系統。神經處理系統10、時脈管理單元20、處理器30、記憶體40、電源管理單元50、儲存器60、顯示器70及照相機80可通過匯流排90傳送及接收資料。Referring to FIG. 16 , the computing system 7 according to this embodiment may be a computing system including a neural processing system 10 , a clock management unit 20 , a processor 30 , a memory 40 , a power management unit 50 , a storage 60 , a display 70 and a camera 80 . system. The neural processing system 10 , the clock management unit 20 , the processor 30 , the memory 40 , the power management unit 50 , the storage 60 , the display 70 and the camera 80 can transmit and receive data through the bus 90 .

在本發明的一些實施例中,計算系統7可以是移動計算系統。舉例來說,計算系統7可以是包括智慧手機、平板電腦、膝上電腦等在內的計算系統。當然,本揭露的範圍並不僅限於此。In some embodiments of the invention, computing system 7 may be a mobile computing system. Computing system 7 may be, for example, a computing system including a smartphone, tablet, laptop, or the like. Of course, the scope of the present disclosure is not limited thereto.

如到目前為止所闡釋的根據本揭露的各種實施例的神經處理系統10能夠使用具有低成本及低功率的CNN對通過照相機80產生的影像資料或儲存在儲存器60中的影像資料執行特徵提取運算。As explained so far, the neural processing system 10 according to various embodiments of the present disclosure is capable of performing feature extraction on image data generated by the camera 80 or stored in the memory 60 using a CNN with low cost and low power. operation.

如上所述,神經處理系統10採用包括多個能夠個別控制時脈及功率的神經處理單元的架構,藉此在降低成本及功耗的同時忠實地實施及執行人工智慧。As described above, the neural processing system 10 employs an architecture including a plurality of neural processing units capable of individually controlling clock and power, thereby faithfully implementing and executing artificial intelligence while reducing cost and power consumption.

通過對所作詳細說明進行總結,所屬領域中的技術人員將認識到,在不實質上背離本揭露的原則的情況下可對優選實施例作出諸多更改及修改。因此,本發明所揭露的優選實施例僅用於一般及說明性意義且不用於限制目的。From the summary of the detailed description, those skilled in the art will recognize that many changes and modifications can be made to the preferred embodiment without materially departing from the principles of this disclosure. Therefore, the disclosed preferred embodiments of the present invention are used in a general and descriptive sense only and not for limiting purposes.

1、2、3、4、5、6、7:計算系統 10:神經處理系統 20:時脈管理單元(CMU) 30:處理器 40:記憶體 50:電源管理單元(PMU) 60:儲存器 70:顯示器 80:照相機 90:匯流排 100a:第一神經處理單元 100b:第二神經處理單元 100c:第三神經處理單元 100d:第四神經處理單元 102a:第一前端模組 102b:第二前端模組 102c:第三前端模組 102d:第四前端模組 104a:第一後端模組 104b:第二後端模組 104c:第三後端模組 104d:第四後端模組 110:橋接器單元 111:第一橋接器 112:第二橋接器 120:工作負荷管理器 1021a、1022a:第一內部記憶體 1021b、1022b:第二內部記憶體 1023a、1024a:第一提取單元 1023b、1024b:第二提取單元 1025a、1026a:第一分派單元 1025b、1026b:第二分派單元 1027a:第一MAC陣列 1027b:第二MAC陣列 1041a:第一求和單元 1041b:第二求和單元 1043a:第一啟動單元 1043b:第二啟動單元 1045a:第一回寫單元 1045b:第二回寫單元 1112、1113、1114、1122、1123、1124:橋接器 CLK1:第一時脈信號 CLK2:第二時脈信號 CLK3:第三時脈信號 CLK4:第四時脈信號 DATA、DATA3、DATA4、DATA11、DATA12、DATA21、DATA22:資料 DATA1:第一資料 DATA2:第二資料 DATA3:第三資料 DATA4:第四資料 PG1:第一電源閘控信號 PG2:第二電源閘控信號 PG3:第三電源閘控信號 PG4:第四電源閘控信號 R11:第一運算結果 R12:第二運算結果/中間結果 R13、R14:中間結果 R21:第三運算結果 R22:第四運算結果/中間結果 R33、R44:中間結果 WB DATA1:第一回寫資料 WB DATA2:第二回寫資料。1, 2, 3, 4, 5, 6, 7: computing systems 10: Neural Processing System 20: Clock Management Unit (CMU) 30: Processor 40: Memory 50: Power Management Unit (PMU) 60: Storage 70: display 80: camera 90: busbar 100a: first neural processing unit 100b: second neural processing unit 100c: The third neural processing unit 100d: Fourth neural processing unit 102a: The first front-end module 102b: Second front-end module 102c: The third front-end module 102d: The fourth front-end module 104a: The first backend module 104b: Second backend module 104c: The third backend module 104d: The fourth backend module 110: Bridge unit 111: First bridge 112: Second bridge 120:Workload Manager 1021a, 1022a: first internal memory 1021b, 1022b: the second internal memory 1023a, 1024a: the first extraction unit 1023b, 1024b: the second extraction unit 1025a, 1026a: the first dispatch unit 1025b, 1026b: the second dispatch unit 1027a: first MAC array 1027b: second MAC array 1041a: first summation unit 1041b: second summation unit 1043a: first activation unit 1043b: second activation unit 1045a: first write-back unit 1045b: the second write-back unit 1112, 1113, 1114, 1122, 1123, 1124: bridge CLK1: the first clock signal CLK2: Second clock signal CLK3: The third clock signal CLK4: The fourth clock signal DATA, DATA3, DATA4, DATA11, DATA12, DATA21, DATA22: data DATA1: the first data DATA2: Second data DATA3: the third data DATA4: the fourth data PG1: The first power gating signal PG2: Second power gating signal PG3: The third power gate control signal PG4: The fourth power gate control signal R11: The result of the first operation R12: second operation result/intermediate result R13, R14: intermediate results R21: The result of the third operation R22: Fourth operation result/intermediate result R33, R44: intermediate results WB DATA1: write data for the first time WB DATA2: Write data for the second time.

通過參照附圖詳細闡述本揭露的示例性實施例,本揭露的以上及其它方面及特徵將變得更顯而易見,在附圖中: 圖1是示出根據本揭露的實施例的計算系統的示意圖。 圖2是示出根據本揭露的實施例的神經處理系統的方塊圖。 圖3是示出根據本揭露的實施例的神經處理系統的方塊圖。 圖4及圖5是示出根據本揭露的實施例的神經處理系統的前端模組的方塊圖。 圖6是示出根據本揭露的實施例的神經處理系統的後端模組的方塊圖。 圖7是示出根據本揭露的另一實施例的計算系統的示意圖。 圖8是示出根據本揭露的另一實施例的神經處理系統的方塊圖。 圖9是示出根據本揭露的再一實施例的計算系統的示意圖。 圖10是示出根據本揭露的再一實施例的神經處理系統的方塊圖。 圖11是示出根據本揭露的再一實施例的計算系統的示意圖。 圖12及圖13是示出根據本揭露的再一實施例的神經處理系統的方塊圖。 圖14是示出根據本揭露的再一實施例的計算系統的方塊圖。 圖15是示出根據本揭露的再一實施例的計算系統的方塊圖。 圖16是示出根據本揭露的再一實施例的計算系統的方塊圖。The above and other aspects and features of the present disclosure will become more apparent by elaborating exemplary embodiments of the present disclosure in detail with reference to the accompanying drawings, in which: FIG. 1 is a schematic diagram illustrating a computing system according to an embodiment of the present disclosure. FIG. 2 is a block diagram illustrating a neural processing system according to an embodiment of the present disclosure. FIG. 3 is a block diagram illustrating a neural processing system according to an embodiment of the present disclosure. 4 and 5 are block diagrams illustrating front-end modules of a neural processing system according to an embodiment of the disclosure. FIG. 6 is a block diagram illustrating backend modules of a neural processing system according to an embodiment of the disclosure. FIG. 7 is a schematic diagram illustrating a computing system according to another embodiment of the present disclosure. FIG. 8 is a block diagram illustrating a neural processing system according to another embodiment of the present disclosure. FIG. 9 is a schematic diagram illustrating a computing system according to yet another embodiment of the present disclosure. FIG. 10 is a block diagram illustrating a neural processing system according to yet another embodiment of the present disclosure. FIG. 11 is a schematic diagram illustrating a computing system according to yet another embodiment of the present disclosure. 12 and 13 are block diagrams illustrating a neural processing system according to yet another embodiment of the present disclosure. FIG. 14 is a block diagram illustrating a computing system according to yet another embodiment of the present disclosure. FIG. 15 is a block diagram illustrating a computing system according to yet another embodiment of the present disclosure. FIG. 16 is a block diagram illustrating a computing system according to yet another embodiment of the present disclosure.

1:計算系統 1: Computing system

10:神經處理系統 10: Neural Processing System

20:時脈管理單元(CMU) 20: Clock Management Unit (CMU)

30:處理器 30: Processor

40:記憶體 40: Memory

90:匯流排 90: busbar

100a:第一神經處理單元 100a: first neural processing unit

100b:第二神經處理單元 100b: second neural processing unit

CLK1:第一時脈信號 CLK1: the first clock signal

CLK2:第二時脈信號 CLK2: Second clock signal

Claims (20)

一種神經處理系統,包括:第一前端模組,利用第一特徵圖及第一權重執行特徵提取運算,並輸出第一運算結果及第二運算結果;第二前端模組,利用第二特徵圖及第二權重執行所述特徵提取運算,並輸出第三運算結果及第四運算結果;第一後端模組,接收從所述第一前端模組提供的所述第一運算結果及通過第二橋接器從所述第二前端模組提供的所述第四運算結果的輸入,以總和所述第一運算結果與所述第四運算結果;以及第二後端模組,接收從所述第二前端模組提供的所述第三運算結果及通過第一橋接器從所述第一前端模組提供的所述第二運算結果的輸入,以總和所述第三運算結果與所述第二運算結果。 A neural processing system, comprising: a first front-end module, which uses a first feature map and a first weight to perform a feature extraction operation, and outputs a first operation result and a second operation result; a second front-end module, which uses a second feature map and the second weight to perform the feature extraction operation, and output the third operation result and the fourth operation result; the first back-end module receives the first operation result provided from the first front-end module and passes the first operation result Two bridges receive an input of the fourth operation result from the second front-end module to sum the first operation result and the fourth operation result; and a second back-end module receives input from the The third operation result provided by the second front-end module and the input of the second operation result provided from the first front-end module through the first bridge are used to sum the third operation result and the first operation result. 2. Operation results. 如申請專利範圍第1項所述的神經處理系統,其中所述第一前端模組及所述第一後端模組根據第一時脈信號被驅動,且所述第二前端模組及所述第二後端模組根據具有與所述第一時脈信號不同的頻率的第二時脈信號被驅動。 The neural processing system described in item 1 of the scope of the patent application, wherein the first front-end module and the first back-end module are driven according to a first clock signal, and the second front-end module and the The second back-end module is driven according to a second clock signal having a frequency different from that of the first clock signal. 如申請專利範圍第1項所述的神經處理系統,其中所述第一橋接器與所述第二橋接器是非同步橋接器。 The neural processing system according to claim 1, wherein the first bridge and the second bridge are asynchronous bridges. 如申請專利範圍第1項所述的神經處理系統,其中所述第一後端模組向所述第一前端模組提供第一回寫資料,且所述第二後端模組向所述第二前端模組提供第二回寫資料。 The neural processing system described in item 1 of the scope of the patent application, wherein the first back-end module provides first write-back data to the first front-end module, and the second back-end module provides the first write-back data to the first front-end module The second front-end module provides second write-back data. 如申請專利範圍第1項所述的神經處理系統,其中所述第一前端模組包括:多個第一內部記憶體,儲存所述第一特徵圖及所述第一權重,多個第一提取單元,從所述多個第一內部記憶體中的每一者提取所述第一特徵圖及所述第一權重,多個第一分派單元,針對每一通道將所提取的所述第一特徵圖及所述第一權重傳送到第一乘法及累積陣列,以及所述第一乘法及累積陣列,對從所述多個第一分派單元傳送的資料執行乘法累積運算。 The neural processing system described in item 1 of the scope of the patent application, wherein the first front-end module includes: a plurality of first internal memories storing the first feature map and the first weight, and a plurality of first An extracting unit, extracting the first feature map and the first weight from each of the plurality of first internal memories, and a plurality of first dispatching units, for each channel, extracting the extracted first feature map A feature map and the first weights are transmitted to a first multiply and accumulate array, and the first multiply and accumulate array performs multiply accumulate operations on data transmitted from the plurality of first dispatch units. 如申請專利範圍第5項所述的神經處理系統,其中所述第一乘法及累積陣列輸出所述第一運算結果及所述第二運算結果,所述第一運算結果被提供到所述第一後端模組,且所述第二運算結果通過所述第一橋接器被提供到所述第二後端模組。 The neural processing system according to claim 5, wherein the first multiplication and accumulation array outputs the first operation result and the second operation result, and the first operation result is provided to the first operation result A backend module, and the second calculation result is provided to the second backend module through the first bridge. 如申請專利範圍第1項所述的神經處理系統,其中所述第二前端模組包括:多個第二內部記憶體,儲存所述第二特徵圖及所述第二權重,多個第二提取單元,從所述多個第二內部記憶體中的每一者提取所述第二特徵圖及所述第二權重,多個第二分派單元,針對每一通道將所提取的所述第二特徵圖及所述第二權重傳送到第二乘法及累積陣列,以及所述第二乘法及累積陣列,對從所述多個第二分派單元傳送 的資料執行乘法累積運算。 The neural processing system described in item 1 of the scope of the patent application, wherein the second front-end module includes: a plurality of second internal memories for storing the second feature map and the second weight, and a plurality of second An extracting unit extracts the second feature map and the second weight from each of the plurality of second internal memories, and a plurality of second dispatching units extracts the extracted first feature map for each channel Two feature maps and said second weights are sent to a second multiply and accumulate array, and said second multiply and accumulate array is sent from said plurality of second dispatch units Perform multiplication and accumulation operations on the data. 如申請專利範圍第1項所述的神經處理系統,更包括:工作負荷管理器,將用於執行特徵提取的資料中的第一資料分配到所述第一前端模組,並將所述資料中的第二資料分配到所述第二前端模組,其中所述第一前端模組利用所述第一特徵圖及所述第一權重對所述第一資料執行所述特徵提取運算,且所述第二前端模組利用所述第二特徵圖及所述第二權重對所述第二資料執行所述特徵提取運算。 The neural processing system described in item 1 of the scope of the patent application further includes: a workload manager, which distributes the first data among the data used for feature extraction to the first front-end module, and distributes the data assigning the second data in the second front-end module to the second front-end module, wherein the first front-end module uses the first feature map and the first weight to perform the feature extraction operation on the first data, and The second front-end module uses the second feature map and the second weight to perform the feature extraction operation on the second data. 如申請專利範圍第8項所述的神經處理系統,其中所述第一資料的量與所述第二資料的量彼此不同。 The neural processing system according to claim 8, wherein the quantity of the first data and the quantity of the second data are different from each other. 如申請專利範圍第8項所述的神經處理系統,更包括:時脈管理單元,向所述第一前端模組及所述第一後端模組提供第一時脈信號,並向所述第二前端模組及所述第二後端模組提供第二時脈信號,其中所述時脈管理單元控制所述第一時脈信號及所述第二時脈信號中的至少一者的頻率,以根據所述工作負荷管理器的分配操作對所述第一前端模組、所述第一後端模組、所述第二前端模組及所述第二後端模組中的至少一者執行時脈閘控。 The neural processing system described in item 8 of the scope of the patent application further includes: a clock management unit that provides a first clock signal to the first front-end module and the first back-end module, and sends the first clock signal to the The second front-end module and the second back-end module provide a second clock signal, wherein the clock management unit controls at least one of the first clock signal and the second clock signal frequency to at least one of the first front-end module, the first back-end module, the second front-end module, and the second back-end module according to the allocation operation of the workload manager One performs clock gating. 如申請專利範圍第8項所述的神經處理系統,更包括:電源管理單元,向所述第一前端模組及所述第一後端模組提供第一電源閘控信號,並向所述第二前端模組及所述第二後端模組提供第二電源閘控信號, 其中所述電源管理單元控制所述第一電源閘控信號及所述第二電源閘控信號的至少一個值,以根據所述工作負荷管理器的分配操作對所述第一前端模組、所述第一後端模組、所述第二前端模組及所述第二後端模組中的至少一者執行電源閘控。 The neural processing system described in item 8 of the scope of the patent application further includes: a power management unit, which provides a first power gating signal to the first front-end module and the first back-end module, and provides the first power gating signal to the The second front-end module and the second back-end module provide a second power gating signal, Wherein the power management unit controls at least one value of the first power gating signal and the second power gating signal, so as to control the first front-end module, the At least one of the first back-end module, the second front-end module, and the second back-end module performs power gating. 一種神經處理系統,包括:第一神經處理單元,包括第一前端模組及第一後端模組;以及橋接器單元,電性連接到所述第一神經處理單元,以及第二神經處理單元,在與所述第一神經處理單元不同的時脈域中運作,其中所述第一前端模組將通過利用第一特徵圖及第一權重執行特徵提取運算而獲得的第一運算結果的一部分提供到所述第一後端模組,所述橋接器單元將在所述第二神經處理單元中執行的第二運算結果的一部分提供到所述第一後端模組,且所述第一後端模組總和所述第一運算結果的所述一部分與所述第二運算結果的所述一部分。 A neural processing system, comprising: a first neural processing unit, including a first front-end module and a first rear-end module; and a bridge unit, electrically connected to the first neural processing unit, and a second neural processing unit , operating in a clock domain different from that of the first neural processing unit, wherein the first front-end module performs a part of a first operation result obtained by performing a feature extraction operation using a first feature map and a first weight provided to the first backend module, the bridge unit provides a portion of the result of the second operation performed in the second neural processing unit to the first backend module, and the first The backend module sums the part of the first operation result and the part of the second operation result. 如申請專利範圍第12項所述的神經處理系統,其中所述橋接器單元電性連接到第三神經處理單元,所述第三神經處理單元在與所述第一神經處理單元不同的時脈域中運作,所述第一前端模組將所述第一運算結果的另一部分提供到所述橋接器單元,且所述橋接器單元將所述第一運算結果的所述另一部分提供到 所述第三神經處理單元。 The neural processing system according to claim 12 of the patent application, wherein the bridge unit is electrically connected to a third neural processing unit, and the third neural processing unit operates at a clock different from that of the first neural processing unit operating in the domain, the first front-end module provides another part of the first operation result to the bridge unit, and the bridge unit provides the other part of the first operation result to The third neural processing unit. 如申請專利範圍第12項所述的神經處理系統,其中所述第一前端模組包括:多個第一內部記憶體,儲存所述第一特徵圖及所述第一權重,多個第一提取單元,從所述多個第一內部記憶體中的每一者提取所述第一特徵圖及所述第一權重,多個第一分派單元,針對每一通道將所提取的所述第一特徵圖及所述第一權重傳送到第一乘法及累積陣列,以及所述第一乘法及累積陣列,對從所述多個第一分派單元傳送的資料執行乘法累積運算,並輸出所述第一運算結果。 The neural processing system described in item 12 of the scope of the patent application, wherein the first front-end module includes: a plurality of first internal memories for storing the first feature map and the first weight, and a plurality of first An extracting unit, extracting the first feature map and the first weight from each of the plurality of first internal memories, and a plurality of first dispatching units, for each channel, extracting the extracted first feature map A feature map and the first weights are transmitted to a first multiply and accumulate array, and the first multiply and accumulate array performs a multiply accumulate operation on the data transmitted from the plurality of first dispatch units, and outputs the The result of the first operation. 一種神經處理系統,包括:第一神經處理單元,包括第一前端模組及第一後端模組;第二神經處理單元,包括第二前端模組及第二後端模組;以及工作負荷管理器,將用於執行特徵提取的資料中的第一資料分配到所述第一神經處理單元,並將所述資料中的第二資料分配到所述第二神經處理單元,其中所述第一前端模組利用第一特徵圖及第一權重對所述第一資料執行特徵提取運算,並輸出第一運算結果及第二運算結果,所述第二前端模組利用第二特徵圖及第二權重對所述第二資料執行所述特徵提取運算,並輸出第三運算結果及第四運算結果,且所述第一後端模組總和所述第一運算結果與所述第四運算結 果,且所述第二後端模組總和所述第三運算結果與所述第二運算結果的總和。 A neural processing system, comprising: a first neural processing unit including a first front-end module and a first back-end module; a second neural processing unit including a second front-end module and a second back-end module; and a workload a manager for distributing a first of the data for performing feature extraction to the first neural processing unit, and distributing a second of the data to the second neural processing unit, wherein the first A front-end module uses the first feature map and the first weight to perform a feature extraction operation on the first data, and outputs the first operation result and the second operation result, and the second front-end module uses the second feature map and the second Two weights perform the feature extraction operation on the second data, and output a third operation result and a fourth operation result, and the first back-end module sums the first operation result and the fourth operation result result, and the second back-end module sums the sum of the third operation result and the second operation result. 如申請專利範圍第15項所述的神經處理系統,更包括:時脈管理單元,向所述第一前端模組及所述第一後端模組提供第一時脈信號,並向所述第二前端模組及所述第二後端模組提供第二時脈信號,其中所述時脈管理單元控制所述第一時脈信號及所述第二時脈信號中的至少一者的頻率,以根據所述工作負荷管理器的分配操作對所述第一前端模組、所述第一後端模組、所述第二前端模組及所述第二後端模組中的至少一者執行時脈閘控。 The neural processing system described in item 15 of the scope of the patent application further includes: a clock management unit, which provides a first clock signal to the first front-end module and the first back-end module, and sends the first clock signal to the The second front-end module and the second back-end module provide a second clock signal, wherein the clock management unit controls at least one of the first clock signal and the second clock signal frequency to at least one of the first front-end module, the first back-end module, the second front-end module, and the second back-end module according to the allocation operation of the workload manager One performs clock gating. 如申請專利範圍第15項所述的神經處理系統,更包括:電源管理單元,向所述第一前端模組及所述第一後端模組提供第一電源閘控信號,並向所述第二前端模組及所述第二後端模組提供第二電源閘控信號,其中所述電源管理單元控制所述第一電源閘控信號及所述第二電源閘控信號的至少一個值,以根據所述工作負荷管理器的分配操作對所述第一前端模組、所述第一後端模組、所述第二前端模組及所述第二後端模組中的至少一者執行電源閘控。 The neural processing system described in item 15 of the scope of the patent application further includes: a power management unit, which provides a first power gating signal to the first front-end module and the first back-end module, and provides the first power gating signal to the The second front-end module and the second back-end module provide a second power gating signal, wherein the power management unit controls at least one value of the first power gating signal and the second power gating signal , to at least one of the first front-end module, the first back-end module, the second front-end module, and the second back-end module according to the allocation operation of the workload manager or perform power gating. 如申請專利範圍第15項所述的神經處理系統,其中所述第一神經處理單元根據第一時脈信號被驅動,且所述第二神經處理單元根據具有與所述第一時脈信號不同的頻率的第二時脈信號被驅動。 The neural processing system as described in item 15 of the scope of the patent application, wherein the first neural processing unit is driven according to a first clock signal, and the second neural processing unit is driven according to a clock signal different from the first clock signal frequency of the second clock signal is driven. 如申請專利範圍第15項所述的神經處理系統,其中所述第一前端模組包括:多個第一內部記憶體,儲存所述第一特徵圖及所述第一權重,多個第一提取單元,從所述多個第一內部記憶體中的每一者提取所述第一特徵圖及所述第一權重,多個第一分派單元,針對每一通道將所提取的所述第一特徵圖及所述第一權重傳送到第一乘法及累積陣列,以及所述第一乘法及累積陣列,對從所述多個第一分派單元傳送的資料執行乘法累積運算。 The neural processing system described in item 15 of the scope of the patent application, wherein the first front-end module includes: a plurality of first internal memories storing the first feature map and the first weight, and a plurality of first An extracting unit, extracting the first feature map and the first weight from each of the plurality of first internal memories, and a plurality of first dispatching units, for each channel, extracting the extracted first feature map A feature map and the first weights are transmitted to a first multiply and accumulate array, and the first multiply and accumulate array performs multiply accumulate operations on data transmitted from the plurality of first dispatch units. 如申請專利範圍第15項所述的神經處理系統,其中所述第二前端模組包括:多個第二內部記憶體,儲存所述第二特徵圖及所述第二權重,多個第二提取單元,從所述多個第二內部記憶體中的每一者提取所述第二特徵圖及所述第二權重,多個第二分派單元,針對每一通道將所提取的所述第二特徵圖及所述第二權重傳送到第二乘法及累積陣列,以及所述第二乘法及累積陣列,對從所述多個第二分派單元傳送的資料執行乘法累積運算。 The neural processing system described in item 15 of the scope of the patent application, wherein the second front-end module includes: a plurality of second internal memories for storing the second feature maps and the second weights, and a plurality of second An extracting unit extracts the second feature map and the second weight from each of the plurality of second internal memories, and a plurality of second dispatching units extracts the extracted first feature map for each channel The second feature map and the second weights are transmitted to a second multiply and accumulate array, and the second multiply and accumulate array performs a multiply accumulate operation on the data transmitted from the plurality of second dispatch units.
TW108127870A 2018-09-07 2019-08-06 Neural processing system TWI805820B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020180106917A KR20200029661A (en) 2018-09-07 2018-09-07 Neural processing system
KR10-2018-0106917 2018-09-07

Publications (2)

Publication Number Publication Date
TW202011279A TW202011279A (en) 2020-03-16
TWI805820B true TWI805820B (en) 2023-06-21

Family

ID=69718889

Family Applications (1)

Application Number Title Priority Date Filing Date
TW108127870A TWI805820B (en) 2018-09-07 2019-08-06 Neural processing system

Country Status (4)

Country Link
US (2) US11443183B2 (en)
KR (1) KR20200029661A (en)
CN (1) CN110889499A (en)
TW (1) TWI805820B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102506622B1 (en) * 2022-04-01 2023-03-07 리벨리온 주식회사 Method for measuring performance of neural processing device and Device for measuring performance
KR20230142336A (en) 2022-04-01 2023-10-11 리벨리온 주식회사 Method for measuring performance of neural processing device and Device for measuring performance
US11954587B2 (en) * 2023-08-30 2024-04-09 Deepx Co., Ltd. System-on-chip for artificial neural network being operated according to dynamically calibrated phase of clock signal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201640422A (en) * 2014-12-19 2016-11-16 英特爾股份有限公司 Method and apparatus for distributed and cooperative computation in artificial neural networks
US20170011288A1 (en) * 2015-07-10 2017-01-12 Samsung Electronics Co., Ltd. Neural network processor
TW201734894A (en) * 2014-07-22 2017-10-01 英特爾股份有限公司 Weight-shifting processor, method and system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5627943A (en) 1993-02-17 1997-05-06 Kawasaki Steel Corporation Neural network processor including systolic array of two-dimensional layers
US5799134A (en) 1995-03-13 1998-08-25 Industrial Technology Research Institute One dimensional systolic array architecture for neural network
US9477925B2 (en) 2012-11-20 2016-10-25 Microsoft Technology Licensing, Llc Deep neural networks training for speech and pattern recognition
KR102084547B1 (en) 2013-01-18 2020-03-05 삼성전자주식회사 Nonvolatile memory device, memory system having the same, external power controlling method thereof
US10331997B2 (en) 2014-05-07 2019-06-25 Seagate Technology Llc Adaptive configuration of a neural network device
CN106575379B (en) 2014-09-09 2019-07-23 英特尔公司 Improved fixed point integer implementation for neural network
US10373050B2 (en) 2015-05-08 2019-08-06 Qualcomm Incorporated Fixed point neural network based on floating point neural network quantization
CN105892989B (en) 2016-03-28 2017-04-12 中国科学院计算技术研究所 Neural network accelerator and operational method thereof
US10650303B2 (en) 2017-02-14 2020-05-12 Google Llc Implementing neural networks in fixed point arithmetic computing systems

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201734894A (en) * 2014-07-22 2017-10-01 英特爾股份有限公司 Weight-shifting processor, method and system
TW201640422A (en) * 2014-12-19 2016-11-16 英特爾股份有限公司 Method and apparatus for distributed and cooperative computation in artificial neural networks
US20170011288A1 (en) * 2015-07-10 2017-01-12 Samsung Electronics Co., Ltd. Neural network processor

Also Published As

Publication number Publication date
US11443183B2 (en) 2022-09-13
KR20200029661A (en) 2020-03-19
US20220405593A1 (en) 2022-12-22
US20200082263A1 (en) 2020-03-12
TW202011279A (en) 2020-03-16
US11625606B2 (en) 2023-04-11
CN110889499A (en) 2020-03-17

Similar Documents

Publication Publication Date Title
CN110678843B (en) Dynamic partitioning of workload in deep neural network modules to reduce power consumption
JP7474586B2 (en) Tensor Computation Data Flow Accelerator Semiconductor Circuit
TWI805820B (en) Neural processing system
Motamedi et al. Design space exploration of FPGA-based deep convolutional neural networks
JP2022070955A (en) Scheduling neural network processing
JP2022046552A (en) Neural network compute tile
KR20190084705A (en) Neural network processing unit including approximate multiplier and system on chip including the same
US20210357732A1 (en) Neural network accelerator hardware-specific division of inference into groups of layers
JP2018073414A (en) Method of controlling work flow in distributed computation system comprising processor and memory units
Pullini et al. A heterogeneous multicore system on chip for energy efficient brain inspired computing
WO2020163315A1 (en) Systems and methods for artificial intelligence with a flexible hardware processing framework
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
CN109753319B (en) Device for releasing dynamic link library and related product
WO2020253383A1 (en) Streaming data processing method based on many-core processor, and computing device
US11847507B1 (en) DMA synchronization using alternating semaphores
CN111047022A (en) Computing device and related product
Chen et al. An efficient accelerator for multiple convolutions from the sparsity perspective
EP4206999A1 (en) Artificial intelligence core, artificial intelligence core system, and loading/storing method of artificial intelligence core system
Bruel et al. Generalize or die: Operating systems support for memristor-based accelerators
CN116484909A (en) Vector engine processing method and device for artificial intelligent chip
US11221979B1 (en) Synchronization of DMA transfers for large number of queues
Ozaki et al. Cool mega-array: A highly energy efficient reconfigurable accelerator
Aghapour et al. Integrated ARM big. Little-Mali pipeline for high-throughput CNN inference
US20230195836A1 (en) One-dimensional computational unit for an integrated circuit
CN114595813B (en) Heterogeneous acceleration processor and data computing method