TW201636823A - Systems, apparatuses, and methods for K nearest neighbor search - Google Patents

Systems, apparatuses, and methods for K nearest neighbor search Download PDF

Info

Publication number
TW201636823A
TW201636823A TW104138748A TW104138748A TW201636823A TW 201636823 A TW201636823 A TW 201636823A TW 104138748 A TW104138748 A TW 104138748A TW 104138748 A TW104138748 A TW 104138748A TW 201636823 A TW201636823 A TW 201636823A
Authority
TW
Taiwan
Prior art keywords
vector
distance
bit
global
circuit
Prior art date
Application number
TW104138748A
Other languages
Chinese (zh)
Other versions
TWI604379B (en
Inventor
希曼休 科爾
馬克A 安德斯
沙努K 馬修
Original Assignee
英特爾公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US14/582,607 external-priority patent/US9626334B2/en
Priority claimed from US14/944,828 external-priority patent/US10303735B2/en
Application filed by 英特爾公司 filed Critical 英特爾公司
Publication of TW201636823A publication Critical patent/TW201636823A/en
Application granted granted Critical
Publication of TWI604379B publication Critical patent/TWI604379B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations

Abstract

Systems, apparatuses, and methods for k-nearest neighbor (KNN) searches are described. In particular, embodiments of a KNN accelerator and its uses are described. In some embodiments, the KNN accelerator includes a plurality of vector partial distance computation circuits each to calculate a partial sum, a minimum sort network to sort partial sums from the plurality of vector partial distance computation circuits to find k nearest neighbor matches and a global control circuit to control aspects of operations of the plurality of vector partial distance computation circuits.

Description

用於K最近相鄰者搜尋之系統、設備及方法 System, device and method for K nearest neighbor search

本發明的領域一般係涉及電腦處理器結構,並且,更具體地說,涉及最近相鄰者搜尋。 The field of the invention relates generally to computer processor architectures and, more particularly, to nearest neighbor searches.

發明背景 Background of the invention

在許多的應用中,希望可對一資料集之多維特徵(點)進行一種快速和有效最近相鄰者搜尋。舉例來說,這種類型的搜尋有利於諸如影像重建和機器學習的領域。最近相鄰者資料集搜尋有幾種方法。在最近相鄰者搜尋中,給定在一空間中的一組點和一輸入實例(查詢點),一種搜尋被執行以在一集合中找到最接近該輸入實例的一點。 In many applications, it is desirable to have a fast and efficient nearest neighbor search for a multidimensional feature (point) of a data set. For example, this type of search facilitates areas such as image reconstruction and machine learning. There are several ways to search for neighbor data sets recently. In the nearest neighbor search, given a set of points and an input instance (query point) in a space, a search is performed to find a point in the set that is closest to the input instance.

依據本發明之一實施例,係特地提出一種設備,其包含有:至少一個向量部分距離計算電路以計算在一搜尋空間中一組向量的一部分總和和一累積距離;一最小排序網路以排序來自該等累積距離之一選擇的位元集合,指出來自該搜尋空間中該等向量之該等選擇位元集合的一最小值和是否該最小值係獨一無二的;以及一全域控制電路以接收該最小排序網路的一輸出,並控制該至少一個向量 部分距離計算電路之操作的各個方面。 In accordance with an embodiment of the present invention, an apparatus is specifically provided that includes: at least one vector portion distance calculation circuit to calculate a partial sum and a cumulative distance of a set of vectors in a search space; a minimum order network to sort a set of bits selected from one of the cumulative distances, indicating a minimum value of the set of selected bits from the vectors in the search space and whether the minimum value is unique; and a global control circuit to receive the Minimizing an output of the network and controlling the at least one vector Partial distance calculates various aspects of the operation of the circuit.

101‧‧‧查詢物件向量 101‧‧‧Query object vector

103_0~103_N‧‧‧向量0~N部分距離計算 103_0~103_N‧‧‧ Vector 0~N partial distance calculation

105‧‧‧全域控制 105‧‧‧Global Control

107‧‧‧最小排序網路 107‧‧‧Minimum sorting network

109‧‧‧0級比較節點 109‧‧‧0 level comparison node

111‧‧‧「k」級比較節點 111‧‧‧"k" level comparison node

203‧‧‧向量部分距離計算 203‧‧‧Vector partial distance calculation

205‧‧‧資料元件計算器電路 205‧‧‧data component calculator circuit

207‧‧‧局部控制電路 207‧‧‧Local control circuit

209‧‧‧移位器 209‧‧‧ shifter

211‧‧‧壓縮器樹 211‧‧‧Compressor tree

213‧‧‧正反器 213‧‧‧Factor

215‧‧‧加法器 215‧‧‧Adder

219‧‧‧選擇器 219‧‧‧Selector

301‧‧‧查詢物件 301‧‧‧Querying objects

303‧‧‧儲存的物件鎖存 303‧‧‧Stored object latch

305‧‧‧|a-b| 305‧‧‧|a-b|

307‧‧‧來自控制的選擇 307‧‧‧Selection from control

309‧‧‧選擇器 309‧‧‧Selector

311‧‧‧乘法器 311‧‧‧Multiplier

313‧‧‧加法器 313‧‧‧Adder

401‧‧‧查詢物件 401‧‧‧Querying objects

403‧‧‧儲存的物件鎖存 403‧‧‧ Stored object latch

405‧‧‧|a-b| 405‧‧‧|a-b|

407‧‧‧來自控制的選擇 407‧‧‧Selection from control

409‧‧‧選擇器 409‧‧‧Selector

501‧‧‧全域指標 501‧‧‧ Global Indicators

503‧‧‧向量位址 503‧‧‧Vector Address

505‧‧‧最小位址 505‧‧‧Minimum address

509‧‧‧最小精確度 509‧‧‧Minimum accuracy

511‧‧‧AND閘 511‧‧‧AND gate

513‧‧‧AND閘 513‧‧‧AND gate

515‧‧‧移除 515‧‧‧Remove

517‧‧‧局部計算控制 517‧‧‧Local calculation control

901‧‧‧OR閘 901‧‧‧OR gate

903‧‧‧12b全域遮罩 903‧‧‧12b global mask

905‧‧‧優先級編碼器 905‧‧‧Priority encoder

907‧‧‧OR樹 907‧‧‧OR tree

911‧‧‧選擇器 911‧‧‧Selector

913‧‧‧12x12b查詢表 913‧‧12x12b lookup table

1001‧‧‧OR閘 1001‧‧‧OR gate

1003‧‧‧XOR閘 1003‧‧‧XOR gate

1005‧‧‧OR閘 1005‧‧‧OR gate

1007‧‧‧|a-b|>ε 1007‧‧‧|a-b|>ε

1009‧‧‧AND閘 1009‧‧‧AND gate

1501~1527‧‧‧方塊 1501~1527‧‧‧

1600‧‧‧管線 1600‧‧‧ pipeline

1602‧‧‧提取 1602‧‧‧ extraction

1604‧‧‧長度解碼 1604‧‧‧ Length decoding

1606‧‧‧解碼 1606‧‧‧Decoding

16016‧‧‧分配 16016‧‧‧ Distribution

1610‧‧‧重新命名 1610‧‧‧Rename

1612‧‧‧排程 1612‧‧‧ Schedule

1614‧‧‧暫存器讀取/記憶體讀取 1614‧‧‧Scratchpad read/memory read

1616‧‧‧執行階段 1616‧‧‧implementation phase

16116‧‧‧寫回/記憶體寫入 16116‧‧‧Write back/memory write

1622‧‧‧例外處理 1622‧‧‧Exception handling

1624‧‧‧提交 Submitted 1624‧‧

1630‧‧‧前端單元 1630‧‧‧ front unit

1632‧‧‧分支預測單元 1632‧‧‧ branch prediction unit

1634‧‧‧指令快取單元 1634‧‧‧Command cache unit

1636‧‧‧指令TLB單元 1636‧‧‧Instructed TLB unit

16316‧‧‧指令提取 16316‧‧‧ instruction extraction

1640‧‧‧解碼單元 1640‧‧‧Decoding unit

1650‧‧‧執行引擎單元 1650‧‧‧Execution engine unit

1652‧‧‧重新命名/分配器單元 1652‧‧‧Rename/Distributor Unit

1654‧‧‧引退單元 1654‧‧‧Retirement unit

1656‧‧‧排程器單元 1656‧‧‧scheduler unit

16516‧‧‧實體暫存器集單元 16516‧‧‧Physical register unit

1660‧‧‧執行集群 1660‧‧‧Executing a cluster

1662‧‧‧執行單元 1662‧‧‧Execution unit

1664‧‧‧記憶體存取單元 1664‧‧‧Memory access unit

1670‧‧‧記憶體單元 1670‧‧‧ memory unit

1672‧‧‧資料TLB單元 1672‧‧‧Information TLB unit

1674‧‧‧資料快取單元 1674‧‧‧Data cache unit

1676‧‧‧L2快取單元 1676‧‧‧L2 cache unit

1690‧‧‧核心 1690‧‧‧ core

1700‧‧‧指令編碼 1700‧‧‧ instruction code

1702‧‧‧環形網路 1702‧‧‧Circular network

1704‧‧‧該L2快取的局部子集 1704‧‧‧Local subset of the L2 cache

1706‧‧‧L1快取 1706‧‧‧L1 cache

1708‧‧‧純量單元 1708‧‧‧ scalar unit

1710‧‧‧向量單元 1710‧‧‧ vector unit

1712‧‧‧純量暫存器 1712‧‧‧ scalar register

1714‧‧‧向量暫存器 1714‧‧‧Vector register

1706A‧‧‧L1資料快取 1706A‧‧‧L1 data cache

1720‧‧‧拌和單元 1720‧‧‧ Mixing unit

1720A、1722B‧‧‧數字轉換 1720A, 1722B‧‧‧ Digital Conversion

1724‧‧‧複製 1724‧‧‧Copy

1726‧‧‧寫入遮罩暫存器 1726‧‧‧Write mask register

1728‧‧‧16寬的向量ALU 1728‧‧16 wide vector ALU

1800‧‧‧處理器 1800‧‧‧ processor

1802A~N‧‧‧核心 1802A~N‧‧‧ core

1804A~N‧‧‧快取單元 1804A~N‧‧‧ cache unit

1806‧‧‧共用的快取單元 1806‧‧‧Shared cache unit

1808‧‧‧專用邏輯 1808‧‧‧Special Logic

1810‧‧‧系統代理單元 1810‧‧‧System Agent Unit

1812‧‧‧環形 1812‧‧‧ ring

1814‧‧‧整合式的記憶體控制器單元 1814‧‧‧Integrated memory controller unit

1816‧‧‧匯流排控制器單元 1816‧‧‧ Busbar Controller Unit

1900‧‧‧系統 1900‧‧‧ system

1910‧‧‧處理器 1910‧‧‧ Processor

1915‧‧‧處理器 1915‧‧‧ Processor

1920‧‧‧控制器集線器 1920‧‧‧Controller Hub

1940‧‧‧記憶體 1940‧‧‧ memory

1945‧‧‧協同處理器 1945‧‧‧coprocessor

1950‧‧‧IOH 1950‧‧‧IOH

1960‧‧‧I/O 1960‧‧‧I/O

1990‧‧‧GMCH 1990‧‧‧GMCH

1995‧‧‧連接 1995‧‧‧Connect

2000‧‧‧系統 2000‧‧‧ system

2014‧‧‧I/O裝置 2014‧‧‧I/O device

2015‧‧‧處理器 2015‧‧‧ processor

2016‧‧‧第一匯流排 2016‧‧‧First bus

2018‧‧‧匯流排橋接器 2018‧‧‧ Bus Bars

2020‧‧‧第二匯流排 2020‧‧‧Second bus

2022‧‧‧鍵盤和/或滑鼠 2022‧‧‧ keyboard and / or mouse

2024‧‧‧音訊I/O 2024‧‧‧Audio I/O

2027‧‧‧通信裝置 2027‧‧‧Communication device

2028‧‧‧資料儲存 2028‧‧‧Data storage

2030‧‧‧程式碼和資料 2030‧‧‧Program code and data

2032、2034‧‧‧記憶體 2032, 2034‧‧‧ memory

2038‧‧‧協同處理器 2038‧‧‧Synthesis processor

2039‧‧‧高效能介面 2039‧‧‧High-performance interface

2050、2052‧‧‧點對點互連 2050, 2052‧‧ ‧ point-to-point interconnection

2072、2082‧‧‧IMC 2072, 2082‧‧‧IMC

2076、2078、2086、2088、2094、2098‧‧‧P-P 2076, 2078, 2086, 2088, 2094, 2098‧‧‧P-P

2070‧‧‧處理器 2070‧‧‧ processor

2072、2082‧‧‧CL 2072, 2082‧‧‧CL

2080‧‧‧處理器/協同處理器 2080‧‧‧Processor/coprocessor

2090‧‧‧晶片組 2090‧‧‧ Chipset

2092、2096‧‧‧I/F 2092, 2096‧‧‧I/F

2100‧‧‧系統 2100‧‧‧ system

2114‧‧‧I/O裝置 2114‧‧‧I/O devices

2115‧‧‧傳統I/O 2115‧‧‧Traditional I/O

2200‧‧‧系統單晶片 2200‧‧‧ system single chip

2202‧‧‧互連單元 2202‧‧‧Interconnect unit

2210‧‧‧應用程式處理器 2210‧‧‧Application Processor

2220‧‧‧協同處理器 2220‧‧‧coprocessor

2230‧‧‧SRAM單元 2230‧‧‧SRAM unit

2232‧‧‧DMA單元 2232‧‧‧DMA unit

2240‧‧‧顯示器單元 2240‧‧‧Display unit

2302‧‧‧高階語言 2302‧‧‧High-level language

2304‧‧‧x86編譯器 2304‧‧x86 compiler

2306‧‧‧x86二進位碼 2306‧‧‧x86 binary code

2308‧‧‧替代的指令集編譯器 2308‧‧‧Alternative Instruction Set Compiler

2310‧‧‧替代的指令集二進位碼 2310‧‧‧Alternative instruction set binary code

2312‧‧‧指令轉換器 2312‧‧‧Command Converter

2314‧‧‧不具有一x86指令集核心的處理器 2314‧‧‧Processor without an x86 instruction set core

2316‧‧‧具有至少一個x86指令集核心的處理器 2316‧‧‧Processor with at least one x86 instruction set core

在該等附圖的該等圖示中,本發明係透過舉例的方式被圖示出,而不是透過限制性的方式被圖示出,其中類似的標號表示類似的元件並且其中:圖1根據一實施例展示出一高階kNN加速器組織。 The present invention is illustrated by way of example only, and not by way of limitation One embodiment demonstrates a high order kNN accelerator organization.

圖2根據一實施例圖示出一示例性向量部分距離計算電路。 2 illustrates an exemplary vector portion distance calculation circuit in accordance with an embodiment.

圖3根據一實施例圖示出一平方差資料元件計算電路的一示例性向量部分距離總和。 3 illustrates an exemplary vector partial distance summation of a square difference data element calculation circuit, in accordance with an embodiment.

圖4根據一實施例圖示出一絕對差資料元件計算電路的一示例性向量部分距離總和。 4 illustrates an exemplary vector partial distance summation of an absolute difference data element calculation circuit, in accordance with an embodiment.

圖5根據一實施例圖示出一示例性局部控制電路。 FIG. 5 illustrates an exemplary local control circuit in accordance with an embodiment.

圖6根據一實施例圖示出一示例性曼哈坦距離排序處理。 FIG. 6 illustrates an exemplary Manhattan binning process in accordance with an embodiment.

圖7根據一實施例圖示出一示例性資料元件歐幾里得距離排序處理。 Figure 7 illustrates an exemplary data element Euclidean distance sorting process, in accordance with an embodiment.

圖8根據一實施例圖示出使用部分距離之一示例性排序處理。 Figure 8 illustrates an exemplary ranking process using one of the partial distances, in accordance with an embodiment.

圖9根據一實施例圖示出一示例性全域控制電路。 Figure 9 illustrates an exemplary global control circuit in accordance with an embodiment.

圖10根據一實施例圖示出一示例性0級比較節點 電路。 Figure 10 illustrates an exemplary level 0 comparison node in accordance with an embodiment. Circuit.

圖11根據一實施例圖示出一示例性k級比較節點電路。 Figure 11 illustrates an exemplary k-level comparison node circuit in accordance with an embodiment.

圖12根據一實施例圖示出一示例性8位元/16位元可重新配置計算電路。 Figure 12 illustrates an exemplary 8-bit/16-bit reconfigurable computing circuit in accordance with an embodiment.

圖13根據一實施例圖示出一示例性部分距離計算用於16位元元件的平方和。 Figure 13 illustrates an exemplary partial distance calculation for the sum of squares of 16-bit elements, in accordance with an embodiment.

圖14根據一實施例圖示出一餘弦相似度計算(1d距離)電路和根據一實施例圖示出用於內積之示例性部分距離計算。 Figure 14 illustrates a cosine similarity calculation (1d distance) circuit and an exemplary partial distance calculation for inner product in accordance with an embodiment, in accordance with an embodiment.

圖15A-B根據一實施例圖示出一kNN搜尋的一種示例性方法。 15A-B illustrate an exemplary method of a kNN search, in accordance with an embodiment.

圖16A根據本發明的實施例係一方塊圖,其圖示出一示例性依序管線和一示例性暫存器重新命名、亂序分發/執行管線兩者。 Figure 16A is a block diagram illustrating an exemplary sequential pipeline and an exemplary scratchpad renaming, out of order distribution/execution pipeline, in accordance with an embodiment of the present invention.

圖16B根據本發明的實施例係一方塊圖,其圖示出一依序結構核心的依示例性實施例以及將被包括在一處理器中之一示例性暫存器重命名、亂序分發/執行結構核心兩者。 Figure 16B is a block diagram illustrating an exemplary embodiment of a sequential structure core and an exemplary scratchpad renaming, out of order distribution, to be included in a processor, in accordance with an embodiment of the present invention. Execute both of the structural cores.

圖17A-B圖示出一更具體之示例性依序核心結構方塊圖,該核心將會是在一晶片中數個邏輯方塊(包括相同類型和/或不同類型的其他核心)中之一。 17A-B illustrate a more specific exemplary sequential core structure block diagram that will be one of several logical blocks (including other cores of the same type and/or different types) in a wafer.

圖18根據本發明的實施例係一處理器1800的方塊圖,其可能有一個以上的核心,可能具有一整合式的記 憶體控制器,以及可能有一整合式的圖形。 Figure 18 is a block diagram of a processor 1800, which may have more than one core, possibly having an integrated record, in accordance with an embodiment of the present invention. Recall the body controller, and possibly have an integrated graphic.

圖19-22係示例性電腦結構的方塊圖。 19-22 are block diagrams of exemplary computer structures.

圖23根據本發明的實施例係一方塊圖,對比一軟體指令轉換器的使用以把在一來源指令集中的二進位指令轉換成在一目標指令集中的二進位指令。 23 is a block diagram comparing the use of a software instruction converter to convert binary instructions in a source instruction set into binary instructions in a target instruction set, in accordance with an embodiment of the present invention.

較佳實施例之詳細說明 Detailed description of the preferred embodiment

在下面的描述中,許多具體的細節將進行闡述。然而,可被理解的是本發明的實施例可以在沒有這些具體細節的情況下被實踐。在其他的實例中,公知的電路、結構和技術並沒有被詳細示出以免模糊了對本說明書的理解。 In the following description, numerous specific details are set forth. However, it is understood that the embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown in detail to avoid obscuring the understanding of the specification.

在本說明書中對「一實施例」、「一個實施例」、「一示例實施例」等的引用表明該描述的實施例可以包括一種特定的特徵、結構、或特性,但是每一個實施例不一定包括該特定的特徵、結構、或特性。此外,這樣的短語不一定指的是同一個實施例。再者,當一特定的特徵、結構、或特性結合一實施例被描述時,被認知的是在本領域習知技藝者的知識範圍內這種特徵、結構、或特性可以結合其他的實施例,不管有沒有明確的被描述。 References to "an embodiment", "an embodiment", "an example embodiment" and the like in this specification indicate that the described embodiments may include a particular feature, structure, or characteristic, but each embodiment does not This particular feature, structure, or characteristic must be included. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is recognized that such features, structures, or characteristics may be combined with other embodiments within the knowledge of those skilled in the art. , whether or not it is clearly described.

最近相鄰者搜尋的一種方法是計算從該輸入實例到一資料集合中每一點的距離,和追踪該最短距離。然而,這種簡單的方法可能不可行於較大的資料集合。該距離計算可使用一k維(k-d)樹一次一個特徵地執行所有特徵 的一種詳盡檢查來完成。因此這種方法很慢並另外有高功耗。 One method of recent neighbor search is to calculate the distance from the input instance to each point in a data set, and to track the shortest distance. However, this simple approach may not be feasible for larger data sets. This distance calculation can perform all features one feature at a time using a k-dimensional (k-d) tree An exhaustive check to complete. Therefore this method is slow and has high power consumption.

另外一種最近相鄰者方法使用Voronoi圖。每一個Voronoi圖把一平面分割成相等最近相鄰者區域,其稱之為胞格。這可圖示為數個胞格,每一個具有一特徵(點)。從理論上來講,可使用一Voronoi圖來找到在一特定胞格中的該特徵來為任何輸入實例找到一個「最佳匹配的」特徵。然而,如圖所示,Voronoi胞格具高度不規則的形狀,並且難以計算(它們同時要大量的時間和處理器)和使用。換句話說,對最近相鄰者特徵搜尋,Voronoi圖本身不是方便或有效率的方法。 Another recent neighbor method uses the Voronoi diagram. Each Voronoi diagram divides a plane into equal nearest neighbor regions, which are called cells. This can be illustrated as a number of cells, each with a feature (point). In theory, a Voronoi diagram can be used to find this feature in a particular cell to find a "best match" feature for any input instance. However, as shown, the Voronoi cells have highly irregular shapes and are difficult to calculate (they require a lot of time and processor at the same time) and use. In other words, the Voronoi diagram itself is not a convenient or efficient method for searching for nearest neighbor features.

本文詳細陳述了將被使用在改進最近相鄰者搜尋中之系統、設備、和方法的實施例,其克服了上述方法的缺點。總之,給定一輸入(即,一觀察),在一個特徵空間(即,一特徵的字典)中搜尋出該最佳匹配的特徵被達成。對於通常係稀疏地出現在高維度向量空間中的特徵向量(注意在本說明書中特徵是向量,因此,特徵和特徵向量可互換使用)而言,這種方法特別的適合。 Embodiments of systems, devices, and methods that will be used in improving nearest neighbor search are set forth in detail herein that overcome the shortcomings of the above methods. In summary, given an input (i.e., an observation), a feature that finds the best match in a feature space (i.e., a dictionary of features) is achieved. This method is particularly suitable for feature vectors that are typically sparsely present in high dimensional vector spaces (note that features are vectors in this specification, and therefore features and feature vectors are used interchangeably).

以下詳述的是一種k-最近相鄰者(kNN)加速器的實施例,其調整該等距離計算的精確度以最小化查詢出每一個最近相鄰者所需要的該級數。許多候選向量僅使用低精確度計算來從該搜尋空間被移除,而更靠近該最近相鄰者之其餘的候選人在較後面的迭代使用較高精確度來被移除以宣告出一贏家。由於大部分的計算要求較低精確度和 消耗較少的能量,整體的kNN能量效率會被顯著的提高。通常,這個kNN加速器係一中央處理單元(CPU)、圖形處理單元(GPU)、等等的一部分。但是,該kNN加速器可以是外部於該CPU、GPU、等等。 Described below is an embodiment of a k-nearest neighbor (kNN) accelerator that adjusts the accuracy of the equidistant calculation to minimize the number of levels required to query each nearest neighbor. Many candidate vectors are removed from the search space using only low accuracy calculations, while the remaining candidates closer to the nearest neighbor are removed with higher accuracy at a later iteration to announce a winner. . Since most calculations require lower accuracy and Consuming less energy, the overall kNN energy efficiency is significantly improved. Typically, this kNN accelerator is part of a central processing unit (CPU), graphics processing unit (GPU), and the like. However, the kNN accelerator can be external to the CPU, GPU, and the like.

圖1根據一實施例圖示出一高階kNN加速器。在此加速器中,有一些主要的組件包括數個向量局部距離計算電路103_0至103_N、一全域控制電路105、以及一最小排序網路107。這些組件的每一個將在以下被詳細的討論。 Figure 1 illustrates a high order kNN accelerator in accordance with an embodiment. In this accelerator, there are some main components including a plurality of vector local distance calculation circuits 103_0 to 103_N, a global control circuit 105, and a minimum order network 107. Each of these components will be discussed in detail below.

一查詢物件向量101被輸入到該等數個向量部分距離計算電路103_0到103_N用於部分距離的計算。沒有被顯示的是用於此一呈現物件向量的儲存。該等部分距離計算電路103_0到103_N計算每一個參考向量的部分距離和一累積的距離,並提供一有效的指示給該最小排序網路107。在一查詢101與儲存向量之間用較低有效位元精確度使用多迭代部分距離計算並在每一次迭代中如本文詳述地做改善,會比過去的方法更為節能。部分距離計算涉及在每次迭代中對不同的距離度量(諸如歐幾里德(平方和)距離和曼哈坦(絕對差之和)距離)計算從該MSB開始之完整距離的較少位元。該部分結果以正確有效性被加入到該累積完成的距離,從而隨著該計算的繼續提高了較低有效位元的精確度。 A query object vector 101 is input to the plurality of vector portion distance calculating circuits 103_0 to 103_N for calculation of the partial distance. What is not shown is the storage for this rendered object vector. The partial distance calculation circuits 103_0 to 103_N calculate a partial distance and a cumulative distance for each reference vector and provide a valid indication to the minimum ordering network 107. Using multiple iterative partial distance calculations with less significant bit precision between a query 101 and a stored vector and improving as detailed in this iteration in each iteration is more energy efficient than in the past. Partial distance calculation involves calculating different distance metrics (such as Euclidean (square sum) distance and Manhatan (sum of absolute difference)) in each iteration to calculate fewer bits from the MSB's full distance from the MSB . This partial result is added to the cumulative completed distance with the correct validity, thereby increasing the accuracy of the lower significant bit as the calculation continues.

圖2根據一實施例圖示出一示例性向量部分距離計算電路203。該向量是由許多的維度構成,在本實例中每一個維度由8b來表示。在每一個維度中的個別距離首先是 由205計算然後它們被加總以在211得到該總距離。一局部控制電路207提供一指示指出在該不同的資料元件計算器電路205中要選擇那些位元。 FIG. 2 illustrates an exemplary vector portion distance calculation circuit 203 in accordance with an embodiment. The vector is made up of a number of dimensions, each of which is represented by 8b in this example. The individual distances in each dimension are first Calculated by 205 then they are summed to get the total distance at 211. A local control circuit 207 provides an indication of which bits are to be selected in the different data element calculator circuit 205.

如以上所述,可以使用幾種不同類型的距離度量,因此有不同的資料元件計算器電路205。當使用一種絕對差總和(曼哈坦距離)度量時,從每一個向量元件的該絕對差選擇適當的兩個位元(2b)並把它們加總起來是在向量部分距離計算電路203中完成的。圖3根據一實施例圖示出一平方差(歐幾里得距離)資料元件計算器電路205的一示例性向量部分距離總和。如圖所示,一查詢物件(如圖所示的8位元)的一部分和一儲存物件(相同位元數)的一部分具有由硬體所做的絕對差(|a-b|)計算並且該結果的特定位元係使用多工器和一控制信號來選擇的。在一些實施例中,一局部控制電路提供的該控制信號將會在下面詳述。該多工的該等結果被乘(一對2bx2b乘法)然後被相加。在這實例中,該輸出是一5位元值,它代表當計算該等差的平方時的一部分距離。這些會由壓縮器樹211相加以計算該整個向量的一部分歐幾里得距離。 As mentioned above, several different types of distance metrics can be used, thus having different data element calculator circuits 205. When an absolute difference sum (Manhatan distance) metric is used, selecting the appropriate two bits (2b) from the absolute difference of each vector element and summing them up is done in the vector portion distance calculation circuit 203. of. 3 illustrates an exemplary vector partial distance sum of a squared difference (Euclidean distance) data element calculator circuit 205, in accordance with an embodiment. As shown, a portion of a query object (8 bits as shown) and a portion of a stored object (the same number of bits) have an absolute difference (|ab|) calculated by the hardware and the result The particular bit is selected using a multiplexer and a control signal. In some embodiments, the control signal provided by a local control circuit will be detailed below. The results of the multiplex are multiplied (a pair of 2bx2b multiplications) and then added. In this example, the output is a 5-bit value that represents a portion of the distance when calculating the square of the difference. These are added by the compressor tree 211 to calculate a portion of the Euclidean distance of the entire vector.

當使用一種絕對差的總和(曼哈坦距離)度量時,從每一個向量元件的該絕對差選擇適當的兩個位元(2b)並把它們加總起來是在向量部分距離計算電路203中完成的。圖4根據一實施例圖示出一絕對差資料元件計算器電路205的一示例性向量部分距離總和。如圖所示,一查詢物件(如圖所示的8位元)的一部分和一儲存物件(相同位元數)的 一部分具有由硬體所做的絕對差(|a-b|)計算並且該結果的特定位元係使用多工器和一控制信號來選擇的。在一些實施例中,一局部控制電路提供的該控制信號將會在下面詳述。在這實例中,該輸出是一2位元值。 When using a sum of absolute difference (Manhatan distance) metric, the appropriate two bits (2b) are selected from the absolute difference of each vector element and added up in the vector portion distance calculation circuit 203. Completed. 4 illustrates an exemplary vector partial distance sum of an absolute difference data element calculator circuit 205, in accordance with an embodiment. As shown, a part of the query object (8 bits as shown) and a storage object (the same number of bits) A portion has an absolute difference (|a-b|) calculated by the hardware and the particular bit of the result is selected using a multiplexer and a control signal. In some embodiments, the control signal provided by a local control circuit will be detailed below. In this example, the output is a 2-bit value.

該部分SAD計算把壓縮器樹大小降低了四倍,而該部分歐幾里德度量計算把每一個向量元件8位元乘8位元的乘法器替換成一對平凡的2位元乘法器,也因此把該壓縮器樹211的面積降低了3倍。該歐幾里得距離之此非顯而易見的建構保證了在處理一較高的MSB位元置之後,任何隨後較低的MSB細化不會影響任何較高位元超過1,如以下針對圖3、4、6、和7所討論的。圖7圖示出一完整平方差操作的一實施例(用於計算歐幾里得距離),其被打散成由圖3該示例性電路所計算的部分計算迭代。圖7說明了由執行如該顯示順序所示的計算和排序,在較低階位元上的計算操作不會擾亂經處理過之較高階位元超過1。類似的,圖4圖示出每一個元件該電路的一實施例而圖6圖示出由該用於曼哈坦距離的電路所執行之該對應計算操作的一實例。圖6說明了藉由執行如該顯示順序所示的計算和排序,在較低階位元上的計算操作不會擾亂經處理過之較高階位元超過1。 This part of the SAD calculation reduces the size of the compressor tree by a factor of four, and this part of the Euclidean metric calculation replaces the 8-bit by 8-bit multiplier of each vector element with a pair of trivial 2-bit multipliers. Therefore, the area of the compressor tree 211 is reduced by a factor of three. This non-obvious construction of the Euclidean distance ensures that any subsequent lower MSB refinement will not affect any higher bits beyond 1 after processing a higher MSB bit, as shown below for Figure 3. Discussed in 4, 6, and 7. Figure 7 illustrates an embodiment of a complete squared difference operation (for calculating the Euclidean distance) that is broken up into a partial computational iteration calculated by the exemplary circuit of Figure 3. Figure 7 illustrates that by performing the calculations and ordering as shown in the display order, the computational operations on the lower order bits do not disturb the processed higher order bits by more than one. Similarly, Figure 4 illustrates an embodiment of the circuit for each component and Figure 6 illustrates an example of the corresponding computational operation performed by the circuit for the Manhattan distance. Figure 6 illustrates that by performing the calculations and ordering as shown in the display order, the computational operations on the lower order bits do not disturb the processed higher order bits by more than one.

在一些實施例中,由於一共同的硬體資料路徑,一單一電路被使用來在不同度量之間做重新配置。 In some embodiments, a single circuit is used to reconfigure between different metrics due to a common hardware data path.

一壓縮器樹211把該資料元件距離計算器的每一個的該等輸出相加。在歐幾里得距離的一個256維向量中,這個輸出是一個13位元的值。該壓縮器樹211的該輸出會被 發送到一移位器209。通常,此移位器係一右移移位器,然而,取決於該位元資料順序配置,它可以是一左移移位器。在大多數的實施例中,該移位元量係由局部控制電路207來控制。該移位器把該部分距離對準相對於該累積距離的正確有效性。 A compressor tree 211 adds the outputs of the data element distance calculator to each of the outputs. In a 256-dimensional vector of Euclidean distance, this output is a 13-bit value. The output of the compressor tree 211 will be Send to a shifter 209. Typically, this shifter is a right shift shifter, however, depending on the bit material order configuration, it can be a left shift shifter. In most embodiments, the shifting element is controlled by local control circuitry 207. The shifter aligns the partial distance with respect to the correct validity of the cumulative distance.

正反器213儲存來自前一迭代的該累積距離而該加法器215的該輸出是在目前迭代中的該累積距離。在下一迭代開始時,該值會被寫入正反器213。該選擇器219基於該全域指標選擇來自該累積距離的該2b。它還選擇了來自把該部分和加到該前一累積距離在該2b位置上的該進位輸出。 The flip flop 213 stores the accumulated distance from the previous iteration and the output of the adder 215 is the cumulative distance in the current iteration. This value is written to the flip flop 213 at the beginning of the next iteration. The selector 219 selects the 2b from the cumulative distance based on the global indicator. It also selects the carry output from the portion and the previous accumulated distance to the 2b position.

該局部控制電路207把psumi作為一輸入,並在把它傳遞到該排序網路107之前僅修改該較高位元作為一psum值(例如,3位元)。一有效位元也被傳遞到該最小排序網路107。圖5根據一實施例圖示出一示例性局部控制電路。該局部控制電路控制該向量部分距離計算電路的幾個不同的方面,如以上所述。此電路接收來自一全域控制單元的一全域指標(下面詳細說明),伴隨來自該選擇器219的一psumi,以及來自該最小排序網路107之一最小總和、位址、和精確指示。 The local control circuit 207 takes psumi as an input and modifies only the higher bit as a psum value (e.g., 3 bits) before passing it to the sorting network 107. A valid bit is also passed to the minimum ordering network 107. FIG. 5 illustrates an exemplary local control circuit in accordance with an embodiment. The local control circuit controls several different aspects of the vector portion distance calculation circuit, as described above. The circuit receives a global metric from a global control unit (described in more detail below) along with a psumi from the selector 219 and a minimum sum, address, and precision indication from the minimum ordered network 107.

如圖所示,該局部控制電路拿該正被處理向量的一位址和來自該最小排序網路107的一最小位址,並使用比較電路來判定它們是否相等。從該比較的輸出和來自該最小排序網路107的該最小精確做邏輯AND,以有助於判定是 否該物件向量不需再被處理。特別的是,該輸出被使用在一有效位元的計算中諸如透過如所示出的該AND閘。該局部控制電路使用該最小總和,psumi,和全域指標以產生一移除信號,以及局部計算信號(用於控制該等不同的資料元件計算器電路205),如圖所示。向量處理可以兩個原因之一來被停止-1)該目前的向量被宣布為一獲勝者或2)該目前的向量從該搜尋空間被移除當它保證不會成為該最近相鄰者時。在上面的描述係針對前者判定該「完成」信號。這個信號的該反向(圖示為經歷一個小圈才進入該AND閘513)會影響該有效信號。如果該「完成」被判定,該有效信號將被反判定。影響有效之該剩下的邏輯判定該向量是否沒有完成但仍是該搜尋空間的一部分。該移除信號表明在該目前的迭代中向量從搜尋空間被移除,並且此資訊由該全域控制所使用。在圖5中所示的該電路還從該全域CLK信號產生時脈以控制儲存元件(包括213)的該時脈。該局部控制電路還從該全域控制電路接收「計算控制」,它然後傳遞給該部分距離計算205和移位器電路209。 As shown, the local control circuit takes the address of the vector being processed and a minimum address from the minimum ordering network 107 and uses a comparison circuit to determine if they are equal. Logically AND from the output of the comparison and the minimum precision from the minimum ordering network 107 to aid in determining No, the object vector does not need to be processed any more. In particular, the output is used in the calculation of a valid bit such as through the AND gate as shown. The local control circuit uses the minimum sum, psumi, and global indices to generate a removal signal, and a local calculation signal (for controlling the different data component calculator circuits 205) as shown. Vector processing can be stopped for one of two reasons - 1) the current vector is declared as a winner or 2) the current vector is removed from the search space when it is guaranteed not to be the nearest neighbor . In the above description, the "complete" signal is determined for the former. This reversal of this signal (illustrated as going through a small turn to enter the AND gate 513) affects the valid signal. If the "complete" is determined, the valid signal will be de-judged. The remaining logic that affects the effect determines if the vector is not complete but is still part of the search space. The removal signal indicates that the vector was removed from the search space in the current iteration and this information is used by the global control. The circuit shown in Figure 5 also generates a clock from the global CLK signal to control the clock of the storage element (including 213). The local control circuit also receives "computation control" from the global control circuit, which is then passed to the partial distance calculation 205 and the shifter circuit 209.

從本質上講,該局部控制電路為每一向量的一局部狀態控制提供跨所有向量的全域狀態控制以記錄距離計算狀態和在其它們被移除的迭代使得當計算k>1的一經排序列表時可以再次使用先前的距離計算和比較。 Essentially, the local control circuit provides global state control across all vectors for a local state control of each vector to record distance calculation states and iterations in which they are removed such that when calculating an ordered list of k > 1 The previous distance calculations and comparisons can be used again.

部分距離計算和排序迭代被交錯地如在圖6和7中所示。圖6根據一實施例圖示出一示例性曼哈坦距離計算與排序處理,而圖7根據一實施例圖示出一示例性資料元件 歐幾里得距離計算與排序處理。在這些圖示中該等字母(a、b、c、和d)是在該等查詢和參考向量之8b元件之間該絕對差的2b成分。如圖所示,該典型的程序是由向量部分距離計算電路103_0到103_N進行一計算迭代隨後跟著在該最小排序網路107中的一排序。然而,有些時候在連續排序迭代之間可以沒有計算迭代諸如在圖7中所示。 Partial distance calculations and sorting iterations are interleaved as shown in Figures 6 and 7. 6 illustrates an exemplary Manhattan shift calculation and sorting process in accordance with an embodiment, and FIG. 7 illustrates an exemplary data component in accordance with an embodiment. Euclidean distance calculation and sorting processing. In these figures the letters (a, b, c, and d) are the 2b components of the absolute difference between the 8b elements of the queries and reference vectors. As shown, the typical procedure is a computational iteration by vector partial distance calculation circuits 103_0 through 103_N followed by a ranking in the minimum ordering network 107. However, there may be times when there is no computational iteration between consecutive sorting iterations as shown in Figure 7.

該最小排序網路107進行基於窗口的排序。特別的是,這種排序網路處理從一部分計算出距離之最高有效位元(MSB)開始的一個小得多的位元窗口,使得進一步的部分計算可用小得多的比較器電路以及可做向量候選者的早先移除。 The minimum ordering network 107 performs window based ordering. In particular, this sorting network handles a much smaller bit window starting from a portion of the most significant bit (MSB) of the calculated distance, allowing further partial calculations to be used with much smaller comparator circuits and The earlier removal of the vector candidate.

例如,在一些實施例中,該排序網路在每次迭代中僅處理該從MSB到LSB累積向量距離之一2位元的窗口(圖6和7)。這使得可用非常低的硬體複雜度來有一種高度的平行化。由於在較低位元的該計算精煉可以影響在該MSB中已處理的位元最多為1,該排序網路107還需要處理產生自在目前2b窗口該計算迭代的該進位輸出。因此,該排序網路107比較,舉例來說,在每一個節點109和111之3b的數字(進位輸出與2b之和)。就比較而言,對於一256維向量距離比較(每元件8b),一常規排序網路在每一個節點會需要24b的比較器。 For example, in some embodiments, the sequencing network processes only one window of one bit from the MSB to the LSB cumulative vector distance in each iteration (Figures 6 and 7). This allows for a high degree of parallelism with very low hardware complexity. Since this computational refinement at lower bits can affect the number of bits processed in the MSB up to one, the sequencing network 107 also needs to process the carry output generated from the current iteration of the current 2b window. Thus, the sorting network 107 compares, for example, the number of 3b at each node 109 and 111 (the sum of the carry output and 2b). In comparison, for a 256-dimensional vector distance comparison (8b per component), a conventional sorting network would require a 24b comparator at each node.

該排序網路107全域地廣播該求出的最小3b結果和局部控制用於各別向量距離計算把它們的3b psum(進位輸出與2b總和)比較該廣播結果,看看該特定的向量是否可 以從更進一步使用該全域105和局部控制電路的距離精化計算和比較中被移除。 The sorting network 107 broadcasts the obtained minimum 3b result globally and local control for the respective vector distance calculations to compare their 3b psum (carry output with the sum of 2b) to see the broadcast result to see if the particular vector is available. It is removed in distance refinement calculations and comparisons from further use of the global domain 105 and local control circuitry.

由於該較低階的計算可在一將來的迭代中影響目前處理的窗口為1,對於候選者的移除,在該局部控制和排序網路107中所有的3b比較需要一超過1的差。出於同樣的原因,該局部控制還考慮到在一先前的迭代中是否一特定向量比該最小值大於1。使用精確信號,該排序網路107指出該發現的最小值是否是獨一無二的。排序迭代繼續下去,直到一獨一無二的最近向量被找到或達到該LSB為止。來自排序網路的反饋移除候選向量免於進一步距離計算和比較,從而導致高達3倍的計算量減少。 Since the lower order calculation can affect the currently processed window to one in a future iteration, for the removal of the candidate, all 3b comparisons in the local control and sequencing network 107 require a difference of more than one. For the same reason, the local control also considers whether a particular vector is greater than one in the previous iteration. Using an accurate signal, the sequencing network 107 indicates whether the minimum value of the discovery is unique. The sorting iteration continues until a unique recent vector is found or reaches the LSB. The feedback from the sorting network removes the candidate vectors from further distance calculations and comparisons, resulting in up to a 3x reduction in computation.

圖8根據一實施例圖示出使用部分距離之一示例性排序處理。由於當找到該最佳候選者時向量候選者會被移除,它們的局部狀態控制儲存了在其它們被移除的該迭代(如在圖5中的1ptr信號和在圖8中的方塊所示)。 Figure 8 illustrates an exemplary ranking process using one of the partial distances, in accordance with an embodiment. Since the vector candidates are removed when the best candidate is found, their local state control stores the iterations in which they were removed (as in the 1ptr signal in Figure 5 and the block in Figure 8). Show).

同時,當該全域控制指標向前移動(朝向該LSB)時,即使一單一向量被丟棄一個1也會被寫入到一全域二進遮罩之該相關聯的位元位置中(其被儲存在全域控制電路105中)。在該第一向量被發現之後,該全域二進制遮罩對該全域控制邏輯105指出該全域指標需要跳回到何處用於可能包含有下一個最近相鄰者的該組向量。該過程繼續迭代,並被圖示出用於在圖8中第二和第三最近相鄰者搜尋。當一全域指標已往回跳向該MSB,只有其儲存之迭代狀態匹配該全域指標位置的那些向量才會變得活躍。更接近該 最近相鄰者之向量將會在一更接近之全域指標位置處被移除。比起一傳統的排序方法其簡單地移除該最近相鄰者和從頭開始進行整個的計算和排序的過程,維護狀態的這種技術具有三大優勢-(a)它重複使用在求先前排序等級時已經完成的部分距離計算,(b)透過利用已經計算出的比較,減少需要被比較的該向量數目,以及(c)不需要k被預先定義來最小化用於任何排序等級之計算和比較。使用此種控制之計算和比較重複使用的優點量化了從X(例如,256)個向量(例如,每個向量256個8-b元件)中找出下一個最近相鄰者的該增量成本,在3個最近相鄰者已被找到之後。一傳統的排序技術將導致從剩餘的253個候選向量中找出該最近向量,而提出的控制會導致把此搜尋空間減少了19X和把該相關聯計算減少了20X。 Meanwhile, when the global control indicator moves forward (toward the LSB), even if a single vector is discarded, a 1 is written into the associated bit position of a global binary mask (which is stored) In the global control circuit 105). After the first vector is found, the global binary mask indicates to the global control logic 105 that the global metric needs to jump back to where the set of vectors may contain the next nearest neighbor. The process continues iteratively and is illustrated for the second and third nearest neighbor searches in FIG. When a global metric has been hopped back to the MSB, only those vectors whose iterative state matches the global metric location will become active. Closer to the The nearest neighbor's vector will be removed at a closer global indicator position. This technique of maintaining state has three major advantages over a traditional sorting method that simply removes the nearest neighbor and performs the entire calculation and sorting process from scratch - (a) it is reused in the previous order Partial distance calculations that have been completed at the level, (b) reduce the number of vectors that need to be compared by using the already calculated comparison, and (c) do not need k to be predefined to minimize the calculation for any ranking level and Comparison. Using the advantages of such control calculations and comparison reuse, quantify the incremental cost of finding the next nearest neighbor from X (eg, 256) vectors (eg, 256 8-b elements per vector) After 3 nearest neighbors have been found. A conventional sorting technique will result in finding the nearest vector from the remaining 253 candidate vectors, and the proposed control will result in a reduction of 19X for this search space and a 20X reduction for this associated calculation.

在該圖示中,來自該等向量之部分距離的一2位元窗口被處理。指向右邊的該等箭頭係來自局部控制電路的有效位元。在週期0中,該第七比較因為比最小值大超過1而關閉,因此該向量可以被移除。這移除被儲存在該局部控制電路中。此處理程序如以上詳述的方式進行。如果所有的向量都進行所有的部分和計算,該得到的距離將會匹配該完整的距離,其被顯示在最左邊僅作為參考。 In this illustration, a 2-bit window from a partial distance of the vectors is processed. The arrows pointing to the right are from the valid bits of the local control circuit. In cycle 0, the seventh comparison is turned off because it is greater than one than the minimum value, so the vector can be removed. This removal is stored in the local control circuit. This process is performed as described in detail above. If all vectors perform all partial sum calculations, the resulting distance will match the complete distance, which is displayed on the far left for reference only.

圖9根據一實施例圖示出一示例性全域控制電路。如在圖1中所示,該全域控制電路105從該最小排序網路107接收最小精確、位址、和總和值。該最小精確信號和一指出在該LSB位置之一全域指標的信號被邏輯OR並被用 作為用於一全域指標輸出的一選擇信號,其不是前一個全域指標遞增一就是已從一全域二進制遮罩被(優先權)編碼的一個,如圖所示。該全域二進制遮罩係由接收自該局部控制電路207之移除信號的OR樹來製成。計算控制信號的求出係使用一查詢表,使用該全域指標充當該表的一索引。直到一獨一無二的最小值或LSB沒被找到為止,該全域指標每次迭代不斷地朝向LSB移動(為達成這個該指標被遞增1)。這種情況係由該OR閘901來測試。否則,該指標回滾到在該二進制遮罩中該最近的1以尋找下一個最近相鄰者。即使一向量被移除,一個1被寫入到在該二進制遮罩內的該指標位置,否則一個0被寫入。該OR樹907檢測就算只有一向量被移除(該等移除信號係由所有各別局部控制電路所產生),之後的解多工器使用該全域指標來設置該適當位置的輸入為1,且當下一次迭代開始時(CLK的上升邊緣)這會被寫入到該全域二進制遮罩(在儲存903中保有)。該最接近1的該位置是由該優先級編碼器905來計算。該計算控制廣播到所有向量係基於該指標位置。藉由將它們儲存在一查詢表913中,這可以是可編程的,可基於該指標讀出該等適當的控制信號。 Figure 9 illustrates an exemplary global control circuit in accordance with an embodiment. As shown in FIG. 1, the global control circuit 105 receives minimum precision, address, and sum values from the minimum ordering network 107. The minimum accurate signal and a signal indicating a global indicator at one of the LSB locations are logically ORed and used As a selection signal for a global indicator output, it is not one of the previous global indicator increments or one that has been encoded (priority) from a global binary mask, as shown. The global binary mask is made up of an OR tree received from the local control circuit 207 to remove the signal. The calculation of the calculation control signal uses a lookup table that is used as an index to the table. Until a unique minimum or LSB is not found, the global indicator continually moves toward the LSB for each iteration (which is incremented by one to achieve this). This situation is tested by the OR gate 901. Otherwise, the indicator rolls back to the nearest 1 in the binary mask to find the next nearest neighbor. Even if a vector is removed, a 1 is written to the indicator position within the binary mask, otherwise a 0 is written. The OR tree 907 detects that even if only one vector is removed (these removal signals are generated by all of the respective local control circuits), the subsequent demultiplexer uses the global metric to set the input to the appropriate location to be 1, And when the next iteration begins (the rising edge of CLK) this will be written to the global binary mask (held in store 903). This position closest to 1 is calculated by the priority encoder 905. This calculation controls the broadcast to all vectors based on the indicator position. By storing them in a lookup table 913, this can be programmable and the appropriate control signals can be read based on the indicator.

更詳細的檢視該最小排序網路107,有兩種類型的比較節點-0級和「k」級節點。圖10根據一實施例圖示出一示例性0級比較節點電路。如圖所示,該電路接受有效位元,該等位元被使用來指出該psum是否是來自是該搜尋空間一部分的一個向量。如果伴隨一psum的該有效位元為0, 則該psum會在一節點的一比較中被忽略。 Looking at the minimum ordering network 107 in more detail, there are two types of comparison node-0 and "k" level nodes. Figure 10 illustrates an exemplary level 0 comparison node circuit in accordance with an embodiment. As shown, the circuit accepts a valid bit that is used to indicate whether the psum is from a vector that is part of the search space. If the valid bit associated with a psum is 0, Then the psum will be ignored in a comparison of nodes.

該等相鄰的有效位元進行邏輯OR以提供一0級有效位元。這些有效位元也進行XOR然後與一信號做OR以產生一精確位元,該信號表示在該等輸入總和之間的該絕對差值大於一臨界值ε。精確位元如果是「1」意味著沒有其他的向量接近。最後,該等相鄰的總和也相互比較,該結果與該等有效位元之一做AND以形成一位址和那一個總和將被輸出的選擇。該0級比較節點的該整體輸出是一位址、有效位元、精確位元、以及一總和。該輸出有效指出是否該結果是有效的(該等輸入有效中的至少一個必須為真以滿足這個條件)。該比較結果被加到該找到之最小向量該位址的該最高階位元(在此情況下位元[0],因為它是該第一比較層級)。該輸出精確信號指出該等兩向量不等價或是接近(如果ε是1,則差大於1;或如果ε是0,則差大於0)。如果該等輸入只有一個是有效該XOR 1003判定該精確信號,不論該比較結果為何(因為如果該等總和中之一是無效的則該比較並不重要)。該比較結果傳到該較小總和連同它的輸入位址到下一個節點。 The adjacent valid bits are logically ORed to provide a level 0 effective bit. These significant bits are also XOR and ORed with a signal to produce an exact bit that indicates that the absolute difference between the input sums is greater than a threshold ε. A precise bit if "1" means that no other vector is close. Finally, the adjacent sums are also compared to each other, and the result is ANDed with one of the valid bits to form a selection of the address and the sum to be output. The overall output of the level 0 comparison node is a bit address, a valid bit, an exact bit, and a sum. The output is valid to indicate if the result is valid (at least one of the valid inputs must be true to satisfy this condition). The result of the comparison is added to the lowest order bit of the found minimum vector (in this case bit [0] because it is the first comparison level). The output precision signal indicates that the two vectors are not equivalent or close (if ε is 1, the difference is greater than 1; or if ε is 0, the difference is greater than 0). If only one of the inputs is valid, the XOR 1003 determines the exact signal, regardless of the result of the comparison (because the comparison is not important if one of the sums is invalid). The comparison results to the smaller sum along with its input address to the next node.

圖11根據一實施例圖示出一示例性k級比較節點電路。該電路接受來自它前面該級(例如,0級)之位址、有效位元、精確位元和一總和的相鄰輸出,並將它們提供給該所示的電路。該操作類似於圖10中所示的操作。該等總和信號的該比較結果現在也從傳入的精確信號選擇並把該選擇的精確與在這個節點所計算的精確信號做AND以產生 該輸出精確信號。該輸出精確信號指出是否該輸出總和是獨一無二的,即,在從0級開始所有向量之任何最接近向量超過ε界限的最小者。 Figure 11 illustrates an exemplary k-level comparison node circuit in accordance with an embodiment. The circuit accepts the adjacent outputs from the address (e.g., level 0) of the preceding stage (e.g., level 0), and supplies them to the illustrated circuit. This operation is similar to the operation shown in FIG. The result of the comparison of the sum signals is now also selected from the incoming precision signal and the precision of the selection is ANDed with the exact signal calculated at this node to produce This outputs an accurate signal. The output accurate signal indicates whether the output sum is unique, that is, the smallest one of any nearest vectors of all vectors starting at level 0 that exceeds the ε bound.

以上所述之該kNN加速器不同的實施例增加了靈活性和/或該加速器將受益的應用空間。舉例來說,在一些實施例中,為了啟用在向量元件大於8b上的計算,為8b元件所設計的該距離計算電路可以透過組合相鄰8b元件電路的配對來被重複使用於16b元件。圖12根據一實施例圖示出一示例性的8位元/16位元可重新配置計算電路。在該電路中,一控制信號廣播兩組選擇信號用於偶數/奇數編號的8b運算電路。對於一固定的電路和儲存大小,當在16b的模式下操作時該向量維數或儲存的向量數會被減少了一半。在16b的模式下要計算一完整平方和所需要的該迭代次數會從6(用於8b元件)增加至15。在連續排序迭代之間可能需要多個計算迭代,以確保當處理較低階位元時較高階位元不被影響超過1。即使在16b模式下,基於部分計算之加速器排序大大的降低了用於找出最近相鄰者的計算。圖13根據一實施例圖示出一示例性部分距離計算用於16位元元件的平方和。在一些實施例中,只有16b寬度或重新配置被使用。當然,也可以使用其他的位元寬度或重新配置。 The different embodiments of the kNN accelerator described above add flexibility and/or application space that the accelerator will benefit. For example, in some embodiments, to enable calculations on vector elements greater than 8b, the distance calculation circuit designed for the 8b elements can be reused for the 16b elements by combining pairs of adjacent 8b element circuits. Figure 12 illustrates an exemplary 8-bit/16-bit reconfigurable computing circuit in accordance with an embodiment. In this circuit, a control signal broadcasts two sets of selection signals for an even/odd numbered 8b operation circuit. For a fixed circuit and storage size, the number of vector dimensions or stored vectors is reduced by half when operating in 16b mode. The number of iterations required to calculate a complete sum of squares in 16b mode is increased from 6 (for 8b components) to 15. Multiple computational iterations may be required between successive ordering iterations to ensure that higher order bits are not affected by more than 1 when processing lower order bits. Even in 16b mode, the accelerator ranking based on partial calculation greatly reduces the computation used to find the nearest neighbor. Figure 13 illustrates an exemplary partial distance calculation for the sum of squares of 16-bit elements, in accordance with an embodiment. In some embodiments, only 16b width or reconfiguration is used. Of course, other bit widths or reconfigurations can be used as well.

在一些實施例中,該kNN加速器可被重新配置來支援較大的向量維度,在該距離計算單元的該壓縮器樹中具有額外階段以加入來自其他距離計算單元方塊的結果。因此,因為每一個向量的維度被增加,該儲存的向量數減 少了。 In some embodiments, the kNN accelerator can be reconfigured to support larger vector dimensions with additional stages in the compressor tree of the distance calculation unit to join results from other distance calculation unit blocks. Therefore, because the dimension of each vector is increased, the number of stored vectors is reduced. not enough.

在一些實施例中,該kNN加速器的功能被延伸成可以在比該加速器儲存容量更大的資料集上操作。來自儲存在該加速器內之一資料庫中經排序k-最近候選者首先被計算,被移除的候選者被來自記憶體之任何剩餘的物件描述符來替換,並且該處理繼續,直到所有的物件候選者都被迭代過以找到該整體k-最近描述符向量。對於一加速器具有256物件容量且每一個物件特徵描述有一256維(每維度8b)向量,橫跨從512至2048個物件之一物件資料庫大小該加速器一致地能夠降低用於一排序的16-最接近候選者名單的平方和計算。 In some embodiments, the functionality of the kNN accelerator is extended to operate on a data set that is larger than the accelerator storage capacity. The sorted k-nearest candidates from one of the libraries stored in the accelerator are first calculated, the removed candidates are replaced by any remaining object descriptors from the memory, and the process continues until all Object candidates are iterated to find the overall k-nearest descriptor vector. For an accelerator with 256 object capacity and each object feature has a 256-dimensional (8b per dimension) vector, across an object library size from 512 to 2048 objects, the accelerator consistently reduces the number of 16-for a sort. The square of the nearest candidate list is calculated.

在一些實施例中,除了由最小距離找出向量,該加速器可以被重新配置來以距離下降順序找出向量,藉由反轉在該排序網路的該等比較節點內該3b比較器電路的該輸出。可替代地,該降序排列的計算是通過把該最大可能距離減去該累積部分距離,然後使用該相同之基於窗口的最小排序網路來處理該等所得到的數字。 In some embodiments, in addition to finding the vector from the minimum distance, the accelerator can be reconfigured to find the vector in descending order of distance by inverting the 3b comparator circuit within the comparison nodes of the sequencing network. The output. Alternatively, the descending permutation is calculated by subtracting the cumulative partial distance from the maximum possible distance and then processing the resulting numbers using the same window-based minimum ordering network.

在一些實施例中,各種距離度量的容納可藉由只重新配置在該網路中的該1D距離電路來達成。除了歐幾里德和曼哈坦的距離之外,對一向量尋找該最接近匹配的另一種流行的度量是餘弦相似度,它採用在向量之間的該角度距離以找到該最接近的匹配。在兩個向量A和B之間該角度的餘弦被計算為[Σ(ai.bi)]/[(Σai 2)1/2.(Σbi 2)1/2],一個較小的角度可產生較大的餘弦。針對基於餘弦的相似度,若該 儲存的資料庫已經被正歸化那麼正歸化就不需要,然後該最佳化會被轉化為找出該點積Σ(ai.bi)具有最大量值的向量。在查詢和儲存物件之間的該點積可以使用於該歐幾里得度量之該等現有2b乘法器來被部分地計算出。 In some embodiments, the accommodation of various distance metrics can be achieved by reconfiguring only the 1D distance circuit in the network. In addition to the distance between Euclidean and Manhatan, another popular metric for finding the closest match to a vector is the cosine similarity, which takes the angular distance between the vectors to find the closest match. . The cosine of this angle between the two vectors A and B is calculated as [Σ(a i .b i )]/[(Σa i 2 ) 1/2 .(Σb i 2 ) 1/2 ], a smaller The angle can produce a large cosine. For cosine-based similarity, if the stored database has been normalized, then normalization is not needed, and then the optimization is converted to find the point product (a i .b i ) has the largest The vector of magnitude. This dot product between the query and stored objects can be partially calculated using the existing 2b multipliers of the Euclidean metric.

圖14根據一實施例圖示出一餘弦相似度計算(1d距離)電路和根據一實施例圖示出用於點積之示例性部分距離計算。在連續排序迭代之間可能需要多個計算迭代,以確保當處理較低階位元時較高階位元不被影響超過1。對於帶有符號元件的點積,每一個計算迭代需要2個步驟-第一步加總所有的正乘積到該積累的部分距離,然後從該累積的部分距離減去所有負乘積之總和。 Figure 14 illustrates a cosine similarity calculation (1d distance) circuit and an exemplary partial distance calculation for dot product in accordance with an embodiment, in accordance with an embodiment. Multiple computational iterations may be required between successive ordering iterations to ensure that higher order bits are not affected by more than 1 when processing lower order bits. For a dot product with symbolic components, each calculation iteration requires 2 steps - the first step sums all positive product products to the accumulated partial distance, and then subtracts the sum of all negative product products from the accumulated partial distance.

在一些實施例中,隨著迭代進行,基於把該累積的部分距離比較一預定的絕對臨界值,候選向量也可被早期的移除。另外,宣告一獲勝向量不需要是精確的,且挑選一贏家的迭代可基於一預定的相對精確度(使用全域指標位置)或絕對積累部分距離來被更早期地停止。這樣子的方案可以減少用於最佳化來進行近似最近相鄰者(ANN)搜尋演算法的能源消耗。 In some embodiments, as the iteration proceeds, the candidate vectors may also be removed early based on comparing the accumulated partial distances to a predetermined absolute threshold. In addition, declaring a winning vector need not be precise, and the iteration of picking a winner may be stopped earlier based on a predetermined relative accuracy (using global indicator position) or absolute accumulation of partial distance. Such a scheme can reduce the energy consumption for optimization to approximate the nearest neighbor (ANN) search algorithm.

圖15根據一實施例圖示出kNN搜尋的一種示例性方法,使用上面所詳述kNN加速器的實施例。在一高層級中,kNN搜尋的該方法包括計算部分距離、積累這些距離、並且以一種交錯的方式排序那些累積的距離。下面是本程序的一更詳細的說明。 Figure 15 illustrates an exemplary method of kNN search, using an embodiment of the kNN accelerator detailed above, in accordance with an embodiment. In a high level, the method of kNN search involves calculating partial distances, accumulating these distances, and sorting those accumulated distances in an interleaved manner. The following is a more detailed description of the program.

在一些實施例中,一個或多個變數被重置。舉例 來說,用於每一個參考向量的一累積的距離、一全域指標、一全域二進制遮罩、一k值、用於每一個參考向量的一有效位元(設置為1)、用於每一個參考向量的一「完成」位元(設置為0)、以及用於每一個參考向量的一局部指標。 In some embodiments, one or more variables are reset. Example For example, a cumulative distance for each reference vector, a global indicator, a global binary mask, a k value, a valid bit for each reference vector (set to 1), for each A "done" bit of the reference vector (set to 0), and a local indicator for each reference vector.

在1501,對於每一個參考向量和該參考向量的每一個元件,在該元件與在該查詢向量中一對應元件之間的一絕對差被計算出。 At 1501, for each reference vector and each element of the reference vector, an absolute difference between the element and a corresponding element in the query vector is calculated.

在1503,基於該全域指標一比較臨界值被設定。 At 1503, a comparison threshold is set based on the global indicator.

在1505判定是否一部分距離將被計算。當該部分距離應被計算時,在1507對於具有一有效位元設置為1(表示有效)的每一個參考物件向量,一部分距離被計算(例如,使用在圖3和4和壓縮器樹211中的電路)並移位和加入到一累積距離。 At 1505 it is determined if a portion of the distance will be calculated. When the partial distance should be calculated, at 1507, for each reference object vector having a valid bit set to 1 (indicating valid), a portion of the distance is calculated (eg, used in Figures 3 and 4 and compressor tree 211). The circuit) is shifted and added to a cumulative distance.

在1509,當該部分距離不應該被計算或在1507已經發生之後,對於具有一有效位元設置為1(表示有效)的每一個參考物件向量,該累積距離之位元(psum)的一全域指標相依子集被發送到該最小排序網路。 At 1509, when the partial distance should not be calculated or after 1507 has occurred, for each reference object vector having a valid bit set to 1 (indicating valid), a global region of the cumulative distance bit (psum) The indicator dependent subset is sent to the least ordered network.

在1511,該排序網路找出一全域最小值和第二最小值。 At 1511, the sorting network finds a global minimum and a second minimum.

在1513,判定該全域最小值減去該第二最小值是否大於該設定的臨界值。如果是的話,則精確被設置為1。在1515,當該精確為1,或該全域指標是累積距離的該LSB時,則該最小值發現被設置為1。 At 1513, it is determined whether the global minimum is subtracted from the second minimum is greater than the set threshold. If so, it is set to exactly one. At 1515, when the precision is 1, or the global indicator is the LSB of the cumulative distance, then the minimum value is found to be set to one.

在1517,典型地平行於1513,對於有效等於1之 每一個參考物件向量,基於該設定的臨界值和一先前迭代的比較,把該psum與該全域最小值做一比較。基於此比較,該有效位元被更新成不是保持在1就是被失效為0。如果在目前的迭代中該有效被更新為0,該目前的全域指標被寫入到與該參考向量相關聯的該局部指標儲存並且1被寫入到在該全域指標位置的該全域二進制遮罩中。 At 1517, typically parallel to 1513, for an effective equal to 1 Each reference object vector compares the psum to the global minimum based on the set threshold and a previous iteration comparison. Based on this comparison, the valid bit is updated to either remain at 1 or be invalidated to zero. If the validity is updated to 0 in the current iteration, the current global metric is written to the local metric store associated with the reference vector and 1 is written to the global binary mask at the global metric location in.

在1519,判定該最小值發現是否等於1。如果是的話,則在1521中k被遞增1。另外,對於該全域最小值向量,在1521完成被設置為1且有效被設置為0。如果不是的話,則在1527該全域指標被遞增1且該比較臨界值再次被設定。 At 1519, it is determined whether the minimum value is found to be equal to one. If so, k is incremented by 1 in 1521. In addition, for the global minimum vector, the completion at 1521 is set to 1 and the validity is set to 0. If not, then at 1527 the global indicator is incremented by one and the comparison threshold is set again.

在k被遞增之後,在1523該全域指標被遞減到在該全域二進制遮罩中最接近1的一個位置。本質上,該全域指標會回滾到一參考物件向量從該搜尋空間被移除的一個最後位置上。 After k is incremented, at 1523 the global indicator is decremented to a position closest to 1 in the global binary mask. Essentially, the global metric is rolled back to a final position where a reference object vector is removed from the search space.

在1525,對於每一個參考物件向量,如果該局部指標大於或等於該全域指標而且該完成位元等於0時,該有效位元被設置為1且該比較臨界值被再次設定。當下一個最近的向量將被計算時,這會再次插入一參考向量進入該搜尋空間中。 At 1525, for each reference object vector, if the local index is greater than or equal to the global index and the completion bit is equal to 0, the valid bit is set to 1 and the comparison threshold is set again. When the next most recent vector is to be calculated, this will insert a reference vector again into the search space.

雖然以上描述的排序和計算是對所有的參考向量平行的做,這些操作可透過在同一電路上對不同的向量循序地進行計算和排序操作以節省佔地面積。 Although the sorting and calculation described above is done parallel to all reference vectors, these operations can save floor space by sequentially calculating and sorting different vectors on the same circuit.

以上所描述的系統、方法、和設備可使用來提供 許多的優點。在向量之間的該距離被迭代地計算使得在連續的迭代中該計算出距離的精確度從MSB提高到LSB。在每一次迭代中,一向量之一部分距離的該計算提供在一特定有效或位元位置上改善該完整(累計)的距離精確的目的。該完整的距離計算被分解成若干部分距離計算用於不同的度量諸如歐幾里德、曼哈坦或點積,使得在計算該等較高位元之後,在連續迭代中於低階位元位置上精確度的改善之後從不會改變較高階位元在一特定臨界值之外。 The systems, methods, and devices described above can be used to provide Many advantages. This distance between the vectors is iteratively calculated such that the accuracy of the calculated distance is increased from MSB to LSB in successive iterations. In each iteration, this calculation of a partial distance of a vector provides the purpose of improving the accuracy of the complete (cumulative) distance at a particular effective or bit position. The complete distance calculation is broken down into a number of partial distance calculations for different metrics such as Euclidean, Manhatan or dot product such that after calculating the higher bits, in the lower order position in successive iterations The improvement in accuracy above never changes the higher order bits beyond a certain threshold.

以上的完成係使用(i)具有1D計算電路之部分距離計算電路,利用控制信號來計算該右邊部分距離並根據該向量的該維數來排列,(ii)所有1D計算的該等部分距離使用一壓縮器樹加總以級(iii)一累加器,其具有用於目前累積距離的儲存,使用一移位器在一適當的有效性該部分距離被加到其。 The above completion uses (i) a partial distance calculation circuit having a 1D calculation circuit, which uses the control signal to calculate the distance of the right portion and arranges according to the dimension of the vector, (ii) the use of the partial distances of all 1D calculations. A compressor tree is summed up in stages (iii) an accumulator having storage for the current cumulative distance, which is added to it using a shifter at an appropriate validity.

排序這些積累的向量距離不用等到完整的距離已被計算-該排序可以用低精確度距離來開始。排序不會把該等累計距離的所有位元考慮在內-它是被迭代地完成,僅以位元的一小窗口開始,從MSB到LSB。該排序網路使用一可編程臨界值(在該示例性情況下,它是1或0)以宣告是否在每一比較和在一迭代的整個排序網路中該發現的最小值比任何其他的數小超過該臨界值。 Sorting these accumulated vector distances does not have to wait until the full distance has been calculated - the sorting can begin with a low precision distance. Sorting does not take into account all the bits of the cumulative distance - it is done iteratively, starting with a small window of bits, from MSB to LSB. The sorting network uses a programmable threshold (in the exemplary case, it is 1 or 0) to announce whether the minimum found for each comparison and the entire sorting network in an iteration is greater than any other The number is smaller than the threshold.

該計算和排序被交錯從MSB到LSB使得許多參考向量使用低精確度的距離計算從該搜尋空間中被移除,而只有剩下的向量進行下一次迭代來提升較低的位元精確 度用於判定最近相鄰者。 The calculation and ordering are interleaved from the MSB to the LSB such that many reference vectors are removed from the search space using low-precision distance calculations, while only the remaining vectors are next iterations to improve lower bit accuracy. Degree is used to determine the nearest neighbor.

與每一個向量相關聯的計算中具有局部控制,其使用該排序網路的結果來判定用於該向量的計算和排序是否進行到下一迭代或從該搜尋空間被移除。 There is local control in the calculation associated with each vector that uses the results of the ranking network to determine whether the calculations and rankings for the vector proceed to or are removed from the next iteration.

在每一個向量計算中局部控制和距離累加器會保持狀態即使它從該搜尋空間被移除。當找到下一個最近相鄰者時,該局部控制可以重新插入該向量進入該搜尋空間(基於全域指標)和重複使用在該先前移除點之前之任何先前的計算。 In each vector calculation the local control and distance accumulator will remain in state even if it is removed from the search space. When the next nearest neighbor is found, the local control can reinsert the vector into the search space (based on the global metric) and reuse any previous calculations prior to the previous removal point.

全域控制協調在該等累積距離中那些位元使用廣播到所有向量之一全域指標被發送到該排序網路的活動。 The global control coordinates the activities in which the bits are broadcast to all of the vectors in which the global metrics are sent to the sorting network.

用於該迭代相依之部分距離計算的控制信號也從該全域控制被廣播到所有的向量。這些控制信號可以被儲存在一可由該全域指標引用的可編程查詢表中或作為一固定的功能邏輯。 Control signals for the partial distance calculations that are dependent on the iteration are also broadcast from all global controls to all vectors. These control signals can be stored in a programmable lookup table that can be referenced by the global indicator or as a fixed function logic.

全域控制紀錄當找到一最近的相鄰者時向量從該搜尋空間被移除之迭代。該全域控制跳回到在其該向量被移除之該最近的迭代狀態以在該搜尋空間中僅有那些移除向量時開始搜尋以找出下一個最近的相鄰者。 The global control record is the iteration of the vector being removed from the search space when a nearest neighbor is found. The global control jumps back to the most recent iteration state in which the vector was removed to begin searching for only the next nearest neighbor in the search space.

該kNN加速器可被做成可編程的以改變該排列的順序。 The kNN accelerator can be made programmable to change the order of the permutations.

位元的大小、維度、或向量數的任何數量都是可支援的。另外,在一些實施例中,該kNN加速器是可編程 的,使得每一個維度的該等位元大小、維度或參考向量的該數量可被編程。 Any number of bit sizes, dimensions, or vectors can be supported. Additionally, in some embodiments, the kNN accelerator is programmable The number of such bit sizes, dimensions or reference vectors for each dimension can be programmed.

該操作可以被序列化,使得用於不同的參考向量的計算和排序可利用共用的部分距離計算和排序電路來完成。 This operation can be serialized such that the calculation and ordering for different reference vectors can be done using a common partial distance calculation and ordering circuit.

示例核心結構、處理器、和電腦結構 Example core structure, processor, and computer structure

處理器核心可以以不同的方式來實現,用於不同的目的,且在不同的處理器中。舉例來說,這種核心的實現方式可以包括:1)意圖用於通用計算之一種通用依序核心;2)意圖用於通用計算之一種高性能通用亂序核心;3)主要意圖用於通用計算之圖形和/或科學(吞吐量)計算之一種特別用途核心。不同的處理器的實現方式可包括:1)一CPU,其包括一個或多個意圖用於通用計算之通用依序核心和/或一個或多個意圖用於通用計算之高性能通用亂序核心;以及2)一協同處理器,其包括一個或多個主要意圖用於通用計算之圖形和/或科學(吞吐量)計算之特別用途核心。這種不同的處理器導致不同的電腦系統系結構,其可包括:1)該協同處理器在一不同於該CPU的單獨晶片上;2)該協同處理器在與一CPU相同的封裝中不同的晶粒上;3)該協同處理器在與一CPU在相同的晶粒上(在此情況下,這樣的一個協同處理器有時被稱為專用邏輯,諸如整合式的圖形和/或科學(吞吐量)的邏輯,或稱為特殊用途核心);及4)一系統單晶片,其在該相同的晶粒上可以包含有該描述的CPU(有時被稱為該(等)應用程式核心或應用程式處理 器),該上述協同處理器、以及其他的功能。示例性核心結構會在接下來被描述,然後是示例性處理器和電腦結構的說明。 The processor core can be implemented in different ways for different purposes and in different processors. For example, such core implementations may include: 1) a generic sequential core intended for general purpose computing; 2) a high performance general out-of-order core intended for general purpose computing; 3) primarily intended for general purpose A special purpose core for computational graphics and/or scientific (throughput) calculations. Implementations of different processors may include: 1) a CPU that includes one or more generic sequential cores intended for general purpose computing and/or one or more high performance general purpose out-of-order cores intended for general purpose computing And 2) a co-processor that includes one or more special purpose cores primarily intended for graphics and/or scientific (throughput) calculations of general purpose computing. Such different processors result in different computer system architectures, which may include: 1) the coprocessor is on a separate wafer from the CPU; 2) the coprocessor is different in the same package as a CPU 3) The coprocessor is on the same die as a CPU (in this case, such a coprocessor is sometimes referred to as dedicated logic, such as integrated graphics and/or science (throughput) logic, or special purpose core); and 4) a system single chip that can contain the described CPU on the same die (sometimes referred to as the (etc) application) Core or application processing The above-described coprocessor, and other functions. An exemplary core structure will be described next, followed by an illustration of an exemplary processor and computer architecture.

示例核心結構Sample core structure 依序和亂序核心方塊圖Sequential and out-of-order core block diagram

圖16A根據本發明的實施例係一方塊圖,其圖示出一示例性依序管線和一示例性暫存器重新命名、亂序分發/執行管線兩者。圖16B根據本發明的實施例係一方塊圖,其圖示出一依序結構核心的一示例性實施例以及將被包括在一處理器中之一示例性暫存器重命名、亂序分發/執行結構核心兩者。在圖16A-B中的實線方塊圖示出了該依序管線和依序核心,而可選擇性地加入虛線方塊圖示出了暫存器重新命名,亂序分發/執行管線和核心。鑑於該依序方面是亂序方面的一個子集,該亂序方面將進行說明。 Figure 16A is a block diagram illustrating an exemplary sequential pipeline and an exemplary scratchpad renaming, out of order distribution/execution pipeline, in accordance with an embodiment of the present invention. Figure 16B is a block diagram illustrating an exemplary embodiment of a sequential structure core and one exemplary temporary register renaming, out of order distribution, to be included in a processor, in accordance with an embodiment of the present invention. Execute both of the structural cores. The sequential pipeline and sequential cores are shown in solid-line block diagrams in Figures 16A-B, and a dashed block diagram is shown optionally showing the register renaming, out-of-order distribution/execution pipelines and cores. Since the sequential aspect is a subset of the out-of-order aspect, the out-of-order aspect will be explained.

在圖16A中,一處理器管線1600包括一提取階段1602、一長度解碼階段1604、一解碼階段1606、一分配階段1608、一重新命名階段1610、一排程(也稱為調度或分發)階段1612、一暫存器讀取/記憶體讀取階段1614、一執行階段1616、一寫回/記憶體寫入階段1618、一例外處理階段1622、以及一提交階段1624。 In FIG. 16A, a processor pipeline 1600 includes an extraction phase 1602, a length decoding phase 1604, a decoding phase 1606, an allocation phase 1608, a rename phase 1610, and a scheduling (also known as scheduling or distribution) phase. 1612. A scratchpad read/memory read stage 1614, an execution stage 1616, a write back/memory write stage 1618, an exception processing stage 1622, and a commit stage 1624.

圖16B展示出處理器核心1690,其包含有耦合到一執行引擎單元1650的一前端單元1630,並且兩者都被耦合到一記憶體單元1670。該核心1690可以是一精簡指令集計算(RISC)核心、一複雜指令集計算(CISC)核心、一超長 指令字(VLIW)核心、或一混合型或替代的核心類型。作為另一種選擇,該核心1690可以是一種特殊用途核心,諸如,舉例來說,一網路或通信核心、壓縮引擎、協同處理器核心、通用計算圖形處理單元(GPGPU)核心、圖形核心、或類似物。 16B shows a processor core 1690 that includes a front end unit 1630 coupled to an execution engine unit 1650, and both are coupled to a memory unit 1670. The core 1690 can be a reduced instruction set computing (RISC) core, a complex instruction set computing (CISC) core, and an ultra-long Command Word (VLIW) core, or a hybrid or alternative core type. Alternatively, the core 1690 can be a special purpose core such as, for example, a network or communication core, a compression engine, a coprocessor core, a general purpose computing graphics processing unit (GPGPU) core, a graphics core, or analog.

該前端單元1630包括一耦合到指令快取單元1634的一分支預測單元1632,該快取單元被耦合到一指令位址轉換緩衝器(TLB)1636,該緩衝器被耦合到一指令提取單元1638,該提取單元備耦合到一解碼單元1640。該解碼單元1640(或解碼器)可以解碼指令,並產生一輸出為一個或多個微操作、微碼入口點、微指令、其他指令、或其它控制信號,其係解碼自、或以其他方式反應、或衍生自該等原始指令。該解碼單元1640可以使用各種不同的機制來實現。合適機制的實例包括,但不侷限於,查詢表、硬體實現、可編程邏輯陣列(PLA)、微碼唯讀記憶體(ROM)、等等。在一實施例中,該核心1690包含有一微碼ROM或其他媒體,其儲存某些巨集指令的微碼(例如,在解碼單元1640中或以其他方式在該前端單元1630中)。該解碼單元1640被耦合到在該執行引擎單元1650中的一重新命名/分配器單元1652。 The front end unit 1630 includes a branch prediction unit 1632 coupled to the instruction cache unit 1634, the cache unit being coupled to an instruction address translation buffer (TLB) 1636 that is coupled to an instruction fetch unit 1638. The extraction unit is coupled to a decoding unit 1640. The decoding unit 1640 (or decoder) can decode the instructions and generate an output as one or more micro-ops, microcode entry points, microinstructions, other instructions, or other control signals that are decoded, or otherwise Reacting, or derived from such original instructions. The decoding unit 1640 can be implemented using a variety of different mechanisms. Examples of suitable mechanisms include, but are not limited to, lookup tables, hardware implementations, programmable logic arrays (PLAs), microcode read only memory (ROM), and the like. In one embodiment, the core 1690 includes a microcode ROM or other medium that stores microcode for certain macro instructions (eg, in the decoding unit 1640 or otherwise in the front end unit 1630). The decoding unit 1640 is coupled to a rename/allocator unit 1652 in the execution engine unit 1650.

該執行引擎單元1650包含有耦合到一引退單元1654之一重新命名/分配器單元1652和一組一個或多個排程器單元1656。該(等)排程器單元1656表示任何數量的不同的排程器,包括保留站、中央指令窗口、等等。該(等)排程 器單元1656被耦合到該(等)實體暫存器集單元1658。該(等)實體暫存器集單元1658的每一個代表一個或多個實體暫存器集,其中的不同者儲存一個或多個不同的資料類型,諸如純量整數、純量浮點數、打包整數、打包浮點數、向量整數、向量浮點、狀態(例如,將要執行之下一個指令位址的一指令指標)、等等。在一實施例中,該(等)實體暫存器集單元1658包括一向量暫存器單元、一寫入遮罩暫存器單元,和一純量暫存器單元。這些暫存器單元可以提供結構式向量暫存器、向量遮罩暫存器、和通用暫存器。該(等)實體暫存器集單元1658被該引退單元1654重疊以圖示出暫存器重新命名和亂序執行可以被實現的各種方式(例如,使用一(多)個重新排序緩衝器和一(多)個引退暫存器集;使用一(多)個未來檔案、一(多)個歷程緩衝器、和一(多)個引退暫存器集;使用一暫存器地圖和一暫存器池;等等)。該引退單元1654和該(等)實體暫存器集單元1658被耦合到該(等)執行集群1660。該(等)執行集群1660包括一組一個或多個執行單元1662及一組一個或多個記憶體存取單元1664。該等執行單元1662可以在各種類型的資料(例如,純量浮點數、打包整數、打包浮點數、向量整數、向量浮點數)上執行各種操作(例如,移位、加、減、乘)。雖然一些實施例可以包括專用於特定功能或功能集合的多個執行單元,其他的實施例可以僅包括一個執行單元或多個執行單元全部來執行所有的功能。該(等)排程器單元1656、實體暫存器集單元1658、以及執行集群1660被展示出為可能為複數的因為某 些實施例中產生用於特定類型資料/操作的獨立管線(例如,一純量整數管線、一純量浮點數/打包整數/打包浮點數/向量整數/向量浮點數管線、和/或一記憶體存取管線,它們每一個都有自己的排程器單元、實體暫存器集單元、和/或執行集群-並在一種單獨的記憶體存取管線的情況下,某些實施例中被實現成只有這個管道的該執行群集具有該(等)記憶體存取單元1664)。還應被理解的是,在各別的管線被使用的情況中,這些管線的一個或多個可以是亂序分發/執行的而其餘的是依序的。 The execution engine unit 1650 includes a rename/distributor unit 1652 coupled to one of the retirement units 1654 and a set of one or more scheduler units 1656. The (etc.) scheduler unit 1656 represents any number of different schedulers, including reservation stations, central command windows, and the like. The (etc.) schedule The unit 1656 is coupled to the (etc.) physical register set unit 1658. Each of the (or) physical register set units 1658 represents one or more physical register sets, wherein different ones store one or more different data types, such as scalar integers, scalar floating point numbers, Package integers, packed floating-point numbers, vector integers, vector floating-points, states (for example, an instruction metric that will execute the next instruction address), and so on. In one embodiment, the (etc.) physical scratchpad set unit 1658 includes a vector register unit, a write mask register unit, and a scalar register unit. These register units can provide structured vector registers, vector mask registers, and general purpose registers. The (etc.) physical scratchpad set unit 1658 is overlaid by the retirement unit 1654 to illustrate various ways in which register renaming and out-of-order execution can be implemented (eg, using one (multiple) reordering buffers and One (multiple) retiring register sets; using one (multiple) future files, one (multiple) history buffers, and one (multiple) retiring register sets; using a temporary map and a temporary Cache pool; etc.). The retirement unit 1654 and the (or other) physical register set unit 1658 are coupled to the (etc.) execution cluster 1660. The (etc.) execution cluster 1660 includes a set of one or more execution units 1662 and a set of one or more memory access units 1664. The execution units 1662 can perform various operations on various types of data (eg, scalar floating point numbers, packed integers, packed floating point numbers, vector integers, vector floating point numbers) (eg, shift, add, subtract, Multiply). While some embodiments may include multiple execution units dedicated to a particular function or set of functions, other embodiments may include only one execution unit or multiple execution units all to perform all functions. The (etc.) scheduler unit 1656, the physical register set unit 1658, and the execution cluster 1660 are shown as possibly plural Separate pipelines for specific types of data/operations are generated in some embodiments (eg, a scalar integer pipeline, a scalar floating point/packed integer/packed floating point number/vector integer/vector floating point number pipeline, and/or Or a memory access pipeline, each of which has its own scheduler unit, physical register set unit, and/or execution cluster - and in the case of a separate memory access pipeline, some implementations The execution cluster implemented in this example with only this pipe has the (etc.) memory access unit 1664). It should also be understood that where individual pipelines are used, one or more of these pipelines may be distributed/executed out of order while the remainder are sequential.

該組記憶體存取單元1664被耦合到該記憶體單元1670,其包括一資料TLB單元1672,該資料TLB單元被耦合到一資料快取單元1674,該資料快取單元被耦合到一2級(L2)快取單元1676。在一示例性實施例中,該等記憶體存取單元1664可以包括一負載單元、一儲存位址單元、以及一儲存資料單元,其中每一個被耦合到在該記憶體單元1670中的該資料TLB單元1672。該指令快取單元1634更被耦合到在該記憶體單元1670中的一2級(L2)快取單元1676。該L2快取單元1676被耦合到快取的一個或多個其他的級別,並最終耦合到一主記憶體。 The set of memory access units 1664 are coupled to the memory unit 1670, which includes a data TLB unit 1672 that is coupled to a data cache unit 1674 that is coupled to a level 2 (L2) cache unit 1676. In an exemplary embodiment, the memory access unit 1664 can include a load unit, a storage address unit, and a stored data unit, each of which is coupled to the data in the memory unit 1670. TLB unit 1672. The instruction cache unit 1634 is further coupled to a level 2 (L2) cache unit 1676 in the memory unit 1670. The L2 cache unit 1676 is coupled to one or more other levels of the cache and is ultimately coupled to a primary memory.

透過舉例的方式,該示例暫存器重新命名,亂序分發/執行核心結構可實現該管線1600如下:1)該指令提取1638執行該等提取與長度解碼階段1602和1604;2)該解碼單元1640執行該解碼階段1606;3)該重新命名/分配器單元1652執行該分配階段1608和該重新命名階段1610;4)該(等) 排程單元1656執行該排程階段1612;5)該(等)實體暫存器集單元1658和該記憶體單元1670執行該暫存器讀取/記憶體讀階段1614;該執行集群1660執行該執行階段1616;6)該記憶體單元1670和該(等)實體暫存器集單元1658執行該寫回/記憶體寫入階段1618;7)各種單元可能涉及該例外處理階段1622;8)該引退單元1654和該(等)實體暫存器集單元1658執行該提交階段1624。 By way of example, the example register is renamed, the out-of-order distribution/execution core structure can implement the pipeline 1600 as follows: 1) the instruction fetch 1638 performs the fetch and length decoding stages 1602 and 1604; 2) the decoding unit 1640 performs the decoding phase 1606; 3) the rename/allocator unit 1652 performs the allocation phase 1608 and the rename phase 1610; 4) the (etc.) Scheduling unit 1656 executes the scheduling phase 1612; 5) the (etc.) physical scratchpad set unit 1658 and the memory unit 1670 execute the scratchpad read/memory read stage 1614; the execution cluster 1660 executes the Execution stage 1616; 6) the memory unit 1670 and the (etc.) physical register set unit 1658 perform the write back/memory write stage 1618; 7) various units may be involved in the exception processing stage 1622; 8) The retirement unit 1654 and the (or other) physical register set unit 1658 perform the commit phase 1624.

該核心1690可以支援一個或多個指令集(例如,x86指令集(與已經被添加在更新版本的一些延伸);美國加州桑尼維爾,MIPS科技的MIPS架構;美國加州桑尼維爾ARM控股的ARM指令集(與任選的附加延伸諸如NEON),包括本文所描述的該(等)指令。在一實施例中,該核心1690包括邏輯以支援一打包資料指令集延伸(例如,AVX1、AVX2),由此允許由許多多媒體應用程式所使用之該等操作將使用打包資料來執行。 The core 1690 can support one or more instruction sets (eg, the x86 instruction set (with some extensions already added to the updated version); MIPS Technologies MIPS Architecture, Sunnyvale, California; ARM Holdings, Sunnyvale, California, USA The ARM instruction set (and optional additional extensions such as NEON) includes the (etc.) instructions described herein. In an embodiment, the core 1690 includes logic to support a packed data instruction set extension (eg, AVX1, AVX2) The operations that are thus allowed to be used by many multimedia applications will be performed using the packaged material.

應被理解的是該核心可以支援多執行緒(執行兩個或多個平行操作或執行緒集合),並且可以以多種方式來進行,包括時間切片多執行緒、同步多執行緒(其中一單一實體核心為該實體核心正同步執行多執行緒的該等執行緒的每一個提供了一邏輯核心)、或它們的組合(例如,之後的時間切片提取和解碼和同步多執行緒諸如Intel®超執行緒技術)。 It should be understood that the core can support multiple threads (performing two or more parallel operations or thread collections) and can be done in a variety of ways, including time slicing multithreading, synchronous multithreading (one of which is single The entity core provides a logical core for each of the threads that the core of the entity is executing synchronously with multiple threads, or a combination thereof (eg, subsequent slice extraction and decoding and synchronization of multiple threads such as Intel® Ultra Thread technology).

雖然暫存器重新命名是在亂序執行的環境中被描述,但是應被理解的是,暫存器重新命名可以被使用在 一依序結構中。儘管該處理器的該圖示的實施例也包括分離的指令和資料快取單元1634/1674和一共用的L2快取單元1676,另外的實施例可以以一單一內部快取同時用於指令和資料,諸如,舉例來說,一1級(L1)內部快取,或多個級別的內部快取。在一些實施例中,該系統可以包括一內部快取和一外部快取的組合,該外部快取外部於該核心和/或該處理器。可替代地,所有的快取可以外部於該核心和/或該處理器。 Although register renaming is described in an out-of-order execution environment, it should be understood that register renaming can be used in In a sequential structure. Although the illustrated embodiment of the processor also includes separate instruction and data cache units 1634/1674 and a shared L2 cache unit 1676, additional embodiments may be used with both a single internal cache and instructions. Information such as, for example, a level 1 (L1) internal cache, or multiple levels of internal cache. In some embodiments, the system can include a combination of an internal cache and an external cache that is external to the core and/or the processor. Alternatively, all caches may be external to the core and/or the processor.

具體的示例依序核心結構Specific example sequential core structure

圖17A-B圖示出一更具體之示例性依序核心結構方塊圖,該核心將會是在一晶片中數個邏輯方塊(包括相同類型和/或不同類型的其他核心)中之一。該等邏輯方塊透過一高頻寬互連網路(例如,一環形網路)進行通信,其具有一些固定功能邏輯、記憶體I/O介面、和其他必要的I/O邏輯,這取決於該應用。 17A-B illustrate a more specific exemplary sequential core structure block diagram that will be one of several logical blocks (including other cores of the same type and/or different types) in a wafer. The logic blocks communicate over a high frequency wide interconnect network (e.g., a ring network) having fixed function logic, a memory I/O interface, and other necessary I/O logic, depending on the application.

圖17A根據本發明的實施例係一單一處理器核心的方塊圖,其連接到晶粒上互連網路1702,與它的該2級(L2)快取1704的局部子集。在一實施例中,一指令編碼器1700支援該x86指令集帶有一打包資料指令集延伸。一L1快取1706允許低延遲存取到快取記憶體進入該等純量和向量單元。雖然在一實施例中(為了簡化該設計),一純量單元1708和一向量單元1710使用各別的暫存器集(分別是純量暫存器1712和向量暫存器1714)並在它們之間傳輸的資料被寫入記憶體,然後從一1級(L1)快取1706讀回,本發明的 替代實施例可使用不同的方法(例如,使用一單一暫存器集或包括一通信路徑其允許資料在兩個暫存器集之間傳送而不會被寫入和讀回)。 Figure 17A is a block diagram of a single processor core connected to a die interconnect network 1702, with its local subset of the level 2 (L2) cache 1704, in accordance with an embodiment of the present invention. In one embodiment, an instruction encoder 1700 supports the x86 instruction set with a packed data instruction set extension. An L1 cache 1706 allows low latency access to the cache memory into the scalar and vector elements. Although in an embodiment (to simplify the design), a scalar unit 1708 and a vector unit 1710 use separate sets of registers (both scalar registers 1712 and vector registers 1714, respectively) and in them. The data transferred between is written to the memory and then read back from a level 1 (L1) cache 1706, the present invention Alternate embodiments may use different methods (e.g., using a single set of registers or including a communication path that allows data to be transferred between two sets of registers without being written and read back).

該L2快取1704的該局部子集是一全域L2快取的一部分,其被分割成單獨的局部子集,每一個處理器核心一個。每一個處理器核心具有一直接存取路徑到該L2快取1704它自己的局部子集。由一處理器核心所讀出的資料被儲存在其L2快取子集1704中並且可被快速地存取,平行於其他處理器核存取它們自己的局部L2快取子集。由一處理器核心所寫入的資料被儲存在它自己的L2快取子集1704中並從其他的子集清除,如果需要的話。該環形網路確保共享資料的一致性。該環形網路是雙向的以允許行為者諸如處理器核心、L2快取和其他的邏輯方塊彼此在該晶片內進行通信。每一個環形資料路徑係每一個方向1012-位元寬。 The local subset of the L2 cache 1704 is part of a global L2 cache that is split into separate local subsets, one for each processor core. Each processor core has a direct access path to the L2 cache 1704 its own local subset. The data read by a processor core is stored in its L2 cache subset 1704 and can be accessed quickly, accessing its own local L2 cache subsets parallel to other processor cores. The data written by a processor core is stored in its own L2 cache subset 1704 and cleared from other subsets, if needed. The ring network ensures the consistency of shared data. The ring network is bidirectional to allow actors such as processor cores, L2 caches, and other logic blocks to communicate with each other within the wafer. Each circular data path is 1012-bit wide in each direction.

圖17B根據本發明的實施例是圖17A處理器核心一部分的展開視圖。圖17B包括該L1快取1704的一L1資料快取1706A部分,以及更詳細的有關於該向量單元1710和該等向量暫存器1714。明確地說,該向量單元1710是一個16-寬的向量處理單元(VPU)(參見16-寬ALU 1728),其執行一個或多個整數、單精度浮點數和雙精度浮點數指令。該VPU支援以拌和單元1720拌和該等暫存器輸入、用數字轉換單元1722A-B做數值轉換、和在該記憶體輸入上以複製單位元1724做複製。寫入遮罩暫存器1726允許預測產生的向量寫入。 Figure 17B is an expanded view of a portion of the processor core of Figure 17A, in accordance with an embodiment of the present invention. Figure 17B includes an L1 data cache 1706A portion of the L1 cache 1704, and more particularly with respect to the vector unit 1710 and the vector registers 1714. In particular, the vector unit 1710 is a 16-wide vector processing unit (VPU) (see 16-wide ALU 1728) that executes one or more integer, single precision floating point, and double precision floating point instructions. The VPU supports mixing of the register inputs with the blending unit 1720, numerical conversion by the digital conversion unit 1722A-B, and copying by the copy unit 1724 on the memory input. The write mask register 1726 allows prediction of the resulting vector writes.

具有整合式記憶體控制器和圖形的處理器Processor with integrated memory controller and graphics

圖18根據本發明的實施例係一處理器1800的方塊圖,其可能有一個以上的核心,可能具有一整合式的記憶體控制器,以及可能有一整合式的圖形。在圖18中的該等實線方塊圖示出一處理器1800具有一單一核心1802A、一個系統代理1810、一組一個或多個匯流排控制器單元1816、而該可選擇之另外的虛線方塊圖示出一替代的處理器1800具有多個核心1802A-N、在該系統代理單元1810中一組一個或多個整合式的記憶體控制器單元1814、和專用邏輯1808。 18 is a block diagram of a processor 1800, which may have more than one core, may have an integrated memory controller, and may have an integrated graphics, in accordance with an embodiment of the present invention. The solid line block diagram in FIG. 18 shows a processor 1800 having a single core 1802A, a system agent 1810, a set of one or more bus controller units 1816, and an optional additional dashed box. The diagram illustrates an alternate processor 1800 having a plurality of cores 1802A-N, a set of one or more integrated memory controller units 1814, and dedicated logic 1808 in the system proxy unit 1810.

因此,該處理器1800不同的實現方式可以包括:1)一CPU,具有該專用邏輯1808被整合有圖形和/或科學(吞吐量)邏輯(其可以包括一個或多個核心),並且該等核心1802A-N是一個或多個通用核心(例如,通用依序核心、通用亂序核心、該兩者的一種組合);2)一協同處理器,具有該等核心1802A-N是大量之主要意圖用於圖形和/或科學(吞吐量)的專用核心;以及3)一協同處理器,具有該等核心1802A-N是大量之通用依序核心。因此,該處理器1800可以是一通用處理器、協同處理器或專用處理器,諸如,舉例來說,一網路或通訊處理器、壓縮引擎、圖形處理器、GPGPU(通用圖形處理單元)、一高吞吐量許多整合式核心(MIC)的協同處理器(包括30個或更多的核心)、嵌入式處理器、或類似物。該處理器可以在一個或多個晶片上被實現。使用數個工序技術的任意一個,諸如,舉例來說,BiCMOS、 CMOS、或NMOS,該處理器1800可以是一部分和/或可以被實現在一個或多個基板上。 Thus, different implementations of the processor 1800 can include: 1) a CPU having the dedicated logic 1808 integrated with graphics and/or science (throughput) logic (which can include one or more cores), and such The cores 1802A-N are one or more general cores (eg, a generic sequential core, a general out-of-order core, a combination of the two); 2) a co-processor with such cores 1802A-N is a large number of A dedicated core intended for graphics and/or science (throughput); and 3) a co-processor with such cores 1802A-N being a large number of generic sequential cores. Accordingly, the processor 1800 can be a general purpose processor, a coprocessor or a dedicated processor such as, for example, a network or communications processor, a compression engine, a graphics processor, a GPGPU (general graphics processing unit), A high throughput multi-integrated core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor can be implemented on one or more wafers. Use any of several process techniques, such as, for example, BiCMOS, CMOS, or NMOS, the processor 1800 can be part of and/or can be implemented on one or more substrates.

該記憶體層級結構包括在該等核心內快取的一個或多個層級、一組或一個或多個共用的快取單元1806、和耦合到該組整合式記憶體控制器單元1814的外部記憶體(圖中未示出)。該組共享快取單元1806可以包括一個或多個中間級快取,諸如2級(L2)、3級(L3)、4級(L4)、或快取的其他級別、一末級快取(LLC),和/或它們的組合。雖然在一實施例中一基於環形的互連單元1812互連該整合式的圖形邏輯1808、該組共享快取單元1806、以及該系統代理單元1810/整合式的記憶體控制器單元1814,但備選的實施例可使用任何數量公知的技術用於互連這些單元。在一實施例中,一致性被維持在一個或多個快取單元1806和核心1802-A-N之間。 The memory hierarchy includes one or more levels cached within the cores, a set or one or more shared cache units 1806, and external memory coupled to the set of integrated memory controller units 1814 Body (not shown). The set of shared cache units 1806 may include one or more intermediate level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache ( LLC), and/or combinations thereof. Although in one embodiment a ring-based interconnect unit 1812 interconnects the integrated graphics logic 1808, the set of shared cache units 1806, and the system proxy unit 1810/integrated memory controller unit 1814, Alternate embodiments may use any number of well known techniques for interconnecting these units. In an embodiment, consistency is maintained between one or more cache units 1806 and cores 1802-A-N.

在一些實施例中,該等核心1802A-N的一個或多個能夠進行多執行緒。該系統代理1810包括協調和運行核心1802A-N的那些組件。該系統代理單元1810可以包括例如一功率控制單元(PCU)和一顯示器單元。該PCU可以是或包括需要用來調節該等核心1802A-N和該整合式圖形邏輯1808之該功率狀態的邏輯和組件。該顯示器單元是用於驅動一個或多個外部連接的顯示器。 In some embodiments, one or more of the cores 1802A-N are capable of multiple threads. The system agent 1810 includes those components that coordinate and operate the cores 1802A-N. The system proxy unit 1810 can include, for example, a power control unit (PCU) and a display unit. The PCU can be or include logic and components needed to adjust the power states of the cores 1802A-N and the integrated graphics logic 1808. The display unit is a display for driving one or more external connections.

以結構指令集的觀點來說,該等核心1802A-N可以是同質或異質的;即,該等核心1802A-N的兩個或多個可能能夠執行相同的指令集,而其他可能能夠執行該指令集 的一個子集或一不同的指令集。 From the point of view of the structural instruction set, the cores 1802A-N may be homogeneous or heterogeneous; that is, two or more of the cores 1802A-N may be capable of executing the same set of instructions, while others may be able to perform the Instruction Set a subset of or a different instruction set.

示例電腦結構Sample computer structure

圖19-22系示例性電腦結構的方塊圖。本領域之公知之其他的系統設計和配置如筆記型電腦、桌上型電腦、手持PC、個人數位助理、工程工作站、伺服器、網路設備、網路集線器、交換機、嵌入式處理器、數位信號處理器(DSP)、圖形裝置、視訊遊戲裝置、機上盒、微控制器、蜂巢式電話、可攜式媒體播放器、手持裝置、和各種其他電子裝置,也是合適的。在一般情況下,能夠納入如本文所揭露之一處理器和/或其他執行邏輯的大量各種系統或電子裝置一般來說是合適的。 19-22 are block diagrams of exemplary computer structures. Other system designs and configurations well known in the art are notebook computers, desktop computers, handheld PCs, personal digital assistants, engineering workstations, servers, network devices, network hubs, switches, embedded processors, digital Signal processors (DSPs), graphics devices, video game devices, set-top boxes, microcontrollers, cellular phones, portable media players, handheld devices, and various other electronic devices are also suitable. In general, a wide variety of systems or electronic devices capable of incorporating a processor and/or other execution logic as disclosed herein are generally suitable.

現在參考圖19,被展示出的是根據本發明之一實施例一系統1900的方塊圖。該系統1900可以包括一個或多個處理器1910、1915,其被耦合到一控制器集線器1920。在一實施例中,該控制器集線器1920包括一圖形記憶體控制器集線器(GMCH)1990和一輸入/輸出集線器(IOH)1950(它們可能是在不同的晶片上);該GMCH 1990包括記憶體和圖形控制器其被耦合到記憶體1940和一協同處理器1945;該IOH 1950把輸入/輸出(I/O)裝置1960耦合到GMCH 1990。可替代地,該記憶體和圖形控制器之一或兩個被整合在該處理器內(如本文所述),該記憶體1940和該協同處理器1945被直接耦合到該處理器1910,和該控制器集線器1920與該IOH 1950是在同一個單一晶片上。 Referring now to Figure 19, shown is a block diagram of a system 1900 in accordance with an embodiment of the present invention. The system 1900 can include one or more processors 1910, 1915 that are coupled to a controller hub 1920. In one embodiment, the controller hub 1920 includes a graphics memory controller hub (GMCH) 1990 and an input/output hub (IOH) 1950 (which may be on different wafers); the GMCH 1990 includes memory And the graphics controller is coupled to a memory 1940 and a coprocessor 1945; the IOH 1950 couples an input/output (I/O) device 1960 to the GMCH 1990. Alternatively, one or both of the memory and graphics controller are integrated within the processor (as described herein), the memory 1940 and the coprocessor 1945 are directly coupled to the processor 1910, and The controller hub 1920 and the IOH 1950 are on the same single wafer.

額外處理器1915之可任意選擇的特性在圖19中 以虛線表示。每一個處理器1910、1915可以包括在此描述之該等處理核心的一個或多個,並可以是該處理器1800的一些版本。 The optional features of the additional processor 1915 are in Figure 19 It is indicated by a dotted line. Each processor 1910, 1915 can include one or more of the processing cores described herein and can be some version of the processor 1800.

該記憶體1940可以是,舉例來說,動態隨機存取記憶體(DRAM)、相變記憶體(PCM)、或者是兩者的組合。對於至少一個實施例,該控制器集線器1920與該(等)處理器1910、1915透過一多點匯流排,諸如一前端匯流排(FSB)、點對點介面諸如QuickPath互連(QPI)、或類似的連接1995,進行通信。 The memory 1940 can be, for example, a dynamic random access memory (DRAM), a phase change memory (PCM), or a combination of the two. For at least one embodiment, the controller hub 1920 and the (or other) processors 1910, 1915 pass through a multi-drop bus, such as a front-end bus (FSB), a point-to-point interface such as QuickPath Interconnect (QPI), or the like. Connect 1995 to communicate.

在一實施例中,該協同處理器1945是一專用處理器,諸如,舉例來說,一高吞吐量MIC處理器、一網路或通信處理器、壓縮引擎、圖形處理器、GPGPU、嵌入式處理器、或類似物。在一實施例中,控制器中心1920可包括一整合式的圖形加速器。 In one embodiment, the coprocessor 1945 is a dedicated processor such as, for example, a high throughput MIC processor, a network or communication processor, a compression engine, a graphics processor, a GPGPU, an embedded Processor, or the like. In an embodiment, controller hub 1920 can include an integrated graphics accelerator.

以品質因數光譜考量而言,包括結構、微結構、熱量、功耗特性、等等,在該等實體資源1910、1915之間可以有多種差異。 In terms of quality factor spectroscopy considerations, including structure, microstructure, heat, power consumption characteristics, etc., there may be multiple differences between these physical resources 1910, 1915.

在一實施例中,該處理器1910執行可控制一般類型資料處理操作的指令。嵌入在該等指令中的可能是協同處理器指令。該處理器1910識別出這些協同處理器指令為應是由該附接協同處理器1945來執行的一種類型。因此,該處理器1910發出這些協同處理器指令(或代表協同處理器指令的控制信號)在一協同處理器匯流排或其他的互連之上,給協同處理器1945。協同處理器1945接受並執行該 等接收到的協同處理器指令。 In an embodiment, the processor 1910 executes instructions that control general type data processing operations. Embedded in these instructions may be coprocessor instructions. The processor 1910 identifies that the coprocessor instructions are of a type that should be executed by the attached coprocessor 1945. Accordingly, the processor 1910 issues the coprocessor instructions (or control signals representative of the coprocessor instructions) to the coprocessor 1945 over a coprocessor bus or other interconnect. The coprocessor 1945 accepts and executes the Wait for the received coprocessor instructions.

現在參考圖20,根據本發明的一實施例展示出的是一第一更具體示例系統2000的方塊圖。如在圖20中所示,多處理器系統2000是一種點對點互連系統,並且包括一第一處理器2070和一第二處理器2080它們經由一點對點互連2050耦合。處理器2070和2080的每一個可以是該處理器1800的一些版本。在本發明的一實施例中,處理器2070和2080分別為處理器1910和1915,而協同處理器2038是協同處理器1945。在另一實施例中,處理器2070和2080分別為處理器1910和協同處理器1945。 Referring now to Figure 20, a block diagram of a first more specific example system 2000 is shown in accordance with an embodiment of the present invention. As shown in FIG. 20, multiprocessor system 2000 is a point-to-point interconnect system and includes a first processor 2070 and a second processor 2080 that are coupled via a point-to-point interconnect 2050. Each of processors 2070 and 2080 can be some version of the processor 1800. In one embodiment of the invention, processors 2070 and 2080 are processors 1910 and 1915, respectively, and coprocessor 2038 is a coprocessor 1945. In another embodiment, processors 2070 and 2080 are processor 1910 and coprocessor 1945, respectively.

處理器2070和2080被展示成分別包括整合式的記憶體控制器(IMC)單元2072和2082。處理器2070還包括其匯流排控制器單元點對點(P-P)介面2076和2078的一部分;類似地,第二處理器2080包括P-P介面2086和2088。處理器2070、2080可使用P-P介面電路2078、2088經由一點對點(P-P)介面2050交換資訊。如在圖20中所示,IMC 2072和2082耦合該等處理器到各自的記憶體,即一記憶體2032和一記憶體2034,其可以是局部附接到該等各別處理器的主記憶體的部分。 Processors 2070 and 2080 are shown to include integrated memory controller (IMC) units 2072 and 2082, respectively. Processor 2070 also includes a portion of its bus controller unit point-to-point (P-P) interfaces 2076 and 2078; similarly, second processor 2080 includes P-P interfaces 2086 and 2088. Processors 2070, 2080 can exchange information via point-to-point (P-P) interface 2050 using P-P interface circuits 2078, 2088. As shown in FIG. 20, IMCs 2072 and 2082 couple the processors to respective memories, a memory 2032 and a memory 2034, which may be primary memories that are locally attached to the respective processors. Part of the body.

處理器2070、2080每一個可以使用點對點介面電路2076、2094、2086、2098經由各自的P-P介面2052、2054與一晶片組2090交換資訊。晶片組2090可以可選擇性地經由一高效能介面2039與該協同處理器2038交換資訊。在一實施例中,該協同處理器2038是一專用處理器,諸如,舉 例來說,一高吞吐量MIC處理器、一網路或通信處理器、壓縮引擎、圖形處理器、GPGPU、嵌入式處理器、或類似物。 Processors 2070, 2080 can each exchange information with a chipset 2090 via respective P-P interfaces 2052, 2054 using point-to-point interface circuits 2076, 2094, 2086, 2098. The chipset 2090 can optionally exchange information with the coprocessor 2038 via a high performance interface 2039. In an embodiment, the coprocessor 2038 is a dedicated processor, such as For example, a high throughput MIC processor, a network or communication processor, a compression engine, a graphics processor, a GPGPU, an embedded processor, or the like.

一共用快取(圖中未示出)可被包括在兩個處理器的任一之中或兩處理器的外部,仍經由P-P互連連接到該等處理器,以使得處理器的任一或兩者之局部快取資訊可被儲存在該共用快取中如果一處理器被置於一低功率模式的話。 A shared cache (not shown) may be included in either or both of the two processors, still connected to the processors via the PP interconnect, such that any of the processors Or partial cache information of both may be stored in the shared cache if a processor is placed in a low power mode.

晶片組2090可以經由一介面2096被耦合到一第一匯流排2016。在一實施例中,第一匯流排2016可以一種週邊組件互連(PCI)匯流排,或一匯流排諸如一PCI Express匯流排或另一種第三代I/O互連匯流排,雖然本發明的範圍並不侷限於此。 Wafer set 2090 can be coupled to a first bus bar 2016 via an interface 2096. In an embodiment, the first bus bar 2016 may be a peripheral component interconnect (PCI) bus bar, or a bus bar such as a PCI Express bus bar or another third generation I/O interconnect bus bar, although the invention The scope is not limited to this.

如圖20所示,各種I/O裝置2014可被耦合到第一匯流排2016,連同一匯流排橋接器2018其耦合第一匯流排2016到一第二匯流排2020。在一實施例中,一個或多個附加的處理器2015,諸如協同處理器、高吞吐量MIC處理器、GPGPU、加速器(諸如,例如,圖形加速器或數位信號處理(DSP)單元)、場可編程閘陣列、或任何其它的處理器,被耦合到第一匯流排2016。在一實施例中,第二匯流排2020可以是一種低引腳計數(LPC)匯流排。各種裝置可被耦合到一第二匯流排2020,包括,舉例來說,一鍵盤和/或滑鼠2022、通信裝置2027和一儲存單元2028比如一磁碟或其他大容量的儲存裝置,其可以包括指令/程式碼和資料2030, 在一個實施例中。此外,一音訊I/O 2024可被耦合到該第二匯流排2020。注意,其他的架構也是可能的。例如,不使用圖20的該點對點架構,一系統可以實現多點匯流排或其他的如此架構。 As shown in FIG. 20, various I/O devices 2014 can be coupled to the first bus bar 2016, with the same bus bar bridge 2018 coupling the first bus bar 2016 to a second bus bar 2020. In an embodiment, one or more additional processors 2015, such as a coprocessor, a high throughput MIC processor, a GPGPU, an accelerator (such as, for example, a graphics accelerator or a digital signal processing (DSP) unit), may be A programmed gate array, or any other processor, is coupled to the first bus 2016. In an embodiment, the second bus 2020 can be a low pin count (LPC) bus. Various devices can be coupled to a second bus 2020, including, for example, a keyboard and/or mouse 2022, a communication device 2027, and a storage unit 2028 such as a magnetic disk or other large-capacity storage device. Including instructions/code and data 2030, In one embodiment. Additionally, an audio I/O 2024 can be coupled to the second bus 2020. Note that other architectures are also possible. For example, instead of using the point-to-point architecture of Figure 20, a system can implement a multi-drop bus or other such architecture.

現在參考圖21,根據本發明的一實施例圖示出一第二更具體示例系統2100的方塊圖。在圖20和21中相同的元件有相同的參考標號,圖20的某些方面已經從圖21中移除以避免模糊了圖21的其他方面。 Referring now to Figure 21, a block diagram of a second more specific example system 2100 is illustrated in accordance with an embodiment of the present invention. The same elements in Figures 20 and 21 have the same reference numerals, and certain aspects of Figure 20 have been removed from Figure 21 to avoid obscuring the other aspects of Figure 21.

圖21圖示出該等處理器2070、2080可以分別包括整合式的記憶體和I/O控制邏輯(「CL」)2072和2082。因此,該CL 2072、2082包括整合式的記憶體控制器單元和包括I/O控制邏輯。圖21不僅圖示出耦合到該CL 2072、2082的該等記憶體2032、2034,I/O裝置2114也耦合到該控制邏輯2072、2082。傳統I/O裝置2115被耦合到該晶片組2090。 21 illustrates that the processors 2070, 2080 can each include integrated memory and I/O control logic ("CL") 2072 and 2082, respectively. Thus, the CL 2072, 2082 includes an integrated memory controller unit and includes I/O control logic. 21 not only illustrates the memories 2032, 2034 coupled to the CLs 2072, 2082, but the I/O devices 2114 are also coupled to the control logic 2072, 2082. A conventional I/O device 2115 is coupled to the wafer set 2090.

現在參考圖22,根據本發明的一實施例圖示出一SoC 2200的方塊圖。在圖18中相似的元件具有相似的參考標號。另外,虛線方塊在更先進的SoC上可選擇的功能。在圖22中,一互連單元2202被耦合到:一應用程式處理器2210,其包括一組一個或多個核心202A-N和共用的快取單元1806;系統代理單元1810;匯流排控制器單元1816;整合式的記憶體控制器單元1814;一組一個或多個協同處理器2220其可包括整合式的圖形邏輯、一影像處理器、一音訊處理器、和一視訊處理器;一靜態隨機存取記憶體(SRAM)單元2230;一直接記憶體存取(DMA)單元2232;以及用於 耦合到一個或多個外部顯示器的一顯示器單元2240。在一實施例中,該(等)協同處理器2220包括一專用處理器,諸如,舉例來說,一網路或通訊處理器、壓縮引擎、GPGPU、一高吞吐量MIC處理器、嵌入式處理器、或類似物。 Referring now to Figure 22, a block diagram of an SoC 2200 is illustrated in accordance with an embodiment of the present invention. Similar elements in Figure 18 have similar reference numerals. In addition, the dotted squares have optional features on more advanced SoCs. In FIG. 22, an interconnect unit 2202 is coupled to: an application processor 2210 that includes a set of one or more cores 202A-N and a shared cache unit 1806; a system proxy unit 1810; a bus controller Unit 1816; an integrated memory controller unit 1814; a set of one or more coprocessors 2220, which may include integrated graphics logic, an image processor, an audio processor, and a video processor; a random access memory (SRAM) unit 2230; a direct memory access (DMA) unit 2232; A display unit 2240 coupled to one or more external displays. In one embodiment, the co-processor 2220 includes a dedicated processor such as, for example, a network or communications processor, a compression engine, a GPGPU, a high throughput MIC processor, embedded processing. , or the like.

本文在這裡揭露機制的實施例可以以硬體、軟體、韌體或這些實現方式的一種組合來實現。本發明的實施例可以被實現為電腦程式或程式碼,執行在可編程系統上,其包括至少一個處理器、一儲存系統(包括依電性和非依電性記憶體和/或儲存元件)、至少一個輸入裝置、以及至少一個輸出裝置。 Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the invention may be implemented as a computer program or code embodied on a programmable system including at least one processor, a storage system (including electrical and non-electrical memory and/or storage elements) At least one input device and at least one output device.

程式碼,諸如在圖20中所示的碼2030,可被應用於輸入指令以執行本文所描述的該等功能和產生輸出資訊。該輸出資訊可被應用於一個或多個輸出裝置,以已知的方式。就本申請的目的而言,一處理系統包括具有處理器,諸如,舉例來說;一數位信號處理器(DSP)、一微控制器、一特定應用積體電路(ASIC)、或一微處理器的任何系統。 A code, such as code 2030 shown in Figure 20, can be applied to input instructions to perform the functions described herein and to generate output information. This output information can be applied to one or more output devices in a known manner. For the purposes of this application, a processing system includes a processor such as, for example, a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor. Any system of the device.

該程式碼可以以一種高階程序性或物件導向式程式語言來來實現以與一處理系統進行通信。該程式碼還可以以組合語言或機器語言來實現,如果需要的話。事實上,本文描述的機制並不侷限於在任何特定程式語言的範疇上。在任何情況下,該語言可以是一種編譯式或直譯式語言。 The code can be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The code can also be implemented in a combined or machine language, if desired. In fact, the mechanisms described in this article are not limited to the scope of any particular programming language. In any case, the language can be a compiled or literal language.

至少一個實施例的一個或多個方面可以由儲存 在一機器可讀取媒體上的代表指令來實現,其代表在該處理器中的各種邏輯,當其由一機器讀取時,會使得該機器可以製造邏輯來執行在本文中所描述的技術。這樣的表示,被稱為「IP核心」可以被儲存在一有形的、機器可讀取媒體上並提供給各種客戶或生產設施以載入到實際做出該邏輯或處理器之該等製造機器中。 One or more aspects of at least one embodiment may be stored Implemented on a machine readable medium representative of instructions that represent various logic in the processor that, when read by a machine, causes the machine to fabricate logic to perform the techniques described herein . Such representations, referred to as "IP cores", can be stored on a tangible, machine-readable medium and provided to various customers or production facilities for loading into the manufacturing machines that actually make the logic or processor. in.

這樣子的機器可讀取儲存媒體可以包括,不是限制性的,由一機器或裝置製造或形成製品之非暫時性、有形的安排,包括儲存媒體諸如硬碟、任何其他類型的碟包括軟碟、光碟、光碟唯讀記憶體(CD-ROM)、可重寫光碟(CD-RW)、和磁光碟、半導體裝置諸如唯讀記憶體(ROM)、隨機存取記憶體(RAM)諸如動態隨機存取記憶體(DRAM)、靜態隨機存取記憶體(SRAM)、可擦除式可編程唯讀記憶體(EPROM)、快閃記憶體、電可擦除式可編程唯讀記憶體(EEPROM)、相變記憶體(PCM)、磁卡或光卡、或任何其他類型適於儲存電子指令的媒體。 Such machine readable storage medium may include, without limitation, non-transitory, tangible arrangements for manufacture or formation of articles by a machine or device, including storage media such as hard disks, any other type of disk including floppy disks. , optical disc, CD-ROM, rewritable disc (CD-RW), and magneto-optical disc, semiconductor devices such as read-only memory (ROM), random access memory (RAM) such as dynamic random Access Memory (DRAM), Static Random Access Memory (SRAM), Erasable Programmable Read Only Memory (EPROM), Flash Memory, Electrically Erasable Programmable Read Only Memory (EEPROM) ), phase change memory (PCM), magnetic or optical card, or any other type of medium suitable for storing electronic instructions.

因此,本發明的實施例還包括非暫時性的、有形的機器可讀取媒體包含有指令或包含有設計資料,諸如硬體描述語言(HDL),其定義本文中所描述之結構、電路、設備、處理器和/或系統特徵。這樣子的實施例也可以被稱為程式產品。 Accordingly, embodiments of the present invention also include non-transitory, tangible machine readable media containing instructions or including design material, such as a hardware description language (HDL), which defines the structures, circuits, and Device, processor, and/or system features. Such an embodiment may also be referred to as a program product.

仿真(包括二進制翻譯、程式碼轉譯、等等)Simulation (including binary translation, code translation, etc.)

在一些情況下,一指令轉換器可被使用來把一指令從一來源指令集轉換為一目標指令集。舉例來說,該指 令轉換器可能轉譯(例如,使用靜態二進制轉譯、包括動態編譯的動態二進制轉換)、變形、仿真、或以其他方式把指令轉換成一個或多個將要由該核心內來處理的其他指令。該指令轉換器可以以軟體、硬體、韌體、或其一組合來實現。該指令轉換可能是在處理器上、不在處理器上,或部分在和部分不在處理器上。 In some cases, an instruction converter can be used to convert an instruction from a source instruction set to a target instruction set. For example, the finger The converter may be translated (eg, using static binary translation, dynamic binary conversion including dynamic compilation), morphing, emulating, or otherwise converting the instructions into one or more other instructions to be processed by the core. The command converter can be implemented in software, hardware, firmware, or a combination thereof. The instruction conversion may be on the processor, not on the processor, or partially and partially off the processor.

圖23根據本發明的實施例係一方塊圖,其對比一軟體指令轉換器的使用以把在一來源指令集中的二進位指令轉換成在一目標指令集中的二進位指令。在該所示實施例中,該指令轉換器是一軟體指令轉換器,雖然可替代地,該指令轉換器可以以軟體、韌體、硬體、或它們的各種組合來實現。圖23展示出一種高階語言2302的程式可以利用一x86編譯器2304來編譯以產生x86二進位碼2306,其可原生地由一具有至少一個x86指令集核心2316的處理器來執行。該具有至少一個x86指令集核心2316的處理器代表與具有至少一個x86指令集核心的Intel處理器基本上執行該等相同功能之任何的處理器,藉由相容地執行或以其他方式處理(1)該Intel x86指令集核心之該指令集的一個相當大的部分或(2)應用程式的目的碼版本或其他軟體其目標為在具有至少一個x86指令集核心的Intel處理器上運行,為了實現與具有至少一個x86指令集核心的Intel處理器基本上相同的結果。該x86編譯器2304代表其可操作以產生x86二進位碼2306(例如,目的碼)的一編譯器,其可以,有或沒有附加的連結處理,在具有至少一個x86指令集核心2316的該處理 23 is a block diagram comparing the use of a software instruction converter to convert binary instructions in a source instruction set into binary instructions in a target instruction set, in accordance with an embodiment of the present invention. In the illustrated embodiment, the command converter is a software command converter, although alternatively the command converter can be implemented in software, firmware, hardware, or various combinations thereof. 23 shows that a high-level language 2302 program can be compiled with an x86 compiler 2304 to produce an x86 binary code 2306 that can be natively executed by a processor having at least one x86 instruction set core 2316. The processor having at least one x86 instruction set core 2316 represents any processor that performs substantially the same functions as an Intel processor having at least one x86 instruction set core, by performing or otherwise processing ( 1) a substantial portion of the instruction set of the Intel x86 instruction set core or (2) an object code version of the application or other software whose target is to run on an Intel processor having at least one x86 instruction set core, in order to Achieve substantially the same results as an Intel processor with at least one x86 instruction set core. The x86 compiler 2304 represents a compiler that is operable to generate an x86 binary code 2306 (eg, a destination code), which may, with or without additional linking processing, in the processing with at least one x86 instruction set core 2316

器上被執行。類似地,圖23展示出該高階語言2302程式可以使用一替代的指令集編譯器2308編譯以產生替代的指令集二進位碼2310,其可原生地由一不具有至少一個x86指令集核心2316的處理器來執行(例如,一處理器具有執行美國加州桑尼維爾MIPS Technologies之該MIPS指令集的核心和/或具有執行美國加州桑尼維爾ARM Holdings之該ARM指令集的核心)。該指令轉換器2312被使用來把該x86二進位碼2306轉換成可原生地由一沒有x86指令集核心2314的該處理器來執行的程式碼。此轉換的程式碼不太可能相同於該替代的指令集二進位碼2310,因為能夠做這個的指令轉換器很難做;然而,轉換後的程式碼將完成該一般操作並且由來自該替換指令集的指令來組成。因此,該指令轉換器2312代表軟體、韌體、硬體、或其一個組合,其透過仿真、模擬或任何其他方法,允許一處理器或不具有x86指令集處理器或核心之其他電子裝置來執行該x86二進位碼2306。 Is executed on the device. Similarly, FIG. 23 shows that the higher-order language 2302 program can be compiled using an alternate instruction set compiler 2308 to generate an alternate instruction set binary code 2310 that can be natively composed of a core 2316 that does not have at least one x86 instruction set core 2316. The processor executes (eg, a processor having the core of the MIPS instruction set executing MIPS Technologies of Sunnyvale, California, USA and/or having the core of the ARM instruction set executing ARM Holdings, Sunnyvale, California, USA). The instruction converter 2312 is used to convert the x86 binary code 2306 into a code that can be natively executed by the processor without the x86 instruction set core 2314. The code of this conversion is unlikely to be the same as the alternate instruction set binary code 2310, because the instruction converter capable of doing this is difficult to do; however, the converted code will complete the general operation and be replaced by the replacement instruction. The set of instructions to form. Thus, the command converter 2312 represents software, firmware, hardware, or a combination thereof that allows for a processor or other electronic device that does not have an x86 instruction set processor or core, through emulation, simulation, or any other method. The x86 binary code 2306 is executed.

101‧‧‧查詢物件向量 101‧‧‧Query object vector

103_0~103_N‧‧‧向量0~N部分距離計算 103_0~103_N‧‧‧ Vector 0~N partial distance calculation

105‧‧‧全域控制 105‧‧‧Global Control

107‧‧‧最小排序網路 107‧‧‧Minimum sorting network

109‧‧‧0級比較節點 109‧‧‧0 level comparison node

111‧‧‧「k」級比較節點 111‧‧‧"k" level comparison node

Claims (22)

一種設備,其包含:至少一個向量部分距離計算電路,其用以計算在一搜尋空間中之一組向量的一部分總和與一累積距離;一最小排序網路,其用以排序來自該等累積距離之一選擇的位元集合,指出來自該搜尋空間中該等向量之該等選擇位元集合的一最小值以及該最小值是否為獨一無二的;以及一全域控制電路,其用以接收該最小排序網路的一輸出,並控制該至少一個向量部分距離計算電路之操作的各個方面。 An apparatus comprising: at least one vector portion distance calculation circuit for calculating a sum of a portion of a set of vectors in a search space and a cumulative distance; a minimum ordering network for sorting from the cumulative distances a selected set of bits indicating a minimum value of the set of selected bits from the vectors in the search space and whether the minimum is unique; and a global control circuit for receiving the minimum order An output of the network and controlling aspects of the operation of the at least one vector portion distance calculation circuit. 如請求項1之設備,其中每一個向量部分距離計算電路包含:數個資料元件計算器電路;一壓縮器樹形電路,其用以把該等數個資料元件計算器電路的每一個結果做加總;一局部控制電路,其用以從該累積距離輸出一較小的位元窗口,並使用該最小排序網路的一結果來判定何時一向量的一計算和排序將進行到下一迭代或從該搜尋空間被移除;以及一累加器,其用以在一目前的迭代中增加該等部分距離之結果,其中,在加到之前迭代所積累的該距離之前,一正確的有效性係由一移位器而提供以移位該部分 距離。 The device of claim 1, wherein each of the vector portion distance calculation circuits comprises: a plurality of data component calculator circuits; and a compressor tree circuit for performing each of the plurality of data component calculator circuits a total control circuit for outputting a smaller bit window from the cumulative distance and using a result of the minimum ordering network to determine when a calculation and ordering of a vector will proceed to the next iteration Or being removed from the search space; and an accumulator for increasing the result of the partial distances in a current iteration, wherein a correct validity is added prior to adding the accumulated distance to the previous iteration Provided by a shifter to shift the portion distance. 如請求項1之設備,其中該最小排序網路包含:數個第一級比較節點,其用以接收來自相鄰向量局部距離計算電路的一部分和與有效位元,並用以輸出一有效位元、精確位元,位址、和總和,其中該第一級比較節點用以邏輯地OR運算該等接收到的相鄰有效位元以提供該輸出有效位元,互斥地OR運算該等接收到的相鄰有效位元,把該互斥OR運算的一結果與該相鄰總和比較可能存在差異總和的一輸出進行邏輯地OR運算以產生該輸出精確位元,其中該精確位元是1來指出在該等兩個輸入之間的差是否大於一可程控的臨界值或者是否兩個輸入都是無效的;以及數個第二級比較節點,用以從相鄰比較節點接收一部分總和、有效位元、位址、和精確位元並用以輸出一有效位元、精確位元、位址、和總和,該等接收總和的一比較結果用以從傳入的精確信號做選擇,把該選擇的精確與在這一節點所計算的精確信號做一邏輯地AND運算以產生該輸出精確信號,該精確信號指出該輸出總和是否為獨一無二的,其中該比較結果形成該位址的一最高階位元。 The device of claim 1, wherein the minimum ordering network comprises: a plurality of first level comparison nodes for receiving a portion of the adjacent vector local distance calculation circuit and the effective bit, and for outputting a valid bit And an exact bit, an address, and a sum, wherein the first level comparison node is configured to logically OR operate the received adjacent valid bits to provide the output valid bit, and mutually exclusive OR operations the receiving And the adjacent valid bit to logically OR an output of the mutually exclusive OR operation and the output of the adjacent sum may have a difference sum to generate the output precise bit, wherein the precise bit is 1 To indicate whether the difference between the two inputs is greater than a programmable threshold or whether both inputs are invalid; and a plurality of second-level comparison nodes for receiving a portion of the sum from the adjacent comparison node, The effective bit, the address, and the precision bit are used to output a valid bit, an exact bit, an address, and a sum, and a comparison result of the received sums is used to select from the incoming accurate signal, The precision of the selection is logically ANDed with the exact signal calculated at this node to produce the output accurate signal indicating whether the output sum is unique, wherein the comparison results in a highest order of the address Bit. 如請求項3之設備,其中該全域控制電路包含:一OR樹,其用以接收及OR運算來自數個局部控制 電路的數個移除位元;一全域遮罩,其用以對該全域控制邏輯指示出該全域指標針對可能包含下一個最近相鄰者的該組向量需要跳回到何處;一選擇器,其用以自遞增一的先前全域指標及來自耦合到該全域遮罩之一優先編碼器的一輸出中選擇出該全域指標。 The device of claim 3, wherein the global control circuit comprises: an OR tree for receiving and OR operations from a plurality of local controls a plurality of removed bits of the circuit; a global mask for indicating to the global control logic that the global indicator needs to jump back to the set of vectors that may include the next nearest neighbor; a selector And selecting the global metric from a previous global metric of incrementing one and an output from a priority coder coupled to the global mask. 如請求項1之設備,其中每一個維度的一位元大小、維度、和參考數目係可重新配置的。 The device of claim 1, wherein the one-bit size, dimension, and reference number of each dimension are reconfigurable. 如請求項2之設備,其中該等數個資料元件計算器電路的每一個係絕對差電路之一局部距離計算總和。 The apparatus of claim 2, wherein each of the plurality of data element calculator circuits is a partial distance calculation sum of one of the absolute difference circuits. 如請求項2之設備,其中該等數個資料元件計算器電路的每一個係平方電路之一局部距離計算總和。 The apparatus of claim 2, wherein each of the plurality of data element calculator circuits is a local distance calculating a sum of local distances. 如請求項2之設備,其中該等數個資料元件計算器電路的每一個係可重新配置以操作為用於數個資料元件位元寬度之一更大資料元件計算器電路的一部分。 The device of claim 2, wherein each of the plurality of data element calculator circuits is reconfigurable to operate as part of a larger data element calculator circuit for one of a plurality of data element bit widths. 如請求項2之設備,其中該等數個資料元件計算器電路的每一個係一部分距離計算點積電路。 The device of claim 2, wherein each of the plurality of data component calculator circuits is a portion of a distance calculation circuit. 如請求項1之設備,其中該全域控制電路用以協調在該等累積的距離中那些位元被發送到該排序網路的活動,使用廣播到所有向量的一全域指標、廣播用於該迭代相依之部分距離計算的控制信號到所有的向量、以及紀錄當找到一最近相鄰者時向量從該搜尋空間被移除之迭代。 A device as claimed in claim 1, wherein the global control circuit is operative to coordinate activities in which the bits are transmitted to the sorting network in the accumulated distances, using a global indicator broadcasted to all vectors, broadcasting for the iteration The dependent portion of the distance calculates the control signal to all vectors, and records the iterations from which the vector is removed from the search space when a nearest neighbor is found. 如請求項10之設備,其中該等控制信號被儲存在一可由該全域指標參考的一可程控查詢表中。 The device of claim 10, wherein the control signals are stored in a programmable lookup table that is referenced by the global indicator. 如請求項2之設備,在每一個向量部分距離計算電路中的該等局部控制電路和距離累加器會保持狀態即使它從該搜尋空間被移除,並且,當找到下一個最近相鄰者時,該局部控制電路可以重新插入該向量至該搜尋空間中並重複使用到該先前移除點之前之任何先前的計算,其中該局部控制電路使用該最小排序網路的該輸出來判定用於一向量的該計算和排序將於何時進行到下一個迭代或從該搜尋空間被移除。 As with the device of claim 2, the local control circuits and the distance accumulator in each vector portion distance calculation circuit maintain state even if it is removed from the search space, and when the next nearest neighbor is found The local control circuit can reinsert the vector into the search space and reuse any previous calculations prior to the previous removal point, wherein the local control circuit uses the output of the minimum ordering network to determine a This calculation and sorting of the vector will proceed to or be removed from the next iteration. 如請求項1之設備,其中該設備被配置成以遞增的距離來排序。 A device as claimed in claim 1, wherein the device is configured to be ordered by increasing distance. 如請求項1之設備,其中該設備被配置以改變排序的一順序。 The device of claim 1, wherein the device is configured to change an order of the ordering. 如請求項1之設備,其中該設備操作在比該設備儲存容量更大的資料集上,計算來自該設備之一資料庫中經排序的k-最近之候選者,用在記憶體中剩餘的物件描述符來替換被移除的候選者,並且重複過程直到所有的物件候選者都被迭代過以找到整體k-最近描述符向量為止。 The device of claim 1, wherein the device operates on a data set having a larger storage capacity than the device, and calculates the sorted k-most recent candidate from the database of the device for use in the remaining memory. The object descriptor replaces the removed candidate and the process is repeated until all object candidates have been iterated to find the overall k-nearest descriptor vector. 一種方法,其包含下列步驟:執行以下的連續迭代,使用至少一個向量部分距離計算電路以針對一查詢向量來計算數個向量的部分距離、移位該等計算的部分距離、並積累該經移位之計算的部分距離,以及 使用一最小排序網路來排序那些累積的距離,從一最高有效位元開始到一最低有效位元。 A method comprising the steps of: performing a continuous iteration using at least one vector partial distance calculation circuit to calculate a partial distance of a plurality of vectors for a query vector, shifting the calculated partial distances, and accumulating the shifted Part of the distance calculated, and A minimum ordering network is used to sort those accumulated distances, starting from a most significant bit to a least significant bit. 如請求項16之方法,其中每一個連續的迭代改善了從一最高有效位元開始到一最低有效位元的該計算之部分距離的精確度。 The method of claim 16, wherein each successive iteration improves the accuracy of the calculated partial distance from a most significant bit to a least significant bit. 如請求項16之方法,其中,排序從一最高有效位元開始到一最低有效位元之那些累積的距離始於低精確度的距離,且只有剩下的向量會進入到下一迭代以針對判定最近相鄰者去改善較低位元精確度。 The method of claim 16, wherein the accumulated distances from a most significant bit to a least significant bit start at a low precision distance, and only the remaining vectors enter the next iteration to target Determine nearest neighbors to improve lower bit accuracy. 如請求項16之方法,其中該排序網路執行該排序係使用一可程控的臨界值來宣稱在每一次比較中及在一迭代中之一整個排序網路中所找到的一最小值是否比任何其他的數字小超過該臨界值。 The method of claim 16, wherein the sorting network performs the ranking using a programmable threshold to declare whether a minimum value found in each of the comparisons and in one of the iterations of the entire sorting network is greater than Any other number is smaller than this threshold. 如請求項16之方法,其中該累積距離計算針對不同的度量被分解成若干個部分距離計算,使得在計算該等較高位元之後,在連續迭代中低階位元位置上之精確度的改善不改變較高階位元超過一臨界值。 The method of claim 16, wherein the cumulative distance calculation is broken down into a plurality of partial distance calculations for different metrics such that an accuracy improvement in low order bit positions in successive iterations after calculating the higher bits Do not change higher order bits beyond a critical value. 如請求項16之方法,其中針對一查詢向量計算數個向量的部分距離之步驟包含:使用具有用於1D計算之電路的電路,其使用控制信號來計算該右邊部分距離並根據該向量的該維數來排列;使用一壓縮器樹來加總所有1D計算的該等部分距離;以及 把該等經加總的部分距離加到一目前累積的距離。 The method of claim 16, wherein the step of calculating a partial distance of the plurality of vectors for a query vector comprises: using a circuit having a circuit for 1D calculation, using a control signal to calculate the right portion distance and according to the vector Dimensions are arranged; a compressor tree is used to add up the partial distances of all 1D calculations; The summed partial distances are added to a currently accumulated distance. 如請求項16之方法,其中每一次迭代以共同之該至少一個向量部分距離計算及最小排序網路電路來執行。 The method of claim 16, wherein each iteration is performed in conjunction with the at least one vector portion distance calculation and the minimum order network circuit.
TW104138748A 2014-12-24 2015-11-23 Systems, apparatuses, and methods for k nearest neighbor search TWI604379B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US14/582,607 US9626334B2 (en) 2014-12-24 2014-12-24 Systems, apparatuses, and methods for K nearest neighbor search
US14/944,828 US10303735B2 (en) 2015-11-18 2015-11-18 Systems, apparatuses, and methods for K nearest neighbor search

Publications (2)

Publication Number Publication Date
TW201636823A true TW201636823A (en) 2016-10-16
TWI604379B TWI604379B (en) 2017-11-01

Family

ID=56116747

Family Applications (1)

Application Number Title Priority Date Filing Date
TW104138748A TWI604379B (en) 2014-12-24 2015-11-23 Systems, apparatuses, and methods for k nearest neighbor search

Country Status (3)

Country Link
CN (1) CN105740200B (en)
DE (1) DE102015015182A1 (en)
TW (1) TWI604379B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10649770B2 (en) * 2017-01-31 2020-05-12 Facebook, Inc. κ-selection using parallel processing
CN110019657B (en) * 2017-07-28 2021-05-25 北京搜狗科技发展有限公司 Processing method, apparatus and machine-readable medium
CN108182401B (en) * 2017-12-27 2021-09-03 武汉理工大学 Safe iris identification method based on aggregated block information
CN112749238A (en) * 2020-12-30 2021-05-04 北京金堤征信服务有限公司 Search ranking method and device, electronic equipment and computer-readable storage medium
CN113705858B (en) * 2021-08-02 2023-07-11 西安交通大学 Shortest path planning method, system, equipment and storage medium for multiple target areas

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0694855B1 (en) * 1994-07-28 2002-05-02 International Business Machines Corporation Search/sort circuit for neural networks
GB2476800A (en) * 2010-01-07 2011-07-13 Linear Algebra Technologies Ltd Sparse matrix vector multiplier using a bit map of non-zero elements to control scheduling of arithmetic operations
CN103136535A (en) * 2011-11-29 2013-06-05 南京理工大学常熟研究院有限公司 K nearest neighbor search method for point cloud simplification
US9405538B2 (en) * 2012-12-28 2016-08-02 Intel Corporation Functional unit having tree structure to support vector sorting algorithm and other algorithms

Also Published As

Publication number Publication date
TWI604379B (en) 2017-11-01
CN105740200A (en) 2016-07-06
CN105740200B (en) 2019-07-30
DE102015015182A1 (en) 2016-06-30

Similar Documents

Publication Publication Date Title
TWI604379B (en) Systems, apparatuses, and methods for k nearest neighbor search
JP6526609B2 (en) Processor
KR101744031B1 (en) Read and write masks update instruction for vectorization of recursive computations over independent data
US10296660B2 (en) Systems, apparatuses, and methods for feature searching
US8766827B1 (en) Parallel apparatus for high-speed, highly compressed LZ77 tokenization and Huffman encoding for deflate compression
KR101723121B1 (en) Vector move instruction controlled by read and write masks
KR101851439B1 (en) Systems, apparatuses, and methods for performing conflict detection and broadcasting contents of a register to data element positions of another register
CN108028665B (en) Systems, methods, and apparatus for compression using hardware and software
US10303735B2 (en) Systems, apparatuses, and methods for K nearest neighbor search
KR101772299B1 (en) Instruction to reduce elements in a vector register with strided access pattern
JP5941488B2 (en) Convert conditional short forward branch to computationally equivalent predicate instruction
CN107925420B (en) Heterogeneous compression architecture for optimized compression ratios
JP2017107587A (en) Instruction for shifting bits left with pulling ones into less significant bits
CN105814538B (en) Floating point enabled pipeline for emulating a shared memory architecture
TW201732559A (en) Processing devices to perform a key value lookup instruction
US9626334B2 (en) Systems, apparatuses, and methods for K nearest neighbor search
JP2018500659A (en) Dynamic memory contention detection with fast vectors
TW201712530A (en) Systems, methods, and apparatuses for improving performance of status dependent computations
CN108845832B (en) Pipeline subdivision device for improving main frequency of processor
TWI740860B (en) Method and apparatus for performing complex regular expression pattern matching utilizing hardware filter based on truncated deterministic finite automata
KR101635856B1 (en) Systems, apparatuses, and methods for zeroing of bits in a data element
Liu et al. G-Learned Index: Enabling Efficient Learned Index on GPU
CN117501256A (en) Complex filter hardware accelerator for large data sets
CN116168765A (en) Gene sequence generation method and system based on improved stroboemer
CN115857872A (en) Matrix operations for multiple tiles per matrix dimension