CN105740200A

CN105740200A - Systems, Apparatuses, and Methods for K Nearest Neighbor Search

Info

Publication number: CN105740200A
Application number: CN201510823660.4A
Authority: CN
Inventors: H·考尔; M·A·安德斯; S·K·马修
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2014-12-24
Filing date: 2015-11-24
Publication date: 2016-07-06
Anticipated expiration: 2035-11-24
Also published as: TW201636823A; CN105740200B; TWI604379B; DE102015015182A1

Abstract

Systems, apparatuses, and methods for k-nearest neighbor (KNN) searches are described. In particular, embodiments of a KNN accelerator and its uses are described. In some embodiments, the KNN accelerator includes a plurality of vector partial distance computation circuits each to calculate a partial sum, a minimum sort network to sort partial sums from the plurality of vector partial distance computation circuits to find k nearest neighbor matches and a global control circuit to control aspects of operations of the plurality of vector partial distance computation circuits.

Description

For the system of K nearest neighbor search, apparatus and method

Technical field

Putting it briefly, the field of the invention relates to computer processor framework and more particularly, it relates to nearest neighbor search.

Background technology

There is many following application, wherein, quickly and efficiently, for the nearest neighbor search of the multidimensional characteristic (point) of data set be intended to.Such as, such search is advantageous in the field of such as image reconstruction and machine learning.The method that there is the search of several arest neighbors data set.In nearest neighbor search, provide one group of point in space and transmission range (query point), then scan for finding the point immediate with this transmission range in set.

Accompanying drawing explanation

The present invention illustrates by way of example, and is not only restricted to the figure in appended accompanying drawing, in the accompanying drawings, and the element that similar labelling instruction is similar, and wherein:

Fig. 1 illustrates high-grade kNN accelerator mechanism according to an embodiment.

Fig. 2 illustrates exemplary vector partial distance counting circuit according to an embodiment.

Fig. 3 according to an embodiment illustrate difference of two squares data element counting circuit exemplary vector partial distance and.

Fig. 4 according to an embodiment illustrate absolute difference data element counting circuit exemplary vector partial distance and.

Fig. 5 illustrates exemplary local control circuit according to an embodiment.

Fig. 6 illustrates exemplary manhatton distance sequencer procedure according to an embodiment.

Fig. 7 illustrates example data element Euclidean distance sequencer procedure according to an embodiment.

Fig. 8 illustrates the exemplary sort operation using partial distance according to an embodiment.

Fig. 9 illustrates exemplary overall situation control circuit according to an embodiment.

Figure 10 illustrates exemplary grade 0 comparison node circuit according to an embodiment.

Figure 11 illustrates exemplary grade k comparison node circuit according to an embodiment.

Figure 12 illustrates the exemplary 8 reconfigurable counting circuits of bit/16 bit according to an embodiment.

According to an embodiment, Figure 13 illustrates that the sample portion distance for the quadratic sum with 16 bit elements calculates.

According to an embodiment, Figure 14 illustrates that cosine similarity calculates (1d distance) circuit, and illustrate that the sample portion distance for dot product calculates according to an embodiment.

Figure 15 illustrates the kNN a kind of illustrative methods searched for according to an embodiment.

Figure 16 A is according to embodiments of the invention, it is shown that the block diagram of exemplary series streamline and exemplary register renaming, out of order distribution/execution pipeline.

Figure 16 B is according to embodiments of the invention, it is shown that will include the block diagram of two kinds of exemplary embodiments of sequential architecture core within a processor and exemplary depositor renaming, out of order distribution/execution framework core.

Figure 17 A-B illustrates the block diagram of exemplary series core architecture more specifically, and this core will be in several logical blocks in chip (including other cores same kind of and/or different types of).

Figure 18 is the block diagram of processor 1800, and according to embodiments of the invention, it is likely to be of more than one core, is likely to be of integrated memory controller, and is likely to be of integrated graphics.

Figure 19-22 is the block diagram of exemplary computer architecture.

Figure 23 is according to embodiments of the invention, by software instruction transducer, the use of binary command that the binary command in source instruction set is converted to target instruction target word concentrates carry out the block diagram that contrasts.

Detailed description of the invention

In the following description, many details are elaborated.It is understood, however, that embodiments of the invention can be put into practice when not having these details.In other instances, known circuit, structure and technology are not illustrated in detail, in order to avoid the fuzzy understanding to this description.

In the description the embodiment quoting instruction described of " embodiment ", " embodiment ", " exemplary embodiment " etc. can be included specific feature, structure or characteristic, but each embodiment can include this special characteristic, structure or characteristic.Additionally, such phrase is not necessarily referring to same embodiment.Additionally, when being described in conjunction with the embodiments specific feature, structure or characteristic, it should think the impact of this feature, structure or characteristic for combining regardless of whether other embodiments being expressly recited are in the ken in those skilled in the art.

Method for nearest neighbor search is the distance calculating each point from input example to data set, and follows the tracks of beeline.But, bigger data set is likely to and infeasible by this simple method.Distance operation can k ties up (k-d) tree, often the detailed inspection of execution of all features is characteristically completed by next by using.Therefore, this method is slow, and it addition, has high power consumption.

Another kind of nearest neighbor search uses Voronoi diagram.Each Voronoi diagram divides the plane into equal neighboring regions (being referred to as community).It is illustrated by multiple communities, and each community has a feature (point).In theory, by using Voronoi diagram to position the feature in specific cell, it is possible to be found for " best coupling " feature of arbitrarily input example.But, as it can be seen, to be height form irregular in Voronoi community, and be difficult to calculate (they be all the time and processor intensive) and use.In other words, Voronoi diagram is not suitable for helping convenient or efficient neighbour's signature search.

Describe in detail herein to be used for improving the system of Nearest Neighbor Search, apparatus and method embodiment, the shortcoming which overcoming said method.In a word, provide input (that is, observing), then complete the search to the best matching characteristic of feature space (that is, the dictionary of feature).This method is particularly suitable for generally sparsely presenting in high dimensional vector space the vector of (noting, be characterized by vector, and therefore, feature and characteristic vector are used interchangeably in this specification).

Described below is the embodiment of k-arest neighbors (KNN) accelerator, and it adjusts the accuracy that distance calculates, to minimize the grade searched needed for each arest neighbors.By using only low accuracy to calculate, many candidate vectors are eliminated from search volume, use the iteration afterwards of more high accuracy the residue candidate closer to this arest neighbors to be eliminated, to state winning side (winner) simultaneously.The accuracy relatively low due to most calculation requirement and consume relatively low energy, therefore overall kNN energy efficiency is significantly increased.Generally, this kNN accelerator is a part for CPU (CPU), Graphics Processing Unit (GPU) etc..But, kNN accelerator is likely to the outside at CPU, GPU etc..

Fig. 1 illustrates the high-grade kNN accelerator according to embodiment.In this accelerator, there is several critical piece, it includes multiple vector section distance counting circuit 103_0 to 103_N, overall situation control circuit 105 and minimum sorting network 107.Will be explained below each in these parts is discussed.

Query object vector 101 is input to multiple vector section distance counting circuit 103_0 to the 103_N calculated for partial distance.Memorizer for storing this Object vector is shown without, but it exists.Partial distance counting circuit 103_0 to 103_N calculates the partial distance for each reference vector and accumulation distance, and provides effective instruction to minimum sorting network 107.As described in this article, use the many iterative part distance operation between inquiry (101) and stored vector, and improve least significant bit degree of accuracy in each iteration, more more energy efficient than the method in past.Partial distance calculates and includes, for different distance metric (such as, euclidean (quadratic sum) distance and Manhattan (absolute difference and) distance), calculating the less bit from the MSB complete distance started in each iteration.With suitable effectiveness (significance), partial results is added to the complete distance accumulated, so that continue along with calculating, improves the accuracy of relatively low validity bit.

Fig. 2 illustrates the exemplary vector partial distance counting circuit 203 according to an embodiment.This vector is made up of many dimensions, and each dimension is represented by 8b in this example.First individual distance in each dimension is calculated by 205, and is added subsequently to find total distance in 211.Local control circuit 207 provides the instruction selecting which bit about different data element calculator circuits 205.

As mentioned above, it is understood that there may be several different types of used distance metrics, and therefore different data element calculator circuits 205.When using absolute difference and (manhatton distance) to measure, carry out in vector section distance counting circuit 203: from the absolute difference of each vector element, select suitable dibit (2b), and be added.Fig. 3 illustrates the exemplary vector partial distance difference of two squares according to embodiment and (Euclidean distance) element calculator circuit 205.As shown in the figure, a part and a part (same number of bits) for the object of storage that query object (is shown as 8 bits) have absolute difference (| a-b |) computing undertaken by hardware, and use selects multiplexer and control signal to select the specific bit of this result.In certain embodiments, local control circuit provides control signal, as will be described in detail below.The result of multiplexing is multiplied (a pair 2bx2b multiplication), and is added subsequently.In this example, output is the value of 5 bits, when calculate difference square time, it represents partial distance.These are added to calculate the local Euclidean distance for whole vector by compressor tree 211.

When using absolute difference and (manhatton distance) to measure, carry out in vector section distance counting circuit 203: from the absolute difference of each vector element, select suitable dibit (2b), and be added.Fig. 4 illustrates exemplary vector partial distance absolute difference and data element calculator circuit 205 according to an embodiment.As shown in the figure, a part and a part (same number of bits) for the object of storage that query object (is shown as 8 bits) have absolute difference (| a-b |) computing undertaken by hardware, and use multiplexer and control signal to select the specific bit of this result.In certain embodiments, local control circuit provides control signal, and this will be described in detail below.In this example, output is the value of 2 bits.

Part SAD calculates the size decreasing compressor tree with the factor of four times, and part euclidian metric calculates and utilizes 2 common a pair bit multiplier to replace every vector element 8 bit × 8 bit multiplier, and also decrease compressor tree 211 region with the factor of 3 times.The non-explicit structure of this Euclidean distance ensures after having processed higher MSB position, any relatively low MSB subsequently refines (refinement) without influence on any tagmeme bit of relatively going up more than 1, as discussed below in relation to Fig. 3,4,6 and 7.Fig. 7 illustrates that complete difference of two squares computing (being used for calculating Euclidean distance) is resolved into the part calculated by the exemplary circuit of Fig. 3 calculates the embodiment of iteration.Fig. 7 illustrates that the calculating operation at relatively low sequential bits place will not upset treated higher order bit more than 1 by performing to calculate and sequence according to shown order.Similarly, Fig. 4 illustrates the embodiment of the circuit for each element, and Fig. 6 illustrates the corresponding example calculating operation performed by the circuit for manhatton distance.Fig. 6 illustrates that the calculating operation at relatively low sequential bits place will not upset treated higher order bit more than 1 by performing to calculate and sequence according to shown order.

In certain embodiments, due to common hardware data path, single circuit is not for reconfiguring between homometric(al).

The output of each data element distance calculator is added by compressor tree 211.In 256 n dimensional vector ns with Euclidean distance, this output is the value of 13 bits.The output of compressor tree 211 is sent to shift unit 209.Generally, this shift unit is dextroposition device, but, depending on end configuration, it can be left shifter.In most embodiments, shift amount is controlled by local control circuit 207.Partial distance, with suitable effectiveness, is alignd by shift unit relative to this accumulation distance.

Trigger 213 stores the accumulation distance from previous ones, and the output of adder 215 is the accumulation distance of current iteration.When following iteration starts, this value is written into trigger 213.Selector 219 selects this 2b based on global pointer from accumulation distance.Its also select in this 2b position from by this part and the carry-out being added to previous accumulation distance.

Local control circuit 207 using psumi as input, and pass it to sorting network 107 as psum value (such as, 3 bits) before, only revise higher bit.Also significant bit is passed to minimum sorting network 107.Fig. 5 illustrates exemplary local control circuit according to an embodiment.The some different aspect of vector section distance counting circuit is controlled by local control circuit, as detailed above.This circuit receives global pointer (described below) from overall situation control unit, and receive together from selector 219 psumi and from the minimum of minimum sorting network 107 and, address and degree of accuracy designator.

As it can be seen, local control circuit takes the address of vector being processed and from the lowest address of minimum sorting network 107, and use comparison circuit to determine that whether they are equal.Logical AND operation is carried out with the minimum degree of accuracy from minimum sorting network 107, to assist in determining whether that Object vector should be no longer processed from this output compared.Specifically, this output for computing to significant bit, as by with door shown by.Local control circuit use minimum and, psumi and global pointer generate elimination signal, and use local calculation signal as shown (for controlling different data element calculator circuits 205).Vector processes and can stop in following amount reason: 1) current vector is declared as winning side, or 2) when current vector is guaranteed as being not arest neighbors, it is by removal from search volume.Above description, for previous, is used for asserting " completing (done) " signal.The reversion of this signal (being shown as by bubble (bubble) to enter and door 513) can be affected useful signal.If " completing " to be asserted, then useful signal is by deasserted.For affecting effective residue logic to determine whether that this vector does not complete, but it is still a part for search volume.Eliminating signal designation in current iteration, vector is removed from search volume, and this information is controlled to use by the overall situation.Circuit shown in Fig. 5 also creates clock from overall situation CLK signal, to control the clock of memory element (including 213).Local control circuit is also from overall situation control circuit reception " calculate and control ", and it then " should calculate and control " and be delivered to partial distance calculating 205 and shifter circuit 209.

Substantially, local control circuit provides every vector local state to control and crosses over the global state control of all vectors, to keep tracking range to calculate the iteration that state is eliminated at which with them, with calculating for k > sorted lists of 1 time, enable reusing of first front distance being calculated and compares.

Partial distance calculates and ordering iteration is interleaved, as shown in Figures 6 and 7.According to an embodiment, Fig. 6 illustrates that exemplary manhatton distance calculates and sequence processes, and according to an embodiment, Fig. 7 illustrates that example data element Euclidean distance calculates and sequence processes.In these diagrams, letter (a, c, c and d) is the 2b component of the absolute difference between query vector and 8b element of reference vector.As it can be seen, typical process is to be calculated iteration by vector section distance counting circuit 103_0 to 103_N, subsequently, sort in minimum sorting network 107.But, exist when calculating iteration is likely to not between continuous print ordering iteration time, for instance shown in the figure 7.

Minimum sorting network 107 performs the window based on sequence.Particularly, this sorting network processes the window of that the highest significant position (MSB) from the distance of part calculating starts, much smaller bit, to enable much smaller comparator circuit and to eliminate vectors candidates early, calculates for further part.

Such as, in certain embodiments, sorter network processes the window (Fig. 6 and 7) of the only dibit of the vector distance added up in each iteration from MSB to LSB.This high parallelization making there is low-down hardware complexity.Owing to being likely to be 1 to the maximum to affect bit processed in MSB in the refinement that calculates compared with low bit, therefore sorting network 107 also needs to process the carry-out produced from the calculating iteration at current 2b window.As a result, the numeral of such as 3b (carry-out and 2b and) is compared by sorting network 107 at each node 109 and 111 place.By contrast, the vector distance for 256 dimensions compares (every element 8b), and conventional sorting network will need 24b comparator at each node place.

The minimum 3b result found is broadcasted by sorting network 107 globally, and the Partial controll calculated for single vector distance by their 3bpsum (carry-out and 2b with) compared with this broadcast results, see whether that specific vector can eliminate calculating from further distance refinement, and this compares the use overall situation 105 and local control circuit.

Due to following character: the calculating of lower-order is likely to affect currently processed window by 1 in iteration in the future, therefore, compares, for the candidate to eliminate, Partial controll and all 3b in sorting network 107, the difference requiring more than 1.For the same reason, Partial controll is additionally contemplates that the minima big 1 in whether concrete vector ratio previous ones.Use degree of accuracy signal, sorting network 107 instruction find minimum whether be unique.Ordering iteration continues, until finding unique vector recently, or reaches LSB.Calculate according to further distance from the feedback of sorting network and compare and eliminate candidate vector, thus causing calculating up to 3 times reducing.

Fig. 8 illustrates the exemplary sort operation using partial distance according to an embodiment.Finding optimal candidate along with vectors candidates is eliminated, the iteration (utilizing shown in frame in 1ptr signal as shown in Figure 5 and Fig. 8) that their local state controls to be eliminated them at which stores simultaneously.

Meanwhile, move forward (towards LSB) along with the overall situation controls pointer, even if single vector is dropped, will the 1 write overall situation two enter in the associated bit positions of mask (its be stored in the overall situation control circuit 105 in).After finding the first vector, overall situation binary mask controls logic 105 to the overall situation and indicates: for the set of vectors by comprising next arest neighbors, global pointer needs rebound where.This process continues iteration, and figure 8 illustrates for second and the 3rd nearest neighbor search.When global pointer is towards MSB rebound, those vectors that only its stored iterative state and global pointer position match come to life.Vector closer to arest neighbors will eliminate in closer global pointer position.This technology of maintenance state have exceed tradition sort method (it eliminates arest neighbors simply, and the calculating of the entirety that starts anew and sequencer procedure) three advantages: (a) its reuse the partial distance that performed when finding previous rank and calculate, (b) its by utilizing the comparison that has been computed to reduce to need the quantity of the vector compared, and (c) need not pre-define k, to minimize the calculating for any rank and to compare.There is the calculating of this control and heavier with determining after finding 3 arest neighbors, from X (such as, 256) individual vector (such as, 256 8-b elements of each vector), search the quantity of the incremental cost of next arest neighbors.Conventional ordering techniques will cause searching nearest vector from remaining 253 candidate vectors, and proposed control causes and reduces this search volume with 19 times and reduce, with 20 times, the calculating being associated.

In the example shown, it is processed from 2 bit window of the partial distance of vector.The arrow pointing to the right originates from the significant bit of local control circuit.In the cycle 0, exceed in 1 due to bigger than minima, then the 7th is compared is close, and therefore this vector can be eliminated.This elimination is stored in local control circuit.As detailed above, this process proceeds.If for all vectors, all parts of process and calculating, then the distance obtained is by distance complete for coupling, and this shows at Far Left, only for referencial use.

Fig. 9 illustrates exemplary overall situation control circuit according to an embodiment.As it is shown in figure 1, overall situation control circuit 105 receives minimum degree of accuracy, address and sum value from minimum sorting network 107.Minimum degree of accuracy signal puts the signal of global pointer at place by logic or operation together with being used for indicating at LSB bit, and it is used as the selection signal (its global pointer that can be previous is incremented by 1, or the global pointer (priority) being encoded) for global pointer output from overall binary mask as illustrated.Overall situation binary mask is by the signal that eliminates received from local control circuit 207 is carried out or tree operations obtains.At global pointer as, to the index of look-up table, using look-up table to find calculating control signal.Until unique minimum or LSB do not find, then each iteration of global pointer constantly carries out (to this, pointer is incremented by) by 1 towards LSB.This condition by or door 901 test.Otherwise, pointer rolls back in binary mask immediate 1, to find next arest neighbors.Even if a vector is eliminated, will 1 write binary mask in pointer position, otherwise write 0.Or tree (ORtree) is even if 907 detect whether when a vector is eliminated (the elimination signal generated by all individual local control circuits), ensuing demultiplexer uses global pointer that the input of appropriate location is set to 1, and when next iteration starts (rising edge of CLK), it is written into overall situation binary mask (keeping in memorizer 903).Position closest to 1 is calculated by priority encoder 905.Calculate and control to broadcast to all of vector based on pointer position.This can make it able to programme by storing it in look-up table 913, wherein, reads suitable control signal based on described pointer.

Check minimum sorting network 107 in more detail, have two kinds of comparison node grade 0 and grade " k " node.Figure 10 illustrates exemplary grade 0 comparison node circuit according to an exemplary embodiment.As indicated, circuit takes significant bit, this significant bit is used to indicate whether that psum carrys out the vector of a part for search volume naturally.If this effectively adjoint psum is 0, then ignore psum in the comparison at node place.

Contiguous significant bit is carried out logic or operation, to provide grade 0 significant bit.These significant bits are also carried out xor operation with signal and carry out subsequently or operate, to generate degree of accuracy bit, described signal designation input and between absolute difference more than thresholding.Degree of accuracy bit, if " 1 " to mean without other vectors close.Finally, adjacent summation is also compared with each other, and its result has carried out logical AND operation with in significant bit, to form address and for exporting the selector of sum.Whole outputs of grade 0 comparison node are address, significant bit, degree of accuracy bit, Yi Jihe.Output is effective to whether instruction result is effective (at least one in described input effectively is necessary for very, to meet this condition).Result of the comparison is added (in the case, being bit [0], because it is the first comparative level) with the highest sequential bits of the address of the minimum vector found.2 vectors of output prograin signal designation be incoordinate or close to (if 1, then poor more than 1, if or be 0, then poor more than 0).No matter if comparative result (because and if in one be not effective, then more unimportant), only one in input is effective, and XOR 1003 asserts degree of accuracy signal.Comparative result transmission is less and inputs address to next node together with it.

Figure 11 illustrates the exemplary grade k comparison node circuit according to an embodiment.This circuit take adjacent address output, significant bit, degree of accuracy bit and come from grade before it and (such as, grade 0), and pass them to shown circuit.This operation is similar to those operations shown in Figure 10.From incoming degree of accuracy signal, now also select the result of the comparison with signal, and selected degree of accuracy is carried out and operation with the degree of accuracy signal calculated at this node place, to produce output prograin signal.Whether output prograin signal designation exports and is unique, i.e. crossing over all vectors started from grade 0, it be minimum, will a little surplus (this surplus more than) apart from any nearest vector.

The different embodiments described above in relation to kNN accelerator add the application space that motility and/or accelerator will benefit wherein.Such as, in certain embodiments, to the calculating of vector element more than 8b, then the distance counting circuit being designed to 8b element can pass through the right of bordering compounding 8b element circuit, is reused for 16b element.Figure 12 illustrates the exemplary 8 reconfigurable counting circuits of bit/16 bit according to an embodiment.In the circuit, control signal broadcast is for two group selection signals of even/odd 8b counting circuit.For fixing circuit and storage size, under with 16b pattern during operation, the vector number of vector dimension or storage reduces by half.Under 16b pattern, it is necessary to its iteration number calculating complete square sum increases to 15 from 6 (for 8b element).It is likely to need multiple calculating iteration between continuous print ordering iteration, to guarantee that the bit of higher order will not be affected more than 1 when processing the bit of relatively low order.Even if under 16b pattern, the accelerator sequence calculated based on part greatly reduces the calculating for searching arest neighbors.Figure 13 illustrates that the sample portion distance for the quadratic sum with 16 bit elements according to an embodiment calculates.In certain embodiments, 16b width or the width reconfigured only are used.It is of course also possible to use other bit width or the width reconfigured.

In certain embodiments, kNN accelerator is reconfigurable for supporting the bigger vector dimension in the compressor tree of metrics calculation unit with additional stages, so that the result from other metrics calculation unit block to be added.Therefore, along with the dimension of each vector increases, the quantity of the vector of storage reduces.

In certain embodiments, the Function Extension of kNN accelerator is to enabling the operation contrasting the bigger data set of accelerator memory capacity.First the nearest candidate of k-of the sequence of the data base being stored in accelerator is calculated, from memorizer, the candidate eliminated is substituted by any remaining Object Descriptor, and process continuation, until all of object candidates finds the nearest descriptor vector of whole k-by being iterated.For having the accelerator of 256 object capacities, with 256 dimensions (each dimension 8b) vector described for each characteristics of objects, crossing over the object database size from 512 to 2048 objects, accelerator enables minimizing all the time in the quadratic sum of the 16 immediate candidate lists for sequence calculates.

In certain embodiments, except being found vector by minimum range, accelerator can be redeployed as by the output of the 3b comparator circuit in the comparison node of sorting network being reversed, and finds vector in descending order.Alternately, descending is calculated by following: deduct cumulative partial distance from maximum possible distance, then uses the identical network based on the minimum sequence of window to process obtained numeral.

In certain embodiments, by reconfiguring the 1D distance circuit in only network, various distance metric is adapted to.Except Euclidean distance and manhatton distance, the popular tolerance of another kind of the closest match for searching out vector is cosine similarity, and it adopts the angular distance between vector to find immediate coupling.The cosine of the angle between two vector A and B is calculated as [Σ (a_i.bⁱ)]/[(Σa_i ²)^1/2.(Σb_i ²)^1/2], wherein, less angle produces bigger cosine.For the similarity based on cosine, if stored data base normalization, then it is made without normalization, and optimum (optimization) is converted into the vector finding the dot product Σ (ai.bi) causing having maximum magnitude subsequently.The existing 2b multiplier for euclidean metric can be used partly to calculate the dot product between inquiry and stored object.

Figure 14 illustrates that the cosine similarity according to an embodiment calculates (1d distance) circuit and the sample portion distance for dot product according to an embodiment calculates.It is likely to need multiple calculating iteration between continuous ordering iteration, to guarantee that higher order bit will not be affected more than 1 when processing relatively low sequential bits.For the dot product having symbol element, each calculating iteration 2 step of needs-first, by all of positive product addition to the partial distance added up, and subsequently, deduct all negative sum of products from added up partial distance.

In certain embodiments, along with iteration carries out, based on the comparison of cumulative partial distance Yu predetermined absolute thresholding, it is also possible to earlier eliminate candidate vector.It addition, be not required to be accurate to the statement of winning side's vector, and based on predetermined relative accuracy (using global pointer position) or absolute increment part distance, the iteration for selecting winning side can stop earlier.This scheme can reduce the energy resource consumption for the algorithm optimized for approximate KNN (ANN) search.

Figure 15 illustrates the illustrative methods embodiment using kNN accelerator described in detail above according to an embodiment, kNN search.At high-grade place, the method for kNN search includes arithmetic section distance, these distances cumulative, and in an interleaved fashion the distance that those are cumulative is ranked up.More detailed description to this process is presented herein below.

In certain embodiments, one or more variablees are reset.Such as, for the accumulation distance of each reference vector, global pointer, the overall situation binary mask, k value, for the validity bit (being set to 1) of each reference vector, " completing " bit for each reference vector and the local pointers for each reference vector.

1501, for each element of each reference vector He this reference vector, carry out the absolute difference between the respective element in this element and query vector.

1503, comparison threshold is set based on global pointer.

1505, it is determined whether partial distance will be carried out computing.When should arithmetic section distance time, for having significant bit be set to 1 each references object vector of (representing effectively), at 1507 places, arithmetic section distance is (such as, use the circuit in Fig. 3 and 4 and compressor tree 211), and shifted and be added to accumulation distance.

When should not arithmetic section distance or after occurring 1507, for having significant bit be set to 1 each references object vector of (representing effectively), 1509, by accumulation distance, the bit subset (psum) that depends on global pointer be sent to minimum sorting network.

1511, this sorting network finds global minimum and the second minima.

1515, it is made whether that global minimum deducts described second minima more than the determination arranging thresholding at 1513 places.If it is, degree of accuracy is set to 1.1515, if degree of accuracy is 1, or global pointer is the LSB of accumulation distance, then the minima found is set to 1.

1517, it is typically parallel with 1513, for having significant bit each references object vector equal to 1, based on the comparison of this group thresholding and previous ones, makes the comparison of psum and global minimum.Compare based on this, this significant bit is updated to or is maintained at 1 or solution be asserted as 0.If in current iteration, be effectively updated to 0, then current global pointer is written to the local pointers memorizer being associated with this reference vector, and is written to 1 in overall situation binary mask in global pointer position.

1519, make about the minimum whether determination equal to 1 found.If it is, be incremented by 1 1521, k.It addition, at 1521 places, for global minima vector, complete to be set to 1 and be effectively set to 0.If it is not, then 1527, global pointer is incremented by 1, and comparison threshold is once again set up.

After k is incremented by, 1523, global pointer is decremented to the position in overall situation binary mask closest to 1.Substantially, global pointer rolls back to the rearmost position that references object vector is eliminated from search volume.

1525, for each references object vector, if local pointers more than or equal to this global pointer and completes bit equal to 0, then significant bit is set to 1, and is once again set up comparison threshold.When to calculate next vector recently, reference vector is inserted into search volume by this.

Although the foregoing describing the sequence for all reference vectors and calculating completing parallel, but by performing the calculating for different vectors and sorting operation on same circuit to save area, it is possible to make these operate more serial.

System described above, method and apparatus can be used for providing many advantages.Distance between vector is calculated iteratively so that in subsequent iteration, and the accuracy of computed distance improves from MSB to LSB.In each iteration, the calculating of the partial distance of vector is provided following purpose: with certain effectiveness or bit position, improve complete (adding up) distance accuracy.Complete distance calculating is broken down into the several partial distances for not homometric(al) (such as euclidean, Manhattan or dot product) and calculates, thus after calculating upper bit by subsequent iteration, the accuracy in relatively low sequential bits position promotes never change higher order bit and exceedes certain thresholding.

It is above (i) partial distance counting circuit utilizing the following to complete, it is for calculating correct partial distance, it has use control signal and carries out the circuit of 1D calculating and the dimension according to this vector arranges, (ii) partial distance that compressor tree is sued for peace, all 1D calculating is used, and (iii) accumulator, it has a memorizer for current accumulation distance, uses shift unit, with suitable effectiveness, partial distance is added to this accumulation distance.

Will not wait until to have calculated that complete distance just starts sequence and can start with low accuracy to the sequence of these cumulative vector distances.Sequence do not consider accumulation distance all bits its be when only from the MSB wicket started to LSB, complete iteratively.Sorting network adopts programmable threshold (in the exemplary case, it is 1 or 0) states whether find minima in comparing every time, and the minima found in whole sorting network in iteration counts little amount more than this thresholding than any other.

Calculating and sequence is staggered from MSB to LSB, thus many reference vectors with low accuracy are eliminated from search volume, meanwhile, remaining vector carries out next iteration, to improve the accuracy of relatively low bit to determine arest neighbors.

The calculating joined with each vector correlation has Partial controll, its use the result of this sorting network to determine the calculating for this vector and sequence are to proceed to following iteration or remove from search volume.

Partial controll and distance accumulator in each Vector operation keep state, even if it eliminates from search volume.When finding next arest neighbors, this vector can be reinserted search volume (based on global pointer) by Partial controll, and reuses until previously having eliminated any of point and being previously calculated.

The overall situation controls to coordinate following activity: uses broadcast to the global pointer of all vectors, which bit in accumulation distance will be sent to sorting network.

The control signal that partial distance for depending on iteration calculates also is controlled to broadcast to all vectors from the overall situation.These control signals can be stored in the programmable look up table quoted by global pointer or as fixing function logic.

When finding arest neighbors, the iteration that the overall situation controls that vector is eliminated from search volume keeps following the tracks of.The overall situation controls to jump back to the nearest iterative state that this vector is eliminated, and when search volume only has the vector of elimination, starts search, finds next arest neighbors.

KNN accelerator can be programmable, to change the order of sequence.

The quantity of any amount of bit size, dimension or vector can be supported.It addition, in certain embodiments, kNN accelerator is programmable, thus the quantity of every bit size of dimension, dimension or reference vector is programmable.

This operation can be serialized, and completes the calculating for different reference vectors and sequence with ranking circuit so that calculating by common ground distance.

Exemplary core framework, processor and computer architecture

Processor core can in a different manner, in order to different purposes, in different processors realize.Such as, the implementation of this core may include that 1) it is intended to the general orderly core for general-purpose computations；2) it is intended to the high performance universal out-of-order core for general-purpose computations；3) figure it is primarily intended for and/or special-purpose core that science (handling capacity) calculates.The implementation of different processor may include that 1) CPU, it includes being intended to the one or more general orderly core for general-purpose computations and/or being intended to the one or more general out-of-order core for general-purpose computations；With 2) coprocessor, they one or more special-purpose cores including being primarily intended for figure and/or science (handling capacity).This different processor causes different computer system architecture, and it may include that 1) from the coprocessor on the independent chip of CPU；2) with the coprocessor with CPU same package on independent tube core；3) coprocessor on the tube core identical with CPU (in the case, this coprocessor is sometimes referred to as special-purpose logic, for instance integrated graphics and/or science (handling capacity) logic, or is referred to as special-purpose core)；And 4) SOC(system on a chip), it can include described CPU (sometimes referred to as application program core or application processor), above-mentioned coprocessor and extra functional unit on same tube core.Next describe exemplary core framework, be followed by the description to example processor and computer architecture.

Exemplary core framework

In order and out-of-order core block diagram

Figure 16 A is according to embodiments of the invention, it is shown that the block diagram of exemplary ordered flow waterline and exemplary register renaming, out of order distribution/both execution pipelines.Figure 16 B is according to embodiments of the invention, it is shown that include within a processor, the exemplary embodiment of orderly framework core and exemplary register renaming, out of order distribution/execution framework core.Solid line boxes in Figure 16 A-B illustrates ordered flow waterline and orderly core, and optional interpolation be illustrated in phantom depositor renaming, out of order distribution/execution pipeline and core.In view of orderly fashion is the subset of out of order mode, out of order mode will be described.

In Figure 16 A, processor pipeline 1600 includes fetching level 1602, length decoder level 1604, decoder stage 1606, distribution stage 1608, renaming level 1610, scheduling (also referred to as assigning or distribution) level 1612, depositor readings/memorizer read level 1614, perform level 1616, write back/memorizer write level 1618, abnormality processing level 1622 and submission level 1624.

Figure 16 B illustrates processor core 1690, it front end unit 1630 including being coupled to enforcement engine unit 1650, and both of which is connected to memory cell 1670.Core 1690 can be Jing Ke Cao Neng (RISC) core, sophisticated vocabulary calculating (CISC) core, very long instruction word (VLIW) core or mixing or alternative core type.Alternatively, core 1690 can be special-purpose core, for instance, for example, network or communication core, compression engine, co-processor core, general-purpose computations Graphics Processing Unit (GPGPU) core, graphic core etc..

Front end unit 1630 includes the inch prediction unit 1632 being coupled to Instruction Cache Unit 1634, and Instruction Cache Unit 1634 is coupled to instruction translation look-aside buffer (TLB) 1636, this TLB1636 is coupled to instruction fetching unit 1638, and this instruction fetching unit 1638 is coupled to decoding unit 1640.Decoding unit 1640 (or decoder) can solve code instruction, and it is generated as one or more microoperations of output, microcode entry points, microcommand, other instructions or other control signal, this decodes from presumptive instruction, or otherwise reflect presumptive instruction, or obtain from presumptive instruction.Decoding unit 1640 can use various different mechanism to realize.The example of suitable mechanism includes but not limited to: the realization of look-up table, hardware, programmable logic array (PLA), microcode read only memory (ROM) etc..In one embodiment, core 1690 includes microcode ROM or other media, and its storage is for the microcode (such as, in decoding unit 1640 or otherwise in front end unit 1630) of some macro-instruction.Decoding unit 1640 is coupled to the renaming/dispenser unit 1652 in enforcement engine unit 1650.

Enforcement engine unit 1650 includes renaming/dispenser unit 1652, and this renaming/dispenser unit 1652 is coupled to the set of retreatment (retirement) unit 1654 and one or more dispatcher unit 1656.Dispatcher unit 1656 represents any amount of different scheduler, including reservation station, central command window etc..Scheduler 1656 is coupled to physical register file unit 1658.Each physical register file unit 1658 represents one or more physical register file, files different in the one or more physical register file stores one or more different data types, such as scalar integer, scalar floating-point, packing integer, packing floating-point, vector int, vector float, state are (such as, instruction pointer, it is the address of the next instruction to perform) etc..In one embodiment, physical register file unit 1658 includes vector register unit, write masks register cell and scalar register unit.These register cells can provide architectural vector register, vector mask register and general register.Physical register file unit 1658 is overlapping by retreatment unit 1654, (such as, uses rearrangement buffer and retreatment register file with the various modes that depositor renaming is described and the Out-of-order execution that can realize；Use future file, history buffers and retreatment register file；Use depositor map and depositor pond etc.).Retreatment unit 1654 and physical register file unit 1658 are coupled to execution cluster 1660.This execution cluster 1660 includes the set of one or more performance element 1662 and the set of one or more memory access unit 1664.Performance element 1662 can to various types of data (such as, scalar floating-point, packing integer, packing floating-point, vector int, vector float) perform various operation (such as, shift, add, subtract, take advantage of).Although some embodiments can include the several performance elements being exclusively used in specific function or function collection, but other embodiments can only include a performance element or multiple performance element, and whole performance elements therein perform all of function.Dispatcher unit 1656, physical register file unit 1658, it is shown as being probably multiple with performing cluster 1660, this is because some embodiment creates independent streamline (such as certain types of data/operation, scalar integer streamline, scalar floating-point/packing integer/packing floating-point/vector int/vector float streamline, and/or memory access streamline, each streamline has the dispatcher unit of oneself, physical register file unit, and/or perform cluster and when independent memory access streamline, achieve some embodiment, wherein, only the execution cluster of this streamline has memory access unit 1664).Should also be understood that wherein, when using independent streamline, one or more in these streamlines can be out of order distribution/execution and remaining be orderly.

This group memory access unit 1664 is coupled to memory cell 1670, this memory cell 1670 includes the TLB unit 1672 being coupled to data cache unit 1674, and data cache unit 1674 is coupled to grade 2 (L2) cache element 1676.In one exemplary embodiment, memory access unit 1664 can include load unit, storage address location and storage data cell, and each is coupled to the data TLB unit 1672 in memory cell 1670.Instruction Cache Unit 1634 is additionally coupled to grade 2 (L2) cache element 1676 in memory cell 1670.This L2 cache element 1676 is coupled to the high-speed cache of other grade one or more and is eventually coupled to main storage.

By way of example, exemplary register renaming, out of order distribution/execution core architecture can be accomplished as follows streamline 1600:1) instruction fetching 1638 perform fetch level 1602 and length decoder level 1604；2) decoding unit 1640 performs decoder stage 1606；3) renaming/dispenser unit 1652 performs distribution stage 1608 and renaming level 1610；4) dispatcher unit 1656 performs scheduling level 1612；5) physical register file unit 1658 and memory cell 1670 perform depositor reading/memorizer and read level 1614；Perform cluster 1660 and carry out the execution stage 1616；6) memory cell 1670 and physical register file unit 1658 perform to write back/memorizer write level 1618；7) various unit may relate to abnormality processing level 1622；And 8) retreatment unit 1654 and physical register file unit 1658 perform to submit level 1624 to.

Core 1690 can support one or more instruction set (such as, x86 instruction set (having some extensions that with the addition of more recent version)；The MIPS instruction set of MIPS science and technology, California Sunnyvale；The ARM instruction set (with optional additional extension, for instance NEON) that ARM is holding, California Sunnyvale), it includes instructions described herein.In one embodiment, core 1690 includes logical block, it is used for supporting packing data (packeddata) instruction set extension (such as, AVX1, AVX2), consequently allows for the data using packing being performed for the operation of many multimedia application.

It should be understood, however, that, core can support multithreading (performing the two or more parallel collection of operation or thread), and can do so in many ways, including the multithreading of time slicing, synchronizing multiple threads (wherein, single physical core provides the logic core for each thread, thus physical core is simultaneous multi-threading), or its combination is (such as, the fetching and decoding of time slicing, and hereafter such as the simultaneous multi-threading in Intel's Hyper-Threading).

And depositor renaming is described in the context of Out-of-order execution, it should be appreciated that, depositor renaming can use with orderly framework.Although the embodiment of shown processor also includes the instruction and data cache element 1634/1674 separated and the L2 cache element 1676 shared, alternate embodiments can have the single inner buffer for instruction and data, such as, for example, grade 1 (L1) is internally cached or multiple grade internally cached.In certain embodiments, system can include internally cached and External Cache (it is outside core external and/or processor) combination.Alternately, all of high-speed cache can be core and/or processor outside.

Concrete exemplary orderly core architecture

Figure 17 A-B illustrates the block diagram of exemplary orderly core architecture more specifically, and this core will be in several logical blocks in chip (including same type and/or other cores different types of).Logical block is by the I/O logic communication of high-bandwidth interconnection network (such as, loop network) with some fixing function logic unit, memory I/O Interface and other necessity, and this depends on application.

Figure 17 A is the block diagram of single according to an embodiment of the invention processor core, together with its connection with on-chip interconnection network 1702, and has the local subset of its grade 2 (L2) high-speed cache 1704.In one embodiment, instruction decoder 1700 supports x86 instruction set and packing data instruction set extension.L1 high-speed cache 1706 allows low latency to access cache memory to scalar sum vector units.Although in one embodiment (in order to simplify design), scalar units 1708 and vector units 1710 use independent Parasites Fauna (being scalar register 1712 and vector register 1714 respectively).And the data transmitted between which are written into memorizer, then read back from grade 1 (L1) high-speed cache 1706, but the alternate embodiments of the present invention can use diverse ways (such as, using single Parasites Fauna or include communication path, this communication path allows data to transmit without being returned by write and read between two register files).

The local subset of L2 high-speed cache 1704 is a part for overall situation L2 high-speed cache, and this overall situation L2 high-speed cache is divided into independent local subset, each processor core one local subset.Each processor core has the direct access path of the local subset of the L2 high-speed cache 1704 to himself.The data read by processor core are stored in its L2 cached subset 1704, and can be accessed quickly, and it is parallel that this and other processor core access themselves local L2 cached subset.The data write by processor core are stored in the L2 cached subset 1704 of their own, and if it is necessary to refresh from other subset.Looped network ensure that the concordance of shared data.Loop network is two-way, to allow agency, L2 high-speed cache and other logical blocks intercommunication in chip such as such as processor core.Each loop data path is each direction 1012 bit width.

Figure 17 B is the expander graphs of a part for the processor core of Figure 17 A according to embodiments of the present invention.Figure 17 B includes L1 data cache 1706A, and it is a part for L1 high-speed cache 1704, and more detailed about vector units 1710 and vector register 1714.Specifically, vector units 1710 is 16-width vector processor unit (VPU) (see 16-width ALU1728), and it performs one or more integers, single-precision floating point and double-precision floating point instruction.This VPU supports to utilize mixed (swizzle) unit 1720 of writing mix and write depositor input, utilize digital conversion unit 1722A-B to carry out numeral to change and utilize copied cells 1724 to replicate on memory cell.Write masks depositor 1726 allows the vector write that prediction obtains.

Processor and integrated memory controller and figure

Figure 18 is the block diagram of processor 1800 according to an embodiment of the invention, and it can have more than one core, it is possible to has integrated memory controller, and can have integrated graphics.Solid line boxes in Figure 18 illustrates that processor 1800 has the set of single core 1802A, System Agent 1810, one or more bus control unit unit 1816, and the broken box of optional interpolation illustrates the alternative processor 1800 with multiple core 1802A-N, the set of the one or more integrated memory controller unit 1814 in system agent unit 1810 and special-purpose logical block 1808.

Therefore, the different implementations of processor 1800 may include that 1) CPU, it has special-purpose logical block 1808, it is integrated figure and/or science (handling capacity) logical block (it can include one or more core), and core 1802A-N be one or more general core (such as, general orderly core, general out-of-order core, combination of the two)；2) coprocessor, it has core 1802A-N, described core 1802A-N and is intended to be mainly used in the substantial amounts of special-purpose core of figure and/or science (handling capacity)；With 3) coprocessor, it has core 1802A-N, described core 1802A-N is a large amount of general orderly cores.Therefore, processor 1800 can be general processor, coprocessor or special purpose processes, such as, for example, network or communication processor, compression engine, graphic process unit, GPGPU (general graphical processing unit), high-throughput how integrated core (MIC) coprocessor (including 30 or more core), flush bonding processor etc..This processor can realize on one or more chips.Using any number of Technology in kinds of processes technology, this processor 1800 can be a part for one or more substrate and/or can be implemented in one or more substrate, and described Technology is such as, for example, BiCMOS, CMOS or NMOS.

Memory hierarchy includes the high-speed cache of the one or more grades in core, the set of one or more shared cache element 1806 and external memory storage (not shown), and described external memory storage is coupled to the set of described integrated memory controller unit 1814.The set of described shared buffer memory unit 1806 can include one or more intermediate grade high-speed cache, such as grade 2 (L2), grade 3 (L3), class 4 (L4), or the high-speed cache of the high-speed cache of other grade, final stage (LLC) and/or its combination.And in one embodiment, Memory Controller unit 1814 integrated to integrated graphics logic 1808, the set sharing cache element 1806 and system agent unit 1810/ is interconnected by the interconnecting unit 1812 based on annular, and alternative embodiment can use any amount of known technology for interconnecting these unit.In one embodiment, between one or more cache element 1806 and core 1802A-N, coherence is maintained.

In certain embodiments, one or more in core 1802A-N can multithreading.This System Agent 1810 includes those assemblies core 1802A-N being coordinated and operating.System agent unit 1810 can include such as power control unit (PCU) and display unit.PCU can be or include needs for regulating logical block and the assembly of the power rating of core 1802A-N and integrated graphics logical block 1808.Display unit is for driving the display of one or more external connection.

About framework instruction set, core 1802A-N can be isomorphism or isomery；It is to say, two or more in core 1802A-N are able to carry out identical instruction set, other the only subset being then able to carry out this instruction set or different instruction set.

Exemplary computer architecture

Figure 19-22 is the block diagram of exemplary computer architecture.Other system for the following known in the art designs and configuration is also suitable: laptop computer, desktop computer, HPC, personal digital assistant, engineering work station, server, the network equipment, hub, switch, flush bonding processor, digital signal processor (DSP), graphics device, video game device, Set Top Box, microcontroller, cell phone, portable electronic device, handheld device and other electronic equipments various.In the ordinary course of things, various systems or the electronic equipment that can be incorporated to processor and/or other execution logic unit as disclosed herein are usually suitable.

With reference now to Figure 19, thus it is shown that the block diagram of system 1900 according to an embodiment of the invention.System 1900 can include one or more processor 1910,1915, and it is coupled to controller hub 1920.In one embodiment, controller hub 1920 includes Graphics Memory Controller hub (GMCH) 1990 and input/output wire collector (IOH) 1950 (it is probably on different chips)；GMCH1990 includes memorizer and graphics controller, and what be coupled to it is memorizer 1940 and coprocessor 1945；Input/output (I/O) equipment 1960 is coupled to GMCH1990 by IOH1950.Alternatively, one or two in memorizer and graphics controller is integrated in processor (as described herein), memorizer 1940 and coprocessor 1945 are directly coupled to processor 1910, and controller hub 1920 together with IOH1950 on a single chip.

Figure 19 has been represented by dotted lines the optional character to extra processor 1915.Each processor 1910,1915 can include one or more process core described herein, and can be certain version of processor 1800.

Memorizer 1940 it may be that such as, dynamic random access memory (DRAM), phase transition storage (PCM) or both combinations.For at least one embodiment, controller hub 1920 communicates via multiple spot (multi-drop) bus (such as Front Side Bus (FSB)), point-to-point interface (such as express passway (QuickPath) interconnection (QPI)) or similar connection 1995 with processor 1910,1915.

In one embodiment, coprocessor 1945 is special purpose processes, for instance, for example, high-throughput MIC processor, network processing unit or communication processor, compression engine, graphic process unit, GPGPU, flush bonding processor etc..In one embodiment, controller hub 1920 can include integrated graphics accelerator.

About the frequency spectrum including the advantage tolerance such as framework, micro-architecture, heat, power consumption characteristics, between physical resource 1910,1915, would be likely to occur each species diversity.

In one embodiment, processor 1910 performs the instruction of the data processing operation for controlling general type.Coprocessor instruction can be embedded in instruction.These coprocessor instructions are identified as the type that be performed by the coprocessor 1945 added by processor 1910.Correspondingly, processor 1910 sends these coprocessor instructions (or representing the control signal of coprocessor instruction) to coprocessor 1945 on coprocessor bus or in other interconnection.Coprocessor 1945 accepts and performs received coprocessor instruction.

Referring now to Figure 20, thus it is shown that the block diagram of the first more specifically example system 2000 according to an embodiment of the invention.As shown in figure 20, multicomputer system 2000 is point-to-point interconnection system, and includes first processor 2070 and the second processor 2080, and described second processor 2080 couples via point-to-point interconnection 2050.Each in processor 2070 and 2080 can be certain version of processor 1800.In one embodiment of the invention, processor 2070 and 2080 respectively processor 1910 and 1915, and coprocessor 2038 is coprocessor 1945.In another embodiment, processor 2070 and 2080 respectively processor 1910 and coprocessor 1945.

Processor 2070 and 2080 is shown as including integrated memory controller (IMC) unit 2072 and 2082 respectively.Processor 2070 also includes the part as its bus control unit unit point-to-point (P-P) interface 2076 and 2078；Similarly, the second processor 2080 includes P-P interface 2086 and 2088.Processor 2070,2080 can use P-P interface circuit 2078,2088 to exchange information via point-to-point (P-P) interface 2050.As shown in figure 20, IMC2072 and 2082 couple the processor to respective memorizer, i.e. memorizer 2032 and memorizer 2034, and it can be a part for the main storage being attached locally to respective processor.

Each in processor 2070,2080 can use point-to-point interface circuit 2076,2094,2086,2098 and chipset 2090 exchange information via independent P-P interface 2052,2054.Chipset 2090 can exchange information via high-performance interface 2039 and coprocessor 2038 alternatively.In one embodiment, coprocessor 2038 is special purpose processes, for instance, for example, high-throughput MIC processor, network processing unit or communication processor, compression engine, graphic process unit, GPGPU, flush bonding processor etc..

Share high-speed cache (not shown) can include in any processor of two processors or outside two processors, but it is connected with processor via P-P interconnection, if so that processor is placed in low-power mode, then the local cache information of any one or two processors can be stored in shared high-speed cache.

Chipset 2090 can be coupled to the first bus 2016 via interface 2096.In one embodiment, the first bus 2016 can be periphery component interconnection (PCI) bus, or the bus of such as PCIExpress bus or another kind of third generation I/O interconnection bus, although the scope of the present invention is not limited to this.

As shown in figure 20, various I/O equipment 2014 are alternatively coupled to the first bus 2016, and together with bus bridge 2018, the first bus 2016 is coupled to the second bus 2020 by bus bridge 2018.In one embodiment, one or more extra processors 2015 are coupled to the first bus 2016, the one or more extra processor 2015 such as coprocessor, high-throughput MIC processor, the processor of GPGPU, accelerator are (such as, for example, graphics accelerator or Digital Signal Processing (DSP) unit), field programmable gate array or any other processor.In one embodiment, the second bus 2020 can be low pin-count (LPC) bus.Various equipment are alternatively coupled to the second bus 2020, described various equipment such as keyboard and/or mouse 2022, communication equipment 2027 and memory element 2028, described memory element 2028 such as disc driver or other mass-memory unit, in one embodiment, described memory element 2028 can include instructions/code and data 2030.Additionally, audio frequency I/O2024 is alternatively coupled to the second bus 2020.Noting, other frameworks are also possible.Such as, replacing the point-to-point framework of Figure 20, system can realize multi-point bus or other this kind of framework.

With reference now to Figure 21, thus it is shown that the block diagram of the example system 2100 more specifically of second according to embodiments of the present invention.Picture with the element of similar reference marker, has eliminated some aspect of Figure 20 in Figure 20 and 21 from Figure 21, in order to avoid other aspects of fuzzy Figure 21.

Figure 21 illustrates that processor 2070,2080 can include integrated memory respectively and I/O controls logical device (" CL ") 2072 and 2082.Therefore, CL2072,2082 include integrated Memory Controller unit, and include I/O and control logical device.Figure 21 illustrates that not only memorizer 2032,2034 is coupled to CL2072,2082, I/O equipment 2114 are also coupled to control logical device 2072,2082.Tradition I/O equipment 2115 is coupled to chipset 2090.

With reference now to Figure 22, thus it is shown that the block diagram of SoC2200 according to an embodiment of the invention.Like in Figure 18 is with similar reference marker.It addition, broken box is the optional feature on more advanced SoC.In fig. 22, interconnecting unit 2202 is coupled to: application processor 2210, it set including one or more core 202A-N and shared cache element 1806；System agent unit 1810；Bus control unit unit 1816；Integrated Memory Controller unit 1814；The set of one or more coprocessors 2220, it can include integrated graphics logical device, image processor, audio process and video processor；Static RAM (SRAM) unit 2230；Direct memory access (DMA) (DMA) unit 2232；With the display unit 2240 for coupling one or more external display.In one embodiment, coprocessor 2220 includes special purpose processes, and described special purpose processes is such as, for example, network processing unit or communication processor, compression engine, GPGPU, high-throughput MIC processor, flush bonding processor etc..

The embodiment of mechanism disclosed herein can realize with hardware, software, firmware, or realize with the combination of these embodiments.Embodiments of the invention can be implemented as the computer program or program code that perform on programmable system, and described programmable system includes at least one processor, accumulator system (including volatibility and nonvolatile memory and/or memory element), at least one input equipment and at least one outut device.

Program code, for instance the code 2030 shown in Figure 20, can be applicable to input instruction to perform functions described herein and to produce output information.Output information can be applied to one or more outut device in known manner.For the purpose of the application, process system includes any system with processor, and described processor is such as, for example, digital signal processor (DSP), microcontroller, ASIC (ASIC) or microprocessor.

Program code can communicate with level process or OO programming language and process system and realize.If necessary, this program code can also realize with compilation or machine language.It is true that mechanisms described herein is not limited to any specific programming language in scope.Under any circumstance, this language can be compiler language or interpretative code.

One or more aspects of at least one embodiment can be realized by representing instruction, described represent instruction and be stored on the machine readable media of the various logic represented in processor, when described instruction is read by machine so that described machine manufacture is for performing the logic of technique described herein.Be referred to as " the IP kernel heart " this expression, it is possible to be stored in tangible, on machine readable media and be supplied to various client or production facility, to be loaded into the manufacture machine actually manufacturing this logic or this processor.

This machinable medium can include but not limited to by machine or device fabrication or the non-provisional of goods that formed, tangible setting, it includes storage medium, such as hard disk, any other type of dish, including floppy disk, CD, compact disk read only memory (CD-ROM), compact disk rewritable (CD-RW) and magneto-optic disk；Semiconductor device, such as read only memory (ROM), random access memory (RAM) (such as, dynamic random access memory (DRAM), static RAM (SRAM), Erasable Programmable Read Only Memory EPROM (EPROM), flash memory, Electrically Erasable Read Only Memory (EEPROM)), phase transition storage (PCM), magnetic or optical card or be applicable to storage e-command any other type of medium.

Therefore, embodiments of the invention also include machine readable media non-transitory, tangible, described machine readable media comprises instruction or comprises design data, and such as hardware description language (HDL), it limits structure as herein described, circuit, device, processor and/or system features.Such embodiment can also be called program product.

Emulation (including binary translation, code morphing etc.)

In some cases, dictate converter may be used for from source instruction set, instruction is converted to target instruction set.Such as, described dictate converter can translate (such as, use static binary translation, include the binary conversion of on-the-flier compiler), deformation, emulating or otherwise convert instructions into will by other instruction one or more processed in core.Described dictate converter can realize with software, hardware, firmware or its combination.Described dictate converter be probably processor, under processor or part on a processor and part under processor.

The use of software instruction transducer is carried out, according to one embodiment of the invention, the block diagram that contrasts by Figure 23, and described software instruction transducer for being converted to the binary command that target instruction target word is concentrated by the binary command in source instruction set.In an illustrated embodiment, dictate converter is software instruction transducer, although alternately, and dictate converter can realize with software, firmware, hardware or its various combinations.Figure 23 illustrates with the program of high-level language 2302, and it can use x86 compiler 2304 to be compiled, and to produce x86 binary code 2306, described x86 binary code 2306 can be performed by the processor with at least one x86 instruction set core 2316 partly.By compatibly processing the following or otherwise processing the following, the processor with at least one x86 instruction set core 2316 represents any processor being able to carry out function substantially the same with the Intel processors with at least one x86 instruction set core: the substantial portion of the instruction set of (1) Intel x86 instruction set core, or (2) be intended to operate in the object code version of the application on the Intel processors with at least one x86 instruction set core or other software, to realize the result substantially the same with the Intel processors with at least one x86 instruction set core.X86 compiler 2304 represents operable to produce x86 binary code 2306 (such as, object code) compiler, when having or do not have extra connection (linkage) and processing, described x86 binary code 2306 to perform on the processor with at least one x86 instruction set core 2316.Similarly, Figure 23 illustrates with the program of high-level language 2302, it can use alternative instruction set compiler 2308 to be compiled, to produce alternative instruction set binary code 2310, described binary code 2310 can locally execute (such as by the processor without at least one x86 instruction set core 2314, there is the processor of following core, described core performs the MIPS instruction set (Sunnyvale of MIPS science and technology, and/or perform the holding ARM instruction set (Sunnyvale, CA) of ARM CA).Described dictate converter 2312 is used for x86 binary code 2306 converts to the code that can be locally executed by the processor without x86 instruction set core 2314.The code of this conversion is unlikely is the same with alternative instruction set binary code 2310, because being difficult to make the dictate converter of energy do so；But, the code after conversion will complete basic operation, and is made up of the instruction from this replaceable instruction set.Therefore, dictate converter 2312 represents: by emulate, simulation or any other process, it is allowed to do not have the processor of x86 instruction set or the processor of core or other electronic equipments to perform x86 binary code 2306, software, firmware, hardware or its combination.

Claims

1. a device, including:

At least one vector section distance counting circuit, it is used for computing for the part of one group of vector in search volume and and accumulation distance；

Minimum sorting network, whether it is unique for being ranked up from one group of bit selected by described accumulation distance to indicate minima and described minima from one group of bit selected by the described vector in described search volume；And

Overall situation control circuit, it is for receiving the output of described minimum sorting network, and for controlling the mode of the operation of at least one vector section distance counting circuit described.

2. device according to claim 1, wherein, each vector section counting circuit includes:

Multiple data element calculator circuits；

Compressor tree circuit, it is for by each results added of the plurality of data element calculator circuit；

Local control circuit, it is for exporting the bit of the relatively wicket from described accumulation distance, and uses the result of described minimum sorting network to determine when calculating and the sequence for vector is carried out to next iteration or it removed from described search volume；And

Accumulator, it is used for the results added of described partial distance in current iteration, and wherein, correct effectiveness is by before being added to via the distance that previous ones is cumulative, shift unit shifting described partial distance and provide.

3. device according to claim 1, wherein, described minimum sorting network includes:

Multiple the first estate comparison node, its for from adjacent vector section distance counting circuit receive part and and significant bit, and export significant bit, degree of accuracy bit, address, Yi Jihe, wherein, described the first estate comparison node is used for:

The adjacent significant bit received is carried out logic or operation, with the significant bit that offer exports,

The adjacent significant bit received is carried out xor operation,

To the result of described xor operation and possible difference that is adjacent and that compare and output carry out logic or operation, with the degree of accuracy bit that generation exports, wherein, described degree of accuracy bit is 1, to indicate that the difference between said two input is greater than programmable threshold or two inputs to be invalid；And

Multiple second grade comparison node, its for from adjacent comparison node receive part and, significant bit, address, and degree of accuracy bit, and export significant bit, degree of accuracy bit, address, and and, the result of the comparison of the sum received for selecting from incoming degree of accuracy signal, and selected degree of accuracy is carried out logical AND with the described degree of accuracy signal calculated at this node place, with the degree of accuracy signal that generation exports, the degree of accuracy signal exported is used to indicate whether described output and is unique, wherein, the described result of described comparison is for forming the highest sequential bits of described address.

4. device according to claim 3, wherein, described overall situation control circuit includes:

Or operation tree, it is for receiving multiple elimination bits from multiple local control circuits and it being carried out or operates；

Global mask, it is for controlling logic display for the set of vectors by comprising next arest neighbors to the described overall situation, and global pointer needs to jump back to where；

Selector, its for from the previous global pointer being increased by 1 and from be coupled to described global mask priority encoder output select described global pointer.

5. device according to claim 1, wherein, the quantity of every dimension bit size, dimension and reference is reconfigurable.

6. device according to claim 2, wherein, each in the plurality of data element calculator circuit is partial distance computing absolute difference and circuit.

7. device according to claim 2, wherein, each in the plurality of data element calculator circuit is partial distance computing quadratic sum circuit.

8. device according to claim 2, wherein, each in the plurality of data element calculator circuit is reconfigurable, using operation as a part for the larger data element calculator circuit for multiple data element bit widths.

9. device according to claim 2, wherein, each in the plurality of data element calculator circuit is partial distance computing carrier dot product circuit.

10. device according to claim 1, wherein, described overall situation control circuit is used for: uses and is broadcast to the global pointer of all vectors, and the activity which bit in described accumulation distance sent to described sorting network coordinates；The control signal broadcast extremely all vectors that will calculate for the partial distance depending on iteration；And when finding arest neighbors, keep following the tracks of the iteration eliminating vector wherein from described search volume.

11. device according to claim 10, wherein, described control signal to be stored in the programmable look up table quoted by described global pointer.

12. device according to claim 2, wherein, described local control circuit and distance accumulator in each vector section distance counting circuit are used for maintaining state, even if after it is eliminated from described search volume, and when finding next arest neighbors, described vector can be reinserted in described search volume by described local control circuit, and reuse being arbitrarily previously calculated until the point previously eliminated, wherein, described local control circuit uses the described output of described minimum sorting network to determine when the described calculating for vector and sequence proceed to next iteration or remove from described search volume.

13. device according to claim 1, wherein, described device can be configured to sort with incremental distance.

14. device according to claim 1, wherein, described device can be configured to change the order of sequence.

15. device according to claim 1, wherein, described device operates on the data set in the memory span more than described device, described device is utilized to calculate the ranked nearest candidate of k-from data base, the candidate of elimination is substituted by object of reservation descriptor in memory, and repeat until all object candidates were all iterated, to find the nearest descriptor vector of whole k-.

16. a method, including:

Perform the subsequent iteration to the following:

Use at least one vector section distance counting circuit to carry out following operation:

The partial distance of institute's computing, about the partial distance of query vector, is shifted by the multiple vector of computing, and the distance of cumulative shifted institute's computing, and

Minimum sorting network is used to start to the order of least significant bit those distances added up are ranked up from highest significant position.

17. method according to claim 16, wherein, each subsequent iteration improves the accuracy of the partial distance of institute's computing according to the order from highest significant position to least significant bit.

18. method according to claim 16, wherein, start to the order of least significant bit to be ranked up starting with low accuracy distance to those distances added up from highest significant position, and only remaining vector proceeds to next iteration, to improve the relatively low bit accuracy for determining arest neighbors.

19. method according to claim 16, wherein, whether described sorting network performs described sequence in the following manner: use the minima ratio amount that arbitrarily other numerical value is little that programmable threshold shows to find in whole sorting network in comparing and in iteration every time more than described thresholding.

20. method according to claim 16, wherein, described accumulation distance calculates to be broken down into and calculates for not isometric multiple partial distances, make after calculating and relatively going up bit, by subsequent iteration, the accuracy in relatively low sequential bits position improves will not be changed into exceed thresholding by relatively going up sequential bits.

21. method according to claim 16, wherein, the multiple vector of computing includes about the partial distance of query object vector:

Using circuit to calculate correct partial distance, described circuit has for using control signal to carry out the circuit of 1D calculating and the dimension according to described vector arranges；

Use the described partial distance summation that all 1D are calculated by compressor tree；And

The partial distance sued for peace is added to current accumulation distance.

22. method according to claim 16, wherein, each iteration is calculated by least one vector section distance described and minimum sorting network circuit performs jointly.