CN117370563A

CN117370563A - Natural language processing using KNN

Info

Publication number: CN117370563A
Application number: CN202311388910.7A
Authority: CN
Inventors: A·奥凯里博
Original assignee: GSI Technology Inc
Current assignee: GSI Technology Inc
Priority date: 2017-07-16
Filing date: 2018-07-16
Publication date: 2024-01-09
Also published as: KR20190008514A; CN110019815A; CN110019815B; KR102608683B1

Abstract

A system for natural language processing includes a memory array and a processor. The memory array is divided into: a SoftMax section storing a plurality of feature vectors, a value section storing a plurality of modified feature vectors, and a tagging section for determining the probability of occurrence of the feature vectors. The processor activates the array to perform the following parallel operations in each column indicated by the tag section: similarity operations in a similarity section between the vector problem and feature vectors stored in the indicated columns; a SoftMax operation in the SoftMax section for determining an associated SoftMax probability value for the indicated feature vector; a multiplication operation in the value section for multiplying the associated SoftMax value by the modified feature vector stored in the indicated column; and a vector sum in the value segment for accumulating the vector of interest of the output of the multiplication operation.

Description

Natural language processing using KNN

The application is a divisional application of the patent application with the same name of 201810775578.2, which is filed on 7 months and 16 days of 2018.

Cross Reference to Related Applications

The present application claims priority and benefit from U.S. provisional patent application 62/533,076 filed on day 7, month 7, and 16, and U.S. provisional patent application 62/686,114 filed on day 6, month 18, 2018, both of which are incorporated herein by reference.

Technical Field

The present invention relates generally to associative computing and, in particular, to data mining algorithms using associative computing.

Background

Data mining is a computational process that discovers patterns in large data sets. It uses different techniques to analyze the data set. One of these techniques is classification, which is a technique for predicting group members of a new project based on data associated with the project in a dataset known to its group members. k nearest neighbor algorithms (k-NN) are one of the known data mining classification methods used in many areas of use of machine learning processes such as, but not limited to, bioinformatics, speech recognition, image processing, statistical estimation, pattern recognition, and other numerous applications.

In a large dataset of objects (e.g., products, images, faces, voice, text, video, human status, DNA sequences, etc.), each object may be associated with one of several predefined categories (e.g., product category may be a clock, vase, earring, pen, etc.). The number of categories may be small or large, and each object may be described by a set of attributes (e.g., size, weight, price, etc. for the product) in addition to being associated with a category. Each attribute may be further defined by a numerical value (e.g., for a product size: a width such as 20.5 cm, etc.). The goal of the classification process is to identify classes of unclassified objects (for which classes have not been defined) based on the values of the object attributes and their similarity to the classified objects in the dataset.

The K nearest neighbor algorithm first calculates the similarity between the incoming object X (unclassified) and each object in the dataset. Similarity is defined by the distance between objects such that the smaller the distance, the more similar the objects will be, and several known distance functions can be used. After the distances are calculated between the newly introduced object X and all objects in the dataset, K nearest neighbors to X can be selected, where K is a predefined number defined by the user of the K nearest neighbor algorithm. X is assigned to the most common class of its k nearest neighbors.

Among other algorithms, the K-nearest neighbor algorithm requires a very fast and efficient analysis of a large unordered data set in order to quickly access the smallest or largest (i.e., extreme) K-term in the data set.

One method for finding the k min/max items in the dataset may be to first sort the dataset such that the numbers are ordered in sequence and the first (or last) k numbers are the k items desired in the dataset. Many classification algorithms are known in the art and may be used.

An in-memory classification algorithm is described in U.S. patent application 14/594,434 filed on 1 month 1 2015 and assigned to the common assignee of the present application. The algorithm may be used to sort the numbers in the set by: the first minimum (or maximum) is initially found, then the second minimum (or maximum) is found, and then the process is repeated until all numbers in the dataset are sorted from minimum to maximum (or from maximum to minimum). The computational complexity of the ordering algorithm described in U.S. patent application 14/594,434 is O (n), when n is the size of the set (because there are n iterations of ordering the entire set). If the computation stops at the kth iteration (if used to find the first k minimum/maximum), the complexity may be O (k).

Disclosure of Invention

Thus, in accordance with a preferred embodiment of the present invention, a system for natural language processing is provided. The system includes a memory array and an in-memory processor. The memory array has rows and columns and is divided into: a similarity section of a plurality of feature vectors or key vectors is initially stored, a softMax section for determining the probability of occurrence of the feature vectors or key vectors, a value section of a plurality of modified feature vectors, and a label section are initially stored. Operations in one or more columns of the memory array are associated with a feature vector to be processed. The in-memory processor activates the memory array to perform the following operations in parallel in each column indicated by the tag section:

similarity operations in a similarity section between the vector problem and each feature vector stored in each indicated column;

a SoftMax operation in the SoftMax section for determining an associated SoftMax probability value for each indicated feature vector;

a multiplication operation in the value section for multiplying the associated SoftMax value by each modified feature vector stored in each indicated column; and

vector sum operations in the value bins for accumulating vector sums of interest of the output of the multiplication operations. Vector sums are used to generate new vector problems for further iterations or to generate output values in final iterations.

Furthermore, in accordance with a preferred embodiment of the present invention, the memory array includes an operation portion, one portion per iteration of the natural language processing operation, each portion being divided into segments.

Further in accordance with a preferred embodiment of the present invention, the memory array is an SRAM, nonvolatile, volatile or non-destructive array.

Still further in accordance with a preferred embodiment of the present invention, the memory array includes a plurality of bit line processors, one bit line processor per column of each sector, each bit line processor operating on one bit of data of its associated sector.

Additionally in accordance with a preferred embodiment of the present invention the system also includes a neural network feature extractor for generating the feature vector and the modified feature vector.

Further in accordance with a preferred embodiment of the present invention, the feature vector includes features of words, sentences or documents.

Still further in accordance with a preferred embodiment of the present invention, the feature vector is an output of a pre-trained neural network.

Additionally in accordance with a preferred embodiment of the present invention the system also includes a pre-trained neural network for generating the initial vector problem.

Furthermore, in accordance with a preferred embodiment of the present invention, the system also includes a problem generator for generating further problems from the initial vector problem and the attention vector sum.

Further in accordance with a preferred embodiment of the present invention, the problem generator is a neural network.

Alternatively, in accordance with a preferred embodiment of the present invention, the problem generator is implemented as a matrix multiplier on the bit lines of the memory array.

There is also provided, in accordance with a preferred embodiment of the present invention, a method for natural language processing. The method includes having a memory array with rows and columns, the memory array divided into: the method includes initially storing a similarity section of a plurality of feature vectors or key vectors, a SoftMax section for determining a probability of occurrence of the feature vectors or key vectors, initially storing a value section of a plurality of modified feature vectors and a tag section, wherein operations in one or more columns of the memory array are associated with one feature vector to be processed, and activating the memory array to perform the following operations in parallel in each column indicated by the tag section. These operations are: the method includes performing a similarity operation in a similarity section between the vector problem and each feature vector stored in each indicated column, performing a SoftMax operation in the SoftMax section to determine an associated SoftMax probability value for each indicated feature vector, performing a multiplication operation in the value section to multiply the associated SoftMax value by each modified feature vector stored in each indicated column, and performing a vector sum operation in the value section to accumulate a vector sum of interest of an output of the multiplication operation. And for generating new vector problems for further iterations or for generating output values in final iterations.

Further in accordance with a preferred embodiment of the present invention, the memory array includes a plurality of bit line processors, one bit line processor per column of each sector, and the method additionally includes each bit line processor operating on one bit of data of its associated sector.

Still further in accordance with a preferred embodiment of the present invention the method further includes generating a feature vector and a modified feature vector using the neural network and storing the feature vector and the modified feature vector in a similarity section and a value section, respectively.

Furthermore, in accordance with a preferred embodiment of the present invention, the method also includes generating an initial vector problem using a pre-trained neural network.

Additionally in accordance with a preferred embodiment of the present invention the method further includes generating a further problem from the initial vector problem and the attention vector sum.

Further in accordance with a preferred embodiment of the present invention, a further problem is generated using neural networks.

Finally, in accordance with a preferred embodiment of the present invention, generating further problems includes performing matrix multiplication on bit lines of the memory array.

Drawings

The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:

FIGS. 1A and 1B are logical and physical schematic diagrams of a memory computing device for computing k extrema in a constant time constructed and operative in accordance with a preferred embodiment of the present invention.

FIG. 2 is a schematic diagram of a data set C stored in a memory array;

FIG. 3 is an example of a dataset C;

FIGS. 4 and 5 are schematic diagrams of temporary storage devices for computing;

FIG. 6 is a flowchart describing the calculation steps of the k-Mins processor;

FIGS. 7-11 are illustrations of an example of the steps of calculation of the exemplary dataset of FIG. 3 by a k-Mins processor constructed and operative in accordance with a preferred embodiment of the present invention;

FIG. 12 is a schematic diagram of one embodiment of efficient shifting for use in counting operations used by a k-Mins processor;

FIG. 13 is a schematic illustration of an event flow for a number of data mining cases;

FIG. 14 is a schematic diagram of a memory array having multiple bit line processors;

FIG. 15 is a schematic diagram of an associative memory layout of an end-to-end memory network constructed and operative to implement for natural language processing; and

fig. 16 is a schematic diagram of an associated processing unit for implementing all hops of a network within a memory in a constant time.

It will be appreciated that for simplicity and clarity of illustration, elements illustrated in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.

Detailed Description

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the present invention.

Applicants have appreciated that ordering the data sets to find k minima is not efficient when the data sets are very large, as the complexity of known ordering mechanisms is proportional to the data set size. As the data set grows, the effective time to respond to a request to retrieve k minimum values from the data set will increase.

The applicant has further appreciated that associative memory devices may be used to store large data sets and that associative computation may provide an in-memory method for finding k minima in any size data set with a constant computational complexity (O (1)), where the constant computational complexity (O (1)) is proportional only to the size of the objects in the data set and not to the size of the data set itself.

U.S. patent application 12/503,916 (now U.S. patent No. 8,238,173) filed on 7.16.2009, which may provide such a constant complexity memory device; U.S. patent application Ser. No. 14/588,419, filed on 1 month 1 2015; U.S. patent application Ser. No. 14/594,434 (now U.S. patent No. 9,859,005), filed on 1 month 12 2015; U.S. patent application Ser. No. 14/555,638, now U.S. patent No. 9,418,719, filed on even 27 at 11/2014, and U.S. patent application Ser. No. 15/146,908, now U.S. patent No. 9,558,812, filed on even 5/2016, all of which are assigned to the common assignee of the present invention.

Applicants have further appreciated that in addition to a constant computational complexity, the associative computation may also provide a fast and efficient method to find k minima with minimum latency for each request. In addition, the data in the associative memory may not move during the computation and may remain in its original memory location prior to the computation.

It will be appreciated that increasing the size of the data set may not affect the computational complexity nor the response time of the k-Mins query.

Reference is now made to fig. 1A and 1B, which are schematic illustrations of a memory computing device 100 constructed and operative in accordance with a preferred embodiment of the present invention. As shown in FIG. 1A, the device 100 may include a memory array 110 for storing a data set, a k-Mins processor 120 implemented on a memory logic element to perform k-Mins operations, and a k-Mins temporary store 130 that may be used to store intermediate and final results of operations performed by the k-Mins processor 120 on data stored in the memory array 110. In FIG. 1B, the physical aspects of the k-Mins processor 120 and the k-Mins temporary storage repository 130 are shown as associated memory arrays 140. The associative memory array 140 combines the operations of the k-Mins processor 120 with the storage of the k-Mins temporary memory store 130. The memory array 110 may store a very large data set of binary numbers. Each binary number is made up of a fixed number of bits and stored in a different column in the memory array 110. The K-min temporary store 130 may store a copy of the information stored in the memory array 110 as well as temporary information related to the computational steps performed by the K-min processor 120 and several vectors including the indicated final results of the K columns storing the K lowest values in the dataset.

It is understood that the data stored in memory array 110 and associated memory array 140 may be stored in columns (so as to be able to perform boolean operations as described in the above-referenced U.S. patent application). However, for clarity, the specification and drawings provide a logical view of information in which numerals are displayed horizontally (on a row). It will be appreciated that the actual storage and computation is done vertically.

Referring now to fig. 2, fig. 2 is a schematic diagram of a data set C stored in the memory array 110. As described above, the rows of data set C are stored as columns in memory array 110. The data set C may store multi-bit binary numbers in q rows. Each binary number in data set C is called C ^p Where p is a row identifier in the memory array C where the binary number is stored. Each number C ^p From m bitsComposition of->Bit i representing the binary number stored in row p. The value of m (including the number of digits of a binary number) may be 8, 16, 32, 64, 128, etc.

As described above, C ^P Represents row (p) in array C, where (p=1..q), C _i Represents column (i) in array C, where (i=0..m-1), andrepresents the cells in array C (intersection of row p and column (i)), where (p=1..q; i=0..m-1). The item in line 3, column 2 in FIG. 2 is called +. >Marked with squares.

Referring now to fig. 3, fig. 3 is an example of a data set C having 11 binary numbers, i.e., q=11. Each row is marked with an identifier starting from 0 to 10. The binary numbers in the exemplary dataset C each have 8 bits, which are stored in columns labeled bit 7 through bit 0, m=8 in this example. The decimal value of each binary number is shown to the right of each row. The desired amount of minimum binary numbers to find in this example can be set to 4, i.e. k=4, and it will be appreciated that the four minimum numbers in the data set of fig. 3 are: (a) number 14 stored in row 9; (b) a number 56 stored in line 5; (c) Number 88 stored in row 1, and (d) number 92 stored in row 4.

The k-Mins processor 120 constructed and operative in accordance with a preferred embodiment of the present invention may look up the k minimum binary numbers in the large data set C. The group of k minimum numbers in the data set C is called the k-min set, which may have k digits. The k-Mins processor 120 may determine the column C of the data set C by scanning the data set C from MSB (most significant bit) to LSB (least significant bit) _i And simultaneously select one of themRow C being 0 ^p To proceed to the next step to create a k-min set. It will be appreciated that a binary number having a value of 0 at a particular location (the ith bit thereof) is smaller than a binary number having a value of 1 at the same location.

The amount of the selected row is compared with the target row k. If the number of selected rows is greater than k, the k-Mins processor 120 may continue to scan the next bit of the already selected row because there are too many rows, and the set should be further reduced. (the unselected rows may contain binary numbers with larger values and therefore they are not considered in the calculation of the rest). If the number of selected rows is less than k, the k-Mins processor 120 may add the selected row to the k-Mins set and may continue to scan the next bit in all the remaining binary numbers. (the number of rows selected is not large and therefore rows with extra larger binary numbers should be considered). If the selected row is exactly k in quantity, the k-Mins processor 120 may stop its processing because the k-Mins set may include k items as needed.

It may be noted that when k=1, the k-ins set contains a single number, which is the global minimum of the entire data set. It will also be appreciated that there may be more than one instance in the dataset that has the value, and that the first instance of the value will be selected as a member of the k-Mins set.

It will be appreciated that the k-Mins processor 120 may be constructed using information in which bits of the binary number of data set C are stored in the memory array 110. In the example of fig. 3, binary numbers are shown in rows where the MSB is the leftmost bit, the LSB is the rightmost bit, and all other bits are located therebetween. Furthermore, the arrangement of the binary numbers in the memory array 110 is such that the bit in the ith position of all binary numbers of the data set C is located in the same row C in the memory array 110 _i Is a kind of medium. That is, the MSBs of all binary numbers in data set C may be in the same row, the LSBs of all binary numbers in data set C may be in the same row, and all bits in the middle may be in the same row.

Reference is now made to fig. 4 and 5, which are schematic illustrations of a k-mines temporary storage device 120 constructed and operative in accordance with a preferred embodiment of the present invention. The K-Mins temporary storage 120 may include intermediate information stored in the vector. The vectors used by the k-Mins processor 120 are: vector d—temporary inverse vector; vector V-qualified k-Mins tag vector; vector m—candidate vector; vector n—temporary candidate vector, vector t—temporary member vector. The size (number of rows) of all vectors used in the k-Mins section 120 is q and is the same as the number of rows in dataset C. Each vector stores in each row an indication about the k-min set regarding the binary numbers stored in the relevant row in the dataset C, e.g. as part of the set, as candidates to join the set, etc. It is understood that vectors that are the entire data set are physically stored in rows in the memory array 110, but are drawn as columns for clarity.

Vector D is a temporary inverse vector that may contain column C processed by the k-Mins processor 120 _i Is the inverse of the bit of (a). As described above, the bits of the binary number of data set C may be processed from MSB to LSB, and at each step k-Mins processor 120 may process another row i of memory array 110.

Vector D is the processed column C in dataset C _i Is the inverse of (3):

D＝NOT C _i 。

vector D has a value of 1 (i.e., D ^P Any p-th row of=1) may indicate storage in a cellThe value of the original bit in (row p of dataset C) is 0, indicating that the binary number stored in row p of dataset C may be a candidate to participate in the k-Mins set. Similarly, having a value of 0 (i.e., D ^P All p-th rows in vector D of =0) may indicate storage in cell +.>The value of the original bit in (row p of dataset C) is 1, indicating that the relevant binary number from dataset C may not be a candidate to participate in the k-Mins set because it is larger than the other numbers from the dataset being evaluated.

Vector V is a qualified k-min marker vector, maintaining a list of all rows in dataset C with binary numbers that are (already) part of the k-min set. As with all other vectors used by the algorithm, it is a q-sized vector that maintains the binary number C in the dataset C in every p-th row ^P Final indication V of whether belonging to the k-Mins set ^P 。

Vector V has a value of 1 (i.e., V ^P Any p-th row of=1) may indicate two stored in the same p-th row of data set CThe value of the binary number qualifies as a k-Mins set member. Similarly, vector V has a value of 0 (i.e., V ^P All p-th rows of=0) may indicate that the binary numbers stored in p-th rows of data set C are not eligible as part of the k-Mins set.

Since the k-Mins set is empty at the beginning of the calculation, the vector V can be initialized to all zeros. At the end of the calculation, V may include k qualification indications (i.e., the value of k bits in vector V may be 1 and the values of all other bits may be 0). Once bit V in vector V is calculated ^P Is set to the associated binary number C of 1, C ^P That is, part of the k-Mins set, and may not cease to be part of the k-Mins set. The indication in vector V can only be set. The indication may not be further "unset" along the calculation process when the k-Mins processor proceeds to the next column in the dataset C. (since a column is processed from MSB to LSB, the number defined as the smallest may not change its properties and becomes larger when the next column is processed).

Vector M is a candidate vector, maintaining a list of all rows in dataset C that have numbers that could potentially be part of the k-Mins set. The associated binary digits in data set C have not been added to the k-Mins set, but they have not been excluded from the set, and may have further been added to the set along with the execution of the k-Mins processor 120. As with all other vectors used by the k-Mins processor 120, it is a q-sized vector that maintains the binary number C in the dataset C in every p-th row ^P Indication M whether or not it can still be considered as a candidate to join the k-Mins set ^P 。

Vector M has a value of 1 (i.e., M ^P Any p-th row of=1) may indicate that the value of the binary number stored in the p-th row of data set C may be a candidate to join the k-Mins set. Similarly, vector M has a value of 0 (i.e., M ^P =0) may indicate that the binary numbers stored in row p of data set C may no longer be considered candidates to join the k-Mins set.

Vector M may be initialized to all 1's because all numbers in data set C may potentially be part of a k-Mins set because the set may not be ordered and the numbers may be randomly distributed.

Once bit M in vector M is calculated ^P Is set to 0, which represents the associated binary number C in C ^P May no longer be considered a potential candidate for the k-Mins set and the indication may not change further back along the calculation process, while the k-Mins processor 120 continues in sequence to the next bit to evaluate. A binary digit that may no longer be a candidate is larger than other binary digits, so it may always be excluded from further evaluation.

Vector N is a temporary candidate vector, maintaining for every p-th row a number C that is not yet in V ^P Whether or not the temporary indication N as a candidate for joining the k-Mins can still be considered ^P Taking into account C expressed in terms of vector M ^P The current candidate state of the binary number of the currently processed bit and its inverse may be stored in the value of the currently processed bit in vector D. N is the logical and of vector M and vector D.

N＝M AND D

Vector N has a value of 1 (i.e., N ^P Any p-th row of=1) may indicate that the value of the binary number stored in the p-th row of data set C is still a candidate to join the k-Mins set. Similarly, there is a value of 0 in vector N (i.e., N ^P =0) may indicate that the binary numbers stored in row p of data set C may no longer be considered candidates to join the k-Mins set. If and only if binary number C ^P Not previously excluded from candidates (i.e. M ^P =1), the current check bit in C is 0, i.e. D ^P =1, np will be 1.

Vector T is a temporary member vector, maintaining the following temporary indications T for each p-th row ^P : whether or not the binary number C ^P Potentially being a member of the k-mps set, i.e. whether it is already in the k-mps set (with an indication in vector V) or is a candidate to join the k-mps set (with an indication in vector N). T is the logical or of vector N and vector V.

T＝N OR V

Vector T has a value of 1 (i.e., T ^P Any p-th row of=1) may indicate that the value of the binary number stored in the p-th row of dataset C may be considered a temporary member of the k-Mins set and has a value of 0 in vector T (i.e., T ^P All p-th rows of=0) may indicate that the associated binary number may not be a member of the k-Mins set.

As described above, the k-Mins processor 120 may simultaneously store all of the digits C in the data set C ^P Operates and may iterate their bits from MSB to LSB. It may start from an empty group (v=0) and may assign candidate states (m=1) to all binary numbers in the dataset. In each step of the k-Mins processor 120, the evaluation column C _i Is of the order of (2)Inverse (d=not C) (to find k maxima, evaluate C _i Rather than their inverse). If the value of D is 0 (i.e. +.>=1), then number C ^P Too large to join the k-Mins set AND may be removed from the candidate list N (n=m AND D). The number of candidates (cnt=count (N OR V)) is calculated and compared with the required size of k-mines group-k.

If CNT (potential binary number in k-mps set) is smaller than desired (CNT < k), then all candidates may become eligible (v=n OR V) and the search may continue (because there are not enough eligible members in the k-mps set).

If CNT is greater than desired (CNT > k), then all binary numbers (m=n) with bit value 1 in the current check bit may be removed from the candidate list, reducing the number of candidates. The remaining candidates may continue to the next step.

If CNT meets the required value (cnt=k), then all candidates may become eligible (v=n OR V) and the computation of the k-Mins processor 120 may end.

Referring now to FIG. 6, FIG. 6 is a flowchart of the functional steps of a k-Mins processor 120 constructed and operative in accordance with a preferred embodiment of the present invention. The functional steps of the k-Mins processor 120 include: initialization 610, loop 620, compute vector 630, large set 640, small set 650, and appropriate set 660. The processing steps of the k-Mins processor 120 are also provided below as pseudo code.

Initialization 610 may initialize vector V to 0 because the k-Mins set may start from the empty set and may initialize vector M to 1 because all binary numbers in data set C may be candidates.

The loop 620 may loop over all bits of the binary number of data set C, starting from the MSB and ending at the LSB.

For each processed bit, the calculation vector 630 may calculate temporary vectors D, N and T, and may calculate the amount of candidates. Vector D may be created as the inverse of column i and candidate vector N is created from the existing candidates (in vector M), as well as the value of bit i reflected by vector D, which holds the inverse value of the bit being processed. Vector T may be calculated as a logical or between the current member of the k-Mins set reflected by vector V and the created candidate vector N. The number of candidates in the vector T may be counted as will be further described below.

If the number of candidates is greater than desired, then large set 640 may update candidate vector M and may continue with the next bit. If the number of candidates is less than desired, the small set 650 may add new candidates to the member vector V and may continue with the next bit, and if the number of candidates is on demand, the appropriate set 660 may update the qualified label vector V and may exit the loop even if the computation does not reach the LSB.

Fig. 7-11 are illustrations of an example of the calculation steps of the exemplary dataset of fig. 3 and the contents of the result vector in each step of the algorithm by the k-ins processor 120 constructed and operative in accordance with a preferred embodiment of the present invention. As previously described, the desired size of the k-Mins set in this example is set to 4.

Fig. 7 is a diagram of the contents of data set C, where the decimal value of each number makes the calculation result clear, and the contents of vectors V and M are 0 and 1, respectively, after their initialization.

Fig. 8 is a diagram of the states of the different vectors after the k-min processor 120 iterates over the MSB (bit number 7 in the example of data set C). Vector D may contain the inverse of column 7 of dataset C. Vector N may then be calculated as a logical and of vector M and vector D. Vector T may then be calculated as a logical or of vector N and vector V, and the indication number in T is calculated. The value of the count is 5, which is greater than the k value of 4 required in the example. In this case, the vector M is updated to the value of N, and the algorithm proceeds to the next bit. Similarly, FIG. 9 is a diagram of the states of different vectors after the k-Mins processor 120 iterates on the next bit (bit number 6 in the example of dataset C). As can be seen, the value of the count in fig. 9 is 2, which is smaller than the required value k=4. In this case, the vector M is updated to the value of N, and the algorithm proceeds to the next bit.

FIG. 10 is a diagram of different vectors after the k-Mins processor 120 iterates over the next bit of bit number 5. Vector D may contain the inverse of column 5 of dataset C. Vector N may be calculated as a logical and of vector M and vector D as described previously. Vector T may then be calculated as a logical or of vector N and vector V, and the number of bits having a value of "1" is counted. The count value is 4, which is the required set size, so V is updated with the value of T and the algorithm ends. The vector V now contains a flag (bit value "1") in all rows, representing a small number in the data set C, and it will be appreciated that the correct number is indicated by the vector V.

In the data set of this example, there are exactly 4 binary numbers with the smallest value, and they can be found by the k-Mins processor 120 after 3 iterations, although the number of bits per binary number is 8. It will be appreciated that the processing complexity is limited by the number of bits of the binary number and not by the size of the data set.

When a binary number occurs more than once in the data set, the k-Mins processor 120 may reach the last bit of the binary number in the data set and fail to find exactly k items eligible as k-Mins members. In this case, an additional set of bits representing a unique index for each binary number in the dataset may be used as additional least significant bits. Since each binary number is associated with a unique index, the additional bits can ensure that a unique value is created for each item in the dataset and that an exact amount of items can be provided in the k-Mins set.

Referring now to FIG. 11, FIG. 11 is a diagram of an exemplary dataset C with repeated instances of binary numbers such that the size of the k-Mins set may be greater than k. (in the example of FIG. 11, there are two repeated binary numbers in rows 3 and 5, with a decimal value of 56, and three repeated binary numbers in rows 8, 9, and 10, with a decimal value of 14. Thus, there may be 5 entries in the K-Mins set, with K being 4). To reduce the number of entries in the k-Mins set, the index of each binary number may be processed by the k-Mins processor 120 as the least significant bit of the binary number of data set C. Since the index is unique, only k indexes will be in the k-Mins set. As shown in fig. 11, the addition of the index bit results in a k-Mins set with exactly k=4 members.

As detailed above, a k-ins processor 120 constructed and operative in accordance with an embodiment of the present invention may count the number of indications in the vector, i.e., the set bits in vector T. There are a number of ways to count the number of set bits in a vector, one of which is the known pyramid count, which adds each number to its nearest neighbor, then adds the result to the results two columns apart, then to the results 4 columns apart, and so on, until the entire vector is counted.

Applicants have recognized that efficient counting can be achieved in associative memory using RSP signals as described in detail in U.S. application 14/594,434 (now issued as U.S. patent No. 9,859,005), filed on 1-2015 and assigned to the common assignee of the present invention. The RSP signal may be used for efficient large shifting of bits required for counting of indications in large vectors. When the vector is large, a large shift, such as shift 16, 256, 2K, etc., may be required to provide an immediate shift, rather than a bit-by-bit shift operation.

The RSP is a wired-OR circuit that can generate a signal in response to a positive identification of a data candidate in at least one column.

Reference is now made to fig. 12, which is a schematic diagram of one embodiment of using RSP signals to enable efficient shifting of counting operations using the example array 1200. The array 1200 may include the following: row 1210, vector 1220, location 1230, X-hold 1240, RSP signal 1245, and RSP column 1250.

Row 1210 may be an index of rows in array 1200. There may be 16 rows in the array 1200, but the array 1200 may have any number of rows, such as 32, 64, 128, 256, 512, 1024, 2K, etc. Vector 1220 may be a vector of such bits: wherein the bit from row n should be relocated to row 0, i.e. the value of the bit in position n should be copied to position 0 (in order to e.g. add it to the bit in row 0 of another column). In each row, the value of the bit may be labeled "y", except for the value stored in row n, which is the value to be shifted, which is labeled "X". All bits of vector 1220 may have a value of "0" or "1". The position column 1230 may be a column having a value of "0" in all rows except that in the nth row, the bit (labeled X) in which the value is set to "1" is shifted from that row. X-hold 1240 may be the result of a Boolean AND operation between the value of vector 1220 and position 1230. X-hold 1240 may hold the value X stored in the nth row of vector 1220 and may null the values of all other rows of vector 1220.

The RSP signal 1245 is the result of an or operation performed on all cells of X hold 1240 and may have a value of X. It will be appreciated that since the value of all bits of X hold 1240 is "0", the value of the OR Boolean operation on all cells of X hold 1240 will be the value X, except for the value X stored in the nth row. The value received in element RSP signal 1245 may be further written into all elements of RSP 1250, including element 0, effectively shifting the value X from row n to row 0.

The K-Mins algorithm described above may be used by a K nearest neighbor (K-NN) data mining algorithm. In K-NN, D may represent a large dataset containing q objects(q is extremely large). D (D) ^P Is an object in dataset D: d (D) ^P E D, a is the object to be classified. An object is defined by a numerical attribute vector: a is defined by a vector of n attributes [ A ] ₀ ,A ₁ ,…A _n ]Definition, D ^P From vectors having the same n attributesAnd (5) defining. Each object D in the imported object A and data set D ^P Between object A and object D ^P The distance between, i.e. m bits of binary number C ^P . Distance C ^P A cosine similarity between two non-zero vectors may be represented. Cosine similarity, as known in the art, associates each pair of vectors with a scalar and is referred to as the inner product of the vectors.

The cosine distance may be calculated using the following formula:

each object D in object a and data set ^P Calculating the distance C between ^P And stores it as a binary number in the large dataset C. The K-Mins algorithm can find the K minimum binary numbers in C that represent the K nearest neighbors of a in a constant time.

It will be appreciated that the number of steps required to complete the calculation of the K-ins algorithm (as used by the K-NN algorithm, for example) depends only on the size of the objects stored in the dataset (the number of bits that make up the binary number representing the distance between a and the objects in the dataset, i.e. m), and not the number of objects in the dataset (q), which may be very large. The calculation of the algorithm may be done on all rows of the dataset at the same time. It will also be appreciated that adding any object to the dataset does not extend the processing time of the k-Mins processor 120. If used in an online application, the retrieval time of objects from the data set may remain the same as the data set grows.

It will be appreciated that the throughput of queries using the invention described above may be improved by starting the calculation of the next query before returning the results of the current query to the user. It will also be appreciated that the k-Mins processor 120 may create an ordered list of items rather than a collection by adding a numerical indication to each binary digit to mark the iteration identifier where the object has changed from a candidate state to a qualified state. Since smaller binary numbers qualify faster than larger binary numbers, the iteration identifier of the smaller binary number may also be smaller than the identifier of the larger binary number in data set C.

Unless specifically stated otherwise, as apparent from the foregoing discussion, it should be understood that throughout the specification, the k minimum numbers of discussions apply mutatis mutandis to the k maximum numbers, and vice versa, and may also be referred to as extreme numbers.

Applicants have recognized that K-NN processes may be utilized to increase the speed of classifiers and recognition systems in a wide variety of fields, such as speech recognition, image and video recognition, recommendation systems, natural language processing, and the like. The applicant has also appreciated that the K-NN algorithm constructed and operative in accordance with the preferred embodiment of the present invention can be used in areas that were not previously used, as it provides excellent computational complexity for O (1).

Referring now to fig. 13, a flow of events for a large number of data mining cases that can be categorized using the K-NN algorithm at some point is illustrated. The system 1300 may include a feature extractor 1320 for extracting features 1330 from the input signal 1310 and a K-NN classifier 1340 for generating an identification and/or classification 1350 of items in the input signal 1310.

The signal 1310 may be an image, voice, document, video, etc. For images, the feature extractor 1320 may be a Convolutional Neural Network (CNN) in a learning phase or the like. For speech, feature 1330 may be a mel-frequency cepstral coefficient (MFCC). For documents, the features may be Information Gain (IG), CHI-square (CHI), mutual Information (MI), calculated Ng-Goh-Low coefficient values (NGL), calculated Galaxotti-Sebastimani-Simi coefficient values (GSS), relevance Scores (RS), MSF DF, word frequencies of document frequencies (TFDF), and so forth. The extracted features may be stored on a device, such as the memory computing device 100 of fig. 1, on which the K-NN classifier 1340 may operate. Classification 1350 may be a predictive classification of the item, such as image recognition, or a classification of the image signal; speech detection, or noise cancellation of audio signals; document classification of document signals or spam detection; etc.

For example, it will be appreciated that a CNN network may begin learning using a training set of items whose classifications are known. After a short learning time, a first convergence of the network is observed. The learning phase typically lasts hours and days to achieve complete convergence of a stable and reliable network.

According to a preferred embodiment of the invention, learning may be stopped immediately after convergence begins and the network may be stored in this "transitional" state before full convergence is reached.

According to a preferred embodiment of the present invention, the activation values of the training set calculated using the network in its "transitional" state may be defined as characteristics 1330 for each item of the training set, and may be stored along with the classification of each such item. It will be appreciated that the features may be normalized, i.e. the sum of the squares of all activations for each item may be set to a sum of up to 1.0.

When a new item to be classified is received, the CNN is performed on the item using the network in its transitional state, and a K-NN process using the stored features may be used to classify the new item. The K-NN classification of new items may be performed by calculating cosine similarity between the feature set of the new object and the items in the database and classifying the new items with classes of K nearest neighbors, as described in detail above.

It will be appreciated that the K-NN algorithm using the K-mins method described above may replace the last part of the standard CNN.

It should be appreciated that the addition of the K-NN algorithm may provide high classification accuracy with a partially trained neural network, while significantly reducing training cycle time.

The use of CNNs with K-NN for classification may replace fully connected portions of networks in applications such as image and video recognition, recommendation systems, natural language processing, and the like.

Applicant has appreciated that the KNN process described above may be used for Natural Language Processing (NLP).

Consider long text such as books, paper protocols, and even complete wikipedia. The prior art Natural Language Processor (NLP) generates a neural network that can interrogate a long set of questions and can get a correct answer. For this, they use Recurrent Neural Networks (RNNs). According to a preferred embodiment of the present invention, long text may be stored in memory 110, and an associated memory array 140 with the KNN process described above may reply back and forth to the miscellaneous questions with a constant computational complexity of O (1). It should be appreciated that NLP may also be used for language translation, malware detection, and the like.

The input to the neural network is a key vector and the output is a value vector, generated internally within the neural network by a similarity search between the input key and all other keys in the neural network. To answer a question, the output may be looped back as the next query, iterating as many times as necessary until an answer is found. Applicants have appreciated that an Associated Processing Unit (APU), such as memory computing device 100, may perform any search function (e.g., cosine similarity that is not an exact match) to achieve all of the functionality required for natural language processing using a neural network.

End-to-end memory network architecture-prior art

The input represents: story is composed of sentences { x } _i A set of feature vectors m _i Generated by a pre-taught RNN, an automatic encoder, or any other method (e.g., k-NN). These features are stored in a neural network. Another pre-taught embedding is then also used to convert the problem q into feature vectors (of the same dimension as the sentence). The neural network then calculates the similarity as q to each feature m _i Is a matrix multiplication of (a). The SoftMax algorithm is then calculated to obtain the probability vector. SoftMax may be performed on all neural networks or on K nearest neighbor vectors.

The output represents: for the purpose ofGenerating an output, a probability vector and a modified feature vector c _i (typically with feature m) _i Identical or very similar thereto). After multiplication, the processor accumulates all N products or only k nearest neighbors to obtain an output support vector (the result is an intermediate answer that helps get the correct answer).

Generating a final prediction: the intermediate answer is combined with the original question as a new query for another jump (in a multi-layer variant of the model) or in the final stage (after 3 jumps). The predictive answer is then generated by multiplying the value vectors by their associated SoftMax probabilities and then adding all vectors to one vector called the "attention vector".

Association implementation

In accordance with a preferred embodiment of the present invention, since memory computing device 100 is fully expandable, it does not have any limitation on the size of text. The memory computing device may store millions of sentences. A typical associative memory server card may hold tens of millions of sentences, enough to store a vast database. For example, wikipedia has 20 hundred million english words. Assuming that these words are divided into 5 hundred million sentences, the entire wikipedia may be stored in 30-50 associative memory servers, or in a single server if pre-hashing is used. According to a preferred embodiment of the present invention and as described in more detail below, all execution steps occur in parallel for all sentences and have an O (1) complexity.

Memory computing device 100 may be formed of any suitable memory array, such as SRAM, nonvolatile, volatile, and non-destructive arrays, and may be formed as a plurality of bit line processors 114, each processing a bit of a word and each storing the word in a column of an associated memory array 140, as discussed in US 9,418,719 (P-13001-US), assigned to the common assignee of the present invention, and incorporated herein by reference.

Thus, each column of array 140 may have multiple bit line processors. This is seen in fig. 14, now briefly referring to fig. 14, which shows a portion of an array 140 in which 6 exemplary 2-bit words A, B, Q, R, X and Y are to be processed. Bits A1 and B1 may be stored in bit line processor 114A along bit line 156, while bits A2 and B2 may be stored in section 114B along bit line 158. Bits Q1 and R1 may be stored in bit line processor 114A along bit line 170, while bits Q2 and R2 may be stored in bit line processor 114B along bit line 172. Bits X1 and Y1 may be stored in bit line processor 114A along bit line 174, while bits X2 and Y2 may be stored in bit line processor 114B along bit line 176.

Typically, for an M-bit word, there may be M sections, each section storing a different bit of the word. Each segment may have an effective number N (e.g., 2048) of bit lines, and thus an effective number N of bit line processors. Each sector may provide a column of bit line processors. Thus, N M-bit words may be processed in parallel, where each bit may be processed in parallel by a separate bit line processor.

A typical column of cells (e.g., cell column 150) may store input data to be processed in the first few cells of the column. In fig. 5, the bits of words A, Q and X are stored in a first cell of a column, while the bits of words B, R and Y are stored in a second cell of the column. According to a preferred embodiment of the invention, the remaining cells in each column (there may be 20-30 cells in a column) may be left as temporary storage for use during processing operations.

The multiplexers may be connected to rows of bit line processors and the row decoder may activate the appropriate cells in each bit line processor. As described above, the rows of cells in the memory array are connected by word lines, and thus the decoder can activate the relevant word lines of the cells of the bit line processor for reading and activate the relevant word lines in a different set of the bit line processor for writing.

For the natural language processing described above, the organization of data in the associative memory is shown in fig. 15, and reference is now made to fig. 15. There are three main portions 1410-j, one for each of the three iterations required to generate a result. Each section may in turn be divided into three operation sections, a similarity section 1412-j for computing a similarity value for each column, a SoftMax section 1414-j for computing SoftMax calculations for similarity results, and a value section 1416-j for determining a vector of interest or supporting answer. It will be appreciated that the columns of each section are aligned with each other, as are the columns of the different iterations. Thus, the operation on feature x will typically occur within the same column in all operations.

Feature vectors or key vectors M of N input sentences ¹ _i Stored in portion 1412-1 of memory 110, each feature vector M ¹ _i Stored in separate columns. Thus, feature vector M ¹ ₀ Stored in column 0, M ¹ ₁ Stored in column 1, etc., and each vector M ¹ _i May be stored in its own bit line processor 114. As discussed above, the feature vector may be the output of a pre-trained neural network or any other vectorized feature extractor, and may be a feature of a word, sentence, document, or the like, as desired.

Modified feature vector C associated with N input sentences ¹ _i May have the same associated M ^j _i Or some or all of the vectors may be modified in some suitable manner. Modified feature vector C ^j _i May be initially stored in value section 1416-1. Similar data may be stored in the similarity section 1412-j and the value section 1416-j for other iterations j, respectively.

For the similarity section 1410-j, the memory computing device 100 may implement the input vector q with each column in parallel _j And may store results, which may be distances between the input vector in the associated bit line processor 114 and features in each column of the similarity section 1410-j, as discussed above. Exemplary matrix multiplication operations are described in U.S. patent application 15/466,889, assigned to the common assignee of the present invention and incorporated herein by reference. The input vector may be the initial question for iteration 1 and may follow the questions in other iterations j.

The tag vector T may be used to designate a selected column when needed to forget or insert and update a new input vector, and may be implemented as a row 1420, which may be operated on for all iterations.

The SoftMax operation described in the wikipedia article "SoftMax function" may be implemented in SoftMax section 1414-j on the result of a point matrix or cosine similarity operation (on the column selected by the tag vector T) performed in the associated similarity section 1412-j. The SoftMax operation may determine the probability of occurrence for each active column based on the similarity results of the portions 1412. The probability of occurrence has a value between 0 and 1 and the sum of the probabilities is 1.0.

The SoftMax operation may include a number of exponential operations, which may be implemented as taylor series approximations, where the intermediate data for each operation is stored in the bit line processor of the associated SoftMax portion 1414-j.

In value section 1416-j, modified feature vector C ^j _i Each time multiplied by its associated SoftMax value in its own bit line processor 114. The first supporting answer may then be generated as multiplied C ^j _i Vector sum of vectors. In the attention operation, such a sum may be accumulated horizontally on all columns selected by the flag vector T. The vector result weighted by SoftMax values (weighted sum of key vectors) may be provided to the controller for generating a question for the next hop or iteration. Fig. 15 shows an initial portion of an initial iteration, with data for further iterations stored in the upper portion, at the bottom of the memory array 110. Three iterations are shown, each with input of the problem q _j And support or final answer as output.

It should be appreciated that the initial problem q ₁ May be generated by the problem generator using a pre-trained neural network external to the memory computing device 100. Up to the remaining problem q of the solution _j The (typically the third iteration but more iterations are possible) may be a combination of the original vector problem and the attention vector.

The combination may be based on an external neural network having two input vectors and one output vector. The input vector is the original problem vectorq ₁ And the focus of the previous iteration and output is on the new vector problem. The neural network may be implemented by matrix multiplication on bit lines of the memory, or may be implemented externally.

It should be appreciated that the initial data stored in the similarity section 1412-j may be the same (i.e., problem q _j The distance between them is with respect to the same data). Similarly, the initial value data stored in the value section 1416-j may be identical (i.e., the data to be multiplied by the SoftMax value is identical).

Performance of

With all sentence features stored in memory, matrix multiplication takes 100 times the cycle time of the query vector size. Assuming a maximum of 10 features per sentence, 1000 clocks (N may be millions) are obtained in parallel for all N sentences or 1 musec (with 1Ghz clock) per all N. SoftMax takes about 1 musec and the multiply and accumulate operation takes 4 musec. 3 hops/iterations take 3X (1+1+4) ≡20 musec, achieving 50000 questions per second.

Referring now briefly to the alternative system 1500 shown in fig. 16, it may include an associative memory 1510 that may be large enough to handle only a single iteration, as well as other elements of the remaining computations.

As in the previous embodiment, association memory 1510 may include a similarity section 1512 for operating on feature vectors (referred to herein as "keys"), a SoftMax section 1514 for implementing SoftMax operations, and a value section 1516 for operating on values associated with the feature vectors. This embodiment may perform all jumps within memory 1510 in a constant time. As can be seen in fig. 16, some operations occur in the memory 1510, while other operations occur outside the memory 1510. The performance is about the same as for an end-to-end implementation, with each hop being about 6 musec.

Flexibility of any long memory network

It should be appreciated that since the associated processor allows all the capabilities of a search by content with constant time to be computed in parallel on all bit lines of the memory, it may represent various types of memory networks, such as a key-value memory network (Miller, jason et al, EMNLP 2016) for reading documents directly.

While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.

Claims

1. A system for natural language processing, the system comprising:

a memory array having rows and columns, the memory array being divided into: a SoftMax section for determining an occurrence probability of the feature vector or the key vector, a value section for initially storing a plurality of modified feature vectors, and a tag section for storing a tag vector specifying a column to be operated, wherein an operation in one or more columns of the memory array is associated with one feature vector to be processed; and

an in-memory processor for activating the memory array to perform the following operations in parallel in each column specified by the tag vector:

a similarity operation in the similarity section between a vector problem and each of the feature vectors stored in each of the indicated columns for generating a similarity output in each of the indicated columns;

a SoftMax operation in the SoftMax section for each of the similarity outputs in the similarity section for determining an associated SoftMax value for each indicated feature vector, wherein an intermediate output of an exponential operation of the SoftMax operation is stored in a bit line processor of the SoftMax section for each indicated column; and

A multiplication operation in said value segment for multiplying each said associated SoftMax value in said SoftMax segment by each said modified feature vector stored in each indicated column for generating a multiplication output in each indicated column;

the in-memory processor is further configured to perform a horizontal vector sum of the multiplication outputs in each indicated column in the value segment to accumulate a vector sum of interest, the vector sum being used to generate a new vector problem for a further iteration or to generate an output value in a final iteration.

2. The system of claim 1, wherein the memory array comprises an operation portion, one portion per iteration of natural language processing operations, each portion divided into the similarity section, the SoftMax section, and the value section.

3. The system of claim 1, wherein the memory array is one of: SRAM, nonvolatile, volatile, and non-destructive array.

4. The system of claim 1, wherein the memory array comprises a plurality of bit line processors, one bit line processor per column of each of the segments, each bit line processor operating on one bit of data of its associated segment.

5. The system of claim 1, and further comprising a neural network feature extractor for generating the feature vector and the modified feature vector.

6. The system of claim 1, and wherein the feature vector comprises a feature of a word, sentence, or document.

7. The system of claim 1, wherein the feature vector is an output of a pre-trained neural network.

8. The system of claim 1, and further comprising a pre-trained neural network for generating an initial vector problem.

9. The system of claim 8, and further comprising a question generator for generating further questions from the initial vector questions and the attention vector sums.

10. The system of claim 9, wherein the problem generator is a neural network.

11. The system of claim 9, and wherein the problem generator is implemented as a matrix multiplier on a bit line of the memory array.

12. A method for natural language processing, the method comprising:

having a memory array with rows and columns, the memory array being divided into: a SoftMax section for determining an occurrence probability of the feature vector or the key vector, a value section for initially storing a plurality of modified feature vectors, and a tag section for storing a tag vector specifying a column to be operated, wherein an operation in one or more columns of the memory array is associated with one feature vector to be processed; and

Activating the memory array to perform the following operations in parallel in each column specified by the tag vector:

performing a similarity operation in the similarity section between a vector problem and each of the feature vectors stored in each of the indicated columns to generate a similarity output in each of the indicated columns;

performing a SoftMax operation in the SoftMax section on each of the similarity outputs in the similarity section to determine an associated SoftMax value for each indicated feature vector, wherein an intermediate output of an exponential operation of the SoftMax operation is stored in a bit line processor of the SoftMax section for each indicated column; and

performing a multiplication operation in the value segment to multiply each of the associated SoftMax values in the SoftMax segment by each of the modified feature vectors stored in each indicated column to generate a multiplication output in each indicated column;

a horizontal vector sum operation of the multiplication outputs in each indicated column in the value segment is performed to accumulate a vector sum of interest for generating a new vector problem for further iterations or for generating output values in a final iteration.

13. The method of claim 12, wherein the memory array comprises a plurality of bit line processors, one bit line processor per column of each of the segments, the method further comprising each of the bit line processors operating on one bit of data of its associated segment.

14. The method of claim 12, and further comprising generating the feature vector and the modified feature vector with a neural network and storing the feature vector and the modified feature vector in the similarity section and the value section, respectively.

15. The method of claim 12, and wherein the feature vector comprises a feature of a word, sentence, or document.

16. The method of claim 12, and further comprising generating an initial vector problem using a pre-trained neural network.

17. The method of claim 16, and further comprising generating a further problem from the initial vector problem and the attention vector sum.

18. The method of claim 17, wherein generating further questions utilizes a neural network.

19. The method of claim 17, and wherein the generating a further problem comprises performing a matrix multiplication on bit lines of the memory array.