CN110197252A

CN110197252A - Deep learning based on distance

Info

Publication number: CN110197252A
Application number: CN201910136561.7A
Authority: CN
Inventors: E·埃雷兹
Original assignee: GSI Technology Inc
Current assignee: GSI Technology Inc
Priority date: 2018-02-26
Filing date: 2019-02-25
Publication date: 2019-09-03
Also published as: KR20190103011A; US20190266482A1

Abstract

Method for neural network includes calculating to export the distance between each of feature vector and multiple finite character vectors vector, the output feature vector and describe unfiled item, a sorting item in each finite character vector description sorting item set simultaneously.This method includes the similarity scores vector for the similarity scores for calculating the similarity scores for each distance vector simultaneously and creating multiple calculating.System for neural network includes associative storage array, input editing device, hidden layer computer and output processor.The information of the unfiled item of input editing device manipulation description storage in memory arrays.Hidden layer computer calculates hidden layer vector.Output processor calculates output feature vector and calculates the distance between output each of feature vector and multiple finite character vectors vector simultaneously, and calculates the similarity scores for being directed to each distance vector simultaneously.

Description

Deep learning based on distance

Technical field

The present invention relates generally to the deep learnings in associated memory devices more particularly to associated memory devices.

Background technique

Neural network is to learn the computing system of completion task by considering example, and usually not task is specifically compiled Journey.Typical neural network is the interconnected set by the node of layer tissue；Each layer, which can input it, executes different conversions.Mind It can be mathematically represented as vector, the activation of expression layer interior joint and matrix through network, indicated between the node of adjacent layer Interconnection weight.Network function is a series of mathematical operations executed to vector sum matrix and between vector sum matrix, And to the nonlinear operation that the value being stored in vector sum matrix executes.

In entire the application, matrix is indicated by the capitalization of runic, for example, A, with the vector of lowercase bold, for example, A, and the entry of the vector sum matrix by italic font expression, such as A and a.Therefore, i, j item of matrix A are by A_ijIt indicates, square The row i of battle array A is expressed as A_i-, the column j of matrix A is expressed as A_-j, and the entry i of vector a is by a_iIt indicates.

Recurrent neural network (RNN) is the operation when the output currently calculated depends on the value being previously calculated to value sequence The neural network of useful specific type.LSTM (shot and long term memory) and GRU (gating cycle unit) is the example of RNN.

The output feature vector (recycling and acyclic) of network is the vector h for storing m numerical value.In Language Modeling, h can To be output insertion vector (vector (real number, integer, finite accuracy etc.) for indicating the number of the word or expression in vocabulary), and And in other deep learning subjects, h can be the feature of problematic object.Using possible it needs to be determined that vector h was indicated ?.In Language Modeling, h can indicate a word in the vocabulary for the v word that application may need to identify.It will be appreciated that It arrives, v may be very big, for example, v is about 170,000 for English.

RNN in Fig. 1 indicates to show with two kinds: folding 100A and non-collapsible 100B.Non-collapsible indicates 100B in time t- 1, RNN is described in t and t+1 to change with time.In folding expression, vector x is " general " input vector, and in non-collapsible In expression, x_tIndicate the input vector at time t.It is to be appreciated that input vector x_tIt indicates in the item sequence handled by RNN Item.Vector x_tThe item k in v set can be indicated by " solely heat " vector, i.e., other than single " 1 " in the k of position All zero vector.Matrix W, U and Z are parameter matrixs, and the operation to be suitble to plan is created using specific dimensions.Matrix with Machine value is initialized and is updated during the operation of RNN, during the training stage and sometimes during the reasoning stage.

In folding expression, vector h indicates the hidden layer of RNN.In non-collapsible expression, h_tIt is the hidden layer at time t Value, according to the value of hidden layer of the equation 1 from time t-1 calculate:

h_t=f (U*x+W*h_t-1) equation 1

In folding expression, y indicates output vector.In non-collapsible expression, y_tIt is the output vector at time t, for v Each item in the set of item has the probability of the class of the item at time t.According to equation 2, nonlinear function can be used (such as SoftMax) calculates probability:

y_t=softmax (Z*h_t) equation 2

Wherein Z is size adjusting matrix, it is intended to by h_tSize be adjusted to y_tSize.

Many applications of the RNN for the sequence of processing item, such as: Language Modeling (processing word sequence)；Machine translation；Language Sound identification；Dialogue；Video annotation (processing sequence of pictures)；Handwriting recognition (processing flag sequence)；Recognition sequence based on image Deng.

For example, Language Modeling calculates the probability that multiple words occur in particular sequence.The sequence of m word is by { w₁,…, w_mProvide.The probability of sequence is by p (w₁,…,w_m) definition, and using previous words all in sequence as the word w of condition_iIt is general Rate can be by the window of the previous word of n come approximate, as defined in equation 3:

The number that can occur in the corpus of text by calculating each combination of word by rule of thumb estimates word Sequence probability.For n word, which is collectively referred to as n gram language model (n-gram), and for two words, it is referred to as Two gram language models (bi-gram).Calculate n gram language model frequency of occurrence memory requirement with window size n exponentially Increase, therefore extremely difficult to the modeling of large-scale window in the case where not exhausting memory.

RNN can be used for a possibility that simulating word sequence, the probability without clearly storing each sequence.For language The complexity that the RNN of speech modeling is calculated is proportional to the size v of the vocabulary of modeling language.It needs a large amount of matrix-vector multiplication Method and SoftMax operation, these are all heavy calculating.

Summary of the invention

Preferred embodiment in accordance with the present invention provides a kind of method for neural network.This method includes while counting Calculate the distance between output feature vector and each of multiple limited (qualified) feature vectors of neural network to Amount.Output feature vector describes unfiled item, and in each of multiple finite character vectors interpretive classification item set One sorting item.This method further includes calculating the similarity score for being directed to each distance vector simultaneously；And create multiple calculating Similarity scores similarity scores vector.

In addition, preferred embodiment in accordance with the present invention, this method further includes by by input vector and input embeded matrix Multiple column simultaneously be multiplied reduce neural network input vector size.

In addition, preferred embodiment in accordance with the present invention, this method further includes activating owning for similarity scores vector simultaneously Nonlinear function on element, to provide ProbabilityDistribution Vector.

In addition, preferred embodiment in accordance with the present invention, nonlinear function is SoftMax function.

In addition, preferred embodiment in accordance with the present invention, this method further includes that extreme value is found in ProbabilityDistribution Vector to look for To the sorting item most like with unfiled item, computation complexity is O (1).

In addition, preferred embodiment in accordance with the present invention, this method further includes the K arest neighbors activated on similarity scores vector (KNN) function, to provide and k most like sorting item of unfiled item.

Preferred embodiment in accordance with the present invention provides a kind of system for neural network.The system includes that association is deposited Memory array, input editing device, hidden layer computer and output processor.Associative storage array includes row and column.Input is compiled Arrange information of the device storage about the unfiled item in associative storage array, the input of operation information and creation to neural network. Hidden layer computer receives input and runs input in neural network to calculate hidden layer vector.Output processor is by hidden layer Vector transformation is output feature vector, and calculates output feature vector and multiple limited spies simultaneously in associative storage array Levy the distance between each of vector vector, one sorting item of each finite character vector description.Output processor also exists The similarity scores for being directed to each distance vector are calculated in associative storage array simultaneously.

In addition, preferred embodiment in accordance with the present invention, input editing device reduces the size of information.

In addition, preferred embodiment in accordance with the present invention, output processor further includes linear block and nonlinear block.

In addition, preferred embodiment in accordance with the present invention, nonlinear block realizes SoftMax function according to similarity scores Vector create ProbabilityDistribution Vector.

In addition, preferred embodiment in accordance with the present invention, which further includes extrema-finding device, in ProbabilityDistribution Vector Find extreme value.

In addition, preferred embodiment in accordance with the present invention, nonlinear block is k arest neighbors module, is provided and unfiled item K most like sorting item.

In addition, preferred embodiment in accordance with the present invention, linear block is the range converter for generating similarity scores.

In addition, preferred embodiment in accordance with the present invention, range converter further includes vector adjusters and distance calculator.

In addition, preferred embodiment in accordance with the present invention, range converter is stored in memory array for matrix column is adjusted First calculate in column, and hidden layer vector is distributed into each calculatings and is arranged, and vector adjusters calculate first and calculate and arrange Output feature vector.

In addition, preferred embodiment in accordance with the present invention, the column for exporting embeded matrix are initially stored in by range converter The second of associative storage array calculates in column, and output feature vector is distributed to all second and calculates column, and apart from meter It calculates device calculating second and calculates the distance vector in arranging.

Preferred embodiment in accordance with the present invention provides a kind of for by not dividing by the non-classified vector description of feature The method that category is compared with multiple sorting items, each sorting item by feature classification vector description.This method includes same When calculate the distance between unfiled vector and each class vector vector；And at the same time calculate for each distance vector away from From scalar, similarity scores between each one provided apart from scalar in unfiled item and multiple sorting items, to create Including multiple similarity scores vectors apart from scalar.

In addition, preferred embodiment in accordance with the present invention, this method further includes non-linear on activation similarity scores vector Function is to create ProbabilityDistribution Vector.

In addition, preferred embodiment in accordance with the present invention, this method further includes that extreme value is found in ProbabilityDistribution Vector to look for To the sorting item most like with unfiled item.

Detailed description of the invention

It particularly points out and is distinctly claimed in the conclusion part of specification and be considered as subject of the present invention.However, When read in conjunction with the accompanying drawings, by reference to following specific embodiments, tissue and operation side of the invention can be best understood Method and its target, feature and advantage, in which:

Fig. 1 is the schematic diagram for folding the prior art RNN indicated with non-collapsible；

Fig. 2 is the diagram of constructed according to the invention and operable neural network output processor；

Fig. 3 is the construction of embodiment according to the present invention and the schematic diagram of operable RNN computing system；

Fig. 4 is the input of a part of embodiment according to the present invention construction and the operable neural network for forming Fig. 1 The schematic diagram of composer；

Fig. 5 is hiding for a part of embodiment according to the present invention construction and the operable neural network for forming Fig. 1 The schematic diagram of layer computer；

Fig. 6 is the output of a part of embodiment according to the present invention construction and the operable RNN processor for forming Fig. 3 The schematic diagram of processor；

Fig. 7 A is the schematic diagram to form the linear block of a part of output processor of Fig. 6, and the linear block passes through Standard converter provides linear transformation；

Fig. 7 B is the linear block of the output processor of embodiment according to the present invention construction and operable alternate figures 6 The schematic diagram of range converter；

Fig. 8 is the schematic diagram of the data arrangement of the matrix in the associative storage used by the range converter of Fig. 7 B；

Fig. 9 is the schematic diagram of the data arrangement of the calculating step and hidden layer vector that are executed by the range converter of Fig. 7 B； And

Figure 10 is operable schematic flow chart according to the present invention, is shown by the behaviour of the RNN computing system execution of Fig. 3 Make.

It should be recognized that in order to illustrate it is simple and clear, element shown in the drawings is not drawn necessarily to scale.Example Such as, for the sake of clarity, some sizes in element may be exaggerated relative to other elements.In addition, being deemed appropriate In the case of, can in the accompanying drawings repeat reference numerals to indicate corresponding or similar element.

Specific embodiment

In the following specific embodiments, numerous specific details are set forth in order to provide thorough understanding of the present invention.So And it will be understood by those skilled in the art that the present invention can be practiced without these specific details.In other situations Under, it is not described in well-known method, process and component, in order to avoid the fuzzy present invention.

Applicants have realized that can use associated memory devices to effectively realize the part of artificial network, institute State artificial network such as RNN (including LSTM (shot and long term memory) and GRU (gating cycle unit)).Such as in entitled " IN The U.S. Patent Publication of MEMORY MATRIX MULTIPLICATION AND ITS USAGE IN NEURAL NETWORKS " System described in US2017/0277659 (its transfer common assignee of the invention and incorporated herein by reference), can To provide linear or event constant complexity for the matrix multiplication part of neural computing.It is submitted on October 15th, 2017 Entitled " PRECISE EXPONENT AND EXACT SOFTMAX COMPUTATION " U.S. Patent application 15/784, System described in 152 (its transfer common assignee of the invention and incorporated herein by reference), can in training and Constant complexity is provided for the RNN non-linear partial calculated in the reasoning stage, and in the topic submitted on July 13rd, 2017 For the U.S. Patent application 15/648 of " FINDING K EXTREME VALUES IN CONSTANT PROCESSING TIME ", System described in 475 (transfer common assignee of the invention and incorporated herein by reference), can be for the RNN of training On the calculating of K arest neighbors (KNN) constant complexity is provided.

Applicant have appreciated that the complexity for preparing the output that RNN is calculated is proportional to the size v of set, i.e. complexity It is O (v).For Language Modeling, set is entire vocabulary, may be very big, and RNN calculating may include big moment matrix Vector multiplication and complicated SoftMax are operated to create ProbabilityDistribution Vector, can provide the finger of the class of the next item down in sequence Show.

Applicant have also realized that can be referred to by calculating a large amount of matrix-vector multiplications of replacement via lighter distance to create Show the similar ProbabilityDistribution Vector of the classification of the next item down in sequence, wherein computation complexity is O (d), and wherein d is much smaller than v.? In Language Modeling, for example, comparing with vocabulary table size v for 170,000, d be can choose as 100 (or 200,500 etc.).It can anticipate Know, vector matrix calculating can be realized by the system of U.S. Patent Publication US 2017/0277659.

Referring now to Fig. 2 be showing for constructed according to the invention and operable neural network output processor system 200 It is intended to, which includes neural network 210, output processor 220 and associative storage array 230,.

Information needed for associative storage array 230 can store the calculating for executing RNN, and can be multipurpose Associated memory devices, such as in United States Patent (USP) No.8,238,173 (entitled " USING STORAGE CELLS TO PERFORM COMPUTATION")；Entitled " the NON-VOLATILE IN-MEMORY COMPUTING DEVICE " submitted on January 1st, 2015 U.S. Patent application No.14/588,419；U.S. Patent application No.14/555,638 (the topic that on November 27th, 2014 submits For " IN-MEMORY COMPUTATIONAL DEVICE ")；United States Patent (USP) No.9,558,812 (entitled " SRAM MULTI- CELL OPERATIONS ") and (entitled " IN-MEMORY of U.S. Patent application 15/650,935 that submits on July 16th, 2017 COMPUTATIONAL DEVICE WITH BIT LINE PROCESSORS ") (its whole transfers of the invention commonly assigned People and incorporated herein by reference) described in those.

Neural network 210, which can be, to be received input vector x and provides any neural network encapsulation of output vector h.Output Processor 220 can receive vector h as input, and can create comprising collection close each probability distribution output to Measure y.Item possible for each of set, output vector y can provide it as the probability of the classification of the expection item in sequence. For example, the class of next expection item can be next word in sentence in word modeling.Come with reference to Fig. 7-Figure 10 detailed Output processor 220 is described.

Referring now to Fig. 3 be embodiment according to the present invention construction and operable RNN computing system 300 signal Figure, the RNN computing system 300 include RNN processor 310 and associative storage array 230.

RNN processor 310 can also include neural network encapsulation 210 and output processor 2.Neural network encapsulation 210 is also It may include input arrangement machine 320, hidden layer computer 330 and cross entropy (CE) loss optimizer 350.

In one embodiment, input arrangement machine 320 can receive the item to be analyzed sequence (word sequence, graphic sequence, Symbol sebolic addressing etc.), and each item in sequence can be transformed to that the form of RNN can be suitble to.For example, being used for Language Modeling RNN may need to handle larger numbers of vocabulary (as described above, for example, the size v of English glossary table is about 170,000 A word).RNN for Language Modeling can receive the hot vector of multiple lists as input, and each hot vector of list indicates word sequence A word in column.It is to be appreciated that indicating that the size v of the hot vector of list of English word can be 170,000 bit.Input Big input vector can be transformed to may be used as the vector of the smaller size of the input of RNN by composer 320.

Any available RNN encapsulation can be used to calculate the value of the activation in hidden layer in hidden layer computer 330, and CE loss optimizer 350 can optimize loss.

Referring now to Fig. 4 be embodiment according to the present invention construction and operable input editing device 320 schematic diagram. Input editing device 320 can receive sparse vector as input.Vector can be single hot vector s_x, and indicating may item from v Set particular item, and the smaller vector d_x (its size be d) for indicating the identical entry from set can be created.It is defeated Entering composer 320 can be used the matrix L having a size of d × v to execute the transformation of vector s_x to vector d_x.Matrix L can be It include one group of feature for characterizing the item k of set in each column k after training RNN.Matrix L be properly termed as input embeded matrix or Dictionary is inputted, and is defined in equation 4:

D_x=L*s_x equation 4

Input editing device 320 can the initial storage matrix L in the first row of the i-th section of associated memory array 230 Row L_i-.The bit i of input vector s_x can be distributed to each calculating of the second row of part i by input editing device 320 simultaneously Arrange j.Input editing device 320 can be arranged in all part i and all calculating simultaneously will value L in j_ijMultiplied by s_x_j, such as 410 institute of arrow Show.Then input editing device 320 can arrange j with every calculating and add multiplication result p in all parts_ij, as indicated by the arrow 520, with The output vector d_x of equation 4 is provided.

Referring now to Fig. 5 be hidden layer computer 330 schematic diagram.Hidden layer computer 330 may include it is any can Neural network encapsulation.Hidden layer computer 330 can be based on the input vector in its intensive expression at time t, d_ x_tAnd the preceding value h of the activation at time t-1_t-1Come calculate at the time t in hidden layer for activation h_tValue, according to Equation 5:

h_t=σ (W*h_t-1+U*d_x_t+ b) equation 5

It as described above, the size d of h can be predefined, and is the smaller size of embeded matrix L.σ is in result vector Each element on the nonlinear function that operates, such as sigmoid function.W and U is predefined parameter matrix, and b is to be biased towards Amount.W and U can usually be initialized as random value, and can update during the training stage.It can be with defined parameters matrix W (m × m) and U (m × d) size and bias vector b (m) to be fitted the size of h and d_x respectively.

The result h of the RNN of intensive vector d_x and previous step can be used in hidden layer computer 330_t-1Come calculate when Between hidden layer vector at t value.Hidden layer the result is that h.The initial value of h is h₀, it can be random.

Referring now to Fig. 6 be embodiment according to the present invention construction and class operation output processor 220 schematic diagram.

Linear block 610 can be used in output processor, for arranging vector h (output of hidden layer computer 330) To be suitble to the size v of set, be directed to each probability followed by nonlinear block 620 to create, thus come create output to Measure y_t.Linear function g may be implemented in linear block 610, and nonlinear function f may be implemented in nonlinear block 620.It can root ProbabilityDistribution Vector y is calculated according to equation 6_t:

y_t=f (g (h_t)) equation 6

Linear function g can will be embedded in vector h (being created by hidden layer computer 330) transformation with the reception that size is m The output vector for being d for size.During the transformation of insertion vector h, linear function g can create pole in the position k of vector h It is worth fractional value h_k(maximum value or minimum value).

Referring now to Fig. 7 A be linear block 610A schematic diagram, linear block 610A can by by standard packaging reality Existing standard converter 710 provides linear transformation.

Standard converter 710 can be provided by standard packaging, and equation 7 can be used will insertion vector h_tIt is transformed to big The small vector for v:

g(h_t)=(H*h_t+ b) equation 7

Wherein H is output representing matrix (v × m).Every a line of matrix H can store the item learnt during the training period The insertion of (in set), and vector b can be the bias vector that size is v.Matrix H can be initialized to random value And it can be updated during the training stage, to minimize intersection entropy loss, as known in the art.

It is to be appreciated that vector h_tMark can be provided multiplied by the row j (the insertion vector for storing each sorting item j) of matrix H Score is measured, which indicates each sorting item j and by vector h_tThe similitude between unfiled object indicated.Score is got over Height, vector are more similar.As a result g (h) be have score vector (size v), the score instruction be directed to each position j input item With the similitude between the item in the row j of matrix H.The item k in the k oriental matrix H of position in g (h) with best result numerical value The class of (each insertion is stored in set) as unfiled item.

It should also be realised that H*h_tWeight matrix-vector multiplication operation is needed, because H has v row, every row storage particular item Insertion, and v is the size of entire set (vocabulary), it as already noted, can be very big.Calculate all internal products (in H and h_tIn every a line between) may become very slow during the training period, even if being also such when using modern GPU.

Applicant have appreciated that output processor 220 can use memory array 230 to significantly reduce linear block 610 computation complexity.

Referring now to Fig. 7 B be embodiment according to the present invention construction and operable linear block 610B schematic diagram. Range converter 720 can calculate between each of output insertion vector h and the column storage as output embeded matrix O j Distance, as defined in equation 8, rather than by it multiplied by big matrix H:

(g(h_t))_j=distance ((M*h_t+c)-O_-j) equation 8

Wherein (g (h_t))_jIt is the scalar calculated for the column j of output embeded matrix O, and h can be provided_tWith matrix O's The distance between vector j score.Vector h_tSize can be different from the sizes of column.Accordingly, it may be desirable to size adjusting matrix M, for vector h will to be embedded in_tSize be adjusted to the size of O, to realize that distance calculates.The size of M can be d × m, be much smaller than The size of H used in standard converter 710, and therefore, the calculating of range converter 720 can be than standard converter 710 Calculating it is faster and resource consumption is less.Vector c is bias vector.

Output embeded matrix O can be initialized to random value, and can be updated in training ession for telecommunication.It exports embedding Entering matrix O can be in each insertion for arranging the calculating of Storage Item j (in set) in j.Output embeded matrix O can be similar to defeated Enter the input embeded matrix L that composer 320 (Fig. 4) uses, and even can be identical as L.Matrix O is appreciated that, when removing Except Language Modeling application in use, can in each column j Storage Item j feature.

Any distance or similarity method (such as L1 or L2 norm, Hamming distance, cosine similarity or any can be used Other similitudes or distance method) come calculate the distance between database of unfiled object and object of classification with calculate by h_tIt is fixed The distance between the unfiled object of justice and the object of classification database being stored in matrix O or similitude.

Norm is distance function, can each vector into vector space distribute stringent positive value, and can mention For numerical value to express the similitude between vector.It can be in h_tEach column j with matrix O is (by O_-jIndicate) between calculate norm. Output embeded matrix O is the simulation to matrix H, but can differently be trained and be can have the column of different number.

Hidden layer vector h can be created multiplied by the result of size adjusting matrix M have it is identical as the size of the column of matrix O Size vector o so that the computing interval of real present range subtracts vector o from each column of matrix O.It is to be appreciated that distance Bias vector c can be added to result vector o by converter 720, and for simplicity, result vector is still properly termed as Vector o.

As already mentioned, L1 or L2 norm can be used to calculate distance in range converter 720.It is appreciated that L1 model Number, referred to as " least absolute deviation " norm, the absolute difference between definition target value and estimated value, and L2 norm, it is referred to as " minimum Square error " norm, be the difference between target value and estimated value square and.Each distance calculate the result is that scalar, And the result (the distance between each column of vector o and matrix O) of the distance of all calculating can provide vector g (h).

Distance, which calculates, can provide scalar score, and scalar score instruction output is embedded in vector o and is stored in the column of matrix O The difference or similitude between item in j.When by norm calculation apart from when, score is lower, and vector is more like.When passing through cosine Similarity calculation apart from when, score is higher, and vector is more similar.(size v) is the vector of score to obtained vector g (h).Have Position k (depending on distance calculating method) in the extremely scores vector g (h) of (minimum or highest) fractional value can be with oriental matrix Item k (insertion for being stored in each of set) in O is unfiled item h_tClass.

Referring now to Fig. 8 be matrix M and matrix O in memory array 230 data arrangement schematic diagram.Distance becomes Parallel operation 720 can use memory array 230 and make a part 230-M, can store matrix M, and another part 230-O It can store matrix O.Every a line i of matrix M can be stored in the i-th of memory array portion 230-M by range converter 720 (each bit i of the column j of matrix M can store in the identical calculations column j of different piece i) such as arrow in partial the first row 911, shown in 912 and 913.

Similarly, every a line i of matrix O can be stored in the of memory array portion 230-O by range converter 720 In the first row of the part i, as shown in arrow 921,922 and 923.

Referring now to Fig. 9 be vector h data arrangement and the calculating step executed by range converter 720 signal Figure.Range converter 720 can also include vector adjusters 970 and distance calculator 980.Vector adjusters 970 can will be embedding Incoming vector h_tEach bit i distribute to memory array portion 230-M i-th section the second row all calculating column, make Obtain vector h_tBit i be stored repetitively in entire second row of part i, the row of storage matrix M in same section i.Bit h1 can be assigned to the second row of part 1, and as shown in arrow 911 and 912, and bit hm can be assigned to portion The second row for dividing m, as shown in arrow 921 and 922.

Vector adjusters 970 can be simultaneously on all calculating column in all parts by M_ijMultiplied by h_iAnd it can will tie Fruit p_ijIt is stored in the third line, as shown in arrow 950.Vector adjusters 970 can arrange upper while add p in all calculating_i's Value is to generate the value o of vector o_i, as shown in arrow 960.

Once for insertion vector h_tVector o is calculated, range converter 720 (can not show bias vector c in figure It is added to result vector o out).

Vector o can be distributed to memory array portion 230-O by range converter 720, so that each value o_iIt is assigned To entire second row of part i.As shown in arrow 931 and 932, position o1 can be assigned to the second row of part 1, and position od It can be assigned to the second row of part d, as shown in arrow 933 and 934.

Distance calculator 980 can be simultaneously on all calculating column in all parts from O_ijIn subtract o_iTo create distance Vector.Then, distance calculator 980 can be calculated by calculating any other distance of L1 or L2 or each result vector come complete At the calculating of g (h), and result g (h) can be provided as output, as shown in arrow 941 and 942.

It is to be appreciated that in another embodiment, range converter 720 can be by each addition result o of vector o_i It is directly written on the final position in memory array portion 230-O.

Entitled " the MEMORY that system 300 (Fig. 3) can be submitted during deduction phase using on January 12nd, 2015 DEVICE " and the U.S. Patent application 14/594,434 for being published as US 2015/0200009 (it is incorporated herein by reference) System finds extreme value (the minimum or maximum) value in vector g (h) to determine the classification of expected the next item down.

Nonlinear function f may be implemented in nonlinear block 620 (Fig. 6), can will be created and be stored by linear function g Arbitrary value in g (h) is transformed to probability.For example, function f can be SoftMax operation, and in this case, it is non-thread " PRECISE EXPONENT AND EXACT submitting and entitled on October 15th, 2017 can be used in property module 620 (it is by quoting simultaneously for the Exact SoftMax system of the U.S. Patent application 15/784,152 of SOFTMAX COMPUTATION " Enter herein).

Additionally or alternatively, RNN computing system 300 can use the entitled " FINDING submitted on July 7th, 2017 The U.S. Patent application 15/648,475, Lai Xu of K EXTREME VALUES IN CONSTANT PROCESSING TIME " K- nearest-neighbors are found during reasoning when wanting several results rather than a result.This of RNN computing system 300 uses Example can be beam search, and wherein nonlinear block 620 can be replaced to find the k item with extreme value, often by KNN module A potential classification for indicating to be directed to unfiled item.

CE loss optimizer 350 (Fig. 3) can calculate cross entropy damage during the study stage using any standard packaging It loses, and equation 9 can be used to optimize it:

In y_tBe anticipated output the hot vector of list in the case where, y_expectedIt is the item in each position k in storage location k It is the probability vector of the probability of the classification of non-classified expected item.

Referring now to Figure 10 be by RNN computing system 300 (Fig. 3) execute schematic stream operable according to the present invention Journey 1000, RNN computing system 300 include the steps that executing in the neural network 210 of system 200 and output processor 220.? In step 1010, RNN computing system 300 can by by sparse vector s_x multiplied by input embeded matrix L by the sparse vector It is transformed to intensive vector d_x.In step 1020, parameter matrix U and W can be used in intensive vector d_ in RNN computing system 300 Hidden layer computer 330 is run on x to calculate hidden layer vector h.

In step 1030, RNN computing system 300 can be used size adjusting matrix M hidden layer vector h is transformed to it is defeated It is embedded in vector o out.In step 1032, computing system 300 can replace the part that RNN is calculated with KNN.This is in the phase in reasoning stage Between it is particularly useful.In step 1040, RNN computing system 300 can calculate every in insertion vector o and output embeded matrix O The distance between a item, and can use step 1042 to find minimum range.In step 1050, RNN computing system 300 The nonlinear function of such as SoftMax can be used to calculate and provide probability vector y, as shown in step 1052, and In step 1060, computing system 300 can be in training ession for telecommunication optimization loss.It will be appreciated by those of skill in the art that institute The step of showing is not limiting, and can use more or less steps, or the step of with different order, or with they Any combination implements the process.

It is to be appreciated that being lower than the RNN using standard converter 710 using total complexity of the RNN of range converter 720 Complexity.The complexity for calculating linear segment is O (d) and the complexity of standard RNN calculating is O (v) when v is very big.By It is much smaller than v in d, therefore the complexity of O (d) can save very much.

It should also be realised that the prior art can be less than using total complexity of the RNN of RNN computing system 300, because SoftMax, KNN and the complexity for finding minimum value are constant (O (1)).

Although certain features of the invention have been illustrated and described, those of ordinary skill in the art now will Expect many modifications, replacement, change and equivalent.It should therefore be understood that appended claims be intended to cover fall into it is of the invention All such modifications and variations in true spirit.

Claims

1. a kind of method for neural network, which comprises

The distance between output feature vector and each of the multiple finite character vectors of the neural network are calculated simultaneously Vector, wherein the output feature vector describes unfiled item, and each of the multiple finite character vector describes A sorting item in the set of sorting item；

The similarity score for being directed to each distance vector is calculated simultaneously；And

Create the similarity scores vector of the similarity scores of multiple calculating.

2. the method as described in claim 1 further includes by simultaneously that the input vector of the neural network is embedding multiplied by inputting Enter multiple column of matrix to reduce the size of the input vector.

3. the method as described in claim 1 further includes non-on all elements for activate simultaneously the similarity scores vector Linear function, to provide ProbabilityDistribution Vector.

4. method as claimed in claim 3, wherein the nonlinear function is SoftMax function.

5. method as claimed in claim 3, further include find extreme value in the ProbabilityDistribution Vector with find with it is described not The most like sorting item of sorting item, computation complexity are O (1).

6. the method as described in claim 1 further includes K- arest neighbors (KNN) letter activated on the similarity scores vector Number, to provide and k most like sorting item of the unfiled item.

7. a kind of system for neural network, the system comprises:

The associative storage array being made of row and column；

Input editing device is used to store the information about the unfiled item in the associative storage array, described in manipulation The input of information and creation to the neural network；

Hidden layer computer is used to receive the input and for running the input in the neural network to calculate Hidden layer vector；And

Output processor is used to the hidden layer vector transformation be output feature vector, in the associative storage battle array The distance between each of the output feature vector and multiple finite character vectors vector, Mei Geyou are calculated in column simultaneously Limit feature vector describes a sorting item, and for calculated simultaneously in the associative storage array for each distance to The similarity scores of amount.

8. system as claimed in claim 7, and further include the input editing device, it is used to reduce the ruler of the information It is very little.

9. system as claimed in claim 7, wherein the output processor further includes linear block and nonlinear block.

10. system as claimed in claim 8, wherein the nonlinear block realizes SoftMax function according to described similar The vector of property score creates ProbabilityDistribution Vector.

11. system as claimed in claim 10, further including extrema-finding device, being used in the ProbabilityDistribution Vector find Extreme value.

12. system as claimed in claim 8, wherein the nonlinear block is k- arest neighbors module, with provide with it is described not K most like sorting item of sorting item.

13. system as claimed in claim 8, wherein the linear block is range converter, is used to generate described similar Property score.

14. system as claimed in claim 13, wherein the range converter includes vector adjusters and distance calculator.

15. system as claimed in claim 14, the range converter is stored in the storage for that will adjust matrix column The first of device array calculates in column, and arranges for the hidden layer vector to be distributed to each calculating, and the vector tune Whole device is used to calculate described first and calculates the output feature vector in arranging.

16. system as claimed in claim 15, the column for exporting embeded matrix for being initially stored in by the range converter The second of the associative storage array calculates in column, and for the output feature vector to be distributed to all described second Column are calculated, and the distance calculator is used to calculate described second and calculates the distance vector in arranging.

17. a kind of side for will be compared by the unfiled item of the non-classified vector description of feature with multiple sorting items Method, each sorting item are described by the class vector of feature, which comprises

The distance between the unfiled vector and each class vector vector are calculated simultaneously；And

Calculated simultaneously for each distance vector apart from scalar, it is each apart from scalar provide the unfiled item with it is the multiple The similarity scores between a sorting item in sorting item, thus creation include multiple similarity scores apart from scalar to Amount.

18. method as claimed in claim 17 further includes activating the nonlinear function on the similarity scores vector to create Build ProbabilityDistribution Vector.

19. method as claimed in claim 18, wherein the nonlinear function is SoftMax function.

20. method as claimed in claim 18, and further include find extreme value in the ProbabilityDistribution Vector with find with The most like sorting item of the unfiled item.

21. method as claimed in claim 18, and further include the K- arest neighbors activated on the similarity scores vector (KNN) function, to provide and k most like sorting item of the unfiled item.