ES2558952B2

ES2558952B2 - Scalable system and hardware acceleration method to store and retrieve information

Info

Publication number: ES2558952B2
Application number: ES201500841A
Authority: ES
Inventors: José Ángel GREGORIO MONASTERIO; Valentín PUENTE VARONA
Original assignee: Universidad de Cantabria
Current assignee: Universidad de Cantabria
Priority date: 2015-11-20
Filing date: 2015-11-20
Publication date: 2016-06-30
Anticipated expiration: 2035-11-20
Also published as: WO2017085337A1; ES2558952A1

Abstract

La presente invención se refiere a un método y un sistema de aceleración por hardware para almacenar y recuperar información, que implementa un algoritmo de aprendizaje cortical a través de una red de conmutación de paquetes. El sistema comprende: un módulo codificador para proveer una entrada SDR y enviar paquetes multidifusión a ciertos módulos columnados conectados entre sí mediante la red de conmutación de paquetes; donde los módulos columnados comprenden a su vez: un encaminador, una pluralidad de módulos de memoria configurados para almacenar las entradas recibidas desde el encaminador y almacenar información de contexto; y un módulo de cálculo que calcula el solapamiento de las entradas, selecciona los módulos de memoria con mayor solapamiento, determina un contexto temporal para los módulos de memoria seleccionados y envía una predicción de salida del sistema a un módulo clasificador, el cual selecciona una salida del sistema entre un grupo de salidas preestablecidas, en función de dicha predicción.The present invention relates to a method and a hardware acceleration system for storing and retrieving information, which implements a cortical learning algorithm through a packet switching network. The system comprises: an encoder module to provide an SDR input and send multicast packets to certain columned modules connected to each other via the packet switching network; where the columnar modules in turn comprise: a router, a plurality of memory modules configured to store the inputs received from the router and store context information; and a calculation module that calculates the overlap of the inputs, selects the memory modules with the highest overlap, determines a time context for the selected memory modules and sends a system output prediction to a classifier module, which selects an output of the system between a group of preset outputs, depending on that prediction.

Description

SISTEMA Y METODO ESCALABLE DE ACELERACIÓN POR HARDWARE PARA ALMACENAR Y RECUPERAR INFORMACIÓNHARDWARE SCALABLE ACCELERATION SYSTEM AND METHOD FOR STORAGE AND RECOVERING INFORMATION

DESCRIPCIONDESCRIPTION

CAMPO TÉCNICO DE LA INVENCIÓNTECHNICAL FIELD OF THE INVENTION

La presente invención se refiere al campo técnico de la inteligencia artificial y más concretamente a las redes neuronales artificiales implementadas en hardware para almacenar y recuperar información.The present invention relates to the technical field of artificial intelligence and more specifically to artificial neural networks implemented in hardware to store and retrieve information.

1010

ANTECEDENTESBACKGROUND

La aplicación de algoritmos basados en redes neuronales consiste en el procesamiento automático de información inspirado en el modo en que funciona el sistema nervioso de los animales, las neuronas y sus conexiones. Las neuronas pueden distinguirse 15 agrupadas en columnas, las cuales están conectadas a través de los axones de bajo- rango a otras columnas cercanas (a través de la capa I del cortex) o a otras columnas distantes y a la interface sensor-motor, es decir, el tálamo (a través de la capa VI). La figura 1 representa estas estructuras de columnas micro-corticales (1) y las hiper- columnas corticales (2). Independientemente de la funcionalidad de cada zona, el cortex 20 es morfológicamente muy regular.The application of algorithms based on neural networks consists in the automatic processing of information inspired by the way in which the nervous system of animals, neurons and their connections works. The neurons can be distinguished in groups grouped in columns, which are connected through the low-range axons to other nearby columns (through the cortex layer I) or to other distant columns and to the sensor-motor interface, that is, the thalamus (through layer VI). Figure 1 represents these structures of micro-cortical columns (1) and cortical hyper-columns (2). Regardless of the functionality of each zone, the cortex 20 is morphologically very regular.

La evidencia empírica sugiere que el sistema neuronal representa la información siguiendo una representación distribuida dispersa SDR para almacenar y recuperar información. En esta representación, en contraste con la representación convencionalEmpirical evidence suggests that the neuronal system represents information following a distributed distributed representation SDR to store and retrieve information. In this representation, in contrast to the conventional representation

25 de datos binarios (también acuñado como representación localista), cada bit tiene significado semántico y la representación de datos es altamente resistente al ambiente ruidoso y propenso a los fallos (como es el biológico), es decir, el cambio indeseado de un numero bajo de bits en la representación original siempre produce un valor similar a la original.25 binary data (also coined as a localist representation), each bit has semantic meaning and the data representation is highly resistant to the noisy environment and prone to failures (as is the biological one), that is, the unwanted change of a low number of bits in the original representation always produces a value similar to the original.

3030

El cortex puede entenderse que funciona como una memoria auto-asociativa, jerárquicamente estructurada como una memoria temporal jerárquica (HTM). Esta afirmación, basada puramente en las observaciones de la neurociencia, presenta un algoritmo preciso, llamado algoritmo de aprendizaje cortical (CLA), que proporciona lasCortex can be understood to function as a self-associative memory, hierarchically structured as a hierarchical temporary memory (HTM). This statement, based purely on the observations of neuroscience, presents a precise algorithm, called the cortical learning algorithm (CLA), which provides the

reglas para almacenar y recuperar información, es decir, aprender y hacer predicciones. Este concepto ha sido utilizado en problemas prácticos, tales como la detección de anomalías, predicción de secuencias, identificación de patrones, etc. imitando el comportamiento de capas superiores de la columna cortical.rules for storing and retrieving information, that is, learning and making predictions. This concept has been used in practical problems, such as anomaly detection, sequence prediction, pattern identification, etc. imitating the behavior of upper layers of the cortical spine.

55

El algoritmo CLA se centra en replicar parcialmente la funcionalidad de las micro- columnas corticales, donde la capa I se utiliza principalmente para la interconexión de las diferentes columnas en la misma hiper-columna; el nivel ll/lll, denominado generalmente como la capa de inferencia, está supuestamente dedicado a predecir el 10 estado de la columna en los próximos pasos de la entrada; y la capa IV, denominada capa sensorial, se ocupa de las señales de entrada a la columna. Los principios de funcionamiento de las capas V y VI aún no se comprenden bien y actualmente CLA no los modela, pero el punto clave de esta organización es que la misma hiper-columna puede ser reutilizada por diferentes hiper-columnas en el siguiente nivel y a través de 15 toda la jerarquía, el nivel de información que una columna puede identificar será cada vez mayor (condensando la semántica de los niveles más bajos).The CLA algorithm focuses on partially replicating the functionality of cortical micro-columns, where layer I is mainly used for interconnecting the different columns in the same hyper-column; the ll / lll level, generally referred to as the inference layer, is supposedly dedicated to predicting the state of the column in the next steps of the entry; and layer IV, called the sensory layer, deals with the input signals to the column. The operating principles of layers V and VI are not yet well understood and currently CLA does not model them, but the key point of this organization is that the same hyper-column can be reused by different hyper-columns at the next level and through of the entire hierarchy, the level of information that a column can identify will be increasing (condensing the semantics of the lowest levels).

El algoritmo CLA define el término columna (20), representado en la figura 2, lo que es suficiente para manejar la predicción sin la estructura jerárquica. En la parte inferior, un 20 segmento dendrítico proximal (21) podría estar conectado a un subconjunto de los bits de la entrada del SDR. Esta restricción modela el hecho de que la actividad del axón de entrada será observada por un subconjunto columnas. Dichos segmentos modelan el crecimiento dendrítico de la conexión de alimentación directa del sistema, lo que es bien conocido que es responsable del aprendizaje en el cortex. En contraste con otras redes 25 neuronales artificiales, cada sinapsis del segmento se caracteriza por un valor binario, es decir, está conectado o no. Para una entrada codificada dada, en cada segmento proximal de dendritas se determina el número de sinapsis activas, es decir, el número de entradas activas conectadas al segmento con una sinapsis conectada (esto se llama solapamiento de la entrada). Una vez que esto se sabe, al igual que en los sistemas 30 biológicos, comienza un proceso de inhibición y únicamente se seleccionan aproximadamente el 2% de las mejores de las columnas con más sinapsis activas. Las columnas restantes son inhibidas. Las sinapsis, que han sido activadas por la entrada en las columnas ganadoras, se fortalecen y las sinapsis conectadas a las entradas inactivas se debilitan. Con el fin de manejar el aprendizaje, para cada conexión sinóptica 35 se realiza un seguimiento con un valor de permanencia. Si el valor está por encima de un umbral predefinido, la sinapsis se considera conectada. En el momento del arranque,The CLA algorithm defines the term column (20), represented in Figure 2, which is sufficient to handle the prediction without the hierarchical structure. At the bottom, a proximal dendritic segment (21) could be connected to a subset of the bits of the SDR input. This restriction models the fact that the activity of the input axon will be observed by a subset of columns. These segments model the dendritic growth of the system's direct power connection, which is well known to be responsible for learning in the cortex. In contrast to other artificial neural networks, each segment synapse is characterized by a binary value, that is, it is connected or not. For a given coded input, in each proximal dendrite segment the number of active synapses is determined, that is, the number of active inputs connected to the segment with a connected synapse (this is called input overlap). Once this is known, as in the biological systems, an inhibition process begins and only about 2% of the best of the columns with more active synapses are selected. The remaining columns are inhibited. The synapses, which have been activated by the entry in the winning columns, are strengthened and the synapses connected to the inactive entrances are weakened. In order to manage learning, for each synoptic connection 35, a permanence value is monitored. If the value is above a predefined threshold, the synapse is considered connected. At boot time,

55

1010

15fifteen

20twenty

2525

3030

3535

los valores se eligen al azar, cerca del valor umbral. Típicamente, tres o cuatro bits pueden ser suficientes. En las implementaciones software, por defecto, el umbral puede ser 0,2, máximo 1,0 y aprender con incremento de 0,1 (la amortiguación u olvido es, por lo general, un orden de magnitud más pequeña, pero se puede evitar el aumento de resolución mediante el uso de una resta al azar). De esta manera se emula la Capa IV de las micro-columnas corticales que, en la terminología CLA/HTM, se llama agrupación espacial (spatial pooling). La intuición detrás de esta agrupación es la de "filtrar" las características más notables de la entrada con el fin de almacenar posteriormente la secuencia.the values are chosen at random, close to the threshold value. Typically, three or four bits may be sufficient. In software implementations, by default, the threshold can be 0.2, maximum 1.0 and learn with an increase of 0.1 (damping or forgetting is usually an order of smaller magnitude, but it can be avoided increased resolution by using a random subtraction). In this way Layer IV of the cortical micro-columns is emulated which, in the CLA / HTM terminology, is called spatial pooling. The intuition behind this grouping is to "filter" the most notable characteristics of the input in order to subsequently store the sequence.

Por otro lado, cuando se activa una columna, es decir gana el proceso de inhibición, las celdas (temporales) tienen que procesar dicha información. Cada columna tendrá unas pocas decenas de celdas (22). Un número de columnas SDR-compatible representará una entrada codificada. Por lo tanto, después de la inhibición, las columnas ganadoras representan las características más sobresalientes de la entrada. En un entorno de "sin-contexto", una sola celda sería suficiente para hacer la predicción. Sin embargo, para obtener una predicción que dependa del contexto, se necesita tanto un valor actual (26) como un contexto de secuencia temporal. Para ello, cada celda por columna representa el valor de la entrada en una secuencia temporal (es decir, la memoria debe ser capaz de predecir las secuencias sucesivas). Incluso con un bajo número de celdas por columna, el número de "contextos" que el sistema puede almacenar para el mismo valor, es enorme. Por ejemplo, en un sistema con 2048 columnas y 32 celdas por columna, será capaz de capturar 4032 contextos temporales diferentes para la misma entrada.On the other hand, when a column is activated, that is, it wins the inhibition process, the (temporary) cells have to process that information. Each column will have a few tens of cells (22). A number of SDR-compatible columns will represent an encoded input. Therefore, after inhibition, the winning columns represent the most outstanding characteristics of the entry. In a "contextless" environment, a single cell would be enough to make the prediction. However, to obtain a prediction that depends on the context, both a current value (26) and a time sequence context are needed. To do this, each cell per column represents the value of the input in a time sequence (that is, the memory must be able to predict the successive sequences). Even with a low number of cells per column, the number of "contexts" that the system can store for the same value is huge. For example, in a system with 2048 columns and 32 cells per column, you will be able to capture 4032 different time contexts for the same entry.

Cada celda podría predecir el estado de la columna en la siguiente entrada en el secuenciador. Para ello, utiliza segmentos dendríticos para el modelado de las relaciones columna. Cada segmento dendrítico distal (23) almacena potenciales sinapsis con otras columnas del cortex. Las reglas para manejar tales sinapsis son similares a las del segmento proximal. Si alguno de los segmentos de la celda alcanza un umbral dado, ésta entra en el estado predictivo (24), lo que significa que dicha columna se activará (25) en el siguiente periodo o época. Cuando una columna no se predijo correctamente, todas las celdas de la columna intentan conectarse con la secuencia vista previamente. En primer lugar, se construyen, sobre la marcha, nuevos segmentos distales según las activaciones remotas previas y, en segundo lugar, se buscan celdas que deben predecir la activación en el siguiente periodo, lo que imita las capas ll/lll en las columnas biológicas. La intuición es utilizar la sinapsis entre las diferentes columnas en el sistema para obtener un camino serpenteante entre celdasEach cell could predict the status of the column in the next entry in the sequencer. To do this, it uses dendritic segments for modeling column relationships. Each distal dendritic segment (23) stores potential synapses with other cortex columns. The rules for handling such synapses are similar to those of the proximal segment. If any of the cell segments reaches a given threshold, it enters the predictive state (24), which means that said column will be activated (25) in the next period or time. When a column was not predicted correctly, all the cells in the column attempt to connect to the sequence seen previously. First, new distal segments are constructed on the fly according to previous remote activations and, secondly, cells are sought that must predict activation in the next period, which mimics layers ll / lll in the biological columns . The intuition is to use the synapse between the different columns in the system to obtain a meandering path between cells

que representen los diferentes contextos temporales. La terminología CLA/HTM utilizada para esta tarea es la agrupación temporal (temporal pooling).that represent the different temporal contexts. The CLA / HTM terminology used for this task is temporary pooling.

Hoy en día, los avances realizados en HTM se implementan en software, lo que limita 5 técnicamente los sistemas a unos pocos miles de columnas. En lugar de conexiones precisas ponderadas, HTM utiliza una topología dinámica compleja para almacenar y recuperar información, lo que, desde la perspectiva hardware simplista, no es posible (una sola columna puede estar potencialmente relacionada con decenas de miles de diferentes columnas). Las soluciones existentes son muy demandantes en memoria y 10 requieren millones de ciclos de reloj para producir cada predicción. Problemas como el reconocimiento de patrones basado en el mecanismo sacádico requieren sistemas mucho más grandes y rápidos y aunque se están haciendo esfuerzos en enfoques basados en FPGA o tecnologías emergentes como apilamiento 3D y memorias no volátiles que podrían aliviar de alguna manera estos estrictos requisitos, el estado del 15 arte recibiría como una valiosa contribución cualquier solución que presentase una implementación hardware factible para superar ese problema y redujese los costes y tiempo de ejecución.Today, the advances made in HTM are implemented in software, which technically limits the systems to a few thousand columns. Instead of precise weighted connections, HTM uses a complex dynamic topology to store and retrieve information, which, from a simplistic hardware perspective, is not possible (a single column can potentially be related to tens of thousands of different columns). Existing solutions are very demanding in memory and 10 require millions of clock cycles to produce each prediction. Problems such as pattern recognition based on the saccadic mechanism require much larger and faster systems and although efforts are being made on FPGA-based approaches or emerging technologies such as 3D stacking and non-volatile memories that could somehow alleviate these strict requirements, the The state of the art would receive as a valuable contribution any solution that presented a feasible hardware implementation to overcome this problem and reduce costs and execution time.

DESCRIPCIÓN DE LA INVENCIÓNDESCRIPTION OF THE INVENTION

La presente invención resuelve los problemas mencionados anteriormente, presentando la arquitectura de una implementación hardware que emplea técnicas y metodologías arquitecturales tales como chips o multiprocesadores de propósito general. Específicamente, las limitaciones que implican las implementaciones software conocidas en el estado del arte, para el algoritmo de aprendizaje cortical CLA, son superadas por la presente invención, la cual se refiere en un primer aspecto a un sistema de aceleración por hardware para almacenar y recuperar información, que implementa dicho algoritmo de aprendizaje cortical a través de una red de conmutación de paquetes. El sistema comprende:The present invention solves the problems mentioned above, presenting the architecture of a hardware implementation employing architectural techniques and methodologies such as general purpose chips or multiprocessors. Specifically, the limitations implied by software implementations known in the state of the art, for the CLA cortical learning algorithm, are overcome by the present invention, which refers in a first aspect to a hardware acceleration system for storing and retrieving information, which implements said cortical learning algorithm through a packet switching network. The system includes:

al menos un módulo codificador configurado para codificar una entrada binaria en una representación distribuida dispersa (SDR), y para enviar, por cada bit activo de la SDR, un paquete multidifusión a un módulo columnado determinado a través de la red de conmutación de paquetes, en función de una tabla de correspondencias previamente establecidas;at least one encoder module configured to encode a binary input in a distributed distributed representation (SDR), and to send, for each active bit of the SDR, a multicast packet to a determined column module through the packet switching network, based on a table of correspondence previously established;

una pluralidad de módulos columnados conectados mediante dicha red de conmutación de paquetes, configurados para recibir los paquetes multidifusióna plurality of columned modules connected by said packet switching network, configured to receive multicast packets

20twenty

2525

3030

enviados desde el codificador, donde cada uno de los módulos columnados comprende a su vez:sent from the encoder, where each of the columnized modules includes:

o un encaminador con soporte multidifusión configurado para recibir paquetes desde el módulo codificador, entregar dichos paquetes a ciertos módulos de memoria del módulo columnado y enviar paquetes desde los módulos de memoria a un clasificador de salida; o una pluralidad de módulos de memoria configurados para almacenar las entradas recibidas desde el encaminador y almacenar información de contexto;or a router with multicast support configured to receive packets from the encoder module, deliver said packets to certain memory modules of the columnar module and send packets from the memory modules to an output sorter; or a plurality of memory modules configured to store the inputs received from the router and store context information;

o un módulo de cálculo configurado para determinar un grado de solapamiento entre el contenido de los ciertos módulos de memoria y la entrada actual, seleccionar un número determinado de módulos de memoria con mayor grado de solapamiento, determinar un contexto temporal para cada uno de los módulos de memoria seleccionados, realizar una predicción de la salida del sistema en función de la entrada actual y la información de contexto temporal y enviar un paquete de salida que contiene dicha predicción a un módulo clasificador de salida; un módulo clasificador de salida configurado para recibir un paquete de salida, enviado a través de la red de conmutación desde cualquiera de los módulos columnados, y para seleccionar una salida del sistema entre un grupo de salidas preestablecidas en función del paquete de salida recibido.or a calculation module configured to determine a degree of overlap between the content of certain memory modules and the current input, select a specific number of memory modules with a greater degree of overlap, determine a time context for each of the modules of selected memory, make a prediction of the system output based on the current input and temporal context information and send an output packet containing said prediction to an output classifier module; an output classifier module configured to receive an output packet, sent through the switching network from any of the columnar modules, and to select a system output from a group of preset outputs based on the received packet output.

El sistema de la presente invención, de acuerdo a una de sus realizaciones particulares, contempla que el módulo de cálculo comprenda un comparador, un sumador y un contador.The system of the present invention, according to one of its particular embodiments, contemplates that the calculation module comprises a comparator, an adder and a counter.

25 La presente invención contempla, en una de sus realizaciones, que cada módulo de memoria de la pluralidad de módulos de memoria comprenda una pluralidad de celdas temporales, las cuales adoptan un estado activo o un estado no activo y su combinación representa un determinado contexto temporal para el módulo de memoria. Ventajosamente se consigue así la representación de diferentes contextos temporales 30 que permiten predecir las siguientes entradas y, además, las secuencias que puede almacenar el sistema contribuyen directamente al aprendizaje para futuras entradas.The present invention contemplates, in one of its embodiments, that each memory module of the plurality of memory modules comprises a plurality of temporary cells, which adopt an active state or a non-active state and their combination represents a certain temporal context for the memory module. Advantageously, the representation of different time contexts 30 that allow predicting the following inputs is achieved and, in addition, the sequences that the system can store contribute directly to learning for future entries.

Adicionalmente, la presente invención contempla que el módulo de cálculo esté configurado para comprobar si su predicción de salida es correcta; en caso de predicción errónea se produce una ráfaga que pone todas las celdas temporales del módulo de 35 memoria en estado activo. Así, se afina ventajosamente el aprendizaje del sistema.Additionally, the present invention contemplates that the calculation module is configured to check if its output prediction is correct; in case of a wrong prediction, a burst occurs that puts all the temporary cells of the memory module in active state. Thus, the learning of the system is advantageously tuned.

1010

15fifteen

Opcionalmente, la presente invención, de acuerdo a una de sus realizaciones, contempla que el módulo de cálculo esté además configurado para simultanear etapas y, dada una secuencia de entrada, producir una predicción en tres intervalos de dicha secuencia. Ventajosamente se aprovechan así las capacidades de la red y se puede 5 segmentar el algoritmo CLA para inyectar los resultados de cada etapa en la red sin necesidad de esperar a terminar todas las fases de cálculo.Optionally, the present invention, according to one of its embodiments, contemplates that the calculation module is further configured to combine stages and, given an input sequence, produce a prediction in three intervals of said sequence. Advantageously, the capabilities of the network are thus taken advantage of and the CLA algorithm can be segmented to inject the results of each stage into the network without having to wait to finish all the calculation phases.

Adicionalmente, una de las realizaciones de la presente invención, contempla la posibilidad de que el módulo de cálculo esté además configurado para agregar tráfico de diferentes etapas en un mismo paquete. Es una medida más de optimización que 10 puede incorporar la presente invención para potenciar las ventajas de la segmentación del algoritmo comentada anteriormente.Additionally, one of the embodiments of the present invention contemplates the possibility that the calculation module is also configured to add traffic from different stages in the same package. It is a further measure of optimization that the present invention can incorporate to enhance the advantages of the segmentation of the algorithm discussed above.

Los módulos columnados ubicados en los extremos de la red, de acuerdo a una de las realizaciones de la invención, se contempla que estén configurados para inyectar en la red de conmutación de paquetes un paquete escoba, el cual se replica en el resto de 15 módulos columnados únicamente cuando el encaminador correspondiente no tiene más paquetes en cola hasta que dicho paquete escoba alcanza el extremo opuesto de la red, lo que indica que la red ha sido vaciada. Esto ventajosamente sirve de mecanismo para garantizar la correcta ejecución de las etapas de cálculo de solapamiento y determinar el contexto temporal.The columnar modules located at the ends of the network, according to one of the embodiments of the invention, are contemplated to be configured to inject a broom package into the packet switching network, which is replicated in the remaining 15 modules Collated only when the corresponding router has no more queued packets until the broom packet reaches the opposite end of the network, indicating that the network has been emptied. This advantageously serves as a mechanism to ensure the correct execution of the overlap calculation steps and determine the temporal context.

20 Una de las realizaciones particulares de la invención contempla la posibilidad de que el número de módulos de memoria que comprende cada uno de los módulos columnados esté determinado por un equilibrio entre el retardo de propagación y el ciclo de reloj del sistema.One of the particular embodiments of the invention contemplates the possibility that the number of memory modules comprising each of the columnar modules is determined by a balance between the propagation delay and the system clock cycle.

El módulo codificador de la presente invención, o uno de los módulos codificadores, 25 puede configurarse para enviar los paquetes de entrada a una selección de módulos columnados preestablecida aleatoriamente que representa en torno al 20% del total de módulos columnados. Así ventajosamente se proporciona otra de las optimizaciones de la presente invención donde, en función de las entradas o de las aplicaciones concretas, podría variarse dinámicamente el tamaño de la selección, o parche proximal.The coding module of the present invention, or one of the coding modules, 25 can be configured to send the input packets to a randomly preset columnar module selection representing about 20% of the total columnar modules. Thus, another optimization of the present invention is advantageously provided where, depending on the inputs or specific applications, the size of the selection, or proximal patch, could be dynamically varied.

30 El sistema propuesto por la presente invención se implementa, de acuerdo a diferentes realizaciones particulares, en una placa de silicio, un chip o un microprocesador utilizando tecnología CMOS.The system proposed by the present invention is implemented, according to different particular embodiments, on a silicon plate, a chip or a microprocessor using CMOS technology.

Un segundo aspecto de la invención se refiere a un método escalable de aceleraciónA second aspect of the invention relates to a scalable acceleration method.

por hardware para almacenar y recuperar información a través de una red de conmutación de paquetes, el método comprende los pasos de:by hardware to store and retrieve information through a packet switching network, the method comprises the steps of:

a) codificar, en un módulo codificador, una entrada binaria en una representacióna) encode, in an encoder module, a binary input in a representation

5 distribuida dispersa (SDR)5 distributed dispersed (SDR)

b) enviar, por cada bit activo de la SDR, un paquete multicast desde el módulo codificador a un módulo columnado determinado de una pluralidad de módulos columnados a través de la red de conmutación de paquetes, en función de una tabla de correspondencias previamente establecidas;b) send, for each active bit of the SDR, a multicast packet from the encoder module to a given column module of a plurality of modules columned through the packet switching network, based on a table of correspondence previously established;

10 c) recibir los paquetes enviados desde el módulo codificador, a través de la red10 c) receive packets sent from the encoder module, through the network

de conmutación de paquetes, en un encaminador del módulo columnado;packet switching, in a column module router;

d) entregar dichos paquetes a ciertos módulos de memoria del módulo columnado;d) deliver said packages to certain memory modules of the columnar module;

e) almacenar en los ciertos módulos de memoria los paquetes recibidos;e) store received packets in certain memory modules;

15 f) determinar, en un módulo de cálculo del módulo columnado, un grado de15 f) determine, in a column module calculation module, a degree of

solapamiento entre el contenido de los módulos de memoria que han recibido el paquete de entrada y la entrada actual;overlap between the contents of the memory modules that have received the input package and the current input;

g) seleccionar, por el módulo de cálculo, un número determinado de módulos de memoria con mayor grado de solapamiento;g) select, by the calculation module, a certain number of memory modules with a greater degree of overlap;

20 h) determinar, por el módulo de cálculo, un contexto temporal para cada uno de20 h) determine, by the calculation module, a temporal context for each of

los módulos de memoria seleccionados;the selected memory modules;

i) realizar, por el módulo de cálculo, una predicción de la salida del sistema en función de la entrada actual y la información de contexto temporal almacenada en los módulos de memoria;i) make, by the calculation module, a prediction of the system output based on the current input and the temporal context information stored in the memory modules;

25 j) enviar un paquete de salida que contiene dicha predicción a un móduloJ) send an output packet containing this prediction to a module

clasificador de salida;output classifier;

k) recibir un paquete de salida en el clasificador de salida, enviado a través de la red de conmutación desde cualquiera de los módulos columnados;k) receive an output packet in the output classifier, sent through the switching network from any of the columnar modules;

l) seleccionar, en el clasificador de salida, una salida del sistema entre un grupol) select, in the output classifier, a system output from a group

30 de salidas preestablecidas en función del paquete de salida recibido.30 of preset outputs depending on the output package received.

55

1010

15fifteen

20twenty

2525

3030

De acuerdo a una de las realizaciones de la presente invención, el método propuesto contempla comprobar si la predicción de salida realizada por el módulo de cálculo es correcta, donde, en caso de predicción errónea se produce una ráfaga que pone todas las celdas temporales del módulo de memoria en estado activo.According to one of the embodiments of the present invention, the proposed method contemplates checking if the output prediction made by the calculation module is correct, where, in case of a wrong prediction, a burst occurs that puts all the temporal cells of the module of memory in active state.

Adicionalmente, la presente invención puede incluir el paso de comprobar que la red de conmutación de paquetes está vacía antes de ejecutar las etapas de calcular el solapamiento y determinar el contexto temporal, donde, para comprobar que la red está vacía, se proporciona un paquete escoba que recorre la red de conmutación de paquetes.Additionally, the present invention may include the step of verifying that the packet switching network is empty before executing the steps of calculating the overlap and determining the time context, where, to verify that the network is empty, a broom packet is provided that runs through the packet switching network.

De forma opcional, la presente invención contempla en una de sus realizaciones el paso de restringir los paquetes enviados por el módulo codificador a una selección de módulos columnados, preestablecida aleatoriamente, que representa en torno al 20% del total de módulos columnados.Optionally, the present invention contemplates in one of its embodiments the step of restricting the packets sent by the encoder module to a selection of columnar modules, randomly preset, which represents about 20% of the total columnar modules.

Inspirándose en las propiedades biológicas de axones y dendritas, la presente invención define, por tanto, un sistema que utiliza una construcción lógica para satisfacer la flexibilidad topológica del conocido algoritmo CLA través de una red on-chip. A diferencia de otros sistemas de aprendizaje del estado del arte, los cálculos del algoritmo CLA son simples (sumas y restas de baja precisión, comparaciones simples), por lo que, añadiendo alguna lógica de cálculo a los encaminadores de dicha red y algunos módulos de memoria para almacenar el estado de conectividad, la presente invención implementa el algoritmo CLA sin necesidad de procesadores de propósito general complejos, donde el substrato de comunicación, y los procedimientos para conseguir una implementación hardware factible del algoritmo CLA conocido, se basan en el uso de una red de conmutación de paquetes y diversas técnicas empleadas en arquitectura de computadores, que garantizan así mismo la escalabilidad del sistema. La combinación de todas las técnicas presentadas en la presente invención permite reducir, en promedio, aproximadamente un 95% el retardo de la red y energía necesaria.Inspired by the biological properties of axons and dendrites, the present invention thus defines a system that uses a logical construction to satisfy the topological flexibility of the known CLA algorithm through an on-chip network. Unlike other state-of-the-art learning systems, the CLA algorithm calculations are simple (low precision addition and subtraction, simple comparisons), so adding some calculation logic to the routers of that network and some modules of memory for storing the connectivity state, the present invention implements the CLA algorithm without the need for complex general purpose processors, where the communication substrate, and the procedures for achieving a feasible hardware implementation of the known CLA algorithm, are based on the use of a packet switching network and various techniques used in computer architecture, which also guarantee the scalability of the system. The combination of all the techniques presented in the present invention makes it possible to reduce, on average, approximately 95% the delay of the network and the necessary energy.

Además, la implementación hardware propuesta por la presente invención implica multitud de ventajas adicionales como ampliar el espectro de aplicación del algoritmo permitiendo, por ejemplo, combinarlo fácilmente con computación tipo von-Neumann, utilizarlo como acelerador de procesamiento neuronal, permitir explorar el potencial de la organización jerárquica, o investigar sobre los mecanismos subyacentes yIn addition, the hardware implementation proposed by the present invention implies a multitude of additional advantages such as extending the spectrum of application of the algorithm allowing, for example, to easily combine it with von-Neumann type computing, use it as a neuronal processing accelerator, allow exploring the potential of the hierarchical organization, or research on the underlying mechanisms and

desconocidos del neo-cortex. Por ello, una implementación basada en silicio como la propuesta en la presente invención supone una valiosa contribución al estado del arte.unknown neo-cortex. Therefore, an implementation based on silicon as proposed in the present invention is a valuable contribution to the state of the art.

DESCRIPCIÓN DE LOS DIBUJOSDESCRIPTION OF THE DRAWINGS

55

Para complementar la descripción que se está realizando y con objeto de ayudar a una mejor comprensión de las características de la invención, se acompaña, como parte integrante de dicha descripción, un juego de figuras en donde, con carácter ilustrativo y no limitativo, se ha representado lo siguiente:To complement the description that is being made and in order to help a better understanding of the characteristics of the invention, it is accompanied, as an integral part of said description, a set of figures where, for illustrative and non-limiting purposes, represented the following:

10 Figuras 1a, 1b.- representan unas estructuras de columnas micro-corticales (1a) y estructuras de hiper-columnas corticales (1b) en las que se basa la presente invención.Figures 1a, 1b.- represent structures of micro-cortical columns (1a) and structures of cortical hyper-columns (1b) on which the present invention is based.

Figura 2.- representa una de las columnas según el algoritmo CLA.Figure 2.- represents one of the columns according to the CLA algorithm.

Figura 3a.- representa una descripción de alto nivel de la arquitectura propuesta por una de las realizaciones de la presente invención.Figure 3a.- represents a high level description of the architecture proposed by one of the embodiments of the present invention.

15 Figura 3b.- representa un bosquejo en alto nivel de uno de los módulos columnados de la figura 3a.15 Figure 3b.- represents a high-level sketch of one of the columnar modules of Figure 3a.

Figura 4.- representa las etapas necesarias para el algoritmo CLAFigure 4.- represents the necessary steps for the CLA algorithm

Figura 5.- representa la segmentación del algoritmo CLA y cómo las etapas son simultaneadas.Figure 5.- represents the segmentation of the CLA algorithm and how the stages are simultaneous.

20 Figura 6.- representa un ejemplo de optimización de acuerdo a una de las realizaciones de la invención, donde se muestra un parche proximal en una topología de tipo panal.Figure 6.- represents an example of optimization according to one of the embodiments of the invention, where a proximal patch is shown in a honeycomb topology.

Figura 7.- representa un ejemplo de optimización de acuerdo a una de las realizaciones de la invención, donde se muestran varias zonas de scale-out.Figure 7.- represents an example of optimization according to one of the embodiments of the invention, where several scale-out areas are shown.

Figura 8.- representa gráficamente el número de ciclos de reloj, por intervalo, para 25 diferentes tamaños de malla cuadrada 2D.Figure 8.- graphically represents the number of clock cycles, per interval, for 25 different sizes of 2D square mesh.

Figura 9.- representa gráficamente el número de ciclos de reloj, por intervalo, para diferentes mallas cuadradas 2D, empleando tráfico agregado.Figure 9.- graphically represents the number of clock cycles, by interval, for different 2D square meshes, using aggregate traffic.

La figura 10.- representa gráficamente el número de ciclos de reloj, por intervalo, para diferentes mallas cuadradas 2D, con el algoritmo segmentado, agregación de tráfico y 30 parches proximales aplicados.Figure 10.- graphically represents the number of clock cycles, by interval, for different 2D square meshes, with the segmented algorithm, traffic aggregation and 30 proximal patches applied.

55

1010

15fifteen

20twenty

2525

3030

La figura 11.- representa gráficamente el número de ciclos de reloj, variando la anchura del enlace.Figure 11.- graphically represents the number of clock cycles, varying the width of the link.

La figura 12.- representa gráficamente los ciclos de reloj requeridos por la red para procesar un intervalo con diferentes anchuras del enlace.Figure 12.- graphically depicts the clock cycles required by the network to process an interval with different link widths.

La figura 13.- representa gráficamente los requerimientos de energía dinámica de la red para procesar un intervalo desde el flujo de entrada.Figure 13.- graphically represents the dynamic energy requirements of the network to process an interval from the input stream.

La figura 14.- representa gráficamente los ciclos, normalizados al algoritmo base, por intervalo de entrada (malla 16x16).Figure 14.- graphically represents the cycles, normalized to the base algorithm, by input interval (16x16 mesh).

La figura 15.- representa gráficamente la energía dinámica de la red por intervalo, normalizada al algoritmo base.Figure 15.- graphically represents the dynamic energy of the network by interval, normalized to the base algorithm.

La figura 16.- representa gráficamente la probabilidad de columnas mal predichas, normalizada al algoritmo base.Figure 16.- graphically represents the probability of poorly predicted columns, normalized to the base algorithm.

DESCRIPCIÓN DETALLADA DE LA INVENCIÓNDETAILED DESCRIPTION OF THE INVENTION

Lo definido en esta descripción detallada se proporciona para ayudar a una comprensión exhaustiva de la invención. En consecuencia, las personas medianamente expertas en la técnica reconocerán que son posibles variaciones, cambios y modificaciones de las realizaciones descritas en la presente memoria sin apartarse del ámbito de la invención. Además, la descripción de funciones y elementos bien conocidos en el estado del arte se omite por claridad y concisión.What is defined in this detailed description is provided to help a thorough understanding of the invention. Accordingly, people moderately skilled in the art will recognize that variations, changes and modifications of the embodiments described herein are possible without departing from the scope of the invention. In addition, the description of functions and elements well known in the state of the art is omitted for clarity and conciseness.

Por supuesto, las realizaciones de la invención pueden ser implementadas en una amplia variedad de plataformas arquitectónicas, protocolos, dispositivos y sistemas, por lo que los diseños e implementaciones específicas, presentadas en este documento, se proporcionan únicamente con fines de ilustración y comprensión, y nunca para limitar aspectos de la invención.Of course, the embodiments of the invention can be implemented in a wide variety of architectural platforms, protocols, devices and systems, so the specific designs and implementations, presented in this document, are provided solely for purposes of illustration and understanding, and never to limit aspects of the invention.

La presente invención divulga la implementación de un acelerador hardware basado en el algoritmo de aprendizaje cortical para almacenar y recuperar la información, donde los detalles, desde la perspectiva de la arquitectura de computadores, se ofrecen a continuación.The present invention discloses the implementation of a hardware accelerator based on the cortical learning algorithm for storing and retrieving information, where details, from the perspective of computer architecture, are given below.

55

1010

15fifteen

20twenty

2525

3030

3535

La suposición básica de las memorias HTM y algoritmos CLA es que la plasticidad sináptica (a través del crecimiento dendrítico) es el elemento clave del cortex para realizar el aprendizaje. Esto supone que la información se almacena en la relación entre las columnas, definida dinámicamente mediante las conexiones establecidas durante el aprendizaje. Por lo tanto, la capacidad de almacenamiento es proporcional al producto del número de columnas por el número máximo de conexiones por columna.The basic assumption of HTM memories and CLA algorithms is that synaptic plasticity (through dendritic growth) is the key element of cortex for learning. This assumes that information is stored in the relationship between the columns, dynamically defined by the connections established during learning. Therefore, the storage capacity is proportional to the product of the number of columns by the maximum number of connections per column.

Aunque la conectividad de las neuronas puede ser potencialmente muy alta (las estrías dendríticas pueden proporcionar hasta decenas de miles de sinapsis potenciales), muchas de estas sinapsis no están activas (es decir, el axón pre-sináptico está demasiado distante de la dendrita) o múltiples sinapsis activas corresponden al mismo par de las neuronas (como mecanismo de redundancia), por lo que, en lugar de replicar eléctricamente la morfología de los sistemas biológicos, que actualmente sería imposible, la presente invención introduce tal funcionalidad en una red de conmutación de paquetes.Although neuron connectivity can be potentially very high (dendritic stretch marks can provide up to tens of thousands of potential synapses), many of these synapses are not active (i.e., the pre-synaptic axon is too distant from the dendrite) or Multiple active synapses correspond to the same pair of neurons (as a redundancy mechanism), so that instead of electrically replicating the morphology of biological systems, which would currently be impossible, the present invention introduces such functionality into a switching network of packages.

Principalmente, es el substrato de comunicación el objeto de organización y optimización para emular la actividad del axón y aplicar correctamente los algoritmos de predicción y aprendizaje del HTM/CLA. La presente invención, en lugar de utilizar las sinapsis para establecer una conexión activa entre dos columnas, recurre a estructuras de memoria asociadas a una pluralidad de encaminadores (routers) para modelar dichos segmentos dendríticos y empleando una lógica de cálculo simple para realizar la tareas de agrupación espacial (spatial pooling) y agrupación temporal (temporal pooling).Mainly, it is the communication substrate the object of organization and optimization to emulate axon activity and correctly apply the prediction and learning algorithms of the HTM / CLA. The present invention, instead of using the synapses to establish an active connection between two columns, uses memory structures associated with a plurality of routers to model said dendritic segments and using a simple calculation logic to perform the tasks of spatial grouping (spatial pooling) and temporal grouping (temporary pooling).

La Figura 3a presenta una descripción de alto nivel de la arquitectura propuesta, donde puede identificarse un codificador (31) en la entrada del sistema, encargado de convertir una entrada localista en una representación SDR; y un clasificador (32) a la salida del sistema, encargado de llevar a cabo la finalidad prevista, por ejemplo detectar anomalías en una secuencia de entrada, predecir el próximo valor en la secuencia de entrada, comparar determinados patrones etc. Entre los módulos codificador y clasificador, la mecánica CLA se implementa mediante un componente al que, en este documento, se hará referencia como Columnar Core (CC) o módulos columnados (33). En una de las realizaciones particulares ilustrada por la Figura 3a, con fines únicamente explicativos, se recurre a un sistema de 16 módulos columnados, CC0-CC15, conectados mediante una red de conmutación de paquetes, por ejemplo con una topología en forma de malla cuadrada, pero configuraciones de otras dimensiones serían igualmente posibles aprovechando una de las mayores ventajas de la presente invención, su escalabilidad.Figure 3a presents a high level description of the proposed architecture, where an encoder (31) can be identified in the system input, responsible for converting a localist input into an SDR representation; and a classifier (32) at the exit of the system, responsible for carrying out the intended purpose, for example detecting anomalies in an input sequence, predicting the next value in the input sequence, comparing certain patterns etc. Among the encoder and classifier modules, CLA mechanics is implemented through a component that, in this document, will be referred to as Columnar Core (CC) or columnar modules (33). In one of the particular embodiments illustrated by Figure 3a, for explanatory purposes only, a system of 16 columnar modules, CC0-CC15, is connected via a packet switching network, for example with a square mesh shaped topology , but configurations of other dimensions would be equally possible taking advantage of one of the greatest advantages of the present invention, its scalability.

55

1010

15fifteen

20twenty

2525

3030

3535

La Figura 3b representa un bosquejo en alto nivel de un CC. En este caso particular, se supone que cada CC tiene B columnas y t celdas temporales por columna. El sistema es homogéneo, al igual que el cortex biológico, y la presente invención, para su implementación hardware, tiene en cuenta los siguientes requisitos para las tres diferentes secciones: comunicación (34), cálculo (35) y memoria (36):Figure 3b represents a high level sketch of a CC. In this particular case, it is assumed that each CC has B columns and t temporal cells per column. The system is homogeneous, as is the biological cortex, and the present invention, for its hardware implementation, takes into account the following requirements for the three different sections: communication (34), calculation (35) and memory (36):

A. Requerimientos de la ComunicaciónA. Communication Requirements

La red de interconexión tiene que manejar todo el tráfico generado por el algoritmo CLA, es decir, el tráfico de entrada proveniente del codificador, el tráfico de inhibición y de la actividad lateral de las activaciones de las celdas temporales (38), así como el envío del patrón de activación al clasificador. Tal actividad se realiza, en la presente invención, a nivel lógico, empleando paquetes en lugar de cables físicos. Por ejemplo, de acuerdo a una de las realizaciones, cada bit de salida del codificador se conecta a un conjunto de columnas (37) estáticamente definido con lo que para una entrada dada, cada uno de los bits activos en la representación SDR enviará un paquete de multidifusión (multicast) al CC donde residen las columnas, o módulos de memoria, que tienen que recibirlos. El codificador cuenta, en una de las realizaciones, con una tabla que relaciona columnas y entradas. Por lo tanto, el paquete de multidifusión utilizado por la presente invención emula la actividad de cada axón. Del mismo modo, cuando una columna alcanza un estado predictivo, se enviará un único paquete al clasificador.The interconnection network has to handle all the traffic generated by the CLA algorithm, that is, the incoming traffic from the encoder, the inhibition traffic and the lateral activity of the activations of the temporary cells (38), as well as the sending the activation pattern to the classifier. Such activity is carried out, in the present invention, at the logical level, using packages instead of physical cables. For example, according to one of the embodiments, each output bit of the encoder is connected to a set of statically defined columns (37) so that for a given input, each of the active bits in the SDR representation will send a packet multicast (multicast) to the CC where the columns, or memory modules, that have to receive them reside. The encoder has, in one of the embodiments, a table that relates columns and entries. Therefore, the multicast package used by the present invention emulates the activity of each axon. Similarly, when a column reaches a predictive state, a single packet will be sent to the classifier.

Internamente, el encaminador (39) recibirá entradas de la lógica de un módulo de cálculo (35) para la agrupación espacial “spatial pooler” (el solapamiento de columnas utilizado en el procedimiento de inhibición) y para la agrupación temporal “temporal pooler” (eventos de activación de celdas). Esas entradas deben ser enviadas a los potenciales receptores de paquetes. En los algoritmos CLA del estado del arte implementados en software se supone que, en la mayoría de los casos, todas las columnas del sistema deben ser conscientes de ello, es decir, los receptores potenciales son todas las columnas. Por ejemplo, en la inhibición global (que es el método por defecto), cualquier columna debe ser consciente del solapamiento con la entrada del resto de columnas. El solapamiento se calcula como el número de sinapsis conectadas en el segmento proximal de las columnas para una entrada dada. Con este tipo de información, la lógica de cálculo puede determinar si la columna actual se encuentra dentro del conjunto del 2% con un mayor solapamiento y alimentar la lógica de agrupación temporal. Del mismo modo, para la construcción de los segmentos distales, aunque probabilísticamente limitado, el algoritmo supone que cada columna es consciente de todas las celdasInternally, the router (39) will receive logic inputs from a calculation module (35) for the spatial grouping "spatial pooler" (the column overlap used in the inhibition procedure) and for the temporary grouping "temporary pooler" ( cell activation events). These entries must be sent to potential packet recipients. In the CLA algorithms of the state of the art implemented in software it is assumed that, in most cases, all the columns of the system must be aware of this, that is, the potential receivers are all the columns. For example, in the global inhibition (which is the default method), any column must be aware of the overlap with the entry of the rest of the columns. The overlap is calculated as the number of synapses connected in the proximal segment of the columns for a given input. With this type of information, the calculation logic can determine if the current column is within the set of 2% with greater overlap and feed the temporal grouping logic. Similarly, for the construction of distal segments, although probabilistically limited, the algorithm assumes that each column is aware of all cells

55

1010

15fifteen

20twenty

2525

3030

3535

(temporales) en estado predictivo, lo que equivale a suponer que los efectos de los axones se difunden a todas las celdas en el sistema.(temporal) in a predictive state, which is equivalent to assuming that the effects of axons are diffused to all cells in the system.

Hay por tanto una gran cantidad de tráfico multidifusión que requiere un considerable ancho de banda de red y un notable consumo de energía. Además, para cualquiera de los cálculos realizados en la parte de cálculo sólo debe poder accederse a información local, por lo que no puede confiarse en ningún componente centralizado para ampliar el sistema a miles de CCs y por tanto, la sincronización del sistema supone un reto.There is therefore a large amount of multicast traffic that requires considerable network bandwidth and remarkable power consumption. In addition, for any of the calculations made in the calculation part, only local information can be accessed, so that no centralized component can be relied on to extend the system to thousands of CCs and therefore, system synchronization is a challenge .

B. Requerimientos ComputacionalesB. Computational Requirements

La mayor parte de la actividad del axón, tal y como ya se ha comentado, se modela de acuerdo a la presente invención como tráfico multidifusión. La lógica de cálculo en el destino será la encargada de realizar la predicción y los procedimientos de aprendizaje para cada paquete entrante.Most of the axon activity, as already mentioned, is modeled according to the present invention as multicast traffic. The calculation logic at the destination will be responsible for making the prediction and learning procedures for each incoming package.

Hay dos etapas en el algoritmo de CLA que tiene que ser aplicadas de forma secuencial, una vez que todo el tráfico del ciclo actual ha sido drenado fuera de la red:There are two stages in the CLA algorithm that have to be applied sequentially, once all traffic in the current cycle has been drained out of the network:

La agrupación espacial, donde el módulo de cálculo evaluará el solapamiento o superposición de la entrada con su segmento proximal (es decir, el número de sinapsis activas) y, suponiendo que la inhibición es global), difundirá su valor al resto de las columnas en el sistema. En la inhibición local se empleará un paquete de multidifusión. Cada columna es consciente de su propio solapamiento, con una simple comparación con el paquete entrante sabrá si está entre el 2% más activo, una vez que se drena todo el tráfico. Se actualizarán las sinapsis en la tabla de segmentos proximales de las entradas activas si la columna estaba activa. Por lo tanto, de acuerdo a una de las realizaciones de la invención, el módulo de cálculo comprende, para realizar estas operaciones de lógica espacial, un comparador, un sumador de 4 bits y un contador. Notar que el solapamiento máximo requiere ~ Log2lnput bits. Para un codificador de 2048 entradas, son suficientes 12 bits.The spatial grouping, where the calculation module will evaluate the overlap or overlap of the entrance with its proximal segment (that is, the number of active synapses) and, assuming that the inhibition is global), will spread its value to the rest of the columns in the system. In the local inhibition a multicast packet will be used. Each column is aware of its own overlap, with a simple comparison with the incoming package you will know if it is between the 2% more active, once all traffic is drained. The synapses in the table of proximal segments of the active entries will be updated if the column was active. Therefore, according to one of the embodiments of the invention, the calculation module comprises, to perform these spatial logic operations, a comparator, a 4-bit adder and a counter. Note that the maximum overlap requires ~ Log2lnput bits. For a 2048 input encoder, 12 bits are sufficient.

- La agrupación temporal, donde el módulo de cálculo evaluará cualquier actividad lateral. Suponiendo que el axón de las celdas (temporales) es global, se generará una difusión. El paquete de entrada incluirá la columna y celda temporal origen. Esto se mantiene en una lista de activaciones actuales. Una vez finalizado el ciclo actual, la lógica determinará, para cada columna, si la activación se predijo correctamente. En tal caso, el segmento- The temporary grouping, where the calculation module will evaluate any lateral activity. Assuming that the axon of the (temporal) cells is global, a diffusion will be generated. The input package will include the source column and temporary cell. This is maintained in a list of current activations. Once the current cycle is finished, the logic will determine, for each column, if the activation was correctly predicted. In that case, the segment

55

1010

15fifteen

20twenty

2525

3030

3535

distal correspondiente de la celda temporal en el estado predictivo se actualizará en consecuencia (esto emula el crecimiento de dendritas). Si la columna no fue predicha correctamente, la lógica debe mantener las activaciones del ciclo anterior para buscar el segmento distal más cercano (o crear uno nuevo si no existiese). Desde la perspectiva del hardware, esto requiere una extensa búsqueda a través de todos los segmentos dendríticos de la columna y determinar qué segmentos dendríticos de la columna están activos. Las celdas (temporales) con un segmento dendrítico activo generarán una difusión/multidifusión en la red y, finalmente, las columnas que no fueron correctamente predichas producirán un estallido o ráfaga (burst), como hace el sistema biológico, que equivale a poner todas las celdas temporales de la columna en estado activo (seleccionando solamente una para realizar el aprendizaje).Corresponding distal temporal cell in the predictive state will be updated accordingly (this emulates the growth of dendrites). If the column was not predicted correctly, the logic must maintain the activations of the previous cycle to search for the nearest distal segment (or create a new one if it did not exist). From the hardware perspective, this requires an extensive search across all dendritic segments of the column and determine which dendritic segments of the column are active. The cells (temporary) with an active dendritic segment will generate a diffusion / multicast in the network and, finally, the columns that were not correctly predicted will produce a burst or burst (burst), as does the biological system, which is equivalent to putting all the temporary cells of the column in active state (selecting only one to perform the learning).

C. Requerimientos de memoriaC. Memory Requirements

Los segmentos proximales almacenan la permanencia de las sinapsis con cada bit de entrada potencialmente conectado. Debe tenerse en cuenta que cada bit de la representación SDR producida por el codificador está potencialmente conectado (es decir, se podría formar una sinapsis) al subconjunto elegido de columnas en el arranque (que puede ser seleccionado de manera uniforme). En general, puede suponerse que cada bit puede conectarse a cualquier columna en el sistema y, por lo tanto, el segmento proximal tiene que tener una entrada para cada potencial entrada, pero de acuerdo a la presente invención, cada columna se conectará (es decir, se formará una sinapsis) a un subconjunto muy pequeño de entradas de codificador, como ocurre en la práctica. Por lo tanto, el segmento proximal se estructura, de acuerdo a una de las realizaciones, como una memoria cache convencional indexada por el índice de entrada. En la práctica, una capacidad para 64-128 entradas parece ser suficiente para un sistema de 2K columnas. El valor de la permanencia necesita almacenarse ahí. Como en los sistemas biológicos, la precisión requerida por el algoritmo es baja (típicamente menos de 4 bits son suficientes). Por ejemplo, suponiendo un sistema de 2K columnas, con 1K entradas, la agregación de todos los segmentos proximales del cortex requerirán (incluyendo etiquetas) entre 0,25MB y 0,5MB (12bits • 64 • 2K ; 12bits ■ 128 ■ 2K), que desde una perspectiva de implementación hardware, la tarea de manipulación de esos segmentos parece sencilla.The proximal segments store the permanence of the synapses with each potentially connected input bit. It should be noted that each bit of the SDR representation produced by the encoder is potentially connected (ie, a synapse could be formed) to the selected subset of columns at startup (which can be selected uniformly). In general, it can be assumed that each bit can be connected to any column in the system and, therefore, the proximal segment has to have an input for each potential input, but according to the present invention, each column will be connected (i.e. , a synapse will form) to a very small subset of encoder inputs, as is the case in practice. Therefore, the proximal segment is structured, according to one of the embodiments, as a conventional cache memory indexed by the input index. In practice, a capacity for 64-128 entries seems to be sufficient for a 2K column system. The value of permanence needs to be stored there. As in biological systems, the accuracy required by the algorithm is low (typically less than 4 bits are sufficient). For example, assuming a 2K column system, with 1K entries, the aggregation of all proximal cortex segments will require (including labels) between 0.25MB and 0.5MB (12bits • 64 • 2K; 12bits ■ 128 ■ 2K), that from a hardware implementation perspective, the task of manipulating these segments seems simple.

En cambio, los segmentos distales parecen significativamente más difíciles de manejar. En un enfoque simplista, cada segmento distal requerirá tantas sinapsis como columnasIn contrast, distal segments seem significantly more difficult to handle. In a simplistic approach, each distal segment will require as many synapses as columns

55

1010

15fifteen

20twenty

2525

3030

3535

haya en el sistema. Además, cada celda temporal podría requerir múltiples segmentos (típicamente en el rango de 128 a 256). Para un sistema con 2K columnas, de 32 celdas temporal cada una y 256 segmentos por celda, suponiendo 4 bits para almacenar la permanencia, los segmentos de cada celda requerirán 8MB. Por lo tanto, la memoria total requerida para el sistema será prohibitiva para un sistema físico real. Sin embargo, como en el caso de la biología, sólo son requeridas unas pocas de las potenciales conexiones. Por ejemplo, restringiendo cada segmento a las sinapsis más activas (utilizando, por ejemplo, una aproximación basada en stack o pila) el número de conexiones potenciales que deben reservarse se puede reducir considerablemente.there is in the system. In addition, each temporary cell may require multiple segments (typically in the range of 128 to 256). For a system with 2K columns, of 32 temporary cells each and 256 segments per cell, assuming 4 bits to store the permanence, the segments of each cell will require 8MB. Therefore, the total memory required for the system will be prohibitive for a real physical system. However, as in the case of biology, only a few of the potential connections are required. For example, by restricting each segment to the most active synapses (using, for example, a stack-based approach) the number of potential connections that must be reserved can be greatly reduced.

Por tanto, una vez identificado los tres problemas mencionados anteriormente para los módulos columnados (comunicación y sincronización, complejidad de la lógica temporal de agrupación y organización de los segmentos distales), desde el punto de vista de la escalabilidad, y por tanto el más relevante para la presente invención, es el primero. Teniendo en cuenta que en los sistemas biológicos, la diferencia fundamental entre las especies parece estar dominado por el número de neuronas y no por el número de sinapsis por neurona, parece oportuno pensar que los problemas de la lógica interna de los CCs no son un inconveniente importante ya que las tablas, y el tiempo requerido por la lógica de cálculo para la agrupación temporal, no es necesario que escalen con el número total de columnas.Therefore, once the three problems mentioned above have been identified for the columnized modules (communication and synchronization, complexity of the temporal logic of grouping and organization of the distal segments), from the point of view of scalability, and therefore the most relevant For the present invention, it is the first. Given that in biological systems, the fundamental difference between species seems to be dominated by the number of neurons and not by the number of synapses per neuron, it seems appropriate to think that the problems of the internal logic of CCs are not an inconvenience. important since the tables, and the time required by the calculation logic for the temporal grouping, do not need to scale with the total number of columns.

Por otro lado, es evidente que, al aumentar el número de columnas en el algoritmo de CLA, los requerimientos para comunicar y sincronizar los diferentes CCs serán sustancialmente mayores. El substrato de comunicaciones será el encargado de facilitar el aprendizaje de nuevos patrones temporales y espaciales y, desde esta perspectiva, se aborda el problema más exigente para la presente invención: el substrato de comunicaciones que se necesita para modelar la actividad del axón y la sincronización de los CCs de una manera eficiente y rápida. A continuación se detallan los aspectos clave para dicho substrato de comunicaciones:On the other hand, it is evident that, by increasing the number of columns in the CLA algorithm, the requirements for communicating and synchronizing the different CCs will be substantially higher. The communications substrate will be responsible for facilitating the learning of new temporal and spatial patterns and, from this perspective, the most demanding problem for the present invention is addressed: the communications substrate that is needed to model axon activity and synchronization of CCs in an efficient and fast way. The key aspects for this communications substrate are detailed below:

A. Características de la RedA. Characteristics of the Network

Puesto que toda la actividad del axón se modela como paquetes multidifusión, el encaminador utilizado por la presente invención requiere soporte multidifusión. Con el apoyo de la red, las necesidades de energía de los paquetes multidifusión serán menores, ya que se realiza la copia del paquete cerca del destino y permite obtener una latencia más baja, ya que no es necesaria su replicación en la inyección y, por lo tanto, no se produce ningún retraso por este motivo.Since all axon activity is modeled as multicast packets, the router used by the present invention requires multicast support. With the support of the network, the energy needs of the multicast packets will be lower, since the copy of the packet is made near the destination and allows to obtain a lower latency, since its replication in the injection is not necessary and, by Therefore, there is no delay for this reason.

55

1010

15fifteen

20twenty

2525

3030

3535

El tamaño requerido de paquete es bastante pequeño. Por ejemplo, el tráfico de inhibición requerirá la identificación de la columna de origen y el solapamiento (Log2Numcolumns + Log2NumEncoderlnputs). La actividad Lateral requerirá la columna origen y la identificación de las celdas temporales (Log2Numcolumns + Log2NumTemporalCells). La actividad de entrada requerirá identificación de la fuente (Log2Numinputs). Para un sistema de 2.048 columnas/ entradas, con 32 celdas temporales por columna, el tamaño requerido será de 22, 16 y 11 bits respectivamente. Por lo tanto, algunas de las realizaciones de la invención contemplan enlaces de comunicación estrechos, que disminuyen aún más las necesidades de energía y los costes del encaminador.The required package size is quite small. For example, inhibition traffic will require the identification of the source column and the overlap (Log2Numcolumns + Log2NumEncoderlnputs). Lateral activity will require the source column and the identification of temporary cells (Log2Numcolumns + Log2NumTemporalCells). The input activity will require source identification (Log2Numinputs). For a system of 2,048 columns / entries, with 32 temporary cells per column, the required size will be 22, 16 and 11 bits respectively. Therefore, some of the embodiments of the invention contemplate narrow communication links, which further decrease the energy needs and costs of the router.

De acuerdo a una de las realizaciones particulares de la invención, suponiendo Log2Numcolumns enlaces anchos, se pueden emplear paquetes compuestos de un único flit (Flow Control digiT) en la mayoría de los casos y, bajo tales circunstancias, las necesidades de almacenamiento dentro de los encaminadores serán bajas.According to one of the particular embodiments of the invention, assuming Log2Numcolumns wide links, packages composed of a single flit (Flow Control digiT) can be used in most cases and, under such circumstances, storage needs within Routers will be low.

Durante el estado estacionario, aproximadamente el 2% de las columnas del cortex tendrán actividad. Por lo tanto, una de las realizaciones de la presente invención contempla una red de grado bajo y enlaces estrechos para satisfacer los requisitos. Redes de grado alto requerirían aumentar la complejidad de los encaminadores y el coste del cableado. Por ejemplo, toros o mallas bidimensionales, o incluso las redes de tipo panal, podrían satisfacer tales requerimientos. Otra realización de la invención, en la que se supone una red tolerante a fallos, contempla incrementar el tamaño del sistema sin problemas de defectos de producción (yield) e incluso la utilización de técnicas de integración oblea-oblea en un entorno 3D.During the steady state, approximately 2% of the cortex columns will have activity. Therefore, one of the embodiments of the present invention contemplates a low-grade network and narrow links to meet the requirements. High-grade networks would require increasing the complexity of the routers and the cost of wiring. For example, two-dimensional bulls or meshes, or even honeycomb networks, could meet such requirements. Another embodiment of the invention, in which a fault-tolerant network is assumed, contemplates increasing the size of the system without problems of production defects (yield) and even the use of wafer-wafer integration techniques in a 3D environment.

B. SincronizaciónB. Synchronization

El algoritmo CLA comprende principalmente cuatro fases: calcular el solapamiento o superposición de las dendritas proximales con la entrada codificada actual, determinar las columnas ganadoras del cortex, determinar la actividad lateral en cada celda temporal de la columna y producir la predicción. Solapado con esas fases se realiza la adaptación (es decir, el aprendizaje) de los segmentos sinópticos.The CLA algorithm mainly comprises four phases: calculate the overlap or overlap of the proximal dendrites with the current coded input, determine the winning cortex columns, determine the lateral activity in each temporal cell of the column and produce the prediction. Overlapped with these phases, the adaptation (that is, learning) of the synoptic segments is carried out.

La dificultad de ejecutar esas fases de una manera totalmente distribuida, es saber cuándo debe ejecutarse cada una. Por ejemplo, la determinación del solapamiento de la entrada no debe ejecutarse hasta que se reciba toda la actividad de entrada (es decir, que cada columna sea consciente de toda las actividades de los axones). Puesto que no hay mensaje de confirmación de la recepción de la actividad del axón, cada CC debeThe difficulty of executing these phases in a fully distributed way is knowing when each one should be executed. For example, the determination of input overlap should not be executed until all input activity is received (that is, each column is aware of all axon activities). Since there is no confirmation message of the reception of axon activity, each CC must

55

1010

15fifteen

20twenty

2525

3030

3535

ser consciente de cuándo ejecutar la parte correspondiente del algoritmo. Del mismo modo, la inhibición no se puede activar hasta que cada columna sea consciente de si ella misma está dentro de las más activas y, finalmente, la predicción no se puede realizar hasta que se conozca la actividad lateral de las celdas temporales relacionados. La forma más sencilla, pero eficaz, para evitar este problema consiste en vaciar el contenido de la red antes de avanzar a la siguiente fase. Si la red está vacía, hay garantía de que todos los paquetes que influyen ya habrán llegado a su destino.Be aware of when to execute the corresponding part of the algorithm. Similarly, the inhibition cannot be activated until each column is aware of whether it is within the most active itself and, finally, the prediction cannot be made until the lateral activity of the related temporal cells is known. The simplest but effective way to avoid this problem is to empty the content of the network before moving on to the next phase. If the network is empty, there is a guarantee that all the influencing packages will have reached their destination.

La Figura 4 detalla todas las etapas necesarias para el algoritmo CLA. Además de la codificación (40) y clasificación (50), hay nueve etapas adicionales, tres de ellas realizan el cálculo de la lógica espacial y temporal (43, 46 y 49), tres corresponden a la actividad del axón (41,44 y 47) y, finalmente, otras tres son necesarias para drenar la red (42, 45 y 48).Figure 4 details all the steps necessary for the CLA algorithm. In addition to coding (40) and classification (50), there are nine additional stages, three of them perform the calculation of spatial and temporal logic (43, 46 and 49), three correspond to the activity of the axon (41,44 and 47) and, finally, three others are necessary to drain the network (42, 45 and 48).

El problema de la sincronización en la presente invención se reduce entonces a proporcionar un mecanismo escalable de drenaje de la red. Para garantizar la escalabilidad de dicho mecanismo, se necesita una manera simple y efectiva de hacerlo dentro de la propia red. La presente invención contempla, de acuerdo a una de las realizaciones, utilizar encaminamiento en orden de dimensión, inyectar un paquete de difusión especial, llamado paquete escoba, en los CCs de los extremos de la red, correspondientes a los identificadores (IDs) más pequeño y el más grande (en el ejemplo de la Figura 3a corresponderían a CC0 y CC15). A cada paquete escoba se le permitirá pasar al siguiente encaminador únicamente si el encaminador local no tiene más paquetes y los buffers de tránsito en los puertos, por los que el encaminador ha recibido las copias del paquete, están vacíos. El paquete se replica en todos los puertos restantes. Por ejemplo, cuando CC5 recibe, de CC4 y CC1, el paquete escoba de CC0, sabemos que no hay paquetes de actividad que pueda afectar a las columnas que maneja CC5. Cuando las colas de tránsito del Oeste y Norte están vacías, el encaminador replica el paquete escoba CC0 a través de los puertos de Sur y Este. Esta operación se aplicará en todo el cortex hasta que el núcleo CC15 reciba el paquete escoba de CC0. En este punto, CC15 es consciente de que ya no hay paquetes en la red para él y puede avanzar a la siguiente etapa del algoritmo. Del mismo modo, cuando un CC intermedio recibe todos los paquetes escoba de CC0 y CC15, sabe que no hay paquetes pendientes en la red para él. Cabe destacar que este mecanismo opera de una manera completamente distribuida y escala según el ancho de banda disponible de la red.The problem of synchronization in the present invention is then reduced to providing a scalable network drainage mechanism. To ensure the scalability of such a mechanism, a simple and effective way of doing it within the network itself is needed. The present invention contemplates, according to one of the embodiments, to use routing in order of dimension, to inject a special diffusion package, called a broom package, into the CCs of the ends of the network, corresponding to the smallest identifiers (IDs) and the largest (in the example in Figure 3a they would correspond to CC0 and CC15). Each broom package will be allowed to pass to the next router only if the local router has no more packages and the transit buffers in the ports, for which the router has received the copies of the package, are empty. The package is replicated on all remaining ports. For example, when CC5 receives, from CC4 and CC1, the broom package of CC0, we know that there are no activity packages that can affect the columns that CC5 handles. When the West and North transit queues are empty, the router replicates the CC0 broom package through the South and East ports. This operation will be applied throughout the cortex until the CC15 core receives the CC0 broom package. At this point, CC15 is aware that there are no longer any packets in the network for it and can advance to the next stage of the algorithm. Similarly, when an intermediate CC receives all the broom packages of CC0 and CC15, it knows that there are no pending packages in the network for it. It should be noted that this mechanism operates in a completely distributed manner and scales according to the available bandwidth of the network.

En los sistemas biológicos, tales drenajes parecen no ser necesarios porque la tasa de entrada de los cambios es lo suficientemente espaciada como para garantizar que la actividad espacial y temporal se realiza satisfactoriamente. Cuando la tasa de entradaIn biological systems, such drainages do not appear to be necessary because the rate of entry of the changes is sufficiently spaced to ensure that spatial and temporal activity is performed satisfactorily. When the entry fee

55

1010

15fifteen

20twenty

2525

3030

3535

es demasiado alta, el sistema será incapaz de aprender o de predecir. Como ejemplo simple, un cambio excesivamente rápido de una imagen será percibido como ruido por el cortex visual. Aunque se podría aplicar una solución similar a la presente invención, el codificador y los datos no están tan sintonizados como en los sistemas biológicos y harán más que recomendable incorporar la solución del drenaje de la red propuesta.It is too high, the system will be unable to learn or predict. As a simple example, an excessively rapid change of an image will be perceived as noise by the visual cortex. Although a similar solution could be applied to the present invention, the encoder and the data are not as tuned as in the biological systems and will make it more than advisable to incorporate the proposed network drainage solution.

C. Segmentación del AlgoritmoC. Algorithm Segmentation

Las etapas del algoritmo requieren una cantidad sustancial de tiempo y energía. Sin embargo, como puede verse en la Figura 4, pueden identificarse etapas como en el caso un procesador de propósito general. Por lo tanto, la presente invención recurre a las mismas técnicas de optimización empleadas allí. En particular, de acuerdo a una de las realizaciones, el algoritmo es segmentado para simultanear actividades de distintas etapas y reducir su número a, solamente, tres por para cada dato de entrada. La Figura 5 muestra cómo esa organización será beneficiosa una vez que se carga el pipeline. La idea es comenzar a calcular el solapamiento de la próxima entrada tan pronto como se haya calculado el solapamiento actual. Entonces, en el intervalo 54, se llevan a cabo dos operaciones en la red simultáneamente. Si avanzamos en el tiempo comprobamos que pueden superponerse tres operaciones de entrada diferentes en una sola etapa. En el intervalo 57 estamos transmitiendo la comunicación de la actividad distal del primer valor de entrada, el tráfico de inhibición del segundo valor de entrada y la realización de la actividad proximal del tercer dato de entrada. En el intervalo 59 se lleva a cabo, de forma simultánea, la predicción para la primera época, el cálculo de la actividad lateral para el segundo y el cálculo del solapamiento para el último. Y, lo que es más importante aún, solamente se necesita un drenaje de la red por cada valor de entrada. Una vez que se carga el pipeline, sólo se necesitan tres intervalos en la secuencia de entrada para producir una predicción.The stages of the algorithm require a substantial amount of time and energy. However, as can be seen in Figure 4, stages can be identified as in the case a general purpose processor. Therefore, the present invention uses the same optimization techniques employed there. In particular, according to one of the embodiments, the algorithm is segmented to combine activities of different stages and reduce their number to only three times for each input data. Figure 5 shows how that organization will be beneficial once the pipeline is loaded. The idea is to start calculating the overlap of the next entry as soon as the current overlap has been calculated. Then, in the interval 54, two operations are carried out in the network simultaneously. If we move forward in time we verify that three different input operations can be superimposed in a single stage. In the interval 57 we are transmitting the communication of the distal activity of the first input value, the inhibition traffic of the second input value and the realization of the proximal activity of the third input data. In the interval 59, the prediction for the first period, the calculation of the lateral activity for the second and the calculation of the overlap for the latter is carried out simultaneously. And, more importantly, only one drain of the network is needed for each input value. Once the pipeline is loaded, only three intervals are needed in the input sequence to produce a prediction.

D. Superposición de Comunicación y ComputaciónD. Communication and Computing Overlay

Simultanear las etapas del algoritmo, como se ha explicado anteriormente, abre la posibilidad a mejoras adicionales, ya que no es necesario terminar la fases de cálculo (es decir, el cálculo del solapamiento, de la actividad lateral y de la predicción), antes de comenzar a enviar el resultado de cada una. Tan pronto como la lógica de cálculo comienzan a generar la actividad del axón, ésta se puede inyectar en la red. Por lo tanto, el número de ciclos de reloj necesarios para procesar un valor en la secuencia de entrada será determinado por la parte más lenta: comunicación o cálculo. El número de ciclos requeridos por el más lento y el tiempo de ciclo de reloj determinarán el tiempo necesario para procesar una muestra de la secuencia de entrada. Por último, la red deSimultaneousing the algorithm stages, as explained above, opens the possibility for further improvements, since it is not necessary to complete the calculation phases (that is, the calculation of overlap, lateral activity and prediction), before Start sending the result of each. As soon as the calculation logic begins to generate axon activity, it can be injected into the network. Therefore, the number of clock cycles necessary to process a value in the input sequence will be determined by the slowest part: communication or calculation. The number of cycles required by the slowest and the clock cycle time will determine the time required to process a sample of the input sequence. Finally, the network of

55

1010

15fifteen

20twenty

2525

3030

3535

drenaje debe ser cuidadosamente manejada: los paquetes escoba se envían al encaminador de cada CC si todas las columnas locales han finalizado las acciones que se deben realizar en el intervalo actual.Drainage must be carefully managed: broom packages are sent to the router of each CC if all local columns have completed the actions to be performed in the current interval.

E. Tráfico Agregado.E. Traffic Added.

La organización óptima de acuerdo a una de las realizaciones de la presente invención y desde el punto de vista de la latencia, es la combinación de varias columnas en un solo CC. Utilizar muchos encaminadores con enlaces muy cortos puede aumentar innecesariamente la latencia media en la red. Para optimizar dicha latencia, el tamaño del CC (es decir, el número de columnas que maneja) debe ajustarse para que el retardo de propagación y el ciclo de reloj de la red sean similares. Con este enfoque es posible agregar múltiples activaciones de los axones procedentes de columnas en el mismo CC, en un único paquete. Aunque esto podría aumentar el número de flits del paquete (su longitud), ello reducirá significativamente la carga de la red.The optimal organization according to one of the embodiments of the present invention and from the point of view of latency, is the combination of several columns in a single CC. Using many routers with very short links can unnecessarily increase the average latency in the network. To optimize this latency, the size of the CC (that is, the number of columns it handles) must be adjusted so that the propagation delay and the network clock cycle are similar. With this approach it is possible to add multiple activations of axons from columns in the same CC, in a single package. Although this could increase the number of packet flits (their length), this will significantly reduce the network load.

Adicionalmente, el algoritmo segmentado permite combinar, en un único paquete, las acciones procedentes de diferentes etapas en el algoritmo. Por ejemplo, la información de la inhibición se puede combinar con la de las activaciones laterales de la época anterior y por consiguiente, agruparla en un solo paquete. Para llevarlo a cabo, se supone la existencia de colas de inyección de agrupamiento (similar a la estructura para soportar caches no bloqueantes, generalmente llamada MSHR (miss information/status handling registers), donde cualquier nuevo paquete inyectado se coteja con los que están esperando a ser inyectados. Si hay una coincidencia en la máscara de destino, el paquete anterior se modifica para contener la información del que acaba de llegar y así, el nuevo paquete puede ser desechado.Additionally, the segmented algorithm allows combining, in a single package, the actions from different stages in the algorithm. For example, the information of the inhibition can be combined with that of the lateral activations of the previous era and therefore, grouped into a single package. To carry it out, the existence of clustering injection queues is assumed (similar to the structure to support non-blocking caches, usually called MSHR (miss information / status handling registers), where any new injected packet is checked against those that are waiting to be injected If there is a match in the target mask, the previous package is modified to contain the information from the one that has just arrived and thus, the new package can be discarded.

F. EscalabilidadF. Scalability

El sistema biológico sugiere que el mejor enfoque para aumentar el almacenamiento es aumentar el número de columnas y no el número de celdas temporales (y segmentos distales) por columna. Desde una perspectiva práctica, si aumentamos el número de columnas podríamos reducir el número de segmentos distales requeridos por celda temporal. Aunque desde la perspectiva software esto parece interesante, desde el punto de vista del hardware es realmente relevante porque podría reducir el coste de la interconexión y tal vez la complejidad de cada CC. Por lo tanto, la presente invención, en una de sus realizaciones, contempla aumentar el número de columnas tanto como lo permita la tecnología. Desafortunadamente, el sistema de comunicación, tal como se ha descrito hasta el momento, podría escalar hasta un número limitado de columnas, pero, claramente, los requisitos de energía no escalarán empleando tecnología CMOS. AThe biological system suggests that the best approach to increase storage is to increase the number of columns and not the number of temporary cells (and distal segments) per column. From a practical perspective, if we increase the number of columns we could reduce the number of distal segments required per temporal cell. Although from the software perspective this seems interesting, from the hardware point of view it is really relevant because it could reduce the cost of interconnection and perhaps the complexity of each CC. Therefore, the present invention, in one of its embodiments, contemplates increasing the number of columns as much as technology allows. Unfortunately, the communication system, as described so far, could scale up to a limited number of columns, but, clearly, the energy requirements will not scale using CMOS technology. TO

55

1010

15fifteen

20twenty

2525

3030

3535

continuación, se tratará este aspecto de acuerdo a diferentes realizaciones de la invención que introducen dos estrategias complementarias:This aspect will then be treated according to different embodiments of the invention that introduce two complementary strategies:

1) Tráfico próximai escalable: Parches proximales1) Scalable nearby traffic: Proximal patches

Las implementaciones software del algoritmo conocidas en el estado del arte suponen que las sinapsis proximales, en una columna dada, no tienen en cuenta la topología del sistema. En el momento del arranque, la entrada codificada está potencialmente conectada a un subconjunto de las columnas elegidas al azar (por defecto, en torno al 20%). Durante el agrupamiento espacial, el sistema aprende las inter-relaciones relevantes según sea la secuencia de entrada. Aunque desde la perspectiva del software esto es beneficioso, ya que equilibra la utilización de las columnas, para una implementación hardware resulta muy exigente. Por ejemplo, empleando CCs con 5 columnas y un subconjunto del 20% de columnas, este enfoque implica que la activación del axón en el codificador requerirá una multidifusión a todos los CCs del sistema. Ciertamente, este enfoque se aparta del funcionamiento de los sistemas biológicos. Desde tal perspectiva, la presente invención opta por, de acuerdo a una de sus realizaciones, limitar las columnas potencialmente conectadas a una zona restringida topológicamente en la red, que se hará referencia como parche proximal. La figura 6 muestra un ejemplo de un parche proximal (60) en una topología de tipo panal. De acuerdo a esta realización particular de la invención, el codificador se conecta a la red a través de colas de inyección, donde cada bit se conecta a un encaminador diferente en la periferia del circuito. En aras de la simplicidad, la figura 6 sólo ilustra esto para dos bits separados, pero téngase en cuenta que el codificador tendrá miles de bits de salida. De esta forma las columnas que podrían conectarse a cada CC quedan restringidas dentro del parche proximal.The software implementations of the algorithm known in the state of the art assume that the proximal synapses, in a given column, do not take into account the system topology. At boot time, the encoded input is potentially connected to a subset of the randomly chosen columns (by default, around 20%). During spatial grouping, the system learns the relevant inter-relationships according to the input sequence. Although from the software perspective this is beneficial, since it balances the use of the columns, for a hardware implementation it is very demanding. For example, using CCs with 5 columns and a subset of 20% of columns, this approach implies that axon activation in the encoder will require multicasting to all CCs in the system. Certainly, this approach departs from the functioning of biological systems. From this perspective, the present invention chooses, according to one of its embodiments, to limit the potentially connected columns to a topologically restricted area in the network, which will be referred to as a proximal patch. Figure 6 shows an example of a proximal patch (60) in a honeycomb topology. According to this particular embodiment of the invention, the encoder is connected to the network through injection queues, where each bit is connected to a different router on the periphery of the circuit. For the sake of simplicity, Figure 6 only illustrates this for two separate bits, but keep in mind that the encoder will have thousands of output bits. In this way the columns that could be connected to each CC are restricted within the proximal patch.

La presente invención, en una de sus realizaciones, define la posición del parche de forma aleatoria en el momento del arranque. Su tamaño es un parámetro de diseño y se puede redefinir según la naturaleza de la entrada o la aplicación concreta. Experimentalmente, se ha observado que una regla de tamaño del 20% es válida, aunque el aumento o disminución dinámicos del parche también se pueden contemplar para equilibrar la utilización de columnas.The present invention, in one of its embodiments, defines the position of the patch randomly at boot time. Its size is a design parameter and can be redefined according to the nature of the entry or the specific application. Experimentally, it has been observed that a 20% size rule is valid, although the dynamic increase or decrease of the patch can also be contemplated to balance the use of columns.

Bajo tales circunstancias, cuando se active la entrada unida al módulo columnado R1 (61), se generará una multidifusión a los CCs dentro del parche. El paquete se inyecta en la red y se comporta como unicast hasta llegar al módulo columnado CC1 (62). La información de cabecera debe incluir tal módulo columnado 62 como nodo intermedio y la máscara de multidifusión para los nodos restantes.Under such circumstances, when the input connected to the column module R1 (61) is activated, a multicast will be generated to the CCs within the patch. The package is injected into the network and behaves as unicast until it reaches the column module CC1 (62). The header information must include such a column module 62 as an intermediate node and the multicast mask for the remaining nodes.

55

1010

15fifteen

20twenty

2525

3030

3535

2) Trafico distal y de inhibición escalable2) Scalable distal and inhibition traffic

Del mismo modo, el tráfico distal y de inhibición se supone global para la implementación del software (aunque la inhibición puede ser local). Desde la perspectiva de la red, el retardo y la potencia requeridos se incrementarán significativamente a medida que aumentemos el número de CCs. Conviene señalar que el número de columnas involucrado en el proceso de inhibición es notablemente mayor que las entradas activas en el codificador (cualquier solapamiento de entrada distinto de cero requerirá una multidifusión). Sin embargo, los sistemas biológicos, sin duda, no utilizan una comunicación global. A partir de esta hipótesis, la presente invención, de acuerdo a una de sus realizaciones, divide la red en zonas separadas y restringe la inhibición y el tráfico distal a su interior. Dichas regiones serán referidas como zonas de scale-out.Similarly, distal and inhibition traffic is assumed to be global for software implementation (although inhibition may be local). From the network perspective, the required delay and power will increase significantly as we increase the number of CCs. It should be noted that the number of columns involved in the inhibition process is significantly greater than the active inputs in the encoder (any non-zero input overlap will require multicasting). However, biological systems certainly do not use global communication. From this hypothesis, the present invention, according to one of its embodiments, divides the network into separate zones and restricts inhibition and distal traffic within. These regions will be referred to as scale-out zones.

La figura 7 representa gráficamente cómo incrementar el número de CCs desde 16 hasta 64 en una de las realizaciones de la invención. En lugar de requerir multidifusión o emisiones completas, el tráfico generado por columnas, en cualquiera de las cuatro zonas representadas (71-74), se restringe a circular por su interior. Si necesitamos aumentar aún más el número de columnas, sólo debemos aumentar el número de zonas, con lo que el tráfico se mantiene casi constante aunque el sistema escale enormemente.Figure 7 graphically depicts how to increase the number of CCs from 16 to 64 in one of the embodiments of the invention. Instead of requiring multicasting or complete broadcasts, the traffic generated by columns, in any of the four zones represented (71-74), is restricted to driving inside. If we need to increase the number of columns even further, we should only increase the number of zones, so that the traffic remains almost constant even if the system scales greatly.

El codificador, es decir, el tráfico proximal, que ve la red a nivel global, selecciona las columnas potencialmente conectadas sin hacer distinciones entre las zonas. De esta forma se aumenta el número de columnas disponibles y, por tanto, la capacidad de representación de valores (precisión). Por ejemplo, con un número de columnas de aproximadamente 2K-4K por zona, esta flexibilidad adicional podría no ser útil para reducir los requisitos de memoria en los segmentos distales e incrementar la complejidad del codificador. La presente invención, en una de sus realizaciones, contempla utilizar tantos valores consecutivos en la secuencia de entrada codificada como número de zonas de scale-out. En este ejemplo, se utilizan 4 codificadores para codificar, simultáneamente, cuatro intervalos diferentes de la secuencia de entrada. De esta manera, no sólo aumenta el rendimiento del sistema, sino también la carga sobre cada columna individual. Además, el aumento del número de zonas mantendrá constante el tráfico proximal total (ya que cada entrada en la secuencia, activa un número de bits de entrada proporcional al número de columnas por zona).The encoder, that is, the proximal traffic, which the network sees globally, selects the potentially connected columns without making distinctions between the zones. This increases the number of available columns and, therefore, the ability to represent values (precision). For example, with a number of columns of approximately 2K-4K per zone, this additional flexibility may not be useful for reducing memory requirements in distal segments and increasing the complexity of the encoder. The present invention, in one of its embodiments, contemplates using as many consecutive values in the encoded input sequence as the number of scale-out zones. In this example, 4 encoders are used to simultaneously encode four different intervals of the input sequence. In this way, not only increases the performance of the system, but also the load on each individual column. In addition, increasing the number of zones will keep the total proximal traffic constant (since each entry in the sequence activates a number of input bits proportional to the number of columns per zone).

Diferentes realizaciones de la presente invención han sido testadas en simuladores adaptados para emplear estructuras de datos y mecanismos apropiados para una implementación hardware factible y obtener resultados energéticos y de rendimientoDifferent embodiments of the present invention have been tested in simulators adapted to employ appropriate data structures and mechanisms for a feasible hardware implementation and obtain energy and performance results.

55

1010

15fifteen

20twenty

2525

3030

3535

precisos. El codificador SDR de entrada se ha implementado utilizando un generador pseudo-aleatorio Mersenne Twister. Para el modelado de entrada se han utilizado series temporales sintéticas, específicamente la serie periódica de datos enteros de 32 bits, generados a partir de polinomios definidos aleatoriamente (hasta cuarto grado con coeficientes elegidos al azar). Se define una serie temporal mediante veinte valores de cada uno de ellos. Se repite cada serie temporal hasta que es aprendida por el sistema, lo que se produce cuando el número de elementos en la secuencia, con columnas sin predicciones incorrectas (es decir, no hay ráfagas de columna), es igual a la mitad de todos los puntos de datos. De esta manera, el sistema se mantiene la mitad del tiempo aprendiendo nuevas secuencias y la otra mitad simplemente prediciéndolas. Por lo tanto, durante la mitad de los intervalos se producirá el tráfico adicional producto de las ráfagas de columnas o el tráfico de inhibición que generará la aparición de una nueva secuencia de entrada. Durante la segunda mitad del tiempo, el sistema tendrá una representación estable de la entrada, siendo mucho benigna para el tráfico de la red. El número de series temporales (es decir polinomios) necesarias para cumplir un intervalo de confianza del 98% es aproximadamente de 500. Por último, el clasificador utilizado en estas pruebas es el más simple de la aplicación NuPIC, que proporciona una puntuación de anomalía como la fracción de las columnas con fallo de predicción.accurate. The input SDR encoder has been implemented using a Mersenne Twister pseudo-random generator. Synthetic time series have been used for the input modeling, specifically the periodic series of 32-bit integer data, generated from randomly defined polynomials (up to fourth grade with randomly chosen coefficients). A time series is defined by twenty values of each of them. Each time series is repeated until it is learned by the system, which occurs when the number of elements in the sequence, with columns without incorrect predictions (i.e. no column bursts), is equal to half of all data points In this way, the system stays half the time learning new sequences and the other half simply by predicting them. Therefore, during the middle of the intervals the additional traffic resulting from the column bursts or the inhibition traffic that will generate the appearance of a new input sequence will occur. During the second half of the time, the system will have a stable representation of the input, being very benign for network traffic. The number of time series (ie polynomials) necessary to meet a 98% confidence interval is approximately 500. Finally, the classifier used in these tests is the simplest of the NuPIC application, which provides an anomaly score such as the fraction of the columns with prediction failure.

Una vez presentadas las condiciones de las pruebas, se presentan a continuación algunos de los resultados obtenidos por la presente invención, demostrando que logra mejorar los resultados de retraso y energía del sistema, a la vez que el análisis de la escalabilidad demuestra su viabilidad para decenas de miles de módulos columnados CCs en el sistema.Once the conditions of the tests are presented, some of the results obtained by the present invention are presented below, demonstrating that it manages to improve the delay and energy results of the system, while the scalability analysis demonstrates its feasibility for dozens of thousands of CCs column modules in the system.

En una primera realización de la invención, se utiliza la configuración por defecto utilizada por la aplicación NuPIC: 2048 columnas, con 32 celdas temporales por columna, e inhibición global. Se emplea una topología de malla 2D, con un encaminador básico convencional, encaminamiento determinista DOR, un pipeline de 4-ciclos, y empleando virtual-cut through como control de flujo. Suponemos cables de enlace low- swing y que requieren de un ciclo de reloj para trasladar un flit de un encaminador a otro y se incluyen buffers de entrada de 1280 bytes, sin canales virtuales. Hay que tener en cuenta que, con esta configuración, todo el tráfico de multidifusión se transmite a todos los CC del sistema. Por lo tanto, empleando replicación en orden de dimensión en los encaminadores intermedios, se obtiene una red libre de bloqueos. El encaminador ha incorporado el mecanismo de drenaje de la red, descrito anteriormente.In a first embodiment of the invention, the default configuration used by the NuPIC application is used: 2048 columns, with 32 time cells per column, and global inhibition. A 2D mesh topology is used, with a conventional basic router, DOR deterministic routing, a 4-cycle pipeline, and using virtual-cut through as flow control. We assume low-swing link cables that require a clock cycle to transfer a flit from one router to another and include 1280 byte input buffers, without virtual channels. Keep in mind that, with this configuration, all multicast traffic is transmitted to all system CCs. Therefore, by using replication in order of dimension in intermediate routers, a block-free network is obtained. The router has incorporated the network drainage mechanism, described above.

55

1010

15fifteen

20twenty

2525

3030

3535

La figura 8 representa gráficamente el número de ciclos de reloj, por intervalo, para diferentes tamaños de malla cuadrada 2D. Es decir, el número de ciclos de red necesarios para llevar a cabo las tareas de cada intervalo de entrada, siguiendo un enfoque secuencial y segmentado para los diferentes tamaños de red (diferente número de columnas por CC). Como se puede apreciar, hasta 300 ciclos pueden ser ahorrados segmentando y solapando el proceso de aprendizaje. Otra observación que puede no resultar intuitiva es el comportamiento de la aproximación segmentada, ya que cuando se aumenta el tamaño de la red, el tiempo requerido apenas se modifica. La razón de este comportamiento es la contención. En este caso, la red recibe una carga mayor (ya que las tres fases de la comunicación se solapan). Po ello, cuando se reduce el tamaño de la red se reduce el ancho de banda disponible y consecuentemente la contención crece. En estas condiciones, parece que la ventaja de ancho de banda compensa el incremento de distancia promedio. Con la aproximación secuencial, la contención únicamente es notable en la malla 4x4.Figure 8 graphically represents the number of clock cycles, per interval, for different sizes of 2D square mesh. That is, the number of network cycles necessary to carry out the tasks of each input interval, following a sequential and segmented approach to the different network sizes (different number of columns per CC). As you can see, up to 300 cycles can be saved by segmenting and overlapping the learning process. Another observation that may not be intuitive is the behavior of the segmented approach, since when the size of the network is increased, the time required is hardly modified. The reason for this behavior is containment. In this case, the network receives a greater load (since the three phases of the communication overlap). Therefore, when the network size is reduced, the available bandwidth is reduced and consequently the contention grows. Under these conditions, it seems that the bandwidth advantage compensates for the average distance increase. With the sequential approach, containment is only notable in the 4x4 mesh.

La figura 9 representa gráficamente el número de ciclos de reloj por intervalo para diferentes mallas cuadradas 2D, empleando tráfico agregado (enlaces de 16 bytes de anchos con paquetes de 5 flits). En esta figura se ha introducido tráfico agregado. Bajo las mismas condiciones que la realización anterior, se modela cuidadosamente la longitud del paquete, suponiendo su tamaño de 5-flit con enlaces de 16 bytes de ancho, pero se agrega otro paquete cuando se supera ese límite por el proceso de agrupamiento. Como puede verse en dicha figura, los beneficios son notables, siendo capaz de procesar las necesidades de comunicación del sistema de un intervalo, en menos de 60 ciclos de reloj de la red. Respecto a la configuración de la red, el resultado obtenido invierte la observación anterior sobre su tamaño, ya que la reducción de tráfico es tan drástica que la contención no está presente en ningún caso. Por lo tanto, bajo tal configuración, el factor dominante es la distancia promedio de la red.Figure 9 graphically depicts the number of clock cycles per interval for different 2D square meshes, using aggregate traffic (16-byte wide links with 5 flit packets). This figure has introduced aggregated traffic. Under the same conditions as the previous embodiment, the length of the package is carefully modeled, assuming its 5-flit size with 16-byte wide links, but another package is added when that limit is exceeded by the grouping process. As can be seen in this figure, the benefits are remarkable, being able to process the system's communication needs of an interval, in less than 60 network clock cycles. Regarding the configuration of the network, the result obtained reverses the previous observation on its size, since the traffic reduction is so drastic that containment is not present in any case. Therefore, under such a configuration, the dominant factor is the average distance of the network.

La figura 10 representa gráficamente el número de ciclos de reloj por intervalo para diferentes mallas cuadradas 2D, con el algoritmo segmentado, agregación de tráfico y parches proximales aplicados. En esta realización se han introducido parches proximales, lo que implica un beneficio significativo. En ambos casos, han sido seleccionadas el 20% de las columnas en el sistema. Teniendo más de 5 columnas en cada CC, la distribución uniforme implica una difusión. Los parches proximales reducen esto de manera significativa, especialmente cuando el tamaño de la red crece (y el beneficio de convertir una difusión en una multidifusión localizada es más grande). El tráfico proximal no cambia significativamente la puntuación de anomalía (es decir, laFigure 10 graphically depicts the number of clock cycles per interval for different 2D square meshes, with the segmented algorithm, traffic aggregation and proximal patches applied. In this embodiment, proximal patches have been introduced, which implies a significant benefit. In both cases, 20% of the columns in the system have been selected. Having more than 5 columns in each CC, the uniform distribution implies a diffusion. Proximal patches reduce this significantly, especially when the network size grows (and the benefit of converting a broadcast to a localized multicast is larger). Proximal traffic does not significantly change the anomaly score (that is, the

55

1010

15fifteen

20twenty

2525

3030

probabilidad de perder una activación de la columna en la agrupación temporal), siendo alrededor del 6% al final de la simulación, en ambos casos.probability of losing an activation of the column in the temporal grouping), being around 6% at the end of the simulation, in both cases.

Teniendo la contención tan poco impacto en la red, parece razonable reducir el ancho de banda del enlace. Los resultados anteriores corresponden a enlaces de 16 bytes de ancho, que es un tamaño bastante convencional para muchos sistemas contemporáneos donde se utilizan redes en chip. Por ejemplo, para una malla 16x16 el ancho de banda de la bisección es aproximadamente de 512 GB/s, suponiendo 1ns de ciclo de reloj. Por lo tanto, en tales circunstancias, parece interesante explorar el efecto de la reducción de la anchura de enlace con el fin de reducir el consumo de energía y el área.Having contention so little impact on the network, it seems reasonable to reduce link bandwidth. The results above correspond to links 16 bytes wide, which is a fairly conventional size for many contemporary systems where chip networks are used. For example, for a 16x16 mesh the bisection bandwidth is approximately 512 GB / s, assuming 1ns of clock cycle. Therefore, in such circumstances, it seems interesting to explore the effect of reducing link width in order to reduce energy consumption and area.

La figura 11 representa gráficamente el número de ciclos de reloj, variando la anchura del enlace. Para ello se utiliza un ejemplo concreto en el que puede verse la variación del tiempo requerido para procesar un intervalo cuando el ancho del enlace varía desde 16 bytes hasta 1 byte. En este punto, con el fin de seleccionar la mejor configuración de la red, hay que equilibrar su retardo y el coste de la lógica de cálculo temporal/espacial en los CCs, para lo que es importante tener en cuenta, al menos, los siguientes 3 aspectos: (1) el aprendizaje, en ambos casos, está fuera del camino crítico; (2) la lógica espacial es bastante simple (necesita calcular el solapamiento de las entradas) y ya que opera en paralelo con la lógica temporal, tampoco estarán en la ruta crítica del circuito; (3) la generación de la actividad lateral en la celda temporal, que está en el camino crítico del algoritmo, está dominada por el acceso a la memoria donde se almacenan los segmentos distales. Por consiguiente, el número de accesos a dicha memoria resulta un elemento clave en toda la lógica que, de acuerdo a diferentes realizaciones de la invención, podrá estructurarse de diferentes formas. En cualquier caso, la hipótesis optimista es que no se necesitará 1 ciclo de reloj por segmento distal, que cada columna tiene 32 celdas temporales y cada celda temporal requiere un promedio de 1 segmento, con lo que se necesitarán 32 ciclos de reloj para procesar una sola columna. En un sistema de de 2K columnas, como el utilizado con estos resultados, el cómputo requerirá aproximadamente desde 4000 ciclos (32 2048/16) en una red de 4x4, a 64 ciclos de reloj en una red 32x32. Por lo tanto, en tales supuestos, la red más apropiada es una red de 256 nodos con enlaces de 2 bytes de ancho. La escalabilidad del retraso de la red permitirá un adecuado ajuste respecto al coste de la lógica de computación.Figure 11 graphically represents the number of clock cycles, varying the width of the link. For this, a specific example is used in which the variation of the time required to process an interval can be seen when the link width varies from 16 bytes to 1 byte. At this point, in order to select the best network configuration, its delay and the cost of the time / space calculation logic in the CCs must be balanced, so it is important to consider at least the following 3 aspects: (1) learning, in both cases, is out of the critical path; (2) the spatial logic is quite simple (it needs to calculate the overlap of the inputs) and since it operates in parallel with the temporal logic, they will not be in the critical path of the circuit either; (3) the generation of lateral activity in the temporal cell, which is in the critical path of the algorithm, is dominated by access to memory where distal segments are stored. Therefore, the number of accesses to said memory is a key element in all the logic that, according to different embodiments of the invention, can be structured in different ways. In any case, the optimistic hypothesis is that 1 clock cycle per distal segment will not be needed, that each column has 32 time cells and each time cell requires an average of 1 segment, which will require 32 clock cycles to process a single column In a 2K column system, like the one used with these results, the computation will require approximately from 4000 cycles (32 2048/16) in a 4x4 network, to 64 clock cycles in a 32x32 network. Therefore, in such cases, the most appropriate network is a 256-node network with 2-byte wide links. The scalability of the network delay will allow an adequate adjustment with respect to the cost of computing logic.

Una de las mayores ventajas que se persiguen, y que soluciona un gran problema del estado del arte, es incrementar notablemente la capacidad de estos sistemas y ofrecer una escalabilidad real que contemple millones de columnas. La presente invención consigue dichas ventajas y la prueba de ello queda reflejada en la figura 12.One of the greatest advantages that are pursued, and that solves a great problem of the state of the art, is to significantly increase the capacity of these systems and offer a real scalability that contemplates millions of columns. The present invention achieves said advantages and the proof of this is reflected in Figure 12.

5 En la figura 12, a partir de una configuración fija para la red que mantiene las optimizaciones anteriores, es posible comparar sus resultados con un sistema con 4 zonas scale-out, es decir, en total 8K columnas, pero manteniendo inalterada la configuración de red. Se muestran así los ciclos de reloj de red requeridos para procesar un intervalo de la entrada actual. Se muestra el tiempo requerido empleando dos 10 anchuras de enlace, para diferentes tamaños de red, y con zonas de scale-out y sin ellas. Como puede apreciarse, el retardo es mucho menos sensible al diámetro de la red, pudiendo utilizar una malla de 32x32 (M1024) con enlaces de 2 bytes de ancho y con unos resultados similares a los del sistema plano (alrededor de 200 ciclos de reloj). Según estos resultados, la inhibición y el tráfico distal dominan la carga de la red, ya 15 que el aumento del tráfico proximal en los destinos es insignificante. El uso de 8 zonas scale-out (es decir, 16K columnas) reporta los mismos resultados. Por lo tanto, se evidencia que la presente invención ofrece un sistema donde el retardo de comunicación es independiente del número de columnas.5 In Figure 12, from a fixed configuration for the network that maintains the previous optimizations, it is possible to compare its results with a system with 4 scale-out zones, that is, in total 8K columns, but keeping the configuration of unaltered net. This shows the network clock cycles required to process a current input interval. The time required is shown using two 10 link widths, for different network sizes, and with and without scale-out zones. As can be seen, the delay is much less sensitive to the diameter of the network, being able to use a 32x32 mesh (M1024) with links 2 bytes wide and with results similar to those of the flat system (around 200 clock cycles) . According to these results, inhibition and distal traffic dominate the network load, since the increase in proximal traffic at destinations is negligible. The use of 8 scale-out zones (i.e. 16K columns) reports the same results. Therefore, it is evident that the present invention offers a system where the communication delay is independent of the number of columns.

20 En este punto, suponiendo un ciclo de reloj de red de 1ns y que la comunicación es el elemento crítico del algoritmo, la presente invención es capaz de procesar hasta 100 millones de valores de la secuencia de entrada por segundo. Este rendimiento no es alcanzable por ningún enfoque software. Por ejemplo, la implementación actual más rápida de Nupic puede procesar en 1 segundo en torno a 1000 entradas (para un tamaño 25 de sistema similar ejecutándose en una máquina media actual).At this point, assuming a network clock cycle of 1ns and that communication is the critical element of the algorithm, the present invention is capable of processing up to 100 million values of the input sequence per second. This performance is not attainable by any software approach. For example, the fastest Nupic current implementation can process around 1 000 entries in 1 second (for a similar system size 25 running on a current average machine).

Incrementar el número de CCs en el sistema aumentará la energía dinámica de red (cada paquete recorrerá más enlaces y routers). La figura 13 presenta la energía dinámica requerida por la red para procesar un intervalo. Como se puede ver, el uso de 30 las zonas de scale-out reduce, hasta en 3 veces, los requisitos de la red, siendo posible reducir, casi cuadráticamente, los efectos negativos de su tamaño.Increasing the number of CCs in the system will increase the dynamic network power (each packet will go through more links and routers). Figure 13 shows the dynamic energy required by the network to process an interval. As you can see, the use of 30 scale-out zones reduces network requirements by up to 3 times, being possible to reduce, almost quadratically, the negative effects of their size.

Para finalizar esta descripción, se detalla a continuación el rendimiento, los requisitos 35 de almacenamiento y la eficiencia de predicción de todas las realizaciones anteriores. La figura 14 muestra, desde el punto de vista de rendimiento, hasta un 92% de los efectos de red que podrían solaparse con las otras actividades en el camino crítico.To finalize this description, the performance, storage requirements and prediction efficiency of all previous embodiments are detailed below. Figure 14 shows, from a performance point of view, up to 92% of the network effects that could overlap with the other activities in the critical path.

La figura 15 muestra cómo, del mismo modo, se reduce la energía dinámica de la red, lo que confirma que los problemas de comunicación son casi inexistentes y que, por lo tanto, afianza la idea de la presente invención de que la solución basada en la conmutación de paquetes es una propuesta factible.Figure 15 shows how, in the same way, the dynamic energy of the network is reduced, which confirms that communication problems are almost non-existent and, therefore, strengthens the idea of the present invention that the solution based on Packet switching is a feasible proposal.

5 La figura 16 presenta la probabilidad de fallos de predicción por columna, por cada experimento (cada experimento tiene aproximadamente 3 millones de intervalos diferentes). Algunos de los cambios introducidos alteran, ligeramente, el algoritmo original de CLA, pero en la mayoría de los casos, los márgenes de confianza indican que el promedio de los casos es similar (es decir, no hay cambios de exactitud en los 10 resultados). Como se señaló anteriormente, el valor no normalizado es alrededor del 6%. Sin embargo, los parches proximales parecen mejorar ligeramente esta cifra a menos del 5%. Al parecer, esta optimización del tráfico sugerida por la biología, es beneficiosa desde el punto de vista de precisión. Como era de esperar, el uso de las zonas scale-out disminuye ligeramente la precisión del sistema de nuevo a los 15 resultados obtenidos con el algoritmo de base.5 Figure 16 presents the probability of prediction failures per column, for each experiment (each experiment has approximately 3 million different intervals). Some of the changes introduced slightly alter the original CLA algorithm, but in most cases, the confidence margins indicate that the average of the cases is similar (that is, there are no changes in accuracy in the 10 results) . As noted above, the non-normalized value is around 6%. However, the proximal patches seem to slightly improve this figure to less than 5%. Apparently, this traffic optimization suggested by biology is beneficial from the point of view of precision. As expected, the use of scale-out zones slightly decreases the accuracy of the system back to the 15 results obtained with the base algorithm.

Claims

5

10

fifteen

twenty

25

30

35

1. A hardware acceleration system for storing and retrieving information, which implements a cortical learning algorithm through a packet switching network, the system comprises:

at least one encoder module configured to encode a binary input in a distributed distributed representation (SDR), and to send, for each active bit of the SDR, a multicast packet to a determined column module through the packet switching network, based on a table of correspondence previously established;

a plurality of columnar modules connected by said packet switching network, configured to receive the multicast packets sent from the encoder, where each of the columnar modules in turn comprises:

or a router with multicast support configured to receive packets from the encoder module, deliver said packets to certain memory modules of the columnar module and send packets from the memory modules to an output sorter;

or a plurality of memory modules configured to store the inputs received from the router and store context information;

or a calculation module configured to determine a degree of overlap between the content of certain memory modules and the current input, select a specific number of memory modules with a greater degree of overlap, determine a time context for each of the modules of selected memory, make a prediction of the system output based on the current input and temporal context information and send an output packet containing said prediction to an output classifier module; an output classifier module configured to receive an output packet, sent through the switching network from any of the columnar modules, and to select a system output from a group of preset outputs based on the received packet output.

2. System according to claim 1, wherein the calculation module comprises a comparator, an adder and a counter.

5

10

fifteen

twenty

25

30

3. System according to any of the preceding claims, wherein each memory module of the plurality of memory modules comprises a plurality of temporary cells that adopt an active state or an non-active state and their combination represents a certain temporal context for the memory module

4. System according to claim 3, wherein the calculation module is further configured to check if its output prediction is correct; in case of a wrong prediction, a burst occurs that puts all the temporary cells of the memory module in active state.

5. System according to any of the preceding claims, wherein the calculation module is configured to combine stages and, given an input sequence, produce a prediction in three intervals of said sequence.

6. System according to claim 5 wherein the calculation module is further configured to add traffic from different stages in the same package.

7. System according to any of the preceding claims, wherein the columned modules of the ends of the network are configured to inject a broom package into the packet switching network, which is replicated in the rest of the modules columnar only when the corresponding router has no more queued packets until said broom packet reaches the opposite end of the network, indicating that the network has been emptied.

8. System according to any of the preceding claims, wherein the number of memory modules comprising each of the columnar modules is determined by a balance between the propagation delay and the system clock cycle.

9. System according to any of the preceding claims wherein at least one encoder module is configured to send an input packet to a selection of randomly preset columnar modules representing around 20% of the total columnar modules.

10. System according to any of the preceding claims made of a silicon plate, a chip or a microprocessor using CMOS technology.

11. Scalable hardware acceleration method for storing and retrieving information through a packet switching network, the method comprises the steps of:

a) encode, in an encoder module, a binary input in a distributed distributed representation (SDR)

b) send, for each active bit of the SDR, a multicast packet from the encoder module to a given column module of a plurality of modules

5 columnized through the packet switching network, depending on one

table of correspondence previously established;

c) receiving the packets sent from the encoder module, through the packet switching network, in a columned router;

d) deliver these packages to certain module memory modules

10 columnar;

e) store received packets in certain memory modules;

f) determine, in a column module calculation module, a degree of overlap between the contents of the memory modules that have received the input package and the current input;

15 g) select, for the calculation module, a certain number of modules

memory with greater degree of overlap;

h) determine, by the calculation module, a temporal context for each of the selected memory modules;

i) make, by the calculation module, a prediction of the system output in

20 function of current input and stored temporal context information

in memory modules;

j) send an output packet containing said prediction to an output sorter module;

k) receive an output package in the output classifier, sent through the

25 switching network from any of the columnar modules;

l) select, in the output classifier, a system output from a group of preset outputs based on the received output package.

12. Method according to claim 11, further comprising checking whether the output prediction made by the calculation module is correct, where, in case of

Wrong prediction produces a burst that puts all the temporary cells of the memory module in active state.

13. Method according to any of claims 10-12, further comprising verifying that the packet switching network is empty before

5 execute the step of calculating the overlap and before determining the temporal context, where, to verify that the network is empty, a broom package is provided that runs through the packet switching network.

14. Method according to any of claims 10-13, further comprising the step of restricting packets sent by the encoder module to a

10 selection of randomly preset columnar modules representing around 20% of the total columnar modules.