WO2013186402A1 - Classification method and device for large volumes of data - Google Patents

Classification method and device for large volumes of data Download PDF

Info

Publication number
WO2013186402A1
WO2013186402A1 PCT/ES2012/000293 ES2012000293W WO2013186402A1 WO 2013186402 A1 WO2013186402 A1 WO 2013186402A1 ES 2012000293 W ES2012000293 W ES 2012000293W WO 2013186402 A1 WO2013186402 A1 WO 2013186402A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
classification
procedure
svm
training
Prior art date
Application number
PCT/ES2012/000293
Other languages
Spanish (es)
French (fr)
Inventor
Javier MARTINEZ MOGUERZA
Javier Castillo Villar
José Ignacio MARTINEZ TORRE
David RIOS INSUA
Javier CANO MONTERO
Original Assignee
Universidad Rey Juan Carlos
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Universidad Rey Juan Carlos filed Critical Universidad Rey Juan Carlos
Publication of WO2013186402A1 publication Critical patent/WO2013186402A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Definitions

  • the present invention is encompassed in the field of classification systems for large volumes of data. More specifically it describes a new method and device that reduces the classification time substantially from the use of Support Vector Machines (SVM).
  • SVM Support Vector Machines
  • Classification and pattern recognition techniques have a wide range of applications. In recent years some areas of the technique, such as life sciences, meteorology or financial analysis, have begun to use these techniques to statistically study interest groups within large datasets.
  • the search for certain protein markers in patients with breast cancer the study of DNA or diabetes have found in these techniques a powerful tool for research.
  • the problem lies in the need, during the clinical trials phase, to test the efficacy of a drug to cure cancer, while it is in an initial phase. For this, it is necessary to identify which patient is developing it as soon as possible.
  • Another field of technique is text mining, where analyzing texts in several documents simultaneously represents a very high computational load.
  • matrices appear whose dimensions are the product of the number of documents by the total vocabulary that appears in each document.
  • huge matrices are generated as a result of the Cartesian product of these two magnitudes. This is, for example, the case of non-targeted searches on the web.
  • the SVM For a set of data, a subset of a larger one (space), in which each of the data belongs to one of two possible categories, the SVM is able to predict whether a new point, whose category we do not know, belongs to one or the other. Before carrying out this classification it is necessary to train the system, based on control data. Once the training is done, the SVM searches for a line, more correctly a hyperplane, that optimally separates the points of each of the classes. This concept of optimal separation is where the fundamental characteristic of the use of SVM resides.
  • Support vector machines have become a novel tool for pattern recognition.
  • the simplest application of this technique is the problem of binary classification, one in which there are only two classes defined.
  • the underlying idea is to find a function of separation of the two classes whose probability of empirical error is minimal. This function, using the appropriate transformations, can be represented as a hyperplane.
  • the present patent application shows a novel method for the classification of new data, within a very large group, using one or several Support Vector Machines that can work in parallel.
  • This classification process is carried out in two stages: a first "training" of the device and a second classification.
  • a first "training" of the device In the first one, from an initial population that is used as a training sample (Figure 1.a), They divide the data into two categories ( Figure 1.b).
  • This first separation into two classes can be done by any known statistical method that groups data that have some similarity.
  • One possible is k-means, which allows a large set of data to be partitioned into smaller k-sets. Each of these sets encompasses data that has a closer average between them.
  • this is a suitable method any other that separates the data into two classes suitable for the analysis to be performed can be used.
  • This group assignment can be done both manually and through a statistical method, again you can use k-averages or other density estimation techniques or, even, methods of estimating fashions based on Monte Carlo mixtures and simulation. The method that in that data model best suits the given distribution will be used. The decision to use one mechanism or another will depend on the person responsible for the classification, or the number of data and the amount to be loaded in each memory.
  • Voronoi regions are a very simple interpolation method based on the calculation of the Euclidean distance between data. When calculated on many points, the area is divided into a series of polygons so that their equidist perimeter of neighboring points. This allows the class to be subdivided into a series of sets that have the same characteristics as the class to which it belongs. It is done in this way, using the previous groups, because if one wanted to calculate the Voronoi regions on all the data of the entire class the computational cost would be very high. The division into regions for subsequent parallelization is the key to success of this method.
  • Pairs of Voronoi regions are randomly selected, one of each of the two classes (14), and the training of SVMs (15a, 15i) is started.
  • the number of SVM will depend on the electronic architecture chosen to execute this method.
  • the parallel training key to this invention, substantially decreases the time spent on this operation.
  • the result is a hyperplane that separates all the data from the training sample (figure 1.d).
  • the second part of the process is the classification itself, shown in Figure 3.
  • SVMs Once all the SVMs have been trained, which means that they are able to recognize to which class any data in the training set belongs, when a new data arrives (21), each SVM votes in which category it is based on the couples of Voronoi regions assigned to that SVM (22a, 22i).
  • the data obtained from each SVM can be used as such or can be weighted according to criteria obtained during training (23a, 23i). This weighting will be a correction factor that is associated with result of each SVM.
  • the result of all SVMs is added and it will be associated to the most voted category (24).
  • the physical device that performs this categorization is composed of a data storage unit that can be both an independent memory and a set of memories, where the information of each cluster will be saved: the matrices of individuals, the matrices of distances and their variables associated and any intermediate data necessary to perform the calculations.
  • the device where the SVMs will be loaded can be built as a custom device (ASIC) or it can be a programmable unit of the FPGA type or any other electronic technology that allows its implementation. As in the memories, there may be more than one module of this type.
  • the electronic device internally will consist of a module for the calculation where the procedure described in this patent will be implemented. It will also include a memory controller, which is responsible for managing access to memory banks and a control unit that manages and synchronizes all data flow within the device and with other external ones.
  • This system can be developed on a printed circuit board, from discrete components, or manufacture an electronic device as it integrates all of them, or at least the most important ones, into a single unit.
  • a possible implementation of this invention is the development of a specific printed card, which can be inserted into a personal computer or a computer server.
  • an independent unit can be developed that includes the functionality described above.
  • Figure 1 shows, graphically, how the process of separation of the data that will be used to train the system is performed, before starting the classification procedure.
  • Figure 2 is a flow chart of the training process.
  • Figure 3 shows a flow chart of the process of classifying a new data.
  • FIGS. 4 and 5 show a diagram of the electronic devices with which the examples of this patent have been developed.
  • Virtex 5 ML505 FPGA connected to a PC using PCI-Express x1 has been used.
  • the FPGA system is composed of a Microblaze processor with 4 KB cache memory and 256 MB of DDR RAM; all connected through a PLB bus.
  • the procedure that performs the combination of the sets generated by k-means and generates the training sets is executed.
  • the method begins by writing one of the training sets in the DDR memory of the FPGA and the Microblaze is notified to begin the execution of a standard sequencing procedure, Sequential Minimal Optimization (SMO).
  • SMO Sequential Minimal Optimization
  • a classification of part of the data with which the SVM has been trained is performed to verify the correct functioning.
  • the percentage of correctly classified data can be used to weigh the SVM vote in the voting system. For example, if the SVM correctly performs " 75% of the control sample classifications, and when classifying a new data places it in class -1, the vote of this SVM will be worth -0.75.
  • the system returns to the PC the SVM and its ' weighting; and the next one begins to train.
  • an SVM is loaded into the DDR memory, is restored by the Microblaze and classifies the new data sent to it from the PC. Once the SVM is finished, the classification of the new data returns the classification of each of them weighted and the next SVM is loaded to start the classification of the data.
  • the voting process is carried out, which consists of adding the weighted classification of each of the SVMs. If for a given data the result is greater than zero, it will belong to class +1, if on the contrary it is less than zero, it will belong to -1.
  • the voting process is finished, the general classification of each of the data is returned to the PC.
  • the training and classification process of several SVMs can be done in parallel, so a system with multiple Microblaze was built.
  • Each of the Microblazes has access to an exclusive memory area in which they will receive the data for training and a shared memory area in which the classification data will be received.
  • the PC maintains a structure with the availability of Microblazes.
  • the PC loads the data from an SVM to the associated memory of that Microblaze so that it begins to perform the training as in the sequential version.
  • the PC will continue to load data into its associated memories.
  • the classification is done in a similar way to the sequential version.
  • the PC loads the data to be classified in a shared memory area and the SVMs are loaded into the different Microblazes. Microblazes classify data in parallel and vote for each of the data. When all the SVMs have completed the classification process, all the weighted votes are added, as in the sequential version, and each data is classified.
  • the experiment measures the average training time and classification of 1000 data clouds.
  • the clouds are formed by two Poisson distributions with 5000 points each.
  • the PC on which the experiments have been carried out has an Intel 7 processor with 8GB of RAM.
  • the sequential version of the FPGA took 93.47 seconds compared to 57.11 seconds of the version with two Microblazes. It is observed that the classification time of the. data also decreases almost linearly due to the independence of the data when classified.

Abstract

The invention relates to a classification method for large volumes of data. The method first trains the system, using a known data sample, and subsequently classifies the data. Training, carried out in parallel on different support vector machines (SVM), comprises the following steps: a. assigning each data item of the sample with membership of a predetermined class within a group of two classes; b. assigning the number of groups to be included in each of the classes of the sample; c. for each class, forming as many groups as assigned in step b. and grouping together all the data of the sample into one of said groups; d. selecting pairs of groups in which each of the members of the pair belongs to a different class; and e. training the support vector machine (SVM). Classification comprises the following steps: a. each SVM votes for the class that contains the new data; b. once all the SVMs have voted, all the results of the votes are tallied; and c. the new data item is assigned to the class receiving the most votes.

Description

PROCEDIMIENTO Y DISPOSITIVO DE CLASIFICACIÓN PARA GRANDES VOLÚMENES  CLASSIFICATION PROCEDURE AND DEVICE FOR LARGE VOLUMES
DE DATOS  OF DATA
Campo de la técnica al que pertenece la invención Field of the technique to which the invention belongs
La presente invención se engloba en el campo de los sistemas de clasificación para grandes volúmenes de datos. Más específicamente describe un nuevo método y dispositivo que reduce el tiempo de clasificación sustancialmente a partir del uso de Maquinas de Vectores Soporte (SVM). The present invention is encompassed in the field of classification systems for large volumes of data. More specifically it describes a new method and device that reduces the classification time substantially from the use of Support Vector Machines (SVM).
Estado de la técnica State of the art
Las técnicas de clasificación y de reconocimiento de patrones tienen un amplio campo de aplicaciones. En los últimos años algunas áreas de la técnica, tales como las ciencias de la vida, la meteorología o el análisis financiero, han comenzado a usar estas técnicas para estudiar estadísticamente grupos de interés dentro de grandes conjunto de datos. Classification and pattern recognition techniques have a wide range of applications. In recent years some areas of the technique, such as life sciences, meteorology or financial analysis, have begun to use these techniques to statistically study interest groups within large datasets.
En el campo de la medicina, por citar uno, la búsqueda de determinados marcadores proteínicos en pacientes con cáncer de mama, el estudio del ADN o la diabetes han encontrado en estas técnicas una potente herramienta de trabajo para la investigación. En el primer caso, por ejemplo, el problema radica en la necesidad, durante la fase de pruebas clínicas, de probar la eficacia de un medicamento para curar el cáncer, mientras este está en una fase inicial. Para ello es necesario identificar que paciente lo está desarrollando lo antes posible.  In the medical field, to name one, the search for certain protein markers in patients with breast cancer, the study of DNA or diabetes have found in these techniques a powerful tool for research. In the first case, for example, the problem lies in the need, during the clinical trials phase, to test the efficacy of a drug to cure cancer, while it is in an initial phase. For this, it is necessary to identify which patient is developing it as soon as possible.
Otro campo de la técnica es la minería de textos, donde analizar textos en varios documentos simultáneamente representa una altísima carga computacional. En este caso aparecen matrices cuyas dimensiones son el producto del número de documentos por el vocabulario total que aparece en cada documento. Así se generan matrices enormes resultado del producto cartesiano de estas dos magnitudes. Este es, por ejemplo, el caso de las búsquedas no dirigidas en la web.  Another field of technique is text mining, where analyzing texts in several documents simultaneously represents a very high computational load. In this case matrices appear whose dimensions are the product of the number of documents by the total vocabulary that appears in each document. Thus huge matrices are generated as a result of the Cartesian product of these two magnitudes. This is, for example, the case of non-targeted searches on the web.
Varias son las técnicas de análisis de datos que se están utilizando actualmente para atacar este tipo de problemas: árboles de decisión, análisis de componentes principales (ACP), análisis Bayesiano o las redes neuronales. Estas técnicas producen muchas "falsas alarmas" o errores en la clasificación. Esto supone, en casos como el análisis del cáncer, que las falsas alarmas provoquen mucha ansiedad en los pacientes y un uso innecesario de biopsias o, en el lado contrario, la no detección de la enfermedad en pacientes que la están desarrollando.  There are several data analysis techniques that are currently being used to attack these types of problems: decision trees, principal component analysis (ACP), Bayesian analysis or neural networks. These techniques produce many "false alarms" or classification errors. This means, in cases such as cancer analysis, that false alarms cause a lot of anxiety in patients and unnecessary use of biopsies or, on the opposite side, the non-detection of the disease in patients who are developing it.
Una novedosa alternativa son las Máquinas de Vector Soporte, SVM a partir de ahora, que ofrecen una nueva aproximación a estos problemas de clasificación de patrones, siendo especialmente robustas para datos de altas dimensiones, donde otros sistemas de clasificación se colapsan, debido a los elevados recursos computacionales que necesitan.  A novel alternative is the Vector Support Machines, SVM from now on, which offer a new approach to these pattern classification problems, being especially robust for high-dimensional data, where other classification systems collapse, due to the high computing resources they need.
Para un conjunto de datos, subconjunto de otro mayor (espacio), en el que cada uno de los datos pertenece a una de dos posibles categorías, la SVM es capaz de predecir si un punto nuevo, cuya categoría desconocemos, pertenece a una u otra. Antes de llevar a cabo esta clasificación es necesario entrenar al sistema, a partir de unos datos de control. Una vez realizado el entrenamiento, la SVM busca una línea, más correctamente un hiperplano, que separa de forma óptima los puntos de cada una de las clases. Este concepto de separación óptima es donde reside la característica fundamental del uso de las SVM. For a set of data, a subset of a larger one (space), in which each of the data belongs to one of two possible categories, the SVM is able to predict whether a new point, whose category we do not know, belongs to one or the other. Before carrying out this classification it is necessary to train the system, based on control data. Once the training is done, the SVM searches for a line, more correctly a hyperplane, that optimally separates the points of each of the classes. This concept of optimal separation is where the fundamental characteristic of the use of SVM resides.
En la literatura técnica se ha comenzado a describir el uso de varias SVM que procesan la información de forma simultánea, como se describe en las patentes US7519563 y US7865898, donde se habla de la posibilidad de fraccionar los datos que se le suministran al sistema para que sea más efectivo, computacionalmente hablando, pero no resuelven el problema de cómo clasificar información con esta arquitectura.  In the technical literature it has begun to describe the use of several SVMs that process information simultaneously, as described in patents US7519563 and US7865898, where there is talk of the possibility of fractioning the data that is supplied to the system so that be more effective, computationally speaking, but they don't solve the problem of how to classify information with this architecture.
Problema técnico a resolver Technical problem to solve
En la actualidad cuando se quiere clasificar un volumen de datos muy elevado, del orden de 10 millones o superior, las técnicas que se pueden utilizar son la de vecinos más próximos o los métodos de inferencia bayesiana, cuyo coste computacional es muy elevado. El tiempo necesario para resolverl estos problemas es de varias horas, no siendo posible en ocasiones hallar una solución al problema planteado debido al colapso del sistema de computación. En el 2010 se propuso una técnica basada en el uso de árboles de decisión [Chang, Fu. Guo, Chien-Yanh, et al. Tree descomposition for large-scale SVM problems, Journal of Machine Learning Research 11 (2010) 2855-2892] que puede resolver alguno de estos problemas pero con dimensiones mucho menores. At present, when one wants to classify a very high volume of data, of the order of 10 million or more, the techniques that can be used are those of closer neighbors or Bayesian inference methods, whose computational cost is very high. The time needed to solve these problems is several hours, sometimes it is not possible to find a solution to the problem posed due to the collapse of the computer system. In 2010, a technique based on the use of decision trees was proposed [Chang, Fu. Guo, Chien-Yanh, et al. Tree decomposition for large-scale SVM problems, Journal of Machine Learning Research 11 (2010) 2855-2892] that can solve any of these problems but with much smaller dimensions.
Es por esto que el problema técnico que resuelve esta invención es el desarrollo de un nuevo método, y de un dispositivo electrónico, capaz de resolver problemas de clasificación, en conjuntos de datos muy elevados -del orden de millones de datos- en espacios de tiempo muy cortos -del orden de segundos o minutos, dependiendo del volumen de información- y utilizando sistemas de computación de uso corrientes.  This is why the technical problem that solves this invention is the development of a new method, and of an electronic device, capable of solving classification problems, in very high data sets - of the order of millions of data - in time spaces very short - of the order of seconds or minutes, depending on the volume of information - and using current computer systems.
Descripción detallada de la invención Detailed description of the invention
Las máquinas de vector soporte se han convertido en una novedosa herramienta para el reconocimiento de patrones. La aplicación más sencilla de esta técnica es el problema de clasificación binaria, aquel en el que únicamente hay dos clases definidas. La idea subyacente es encontrar una función de separación de las dos clases cuya probabilidad de error empírico sea mínima. Esta función, utilizando las transformaciones adecuadas, se puede representar como un hiperplano. Support vector machines have become a novel tool for pattern recognition. The simplest application of this technique is the problem of binary classification, one in which there are only two classes defined. The underlying idea is to find a function of separation of the two classes whose probability of empirical error is minimal. This function, using the appropriate transformations, can be represented as a hyperplane.
La presente solicitud de patente muestra un novedoso método para la clasificación de nuevos datos, dentro de un grupo muy grande, utilizando una o varias Maquinas de Vectores Soporte que pueden trabajan en paralelo.  The present patent application shows a novel method for the classification of new data, within a very large group, using one or several Support Vector Machines that can work in parallel.
Este proceso de clasificación se realiza en dos etapas: una primera de "entrenamiento" del dispositivo y una segunda de clasificación. En la primera de ellas, a partir de una población inicial que se utiliza como muestra de entrenamiento (figura 1.a) se dividen los datos en dos categorías (figura 1.b). Esta primera separación en dos clases se puede realizar por cualquier método estadístico conocido que agrupe datos que tienen cierta similitud. Uno posible es k-medias, que permite particionar un conjunto grande de datos en k-conjuntos más pequeños. Cada uno de estos conjuntos engloba a los datos que tienen una media más cercana entre ellos. Aunque este es un método adecuado, se puede utilizar cualquiera otro que separe los datos en dos clases adecuadas para el análisis que se va a realizar. This classification process is carried out in two stages: a first "training" of the device and a second classification. In the first one, from an initial population that is used as a training sample (Figure 1.a), They divide the data into two categories (Figure 1.b). This first separation into two classes can be done by any known statistical method that groups data that have some similarity. One possible is k-means, which allows a large set of data to be partitioned into smaller k-sets. Each of these sets encompasses data that has a closer average between them. Although this is a suitable method, any other that separates the data into two classes suitable for the analysis to be performed can be used.
El proceso de entrenamiento completo se muestra esquemáticamente en el diagrama de flujo de la figura 2. Una vez que todos los individuos del conjunto de entrenamiento están asociados a una única clase (11), se asigna el número de grupos en los que se dividirá el conjunto de datos (12). Esto tiene por objetivo fraccionar cada una de las clases en grupos más pequeños que permitirá su procesamiento en paralelo posteriormente.  The complete training process is shown schematically in the flowchart of Figure 2. Once all the individuals in the training set are associated with a single class (11), the number of groups into which the training will be divided is assigned. data set (12). This aims to divide each of the classes into smaller groups that will allow their parallel processing later.
Esta asignación de grupos se puede realizar tanto manualmente como a través de un método estadístico, de nuevo se puede utilizar k-medias u otras técnicas de estimación de densidades o, incluso, métodos de estimación de modas basados en mixturas y simulación Montecarlo. Se utilizará el método que en ese modelo de datos se adecué más a la distribución dada. La decisión de utilizar un mecanismo u otro estará en función, o bien, de la persona responsable de la clasificación, bien, del número de datos y la cantidad que se quiere cargar en cada memoria.  This group assignment can be done both manually and through a statistical method, again you can use k-averages or other density estimation techniques or, even, methods of estimating fashions based on Monte Carlo mixtures and simulation. The method that in that data model best suits the given distribution will be used. The decision to use one mechanism or another will depend on the person responsible for the classification, or the number of data and the amount to be loaded in each memory.
Cada uno de estos grupos tendrá asignado un centroide (13) que servirá de referencia para posteriormente calcular las áreas de influencia de estos grupos, esto es, la pertenencia de los datos a estos grupos. Para ello se calculan las regiones de Voronoi asociadas a estos grupos (figura 1.c). Las regiones de Voronoi son un método de interpolación muy simple basado en el cálculo de la distancia euclídea entre datos. Cuando se calculan sobre muchos puntos, el área se divide en una serie de polígonos de manera que su perímetro equidista de los puntos vecinos. Esto permite subdividir la clase en una serie de conjuntos que tienen las mismas características que la clase a la que pertenece. Se hace de esta forma, utilizando los grupos anteriores, porque si se quisiera calcular las regiones de Voronoi sobre todos los datos de la clase completa el coste computacional sería elevadísimo. La división en regiones para su posterior paralelización es la clave de éxito de este método.  Each of these groups will be assigned a centroid (13) that will serve as a reference to later calculate the areas of influence of these groups, that is, the belonging of the data to these groups. For this, the Voronoi regions associated with these groups are calculated (Figure 1.c). Voronoi regions are a very simple interpolation method based on the calculation of the Euclidean distance between data. When calculated on many points, the area is divided into a series of polygons so that their equidist perimeter of neighboring points. This allows the class to be subdivided into a series of sets that have the same characteristics as the class to which it belongs. It is done in this way, using the previous groups, because if one wanted to calculate the Voronoi regions on all the data of the entire class the computational cost would be very high. The division into regions for subsequent parallelization is the key to success of this method.
Se seleccionan aleatoriamente pares de regiones de Voronoi, una de cada una de las dos clases (14), y se comienza el entrenamiento de las SVM (15a, 15i). El número de SVM dependerá de la arquitectura electrónica elegida para ejecutar este método. El entrenamiento en paralelo, clave de esta invención, disminuye sustancialmente el tiempo dedicado a esta operación. El resultado es un hiperplano que separa todos los datos de la muestra de entrenamiento (figura 1.d).  Pairs of Voronoi regions are randomly selected, one of each of the two classes (14), and the training of SVMs (15a, 15i) is started. The number of SVM will depend on the electronic architecture chosen to execute this method. The parallel training, key to this invention, substantially decreases the time spent on this operation. The result is a hyperplane that separates all the data from the training sample (figure 1.d).
La segunda parte del proceso es la clasificación en sí misma, mostrado en la figura 3. Una vez entrenadas todas las SVM, lo que quiere decir que son capaces de reconocer a qué clase pertenece cualquier dato del conjunto de entrenamiento, cuando llega un nuevo dato (21), cada SVM vota en que categoría se encuentra en función de las parejas de regiones de Voronoi que tiene asignada esa SVM (22a, 22i). El dato obtenido de cada SVM se puede utilizar como tal o puede ser ponderado en función de criterios obtenidos durante el entrenamiento (23a, 23i). Esta ponderación será un factor de corrección que se asocia al resultado de cada SVM. Para clasificar el nuevo dato en una de las dos categorías, se suma el resultado de todas las SVM y se asociará a la categoría más votada (24). The second part of the process is the classification itself, shown in Figure 3. Once all the SVMs have been trained, which means that they are able to recognize to which class any data in the training set belongs, when a new data arrives (21), each SVM votes in which category it is based on the couples of Voronoi regions assigned to that SVM (22a, 22i). The data obtained from each SVM can be used as such or can be weighted according to criteria obtained during training (23a, 23i). This weighting will be a correction factor that is associated with result of each SVM. To classify the new data into one of the two categories, the result of all SVMs is added and it will be associated to the most voted category (24).
El dispositivo físico que realiza esta categorización está compuesto por una unidad de almacenamiento de datos que puede ser tanto una memoria independiente como un conjunto de memorias, donde se guardará la información de cada conglomerado: las matrices de individuos, las matrices de distancias y sus variables asociadas y cualquier dato intermedio necesario para la realización de los cálculos. El dispositivo donde se cargarán las SVM puede estar construido como un dispositivo a medida (ASIC) o puede ser una unidad programable del tipo FPGA o cualquier otra tecnología electrónica que permita su implementacion. Al igual que en las memorias, puede haber más de un módulo de este tipo. El dispositivo electrónico internamente constará de un módulo para el cálculo donde se implementará el procedimiento descrito en esta patente. Además incluirá un controlador de memoria, que se encarga de gestionar el acceso a los bancos de memoria y una unidad de control que gestiona y sincroniza todo el flujo de datos dentro del dispositivo y con otros externos.  The physical device that performs this categorization is composed of a data storage unit that can be both an independent memory and a set of memories, where the information of each cluster will be saved: the matrices of individuals, the matrices of distances and their variables associated and any intermediate data necessary to perform the calculations. The device where the SVMs will be loaded can be built as a custom device (ASIC) or it can be a programmable unit of the FPGA type or any other electronic technology that allows its implementation. As in the memories, there may be more than one module of this type. The electronic device internally will consist of a module for the calculation where the procedure described in this patent will be implemented. It will also include a memory controller, which is responsible for managing access to memory banks and a control unit that manages and synchronizes all data flow within the device and with other external ones.
Este sistema se puede desarrollar en una tarjeta de circuito impreso, a partir de componentes discretos, o fabricar un dispositivo electrónico a medida que integre todos ellos, o al menos los más importantes, en una sola unidad. Una implementacion posible de esta invención es el desarrollo de una tarjeta impresa específica, que pueda ser insertada en un ordenador personal o un servidor informático. Alternativamente se puede desarrollar una unidad independiente que incluya la funcionalidad descrita anteriormente.  This system can be developed on a printed circuit board, from discrete components, or manufacture an electronic device as it integrates all of them, or at least the most important ones, into a single unit. A possible implementation of this invention is the development of a specific printed card, which can be inserted into a personal computer or a computer server. Alternatively, an independent unit can be developed that includes the functionality described above.
Descripción de las figuras Description of the figures
La figura 1 muestra, de forma gráfica, cómo se realiza el proceso de separación de los datos que se utilizarán para entrenar el sistema, antes de empezar el procedimiento de clasificación. Figure 1 shows, graphically, how the process of separation of the data that will be used to train the system is performed, before starting the classification procedure.
La figura 2 es un diagrama de flujo del proceso de entrenamiento.  Figure 2 is a flow chart of the training process.
La figura 3 muestra un diagrama de flujo del proceso de clasificación de un nuevo dato.  Figure 3 shows a flow chart of the process of classifying a new data.
Las figuras 4 y 5 muestran un esquema de los dispositivos electrónicos con los que se han desarrollado los ejemplos de esta patente.  Figures 4 and 5 show a diagram of the electronic devices with which the examples of this patent have been developed.
Descripción detallada de las realizaciones particulares Detailed description of the particular embodiments
Ejemplol  Example
Comparación de un procedimiento secuencial estándar y del procedimiento descrito en esta solicitud de patente. En ambos casos primero se realiza el entrenamiento y posteriormente la clasificación de nuevos datos.  Comparison of a standard sequential procedure and the procedure described in this patent application. In both cases the training is first carried out and then the classification of new data.
1. 1 Proceso secuencial Para este ejemplo se ha utilizado una FPGA Virtex 5 ML505 conectada a un PC mediante PCI-Express x1. El sistema de la FPGA está compuesto por un procesador Microblaze con memorias caché de 4 KB y 256 MB de RAM DDR; todo conectado a través de un bus PLB. 1. 1 Sequential process For this example, a Virtex 5 ML505 FPGA connected to a PC using PCI-Express x1 has been used. The FPGA system is composed of a Microblaze processor with 4 KB cache memory and 256 MB of DDR RAM; all connected through a PLB bus.
1.1.a Entrenamiento 1.1.a Training
En el PC se ejecuta el procedimiento que realiza la combinación de los conjuntos generados por k-medias y genera los conjuntos de entrenamiento. El método comienza escribiendo uno de los conjuntos de entrenamiento en la memoria DDR de la FPGA y se notifica al Microblaze para que comience la ejecución de un procedimiento de secuenciación estándar, Sequential Minimal Optimization (SMO). Cuando el procedimiento de entrenamiento finaliza, se realiza una clasificación de parte de los datos con los que ha sido entrenada la SVM para comprobar el correcto funcionamiento. Como se ha indicado anteriormente, el porcentaje de datos clasificados correctamente se puede usar para ponderar el voto de la SVM en el sistema de votaciones. Por ejemplo, si la SVM realiza correctamente" un 75% de las clasificaciones de la muestra de control, y al clasificar un nuevo dato lo coloca en la clase -1 , el voto de esta SVM valdrá -0,75. Cuando la SVM finaliza la prueba, el sistema devuelve al PC la SVM y su' ponderación; y se comienza a entrenar la siguiente. On the PC the procedure that performs the combination of the sets generated by k-means and generates the training sets is executed. The method begins by writing one of the training sets in the DDR memory of the FPGA and the Microblaze is notified to begin the execution of a standard sequencing procedure, Sequential Minimal Optimization (SMO). When the training procedure is finished, a classification of part of the data with which the SVM has been trained is performed to verify the correct functioning. As indicated above, the percentage of correctly classified data can be used to weigh the SVM vote in the voting system. For example, if the SVM correctly performs " 75% of the control sample classifications, and when classifying a new data places it in class -1, the vote of this SVM will be worth -0.75. When the SVM ends the test, the system returns to the PC the SVM and its ' weighting; and the next one begins to train.
Cuando todas las SVMs (k x k) han sido entrenadas, el sistema finaliza el entrenamiento y permite la clasificación de nuevos datos utilizando las SVMs guardadas durante el entrenamiento.  When all the SVMs (k x k) have been trained, the system finishes the training and allows the classification of new data using the SVMs stored during the training.
1.1. b Clasificación 1.1. b Classification
Para clasificar nuevos datos, una SVM se carga en la memoria DDR, es restaurada por el Microblaze y clasifica los datos nuevos que se le envíe desde el PC. Una vez finaliza la SVM de clasificar los nuevos datos retorna la clasificación de cada uno de ellos ponderada y se procede a cargar la siguiente SVM para que comience la clasificación de los datos.  To classify new data, an SVM is loaded into the DDR memory, is restored by the Microblaze and classifies the new data sent to it from the PC. Once the SVM is finished, the classification of the new data returns the classification of each of them weighted and the next SVM is loaded to start the classification of the data.
Cuando todas las SVMs han finalizado, se procede al proceso de votación, que consiste en sumar la clasificación ponderada de cada una de las SVMs. Si para un dato el resultado es mayor que cero, pertenecerá a la clase +1, si por el contrario es menor que cero, pertenecerá a la -1. Cuando el proceso de votación finaliza, se devuelve al PC la clasificación general de cada uno de los datos.  When all the SVMs have finished, the voting process is carried out, which consists of adding the weighted classification of each of the SVMs. If for a given data the result is greater than zero, it will belong to class +1, if on the contrary it is less than zero, it will belong to -1. When the voting process is finished, the general classification of each of the data is returned to the PC.
1.2 Proceso Paralelo 1.2 Parallel Process
Como el entrenamiento de distintas SVMs se realiza con datos distintos e independientes entre sí, el proceso de entrenamiento y clasificación de varias SVMs puede realizarse en paralelo, por lo que se construyó un sistema con múltiples Microblaze. Cada uno de los Microblazes tiene acceso a una zona de memoria exclusiva en la que recibirá los datos para realizar el entrenamiento y a una zona de memoria compartida en la que se recibirán los datos de clasificación. As the training of different SVMs is carried out with different and independent data, the training and classification process of several SVMs can be done in parallel, so a system with multiple Microblaze was built. Each of the Microblazes has access to an exclusive memory area in which they will receive the data for training and a shared memory area in which the classification data will be received.
1.2.a Entrenamiento 1.2.a Training
En esta versión del sistema, el PC mantiene una estructura con la disponibilidad de los Microblazes. Cuando un Microblaze queda libre, el PC carga los datos de una SVM a la memoria asociada de ese Microblaze para que comience a realizar el entrenamiento al igual que en la versión secuencial. Mientras siga habiendo Microblazes disponibles, el PC seguirá cargando datos en sus memorias asociadas.  In this version of the system, the PC maintains a structure with the availability of Microblazes. When a Microblaze is free, the PC loads the data from an SVM to the associated memory of that Microblaze so that it begins to perform the training as in the sequential version. As long as Microblazes remain available, the PC will continue to load data into its associated memories.
Una vez el entrenamiento de todas las SVMs ha finalizado y se ha ponderado su voto puede comenzar el proceso de clasificación.  Once the training of all SVMs has finished and their vote has been weighted, the classification process can begin.
1.2. b Clasificación 1.2. b Classification
La clasificación se realiza de modo similar a la versión secuencial. El PC carga los datos a clasificar en una zona de memoria compartida y se cargan las SVMs en los distintos Microblazes. Los Microblazes clasifican los datos en paralelo y votan para cada uno de los datos. Cuando todas las SVMs han finalizado el proceso de clasificación, se suman todos los votos ponderados, al igual que en la versión secuencial, y se clasifica cada dato.  The classification is done in a similar way to the sequential version. The PC loads the data to be classified in a shared memory area and the SVMs are loaded into the different Microblazes. Microblazes classify data in parallel and vote for each of the data. When all the SVMs have completed the classification process, all the weighted votes are added, as in the sequential version, and each data is classified.
1.3 Comparativas entre el procedimiento paralelo y procedimiento secuencial. 1.3 Comparisons between the parallel procedure and sequential procedure.
A continuación se realiza una comparativa entre el procedimiento SMO estándar y el procedimiento paralelo descrito en esta solicitud de patente con 10 SVMs en paralelo.  A comparison is then made between the standard SMO procedure and the parallel procedure described in this patent application with 10 SVMs in parallel.
El experimento mide el tiempo medio de entrenamiento y clasificación de 1000 nubes de datos. Las nubes están formadas por dos distribuciones de Poisson con 5000 puntos cada una.  The experiment measures the average training time and classification of 1000 data clouds. The clouds are formed by two Poisson distributions with 5000 points each.
El PC sobre el que se han realizado los experimentos tiene un procesador Intel ¡7 con 8GB de memoria RAM.  The PC on which the experiments have been carried out has an Intel 7 processor with 8GB of RAM.
El tiempo medio de entrenamiento para el procedimiento SMO fue de 9.19 segundos frente a los 0.89 segundos del procedimiento paralelo. Esta diferencia de tiempos se debe a la independencia de datos entre las 0 SVMs, lo que permite que la mejora sea casi lineal. Entrenamiento Clasificación The average training time for the SMO procedure was 9.19 seconds compared to 0.89 seconds for the parallel procedure. This time difference is due to the independence of data between the 0 SVMs, which allows the improvement to be almost linear. Training Classification
(segundos) (segundos)  (seconds) (seconds)
Secuencial SMO 9.19 1.59  Sequential SMO 9.19 1.59
Paralelo 0.89 1.23  Parallel 0.89 1.23
Figure imgf000009_0001
Figure imgf000009_0001
Ejemplo 2 Example 2
2.1 Comparativas de versión monoprocesador y multiprocesador. 2.1 Comparison of single-processor and multiprocessor version.
El experimento realizado para comparar las versiones ha sido el mismo que en el ejemplo anterior. Utilizando una arquitectura de una sola FPGA (Figura 4), el sistema secuencial de la FPGA consiguió un tiempo medio de entrenamiento de 79.58 segundos, la versión paralela con dos Microblazes (Figura 5) tardó 43.27 segundos. Se observa que, al igual que en las comparativas entre el SMO y el proceso paralelo, debido a la independencia de los datos, el incluir más procesadores para realizar el entrenamiento de las SVMs hace que disminuya el tiempo de entrenamiento de forma casi lineal.  The experiment performed to compare the versions has been the same as in the previous example. Using a single FPGA architecture (Figure 4), the FPGA sequential system achieved an average training time of 79.58 seconds, the parallel version with two Microblazes (Figure 5) took 43.27 seconds. It is observed that, as in the comparisons between the SMO and the parallel process, due to the independence of the data, the inclusion of more processors to perform the training of the SVMs reduces the training time almost linearly.
Respecto al tiempo de clasificación, la versión secuencial de la FPGA tardó 93.47 segundos frente a los 57.11 segundos de la versión con dos Microblazes. Se observa que el tiempo de clasificación de los . datos disminuye también de manera casi lineal debido a la independencia de los datos al ser clasificados.  Regarding the time of classification, the sequential version of the FPGA took 93.47 seconds compared to 57.11 seconds of the version with two Microblazes. It is observed that the classification time of the. data also decreases almost linearly due to the independence of the data when classified.
Figure imgf000009_0002
Figure imgf000009_0002
Clasificados correctamente  Classified correctly
1 FPGA 67.53%  1 FPGA 67.53%
2 FPGA 67.38%  2 FPGA 67.38%

Claims

REIVINDICACIONES
Procedimiento para la clasificación de nuevos individuos, en un conjunto de datos, que utiliza Máquinas de Vectores Soporte, caracterizado porque consta primero de un procedimiento de entrenamiento del sistema, a partir de una muestra de datos conocida, y posteriormente de un procedimiento de clasificación de nuevos individuos. Procedure for the classification of new individuals, in a data set, that uses Support Vector Machines, characterized in that it consists first of a system training procedure, from a known data sample, and subsequently of a classification procedure of new individuals
Procedimiento para la clasificación de nuevos individuos, en un conjunto de datos, de acuerdo con la reivindicación 1 , donde el entrenamiento de la Máquina de Vectores Soporte (SVM) que realizará la clasificación, está caracterizado por constar de las siguientes etapas: Procedure for the classification of new individuals, in a data set, according to claim 1, wherein the training of the Support Vector Machine (SVM) that will perform the classification, is characterized by consisting of the following steps:
a. a cada dato de la muestra de entrenamiento se le asigna la pertenencia a una clase determinada, dentro de un grupo de dos clases;  to. each data in the training sample is assigned membership in a particular class, within a group of two classes;
b. se asigna el número de grupos que habrá en cada una de las clases de la muestra de datos de entrenamiento;  b. the number of groups that will be in each of the classes of the training data sample is assigned;
c. para cada clase se forman tantos grupos como se hayan asignado en el punto b, agrupando todos los datos de la muestra en alguno de estos grupos; d. se seleccionan parejas de grupos, donde cada uno de los miembros de la pareja pertenece una clase diferente;  C. For each class, as many groups are formed as they have been assigned in point b, grouping all the sample data into one of these groups; d. pairs of groups are selected, where each member of the couple belongs a different class;
e. se entrena la Maquina de Vectores Soporte (SVM).  and. the Support Vector Machine (SVM) is trained.
Procedimiento para la clasificación de nuevos individuos, en un conjunto dé datos, donde el entrenamiento de la Máquina de Vectores Soporte (SVM) que realizará la clasificación, de acuerdo con la reivindicación 2, está caracterizado porque la asignación del número de grupos se puede hacer tanto de forma manual como en función del número de datos y el tamaño máximo de cada grupo. Procedure for the classification of new individuals, in a set of data, where the training of the Support Vector Machine (SVM) that will carry out the classification, according to claim 2, is characterized in that the allocation of the number of groups can be done both manually and based on the number of data and the maximum size of each group.
Procedimiento para la clasificación de nuevos individuos, en un conjunto de datos, donde el entrenamiento de la Máquina de Vectores Soporte (SVM) que realizará la clasificación, de acuerdo con la reivindicación 2, está caracterizado porque la posición de los centroides de cada uno de los grupos se calcula utilizando el algoritmo de k-medias. Procedure for the classification of new individuals, in a data set, where the training of the Support Vector Machine (SVM) that will perform the classification, according to claim 2, is characterized in that the position of the centroids of each of The groups are calculated using the k-means algorithm.
Procedimiento para la clasificación de nuevos individuos, en un conjunto de datos, donde el entrenamiento de la Máquina de Vectores Soporte (SVM) que realizará la clasificación, de acuerdo con la reivindicación 2, está caracterizado porque la agrupación de los datos en cada clase se hace utilizando regiones de Voronoi. Procedure for the classification of new individuals, in a data set, where the training of the Support Vector Machine (SVM) that will carry out the classification, according to claim 2, is characterized in that the grouping of the data in each class is ago using regions of Voronoi.
Procedimiento para la clasificación de nuevos individuos, en un conjunto de datos, donde el entrenamiento de la Máquina de Vectores Soporte (SVM) que realizará la clasificación, de acuerdo con la reivindicación 2, está caracterizado porque el entrenamiento de las diferentes SVM se hace en paralelo en todas ellas a la vez. Procedure for the classification of new individuals, in a data set, where the training of the Support Vector Machine (SVM) that will perform the classification, according to claim 2, is characterized in that the training of the different SVMs is done in parallel in all of them at once.
7. Procedimiento para la clasificación de nuevos individuos, dentro del conjunto de datos, de acuerdo con la reivindicación 1 , caracterizado porque consta de los siguientes pasos: 7. Procedure for the classification of new individuals, within the data set, according to claim 1, characterized in that it consists of the following steps:
a. cada Maquina de Vectores Soporte (SVM) vota en qué clase está el nuevo dato;  to. Each Support Vector Machine (SVM) votes in which class the new data is;
b. una vez que todas las Maquinas de Vectores Soporte (SVM) ha votado a que clase pertenece el nuevo dato se suman todos los resultados de las votaciones;  b. once all the Support Vector Machines (SVM) have voted to which class the new data belongs, all the voting results are added;
c. El nuevo dato se asigna la clase más votada.  C. The new data is assigned the most voted class.
8. Procedimiento para la clasificación de nuevos individuos, dentro del conjunto de datos, de acuerdo con la reivindicación 7, caracterizado porque el voto que realiza la Maquina de Vectores Soporte (SVM) puede ser ponderado de acuerdo a un criterio pre-determinado. 8. Procedure for the classification of new individuals, within the data set, according to claim 7, characterized in that the vote made by the Support Vector Machine (SVM) can be weighted according to a predetermined criterion.
9. Procedimiento para la clasificación de nuevos individuos, dentro del conjunto de datos, de acuerdo con la reivindicación 7, caracterizado porque el voto se realiza simultáneamente en todas las Maquinas de Vectores Soporte (SVM). 9. Procedure for the classification of new individuals, within the data set, according to claim 7, characterized in that the vote is carried out simultaneously in all Support Vector Machines (SVM).
10. Dispositivo electrónico para la clasificación de datos que comprende al menos una memoria de almacenamiento de datos, una unidad de proceso de los datos, un bus de comunicaciones y un interface de entrada/salida caracterizado porque es capaz de realizar los procedimientos descritos en las reivindicaciones anteriores. 10. Electronic device for the classification of data comprising at least one data storage memory, a data processing unit, a communication bus and an input / output interface characterized in that it is capable of performing the procedures described in the previous claims.
11. Dispositivo electrónico, según la reivindicación 10, caracterizado porque la unidad de proceso de los datos puede ser una FPGA, un circuito integrado diseñado específicamente para esta labor (ASIC) o cualquier otra tecnología que permita su fabricación en un sistema electrónico. 11. Electronic device according to claim 10, characterized in that the data processing unit can be an FPGA, an integrated circuit designed specifically for this task (ASIC) or any other technology that allows its manufacture in an electronic system.
12. Dispositivo electrónico, según las reivindicaciones 10 y 11 , caracterizado porque el dispositivo electrónico está integrado en una tarjeta de circuito impreso. 12. Electronic device according to claims 10 and 11, characterized in that the electronic device is integrated in a printed circuit board.
13. Uso del procedimiento para la clasificación de nuevos individuos, en un conjunto de datos, de acuerdo a todas las reivindicaciones anteriores, para la evaluación de riesgos financieros. 13. Use of the procedure for the classification of new individuals, in a set of data, according to all the previous claims, for the evaluation of financial risks.
14. Uso del procedimiento para la clasificación de nuevos individuos, en un conjunto de datos, de acuerdo a todas las reivindicaciones anteriores, para la búsqueda de textos en bases de datos documentales. 14. Use of the procedure for the classification of new individuals, in a set of data, according to all the previous claims, for the search of texts in documentary databases.
15. Uso del procedimiento para la clasificación de nuevos individuos, en un conjunto de datos, de acuerdo a todas las reivindicaciones anteriores, para resolución de la búsqueda de marcadores proteínicos en al estudio de ADN. 15. Use of the procedure for the classification of new individuals, in a set of data, according to all the preceding claims, for resolution of the search for protein markers in the DNA study.
PCT/ES2012/000293 2012-06-11 2012-11-27 Classification method and device for large volumes of data WO2013186402A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
ESP201230903 2012-06-11
ES201230903A ES2438366B1 (en) 2012-06-11 2012-06-11 CLASSIFICATION PROCEDURE AND DEVICE FOR LARGE DATA VOLUMES

Publications (1)

Publication Number Publication Date
WO2013186402A1 true WO2013186402A1 (en) 2013-12-19

Family

ID=49757622

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/ES2012/000293 WO2013186402A1 (en) 2012-06-11 2012-11-27 Classification method and device for large volumes of data

Country Status (2)

Country Link
ES (1) ES2438366B1 (en)
WO (1) WO2013186402A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014176514A2 (en) 2013-04-26 2014-10-30 Genomatica, Inc. Microorganisms and methods for production of 4-hydroxybutyrate, 1,4-butanediol and related compounds
US9424530B2 (en) 2015-01-26 2016-08-23 International Business Machines Corporation Dataset classification quantification

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030200188A1 (en) * 2002-04-19 2003-10-23 Baback Moghaddam Classification with boosted dyadic kernel discriminants

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030200188A1 (en) * 2002-04-19 2003-10-23 Baback Moghaddam Classification with boosted dyadic kernel discriminants

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
BAO-LIANG LU ET AL.: "Comparison of parallel and cascade methods for training support vector machines on large-scale problems.", PROCEEDINGS OF 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (IEEE CAT. NO.04EX826), vol. 5, 30 November 2003 (2003-11-30), PISCATAWAY, NJ, USA, pages 3056 - 3061, XP010760108, DOI: doi:10.1109/ICMLC.2004.1378557 *
COLLOBERT R ET AL.: "A parallel mixture of SVMs for very large scale problems.", NEURAL COMPUTATION., vol. 14, no. 5, May 2002 (2002-05-01), pages 1105 - 1114 *
JIAN-PEI ZHANG ET AL.: "A parallel SVM training algorithm on large-scale classification problems.", PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (IEEE CAT. NO. 05EX1059) 2005 IEEE, vol. 3, 30 November 2004 (2004-11-30), PISCATAWAY, NJ, USA, pages 1637 - 1641, XP010847010, DOI: doi:10.1109/ICMLC.2005.1527207 *
YAXIN BI ET AL.: "Combining Multiple Classifiers Using Dempster's Rule of Combination for Text Categorization.", MODELING DECISIONS FOR ARTIFICIAL INTELLIGENCE. LECTURE NOTES IN COMPUTER SCIENCE., vol. 3131, 2004, BERLIN HEIDELBERG., pages 127 - 138, XP019009157 *
YI-MIN WEN ET AL.: "A cascade method for reducing training time and the number of support vectors.Advances in Neural Networks - ISNN 2004.", INTERNATIONAL SYMPOSIUM ON NEURAL NETWORKS. PROCEEDINGS (LECTURE NOTES IN COMPUT. SCI., vol. 3173, 30 November 2003 (2003-11-30), BERLIN, GERMANY, pages 480 - 486 *
YI-MIN WEN ET AL.: "A confident majority voting strategy for parallel and modular support vector machines.", ADVANCES IN NEURAL NETWORKS. 4TH INTERNATIONAL SYMPOSIUM ON NEURAL NETWORKS, vol. 4493, 30 November 2006 (2006-11-30), BERLIN, GERMANY, pages 525 - 534 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014176514A2 (en) 2013-04-26 2014-10-30 Genomatica, Inc. Microorganisms and methods for production of 4-hydroxybutyrate, 1,4-butanediol and related compounds
US9424530B2 (en) 2015-01-26 2016-08-23 International Business Machines Corporation Dataset classification quantification

Also Published As

Publication number Publication date
ES2438366B1 (en) 2014-10-22
ES2438366A1 (en) 2014-01-16

Similar Documents

Publication Publication Date Title
Tharwat et al. Linear discriminant analysis: A detailed tutorial
McNicholas Model-based clustering
CN107111869B9 (en) Image identification system and method
Chattopadhyay et al. A comparative study of fuzzy c-means algorithm and entropy-based fuzzy clustering algorithms
Piao et al. A new ensemble method with feature space partitioning for high-dimensional data classification
Wilson et al. A testing based extraction algorithm for identifying significant communities in networks
Li et al. Sparse representation approaches for the classification of high-dimensional biological data
Reyes et al. Evolutionary strategy to perform batch-mode active learning on multi-label data
de-Santos-Sierra et al. Unconstrained and contactless hand geometry biometrics
Cheung Convolutional neural networks applied to human face classification
Ailem et al. Model-based co-clustering for the effective handling of sparse data
Zhao et al. Bisecting k-means clustering based face recognition using block-based bag of words model
Cerruela García et al. Filter feature selectors in the development of binary QSAR models
WO2013186402A1 (en) Classification method and device for large volumes of data
Singh et al. Optimization of stochastic networks using simulated annealing for the storage and recalling of compressed images using SOM
Li et al. Versatile sparse matrix factorization and its applications in high-dimensional biological data analysis
Dohnalek et al. Human activity recognition: classifier performance evaluation on multiple datasets
Wu et al. Handwritten digit classification using the mnist data set
Afif et al. Data classification using support vector machine integrated with scatter search method
Chen et al. Embedded supervised feature selection for multi-class data
Wang et al. Uncovering locally discriminative structure for feature analysis
Loka et al. Hilbert vector convolutional neural network: 2D neural network on 1D data
Le et al. Speeding up and enhancing a large-scale fingerprint identification system on GPU
Ranalli et al. A model-based approach to simultaneous clustering and dimensional reduction of ordinal data
Gabryel et al. The bag-of-words method with dictionary analysis by evolutionary algorithm

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12878875

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12878875

Country of ref document: EP

Kind code of ref document: A1