WO2013186402A1

WO2013186402A1 - Classification method and device for large volumes of data

Info

Publication number: WO2013186402A1
Application number: PCT/ES2012/000293
Authority: WO
Inventors: Javier MARTINEZ MOGUERZA; Javier Castillo Villar; José Ignacio MARTINEZ TORRE; David RIOS INSUA; Javier CANO MONTERO
Original assignee: Universidad Rey Juan Carlos
Priority date: 2012-06-11
Filing date: 2012-11-27
Publication date: 2013-12-19
Also published as: ES2438366B1; ES2438366A1

Abstract

The invention relates to a classification method for large volumes of data. The method first trains the system, using a known data sample, and subsequently classifies the data. Training, carried out in parallel on different support vector machines (SVM), comprises the following steps: a. assigning each data item of the sample with membership of a predetermined class within a group of two classes; b. assigning the number of groups to be included in each of the classes of the sample; c. for each class, forming as many groups as assigned in step b. and grouping together all the data of the sample into one of said groups; d. selecting pairs of groups in which each of the members of the pair belongs to a different class; and e. training the support vector machine (SVM). Classification comprises the following steps: a. each SVM votes for the class that contains the new data; b. once all the SVMs have voted, all the results of the votes are tallied; and c. the new data item is assigned to the class receiving the most votes.

Description

CLASSIFICATION PROCEDURE AND DEVICE FOR LARGE VOLUMES

OF DATA

Field of the technique to which the invention belongs

The present invention is encompassed in the field of classification systems for large volumes of data. More specifically it describes a new method and device that reduces the classification time substantially from the use of Support Vector Machines (SVM).

State of the art

Classification and pattern recognition techniques have a wide range of applications. In recent years some areas of the technique, such as life sciences, meteorology or financial analysis, have begun to use these techniques to statistically study interest groups within large datasets.

In the medical field, to name one, the search for certain protein markers in patients with breast cancer, the study of DNA or diabetes have found in these techniques a powerful tool for research. In the first case, for example, the problem lies in the need, during the clinical trials phase, to test the efficacy of a drug to cure cancer, while it is in an initial phase. For this, it is necessary to identify which patient is developing it as soon as possible.

Another field of technique is text mining, where analyzing texts in several documents simultaneously represents a very high computational load. In this case matrices appear whose dimensions are the product of the number of documents by the total vocabulary that appears in each document. Thus huge matrices are generated as a result of the Cartesian product of these two magnitudes. This is, for example, the case of non-targeted searches on the web.

There are several data analysis techniques that are currently being used to attack these types of problems: decision trees, principal component analysis (ACP), Bayesian analysis or neural networks. These techniques produce many "false alarms" or classification errors. This means, in cases such as cancer analysis, that false alarms cause a lot of anxiety in patients and unnecessary use of biopsies or, on the opposite side, the non-detection of the disease in patients who are developing it.

A novel alternative is the Vector Support Machines, SVM from now on, which offer a new approach to these pattern classification problems, being especially robust for high-dimensional data, where other classification systems collapse, due to the high computing resources they need.

For a set of data, a subset of a larger one (space), in which each of the data belongs to one of two possible categories, the SVM is able to predict whether a new point, whose category we do not know, belongs to one or the other. Before carrying out this classification it is necessary to train the system, based on control data. Once the training is done, the SVM searches for a line, more correctly a hyperplane, that optimally separates the points of each of the classes. This concept of optimal separation is where the fundamental characteristic of the use of SVM resides.

In the technical literature it has begun to describe the use of several SVMs that process information simultaneously, as described in patents US7519563 and US7865898, where there is talk of the possibility of fractioning the data that is supplied to the system so that be more effective, computationally speaking, but they don't solve the problem of how to classify information with this architecture.

Technical problem to solve

At present, when one wants to classify a very high volume of data, of the order of 10 million or more, the techniques that can be used are those of closer neighbors or Bayesian inference methods, whose computational cost is very high. The time needed to solve these problems is several hours, sometimes it is not possible to find a solution to the problem posed due to the collapse of the computer system. In 2010, a technique based on the use of decision trees was proposed [Chang, Fu. Guo, Chien-Yanh, et al. Tree decomposition for large-scale SVM problems, Journal of Machine Learning Research 11 (2010) 2855-2892] that can solve any of these problems but with much smaller dimensions.

This is why the technical problem that solves this invention is the development of a new method, and of an electronic device, capable of solving classification problems, in very high data sets - of the order of millions of data - in time spaces very short - of the order of seconds or minutes, depending on the volume of information - and using current computer systems.

Detailed description of the invention

Support vector machines have become a novel tool for pattern recognition. The simplest application of this technique is the problem of binary classification, one in which there are only two classes defined. The underlying idea is to find a function of separation of the two classes whose probability of empirical error is minimal. This function, using the appropriate transformations, can be represented as a hyperplane.

The present patent application shows a novel method for the classification of new data, within a very large group, using one or several Support Vector Machines that can work in parallel.

This classification process is carried out in two stages: a first "training" of the device and a second classification. In the first one, from an initial population that is used as a training sample (Figure 1.a), They divide the data into two categories (Figure 1.b). This first separation into two classes can be done by any known statistical method that groups data that have some similarity. One possible is k-means, which allows a large set of data to be partitioned into smaller k-sets. Each of these sets encompasses data that has a closer average between them. Although this is a suitable method, any other that separates the data into two classes suitable for the analysis to be performed can be used.

The complete training process is shown schematically in the flowchart of Figure 2. Once all the individuals in the training set are associated with a single class (11), the number of groups into which the training will be divided is assigned. data set (12). This aims to divide each of the classes into smaller groups that will allow their parallel processing later.

This group assignment can be done both manually and through a statistical method, again you can use k-averages or other density estimation techniques or, even, methods of estimating fashions based on Monte Carlo mixtures and simulation. The method that in that data model best suits the given distribution will be used. The decision to use one mechanism or another will depend on the person responsible for the classification, or the number of data and the amount to be loaded in each memory.

Each of these groups will be assigned a centroid (13) that will serve as a reference to later calculate the areas of influence of these groups, that is, the belonging of the data to these groups. For this, the Voronoi regions associated with these groups are calculated (Figure 1.c). Voronoi regions are a very simple interpolation method based on the calculation of the Euclidean distance between data. When calculated on many points, the area is divided into a series of polygons so that their equidist perimeter of neighboring points. This allows the class to be subdivided into a series of sets that have the same characteristics as the class to which it belongs. It is done in this way, using the previous groups, because if one wanted to calculate the Voronoi regions on all the data of the entire class the computational cost would be very high. The division into regions for subsequent parallelization is the key to success of this method.

Pairs of Voronoi regions are randomly selected, one of each of the two classes (14), and the training of SVMs (15a, 15i) is started. The number of SVM will depend on the electronic architecture chosen to execute this method. The parallel training, key to this invention, substantially decreases the time spent on this operation. The result is a hyperplane that separates all the data from the training sample (figure 1.d).

The second part of the process is the classification itself, shown in Figure 3. Once all the SVMs have been trained, which means that they are able to recognize to which class any data in the training set belongs, when a new data arrives (21), each SVM votes in which category it is based on the couples of Voronoi regions assigned to that SVM (22a, 22i). The data obtained from each SVM can be used as such or can be weighted according to criteria obtained during training (23a, 23i). This weighting will be a correction factor that is associated with result of each SVM. To classify the new data into one of the two categories, the result of all SVMs is added and it will be associated to the most voted category (24).

The physical device that performs this categorization is composed of a data storage unit that can be both an independent memory and a set of memories, where the information of each cluster will be saved: the matrices of individuals, the matrices of distances and their variables associated and any intermediate data necessary to perform the calculations. The device where the SVMs will be loaded can be built as a custom device (ASIC) or it can be a programmable unit of the FPGA type or any other electronic technology that allows its implementation. As in the memories, there may be more than one module of this type. The electronic device internally will consist of a module for the calculation where the procedure described in this patent will be implemented. It will also include a memory controller, which is responsible for managing access to memory banks and a control unit that manages and synchronizes all data flow within the device and with other external ones.

This system can be developed on a printed circuit board, from discrete components, or manufacture an electronic device as it integrates all of them, or at least the most important ones, into a single unit. A possible implementation of this invention is the development of a specific printed card, which can be inserted into a personal computer or a computer server. Alternatively, an independent unit can be developed that includes the functionality described above.

Description of the figures

Figure 1 shows, graphically, how the process of separation of the data that will be used to train the system is performed, before starting the classification procedure.

Figure 2 is a flow chart of the training process.

Figure 3 shows a flow chart of the process of classifying a new data.

Figures 4 and 5 show a diagram of the electronic devices with which the examples of this patent have been developed.

Detailed description of the particular embodiments

Example

Comparison of a standard sequential procedure and the procedure described in this patent application. In both cases the training is first carried out and then the classification of new data.

1. 1 Sequential process For this example, a Virtex 5 ML505 FPGA connected to a PC using PCI-Express x1 has been used. The FPGA system is composed of a Microblaze processor with 4 KB cache memory and 256 MB of DDR RAM; all connected through a PLB bus.

1.1.a Training

On the PC the procedure that performs the combination of the sets generated by k-means and generates the training sets is executed. The method begins by writing one of the training sets in the DDR memory of the FPGA and the Microblaze is notified to begin the execution of a standard sequencing procedure, Sequential Minimal Optimization (SMO). When the training procedure is finished, a classification of part of the data with which the SVM has been trained is performed to verify the correct functioning. As indicated above, the percentage of correctly classified data can be used to weigh the SVM vote in the voting system. For example, if the SVM correctly performs ^" 75% of the control sample classifications, and when classifying a new data places it in class -1, the vote of this SVM will be worth -0.75. When the SVM ends the test, the system returns to the PC the SVM and its ^' weighting; and the next one begins to train.

When all the SVMs (k x k) have been trained, the system finishes the training and allows the classification of new data using the SVMs stored during the training.

1.1. b Classification

To classify new data, an SVM is loaded into the DDR memory, is restored by the Microblaze and classifies the new data sent to it from the PC. Once the SVM is finished, the classification of the new data returns the classification of each of them weighted and the next SVM is loaded to start the classification of the data.

When all the SVMs have finished, the voting process is carried out, which consists of adding the weighted classification of each of the SVMs. If for a given data the result is greater than zero, it will belong to class +1, if on the contrary it is less than zero, it will belong to -1. When the voting process is finished, the general classification of each of the data is returned to the PC.

1.2 Parallel Process

As the training of different SVMs is carried out with different and independent data, the training and classification process of several SVMs can be done in parallel, so a system with multiple Microblaze was built. Each of the Microblazes has access to an exclusive memory area in which they will receive the data for training and a shared memory area in which the classification data will be received.

1.2.a Training

In this version of the system, the PC maintains a structure with the availability of Microblazes. When a Microblaze is free, the PC loads the data from an SVM to the associated memory of that Microblaze so that it begins to perform the training as in the sequential version. As long as Microblazes remain available, the PC will continue to load data into its associated memories.

Once the training of all SVMs has finished and their vote has been weighted, the classification process can begin.

1.2. b Classification

The classification is done in a similar way to the sequential version. The PC loads the data to be classified in a shared memory area and the SVMs are loaded into the different Microblazes. Microblazes classify data in parallel and vote for each of the data. When all the SVMs have completed the classification process, all the weighted votes are added, as in the sequential version, and each data is classified.

1.3 Comparisons between the parallel procedure and sequential procedure.

A comparison is then made between the standard SMO procedure and the parallel procedure described in this patent application with 10 SVMs in parallel.

The experiment measures the average training time and classification of 1000 data clouds. The clouds are formed by two Poisson distributions with 5000 points each.

The PC on which the experiments have been carried out has an Intel 7 processor with 8GB of RAM.

The average training time for the SMO procedure was 9.19 seconds compared to 0.89 seconds for the parallel procedure. This time difference is due to the independence of data between the 0 SVMs, which allows the improvement to be almost linear. Training Classification

(seconds) (seconds)

Sequential SMO 9.19 1.59

Parallel 0.89 1.23

Example 2

2.1 Comparison of single-processor and multiprocessor version.

The experiment performed to compare the versions has been the same as in the previous example. Using a single FPGA architecture (Figure 4), the FPGA sequential system achieved an average training time of 79.58 seconds, the parallel version with two Microblazes (Figure 5) took 43.27 seconds. It is observed that, as in the comparisons between the SMO and the parallel process, due to the independence of the data, the inclusion of more processors to perform the training of the SVMs reduces the training time almost linearly.

Regarding the time of classification, the sequential version of the FPGA took 93.47 seconds compared to 57.11 seconds of the version with two Microblazes. It is observed that the classification time of the. data also decreases almost linearly due to the independence of the data when classified.

Classified correctly

1 FPGA 67.53%

2 FPGA 67.38%

Claims

Procedure for the classification of new individuals, in a data set, that uses Support Vector Machines, characterized in that it consists first of a system training procedure, from a known data sample, and subsequently of a classification procedure of new individuals

Procedure for the classification of new individuals, in a data set, according to claim 1, wherein the training of the Support Vector Machine (SVM) that will perform the classification, is characterized by consisting of the following steps:

to. each data in the training sample is assigned membership in a particular class, within a group of two classes;

b. the number of groups that will be in each of the classes of the training data sample is assigned;

C. For each class, as many groups are formed as they have been assigned in point b, grouping all the sample data into one of these groups; d. pairs of groups are selected, where each member of the couple belongs a different class;

and. the Support Vector Machine (SVM) is trained.

Procedure for the classification of new individuals, in a set of data, where the training of the Support Vector Machine (SVM) that will carry out the classification, according to claim 2, is characterized in that the allocation of the number of groups can be done both manually and based on the number of data and the maximum size of each group.

Procedure for the classification of new individuals, in a data set, where the training of the Support Vector Machine (SVM) that will perform the classification, according to claim 2, is characterized in that the position of the centroids of each of The groups are calculated using the k-means algorithm.

Procedure for the classification of new individuals, in a data set, where the training of the Support Vector Machine (SVM) that will carry out the classification, according to claim 2, is characterized in that the grouping of the data in each class is ago using regions of Voronoi.

Procedure for the classification of new individuals, in a data set, where the training of the Support Vector Machine (SVM) that will perform the classification, according to claim 2, is characterized in that the training of the different SVMs is done in parallel in all of them at once.

7. Procedure for the classification of new individuals, within the data set, according to claim 1, characterized in that it consists of the following steps:

to. Each Support Vector Machine (SVM) votes in which class the new data is;

b. once all the Support Vector Machines (SVM) have voted to which class the new data belongs, all the voting results are added;

C. The new data is assigned the most voted class.

8. Procedure for the classification of new individuals, within the data set, according to claim 7, characterized in that the vote made by the Support Vector Machine (SVM) can be weighted according to a predetermined criterion.

9. Procedure for the classification of new individuals, within the data set, according to claim 7, characterized in that the vote is carried out simultaneously in all Support Vector Machines (SVM).

10. Electronic device for the classification of data comprising at least one data storage memory, a data processing unit, a communication bus and an input / output interface characterized in that it is capable of performing the procedures described in the previous claims.

11. Electronic device according to claim 10, characterized in that the data processing unit can be an FPGA, an integrated circuit designed specifically for this task (ASIC) or any other technology that allows its manufacture in an electronic system.

12. Electronic device according to claims 10 and 11, characterized in that the electronic device is integrated in a printed circuit board.

13. Use of the procedure for the classification of new individuals, in a set of data, according to all the previous claims, for the evaluation of financial risks.

14. Use of the procedure for the classification of new individuals, in a set of data, according to all the previous claims, for the search of texts in documentary databases.

15. Use of the procedure for the classification of new individuals, in a set of data, according to all the preceding claims, for resolution of the search for protein markers in the DNA study.