CN103838617A

CN103838617A - Method for constructing data mining platform in big data environment

Info

Publication number: CN103838617A
Application number: CN201410055529.3A
Authority: CN
Inventors: 叶枫; 王亚普; 周发超; 周远超
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2014-02-18
Filing date: 2014-02-18
Publication date: 2014-06-04

Abstract

The invention discloses a method for constructing a data mining platform in a big data environment. The data mining platform is suitable for processing data sets of different scales and various types and analyzing and displaying data through the rich functions of an R language. The system structure of the data mining platform is shown in the following figure and comprises a physical layer, a virtualization layer, a service layer and an application layer from bottom to top. Heterogeneous hardware resources are deployed on the physical layer. On the virtualization layer, a virtual machine group is constructed through CloudStack, and then a Hadoop environment is deployed on the virtual machine group. On the service layer, the R language is integrated, and multiple data mining functions are achieved and packaged into services. On the application layer, a clear operation interface is provided for a user to customize flow paths and configure parameters. According to the method, big data can be effectively processed, the analysis result can be effectively displayed, and high processing efficiency is achieved.

Description

The construction method of the data mining platform under large data environment

?

Technical field

The present invention relates to the construction method of the data mining platform under a kind of large data environment, in conjunction with technology such as cloud computing, virtual and Hadoop, integrated R language, is applicable to process different scales, the various data set of type, allows user to carry out data mining, analysis by the mode at Web interface.

Background technology

Along with informationalized propelling, enterprises and institutions produce or have had magnanimity business datum, are wherein containing a large amount of unknown, potential information.Data mining is a kind of new business information treatment technology, has obtained general application in fields such as bank, telecommunications, insurance, traffic, retails.By a large number of services data being extracted, change, analyzed and other modelling processing, can extract the auxiliary correct and crucial decision-making of making.The data volume of facing is increasing, is increasingly paid close attention to for excavation, the analysis of large data.But the analysis of single cpu mode is limited to memory size and computing power, make traditional data mining, analytical approach no longer valid under large data environment.

The appearance of cloud computing, provides effective approach for solving large data problem.Cloud computing, Intel Virtualization Technology can be integrated infrastructure resources effectively, for excavation, the analysis of large data provide calculating and storage capacity.Hadoop is the realization of increasing income of MapReduce programming model, for calculating and the storage of large data provide available framework.Open source software R is current quite popular data analysis, statistical cartography language, has abundant analysis module and utility, is in the industry cycle used widely.In order fully to excavate, analyze the value of large data, for user provides powerful data mining, analytic function, design an integrated R language, easy-to-use large data mining platform, there is good using value.

Summary of the invention

Goal of the invention: the invention provides the construction method of the data mining platform under a kind of large data environment, integrated R language, as data analysis engine, has designed the data mining platform that can process under large data environment.Utilize this platform to carry out data mining, user can solve some typical data mining problem, as customer segmentation, cross-selling, and the problem such as customer churn analysis, client's credit appraisal.

To achieve these goals, the architecture of constructed system is as follows:

Physical layer: formed by hardware such as server, PC, the network equipments, for large data processing provides essential hardware foundation.

Virtualization layer: adopt the cloud Platform Solution CloudStack 4.0 that increases income to build cluster virtual machine, integrate infrastructure resources, for whole system provides extendible, manageable calculating and storage capacity; Then, at virtual machine deploy Hadoop environment and MySQL cluster, for supporting read-write and the storage of large data.

Service layer: dispose RHadoop environment, R language engine can be operated on Hadoop cluster, both can give full play to the power of R language aspect statistical computation and drawing, and can utilize Hadoop to make up the deficiency of R language in the time processing large data in the ability aspect parallel computation and extendability simultaneously; Exploitation is served, and the function that the data digging method that encapsulation is used conventionally in service is realized, comprises 10 kinds of data mining algorithms of 4 large classes, respectively: classification and decision tree, SVM support vector machine and the neural network algorithm predicted; K-Mmeans, Pam, Clara, Agnes and the Diana algorithm of cluster analysis; The multiple regression of regretional analysis; The ARIMA model analysis method of time series analysis.

Application layer: the various functions that realize to user service layer in the mode at Web interface.User can set up analysis process, comprising: Data Source, selection analysis method are set, analytical parameters, data mining and analysis are set, draw analysis result and show.

Technical scheme: the construction method of the data mining platform under a kind of large data environment, comprises following several step:

Step 1: infrastructure is virtual.Adopt Intel Virtualization Technology can realize the integrated integration of main frame and storage resources and share and utilize.Facility is virtual, comprise server virtualization, Storage Virtualization, network virtualization.Mainly carry out virtually from two aspects, set up two virtual ponds and calculate virtual pond and Storage Virtualization pond.Calculate virtual pond and mainly realize applying virtual, computational resource aspect comprise server virtualization and Application Middleware virtual.Data storage virtualization is mainly realized in Storage Virtualization pond, comprises that at accumulation layer face storage hardware framework is virtual and storing software is virtual.The present invention builds the hardware such as main frame, management node, many computing nodes and the network equipment according to above-mentioned thinking, for large data processing provides essential hardware foundation.

Step 2: dispose virtual device, i.e. the stage of virtual machine instantiation.This flow process is roughly divided into following step:

(1) select virtual device and customize;

(2) preserve and customize Parameter File;

(3) the target physical machine server that selection is disposed;

(4) associated documents of copy virtual device;

(5) on target machine, start the virtual device after disposing.

Step 3: the installation of the cloud computing solution CloudStack that increases income.Use CloudStack as basis, user can create privately owned cloud computing platform quickly and easily in existing architecture, and its installation process mainly comprises the following steps:

(1) configuration installation source (management and computing node all need configuration);

(2) CloudStack Management Server is installed;

(3) MySQL database is installed;

(4) HOST main frame is installed;

(5) configuration security strategy, bridge, fire wall, NFS share etc.

Step 4: service layer: dispose RHadoop environment, R language engine can be operated on Hadoop cluster; In order to shield the complicacy of R language, need to configure JRI dynamic link library, actual computation process is realized by call R language at bottom.

Step 5: the mass data in the type of dealing with relationship database.Realize the operation to large-scale data in relevant database in conjunction with R and Hadoop.The present invention adopted a kind of can read more efficiently and the database of dealing with relationship in the solution of mass data record: by Open-Source Tools Sqoop, a large amount of data to be analyzed are output as to text data file, and upload in HDFS, be then converted into text data set is carried out to distributed treatment.

Step 6: the method for operating of procedure.The various functions that realize to user service layer in the mode at Web interface.User can, according to the self-defined analysis process of self-demand, comprise: Data Source, selection analysis method are set, analytical parameters, data mining and analysis are set, draw analysis result and show.

The present invention adopts technique scheme, has following beneficial effect:

(1) utilize cloud computing and Intel Virtualization Technology, integrate infrastructure resources, the calculating and the storage capacity that be convenient to unified management for platform provides, possess enhanced scalability.

(2) adopt optimum data processing mode for different scales data set, in the time that data scale single cpu mode can not be processed, utilize Hadoop cluster to provide support.And heartbeat mechanism when many backup policy of Hadoop storage, tasks carrying and data-base cluster and reproduction technology have guaranteed that platform possesses higher fault-tolerant ability.

(3), for solving the extensibility of data mining algorithm, use multiple Design Mode optimized interface design, the logic loose coupling of the parameter configuration interface of presentation layer and R language analysis data.

(4) provide the data mining algorithm of main flow, supported to process structuring formatted files such as () MySQL, SQLServer, txt, csv and xls, semi-structured formatted files such as () XML, HTML, destructuring image files such as () jpg, bmp and GIS base maps three major types data.

(5) integratedly in whole platform used 8 kinds of open source softwares, cost performance is high.

Accompanying drawing explanation

Fig. 1 is the architectural framework figure of the data mining platform under large data environment.

Fig. 2 is the business process map of application layer.

Embodiment

Below in conjunction with specific embodiment, further illustrate the present invention, should understand these embodiment is only not used in and limits the scope of the invention for the present invention is described, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the modification of the various equivalent form of values of the present invention.

The architectural framework of the data mining platform under large data environment, as shown in Figure 1, comprises following several step:

Step 1: infrastructure is virtual.Adopt Intel Virtualization Technology can realize the integrated integration of main frame and storage resources and share and utilize, can improve resource utilization, reduce costs, can reduce again the complicacy of management.Facility is virtual, comprise server virtualization, Storage Virtualization, network virtualization.The present invention mainly carries out virtual from two aspects, set up two virtual ponds and calculate virtual pond and Storage Virtualization pond.Calculate virtual pond and mainly realize applying virtual, computational resource aspect comprise server virtualization and Application Middleware virtual.Data storage virtualization is mainly realized in Storage Virtualization pond, comprises that at accumulation layer face storage hardware framework is virtual and storing software is virtual.The present invention builds the hardware such as main frame, management node, many computing nodes and the network equipment according to above-mentioned thinking, for large data processing provides essential hardware foundation.

(1) select virtual device and customize;

(2) preserve and customize Parameter File;

(3) the target physical machine server that selection is disposed;

(4) associated documents of copy virtual device;

(5) on target physical machine server, start the virtual device after disposing.

(2) CloudStack Management Server is installed;

(3) MySQL database is installed;

(4) HOST main frame is installed;

(5) configuration security strategy, bridge, fire wall, NFS share etc.

Step 4: service layer: dispose RHadoop environment, R language engine can be operated on Hadoop cluster, both can give full play to the power of R language aspect statistical computation and drawing, and can utilize Hadoop to make up the deficiency of R language in the time processing large data in the ability aspect parallel computation and extendability simultaneously.Concrete configuration step is as follows: the 1. installation of Ubuntu operating system.2. building of Java environment.3. building of Hadoop environment.4. rely on the installation of storehouse (rmr, rhdfs, rhbase).In order to shield the complicacy of R language, need to configure Rserve or JRI dynamic link library, realize the Overpassing Platform by Using of R language, actual computation process is completed by call R language at bottom.Rserve is one and allows the program of the C/S structure of R language and other speech communications based on ICP/IP protocol, and its use step is as follows: 1. rely on the installation of storehouse (Rserve): install.packages (" Rserve ").2. start service: in order line, input R CMD Rserve.

Step 5: the mass data in the type of dealing with relationship database.In R, have the interface of multiple facing relation type data base management system (DBMS), but for mass data record, there is internal memory restriction and the low problem for the treatment of effeciency in R equally.The present invention realizes the operation to large-scale data in relevant database in conjunction with R and Hadoop.Hadoop provides the corresponding interface from relation data library inquiry and reading out data, although allow with relevant interface from database directly reads data log as the input of MapReduce, but treatment effeciency is lower, and a large amount of inquiry and read relational database and may greatly increase the access load of database from MapReduce program continually.The present invention adopted a kind of can read more efficiently and the database of dealing with relationship in the solution of mass data record: by Open-Source Tools Sqoop, a large amount of data to be analyzed are output as to text data file, and upload in HDFS, be then converted into text data set is carried out to distributed treatment.

Step 6: the method for operating of procedure.The various functions that realize to user service layer in the mode at Web interface.User can, according to the self-defined analysis process of self-demand, comprise: Data Source, selection analysis method are set, analytical parameters, data mining and analysis are set, draw analysis result and show, concrete operation flow as shown in Figure 2.

Claims

1. a construction method for the data mining platform under large data environment, is characterized in that, comprises following several step:

Step 1: infrastructure is virtual; Adopt Intel Virtualization Technology that facility is virtual, comprise server virtualization, Storage Virtualization and the network virtualization of Physical layer, form virtualization layer; Mainly carry out virtually from two aspects, set up two virtual ponds and calculate virtual pond and Storage Virtualization pond; Calculate virtual pond and mainly realize applying virtual, computational resource aspect comprise server virtualization and Application Middleware virtual; Data storage virtualization is mainly realized in Storage Virtualization pond, comprises that at accumulation layer face storage hardware framework is virtual and storing software is virtual;

Step 2: dispose virtual device, i.e. the stage of virtual machine instantiation; This flow process is roughly divided into following step:

(1) select virtual device and customize;

(2) preserve and customize Parameter File;

(3) the target physical machine server that selection is disposed;

(4) associated documents of copy virtual device;

(5) on target machine, start the virtual device after disposing;

Step 3: the installation of the cloud computing solution CloudStack that increases income; Use CloudStack as basis, build cluster virtual machine, user can create privately owned cloud computing platform quickly and easily in existing architecture, and its installation process mainly comprises the following steps:

(1) configuration installation source;

(2) CloudStack Management Server is installed;

(3) MySQL database is installed;

(4) HOST main frame is installed;

(5) configuration security strategy, bridge, fire wall, NFS share;

Step 4: service layer: dispose RHadoop environment, R language engine can be operated on Hadoop cluster; Configuration JRI dynamic link library, makes actual computation process realize by call R language at bottom;

Step 5: the mass data in the type of dealing with relationship database; Realize the operation to large-scale data in relevant database in conjunction with R and Hadoop: by Open-Source Tools Sqoop, a large amount of data to be analyzed are output as to text data file, and text data file is uploaded in HDFS, be then converted into text data set is carried out to distributed treatment;

Step 6: the method for operating of procedure; The various functions that realize to user service layer in the mode at Web interface in application layer; User can, according to the self-defined analysis process of self-demand, comprise: Data Source, selection analysis method are set, analytical parameters, data mining and analysis are set, draw analysis result and show.

2. the construction method of the data mining platform under large data environment according to claim 1, it is characterized in that: in described service layer, used the reproduction technology of MySQL database and Open-Source Tools Sqoop to realize and between Hadoop and database, carried out customizable data pass through mechanism.

3. the construction method of the data mining platform under large data environment according to claim 1, it is characterized in that: in described application layer, design the user interface of B/S pattern, user only need utilize graphic interface to operate, carry out data analysis and statistics and do not need directly to write R code, actual computation process realizes by call R language at bottom, has fundamentally shielded the complicacy of R language.