CN104320460A

CN104320460A - Big data processing method

Info

Publication number: CN104320460A
Application number: CN201410577834.9A
Authority: CN
Inventors: 王茜; 李安颖; 史晨昱; 葛新; 梁小江
Original assignee: Xi'an Following International Information Ltd Co
Current assignee: Xi'an Following International Information Ltd Co
Priority date: 2014-10-24
Filing date: 2014-10-24
Publication date: 2015-01-28

Abstract

The invention discloses a big data processing method which includes the following steps: building a Hadoop cluster on an Open Stack cloud platform to provide basic environment for big data processing; importing data into HDFS and Swift to build a data source; processing the data built in the step 2 by a user, displaying a processing result through the Web page or assigning the processing result to the output file under the specific route. By means of the big data processing method based on Open Stack and Hadoop, the server resource utilization rate is improved, and the big data access requirement is reduced.

Description

A kind of large data processing method

Technical field

The invention belongs to large technical field of data processing, relate to a kind of large data processing method.

Background technology

More prevalent along with the Network Information epoch, mobile Internet, social networks, ecommerce have expanded boundary and the application of the Internet greatly, we are in " large data " epoch of a data explosive increase, large data are in social economy, politics, culture, the aspects such as people's life produce far-reaching influence, and the data controling power of large data age to the mankind proposes new Oppertunities and challenges.Large data have magnanimity, diversity, high speed, mutability, data type are various, and data value relative density is low, ageing requires high, beyond the disposal ability of traditional data base handling system.Under cover valuable pattern and information in data, utilize the mode of traditional data processing, excavate the information in large data, needs to take a long time and huge cost, even cannot process some data.The tide that cloud computing and large data revolution start, driven the development of data analysis industry, cloud computing provides base platform, and large market demand operates on this platform, and this is one of method of generally acknowledging the large data of process very efficiently at present.Utilize cloud computing to carry out large data analysis, one of development trend in the future certainly will be become.Wherein be applied as the large data analysis of representative with Hadoop, be best suited for one of business run on cloud platform.

OpenStack is that one is increased income cloud computing technology, its main task be simplify cloud deployment and bring good extensibility for it.

Conveniently carry out Treatment Analysis to large data fast, therefrom excavate the value of data, we propose a kind of new processing method and OpenSatck Sahara, utilize Openstack Sahara can fast and the information excavated in large data of low cost.

Summary of the invention

The object of this invention is to provide a kind of large data processing method, improve the resource utilization of server, and reduce the access threshold of large data.

Technical scheme of the present invention is, a kind of large data processing method, specifically implements according to following steps:

Step 1, openstack cloud platform creates Hadoop cluster, provides the Essential Environment of large data processing;

Step 2, by creating data source by data importing to HDFS and Swift;

Step 3, user processes the data in the data source created in step 2, and result is shown by Web page or result be assigned to the output file under particular path.

Feature of the present invention is also,

Step 1 is specifically implemented according to following steps:

Step 1.1, user applies for OpenStack account, and uses OpenStack account to log in OpenStack cloud platform;

Step 1.2, user uploads mirror image to OpenStack cloud platform and registers mirror image;

Step 1.3, user creates network and route, node group module and cluster template;

Step 1.4, user by selecting Plugin and Hadoop version, fills in cluster name, selects cluster template, foundation image, double secret key and network to create Hadoop cluster.

In step 2, data source comprises HDFS data source and Swift data source.

In step 3, user carries out process to data and comprises user interface process method and order line processing method,

User interface process method refers to carries out man-machine interaction by user interface, creates Job Binaries and Job, and performs operation, check execution result by web page;

Order line processing method refers to that user is under Command Line Interface, is submitted to and is performed operation, checked by the output file under the specified path of Output rusults by order.

In step 3, user adopts the Map-Reduce framework of Hadoop to process data.

The invention has the beneficial effects as follows, utilize Sahara can in openstack cloud environment rapid deployment Hadoop cluster, as the bridge of cloud computing and large data, the integration of openstack cloud platform and Hadoop can be promoted, thus can fast and the information excavated in large data of low cost, improve the resource utilization of server, greatly reduce again the access threshold of large data, it is one of methods of the large data of process very efficiently that large market demand operates on cloud platform.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of a kind of large data processing method of the present invention;

Fig. 2 is the schematic diagram of Hadoop cluster constructive process in the inventive method;

Fig. 3 is the schematic flow sheet of Map-Reduce processing method in the inventive method.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.

A kind of large data processing method of the present invention, as shown in Figure 1, comprises the following steps:

Wherein, as shown in Figure 2, step 1 is specifically implemented according to following steps:

Step 1.4, user by selecting Plugin and Hadoop version, fills in cluster name, selects cluster template, foundation image, double secret key and network to create Hadoop cluster;

Step 2, by creating data source by data importing to HDFS and Swift;

Wherein, in step 2, data source comprises HDFS data source and Swift data source,

HDFS data source comprises input/output data source name, selects data source types HDFS, I/O URL path.

Swift data source comprises input/output data source name, selection data source types Swift, I/O URL path, input username and password.

Step 3, user can be processed data by two kinds of methods, and one carries out man-machine interaction by user interface, creates Job Binaries, creates job, performs job, checks execution result by web; One is by Command Line Interface, and user, under Command Line Interface, is submitted to by order and performed operation, being checked by the output file under the specified path of Output rusults.Concrete data processing is the Map-Reduce framework adopting Hadoop.Map-reduce is exactly the decomposition of task and gathering of result.Processing procedure is as shown in Figure 3:

The Map stage: Hadoop Map/Reduce framework produces a map task for each InputSplit, and each InputSplit is produced by the InputFormat of this operation; Framework can the value (value) of all pilot processs associated with a specific key point in groups, after the output of Mapper is sorted, is just allocated to each Reducer

Reduce stage: Reducer has 3 Main Stage: shuffle, sort and reduce.The input of Shuffle Reducer is exactly Mapper sorted output.In this stage, framework is that each Reducer obtains piecemeal associated in all Mapper output by HTTP.

In this stage of Sort, the input of value to Reducer according to key is divided into groups by framework (because may have identical key in the output of different mapper).

Two stages of Shuffle and Sort carry out simultaneously; The output of map is also be retrieved while merged.

Reduce is in this stage, and framework is each <key in the input data of having divided into groups, and (list of values) > is to calling a reduce method.The output of Reduce task is normally by calling OutputCollector.collect writing in files system.

Claims

1. a large data processing method, is characterized in that, specifically implements according to following steps:

Step 2, by creating data source by data importing to HDFS and Swift;

2. the large data processing method of one according to claim 1, is characterized in that, described step 1 is specifically implemented according to following steps:

3. the large data processing method of one according to claim 1, is characterized in that, in step 2, data source comprises HDFS data source and Swift data source.

4. the large data processing method of one according to claim 1, is characterized in that, in described step 3, user carries out process to data and comprises user interface process method and order line processing method,

Described user interface process method refers to carries out man-machine interaction by user interface, creates Job Binaries and Job, and performs operation, check execution result by web page;

Described order line processing method refers to that user is under Command Line Interface, is submitted to and is performed operation, checked by the output file under the specified path of Output rusults by order.

5. the large data processing method of the one according to Claims 1-4 any one, is characterized in that, in step 3, user adopts the Map-Reduce framework of Hadoop to process data.