CN104881476A

CN104881476A - Cloud computing based mass data processing system

Info

Publication number: CN104881476A
Application number: CN201510296226.5A
Authority: CN
Inventors: 陈勇; 胡中骥
Original assignee: Science And Technology Co Ltd Is Swum In Jiangsu At Once
Current assignee: Science And Technology Co Ltd Is Swum In Jiangsu At Once
Priority date: 2015-06-03
Filing date: 2015-06-03
Publication date: 2015-09-02

Abstract

The invention discloses a cloud computing based mass data processing system which comprises a Hadoop system, a distributed regional small group, a master node and a distributed file system, wherein the distributed regional small group is viewed as a node in a bigger non-sharing cluster and is managed by the Hadoop system, the master node is a coordinator of the Hadoop system, and data are stored in the distributed file system. The cloud computing based mass data processing system provides excellent loading balance, has a fault tolerance function and meets the requirements on distributed and parallel processing and can greatly reduce communication expense.

Description

A kind of mass data processing system based on cloud computing

Technical field

The present invention relates to data handling system, more specifically, relate to a kind of mass data processing system based on cloud computing.

Background technology

A major issue in cloud computing framework how to design an efficient accumulation layer to process the mass data on cloud computing platform.According to the design of swimming cloud platform at once, data are natural distributed management and storage, and namely all data connect into a data group by a high-speed local area network.The data of magnanimity are generated by various application on cloud plateform system, and possible data store and querying method is that use one is concentrated, and relational database management system (DBMS) is as bottom data accumulation layer.But we see the limitation of several this method, especially under distributed system.

First, central database server is difficult to the load balance realizing multiple node in system.

The second, be easy to appearance single point failure, namely Fault-Tolerant Problems may constitute a threat to the function of system.

3rd, it can produce very serious traffic load, because the data being distributed in each node must be delivered to central server by basic network.Finally, this pattern is difficult to realize parallel processing, to utilize the calculating advantage framework of cloud platform.

Summary of the invention

The object of the invention is the defect in order to solve existing for above-mentioned prior art, the present invention proposes a kind of mass data processing system based on cloud computing.

The technical solution adopted in the present invention is:

There is provided extendible distributed storage layer, adopt Hadoop system, keep distributed region groupuscule, then, these clusters are regarded as a larger node without sharing in cluster, return Hadoop system to manage.Each little cluster node is regarded as the slave node in Hadoop system, and wherein two host nodes are designated as the expeditor of Hadoop system.We are referred to as this design the Distributed Data Warehouse using Hadoop.We are stored in distributed file system data, HadoopDistributed File System (HDFS, and Map and the Reduce function that design ap-plication needs, to adapt to and to reduce calculated amount and the traffic of user application in cloud computing system.

This Distributed Data Warehouse is in particular designed by cloud computing framework, because it naturally provides fabulous load balance, fault tolerance, meets distributed and requirement that is parallel processing.Such as, distribution computation requirement can automatically be processed to underloaded node in our system.It utilizes the technology of data heavy duty, therefore, it is possible to the task that a failure node is performing is transplanted to other normal node continue evaluation work.Another attracting feature of our system is that it can greatly reduce the communication overhead of system.Our significant challenge to design, and realizes the design of personalized Map and Reduce to reduce communication cost and overall calculation cost (such as pruning unnecessary node visit and data transmission).The relational database management system of our also integrating traditional to our Hadoop Distributed Data Warehouse, especially in the process to structural data.For this reason, our useful expansion utilizes HadoopDB technology.Eachly use a relational database management system as its accumulation layer example in this locality from node, instead of only rely on HDFS's.Therefore, it can provide better efficiency (such as, to use an index structure a data base management system (DBMS), to accelerate to access local data) when processing structural data.

HBase is adopted to store computing system as our data.HBase is that an open source projects support is random, the large data of real-time read/write access.Its target is especially big table-billions of row on process commercial hardware cluster and millions of row.

The invention has the beneficial effects as follows,

The present invention is based on the mass data processing system of cloud computing,

1, provide fabulous load balance, fault tolerance, meet distributed and requirement that is parallel processing;

2, the communication overhead of system can be greatly reduced.

Below in conjunction with accompanying drawing, the present invention is described in further detail.

Accompanying drawing explanation

Fig. 1 is the mass data processing system based on cloud computing of the present invention: data store and processing procedure.

Embodiment

In order to deepen the understanding of the present invention, below in conjunction with drawings and Examples, the present invention is further detailed explanation.Following examples only for technical scheme of the present invention is clearly described, and can not limit the scope of the invention with this.

Specific embodiments of the invention are,

As shown in Figure 1, provide extendible distributed storage layer, adopt Hadoop system, keep distributed region groupuscule, then, these clusters are regarded as a larger node without sharing in cluster, return Hadoop system to manage.Each little cluster node is regarded as the slave node in Hadoop system, and wherein two host nodes are designated as the expeditor of Hadoop system.We are referred to as this design the Distributed Data Warehouse using Hadoop.We are stored in distributed file system data, Hadoop Distributed File System (HDFS, and Map and the Reduce function that design ap-plication needs, to adapt to and to reduce calculated amount and the traffic of user application in cloud computing system.

Be noted that, the above embodiment is unrestricted to the explanation of technical solution of the present invention, the equivalent replacement of art those of ordinary skill or other amendments made according to prior art, as long as do not exceed thinking and the scope of technical solution of the present invention, all should be included within interest field of the presently claimed invention.

Claims

1. the mass data processing system based on cloud computing, it is characterized in that: comprise Hadoop system, Distributed Area groupuscule, host node and distributed file system, Distributed Area groupuscule is regarded as a larger node without sharing in cluster, Hadoop system is returned to manage, host node is the expeditor of Hadoop system, and data are stored in distributed file system.

2. the mass data processing system based on cloud computing according to claim 1, is characterized in that: also comprise MapReduce node in described Hadoop system, to adapt to and to reduce calculated amount and the traffic of user application in cloud computing system.