CN103067501B

CN103067501B - The large data processing method of PaaS platform

Info

Publication number: CN103067501B
Application number: CN201210581670.8A
Authority: CN
Inventors: 李进
Original assignee: GCI Science and Technology Co Ltd
Current assignee: GCI Science and Technology Co Ltd
Priority date: 2012-12-28
Filing date: 2012-12-28
Publication date: 2015-12-09
Anticipated expiration: 2032-12-28
Also published as: CN103067501A

Abstract

The invention discloses a kind of large data processing method of PaaS platform, comprising: the cluster of PaaS platform server receives user input creates parameter; PaaS platform server creates parameter according to described cluster, generates distributed treatment cluster by Intel Virtualization Technology; PaaS platform server analyzes the script transmission of data to described distributed treatment cluster by being used for, and is processed data to be analyzed by described distributed treatment cluster; Data processed result is supplied to described user by PaaS platform server.The embodiment of the present invention can solve the process problem of the mass data of PaaS platform, improves data-handling efficiency.

Description

The large data processing method of PaaS platform

Technical field

The present invention relates to field of cloud computer technology, particularly relate to a kind of PaaS(Platform-as-a-Service, namely platform serves) the large data processing method of platform.

Background technology

Cloud computing development is just in full swing, and as the key areas of cloud computing industry, PaaS has become each large enterprises and contended following important camp.Due to IaaS(InfrastructureasaService, namely infrastructure serve) and SaaS(Software-as-a-service, namely software serve) achieve commercialization, in cloud environment, numerous application software all achieves standardization, user needs to make full use of the innovative solution that PaaS brings, and service provider then needs this solution to embody the differential competition of oneself.PaaS, as a kind of service mode, can advance the development of SaaS, can improve available resource quantity on Web platform.PaaS solution is that the deployment of application program provides conveniently, simplifies the complexity bought and manage bottom software and hardware, also reduces cost.

Along with the development of PaaS platform, more and more, increasing application deployments is in PaaS platform, due to the automation progress that data produce, increasing application requires the data flow persistence of these quantity sustainable growths to preserve, and carry out follow-up query analysis and data mining, this proposes stern challenge to the management of the mass data of PaaS platform, and the large data processing problem under PaaS platform arises.

Summary of the invention

The embodiment of the present invention proposes a kind of large data processing method of PaaS platform, can solve the process problem of the mass data of PaaS platform, improves data-handling efficiency.

The embodiment of the present invention provides a kind of large data processing method of PaaS platform, comprising:

S1, the cluster of PaaS platform server receives user input creates parameter; Described cluster creates the storage size that parameter comprises the quantity of node of distributed treatment cluster to be created, the memory size of node and node;

S2, PaaS platform server creates parameter according to described cluster, generates distributed treatment cluster by Intel Virtualization Technology;

S3, the journal file memory address that PaaS platform server inputs according to described user or the Apply Names that described user disposes, configure data source to be analyzed;

S4, PaaS platform server analyzes the script transmission of data to described distributed treatment cluster by being used for, and is processed data to be analyzed by described distributed treatment cluster;

S5, data processed result is supplied to described user by PaaS platform server.

Wherein, described node is the virtual machine in distributed treatment cluster; Described node comprises Controlling vertex and computing node, and described Controlling vertex is used for management cluster and distribute data Processing tasks, and described computing node is used for analyzing and processing data.

The large data processing method of the PaaS platform that the embodiment of the present invention provides, utilizes PaaS platform existing resource, generates each node in distributed treatment cluster by PaaS platform by the IaaS layer Intel Virtualization Technology of bottom; There is provided large data-handling capacity by the distributed treatment cluster generated for PaaS platform, the process problem of the mass data of PaaS platform can be solved, improve data-handling efficiency.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of an embodiment of the large data processing method of PaaS platform provided by the invention;

Fig. 2 is the structural representation of an embodiment of the large data handling system of PaaS platform provided by the invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

See Fig. 1, it is the schematic flow sheet of an embodiment of the large data processing method of PaaS platform provided by the invention.

The embodiment of the present invention provides a kind of large data processing method of PaaS platform, comprises step S1 ~ S5, specific as follows:

s1, the cluster of PaaS platform server receives user input creates parameter.

Described cluster creates the storage size that parameter comprises the quantity of node of distributed treatment cluster to be created, the memory size of node and node, and other parameters.

Described node is the virtual machine in distributed treatment cluster; Described node comprises Controlling vertex and computing node, and described Controlling vertex is used for management cluster and distribute data Processing tasks, and described computing node is used for analyzing and processing data.

In addition, PaaS platform server also creates parameter according to described cluster, and whether detection system resource meets the demands.When meeting the demands, performing step S2, creating distributed treatment cluster.

s2, PaaS platform server creates parameter according to described cluster, generates distributed treatment cluster by Intel Virtualization Technology.

Step S2 specifically comprises step S201 ~ S204, as follows:

S201, creates parameter according to described cluster, generates a virtual machine, and configure the running environment of described virtual machine by Intel Virtualization Technology.

Such as, the virtual machine generated installs the softwares such as jdk, mysql, hadoop, and sets.Required software can copy from the soft file large data processing service assembly.In one embodiment, virtual machine adopts Centos5.5 operating system, and jdk version is 1.6.23, mysql version be 5.5, hadoop version is 1.0.2.

S202, creates the quantity of the node in parameter, the virtual machine that copy step S201 generates according to described cluster, generate the virtual machine of requirement.

S203, arrange between virtual machine without cryptographic communication.

Step S203 specifically comprises: control every platform virtual machine activation key generation procedure, generate respective PKI and private key.Again the PKI that every platform virtual machine generates is copied on other virtual machines, realize without cryptographic communication.

During concrete enforcement, a ssh-keygen-tdsa program can be run on every platform virtual machine, respective PKI and private key can be generated.And the content inside PKI file is copied in the authorized_key file of other virtual machines mutually, respectively log in once, generate known_hosts file, realize without cryptographic communication.

S204, arranges the Controlling vertex in distributed treatment cluster and computing node.

The virtual machine that first time generates by the present embodiment acquiescence is as Controlling vertex, and all the other virtual machines are as computing node.Further, slaves, masters, mapred-site.xml, hdfs-site.xml, hadoop-env.sh, core-site.xml in amendment hadoop, configure distributed treatment cluster parameter.

s3, the journal file memory address that PaaS platform server inputs according to described user or the Apply Names that described user disposes, configure data source to be analyzed.

Step S3 specifically comprises:

PaaS platform server receives user input journal file memory address, or obtain corresponding journal file memory address according to described user in the title of the application of PaaS platform deploy;

Whether the file format that PaaS platform server detects in described journal file memory address is journal file (namely judging whether journal file exists); If so, then data to be analyzed are imported from described journal file memory address, otherwise data source configuration failure to be analyzed.

Journal file in described journal file memory address, is data source to be analyzed, is imported in distributed type assemblies and carries out data processing in follow-up step S4.

s4, PaaS platform server analyzes the script transmission of data to described distributed treatment cluster by being used for, and is processed data to be analyzed by described distributed treatment cluster.

Above-mentioned steps S4 specifically comprises:

S401, PaaS platform server will be used for analyzing the script transmission of data to the Controlling vertex in described distributed treatment cluster; The described script for analyzing data is MapReduce script, is used to indicate the method importing data to be analyzed and the method performing MapReduce operation.

S402, described Controlling vertex selects computing node idle in described distributed treatment cluster, performs data processing task concurrently, process data to be analyzed by described computing node.

Controlling vertex in distributed treatment cluster mainly exercises supervision to the execution of MapReduce operation in cluster and manages, and computing node is responsible for the specific implementation of Map task and Reduce task in MapReduce operation.When MapReduce Hand up homework is to distributed treatment cluster, first relevant input data will be divided into multiple segment, and then Controlling vertex is selected idle computing node and performed Map task concurrently to data fragments.Then these intermediate records produced by Map task, again can be divided into and select by Controlling vertex the Reduce task that idle computing node performs concurrently to them, thus obtain the data acquisition system corresponding with each key assignments as operation result.Such process will perform repeatedly, until Map tasks all in MapReduce operation and Reduce tasks carrying complete.

During concrete enforcement, whether PaaS platform server also meets the requirements for the script analyzing data according to script type detection.Such as, require that script is necessary for jar type.After meeting the requirements, perform step S401 and S402.

s5, data processed result is supplied to described user by PaaS platform server.

The large data processing method of the PaaS platform that the embodiment of the present invention provides, can utilize PaaS platform existing resource, generates each node in distributed treatment cluster by PaaS platform by the IaaS layer Intel Virtualization Technology of bottom; There is provided large data-handling capacity by the distributed treatment cluster generated for PaaS platform, thus solve the process problem of the mass data of PaaS platform, improve data-handling efficiency.

In the middle of concrete enforcement, PaaS platform server configures PaaS platform, and this PaaS platform is integrated with large data processing service assembly, is performed the large flow chart of data processing in above-mentioned steps S1 ~ S5 by described large data processing service assembly.

See Fig. 2, the structural representation of an embodiment of the large data handling system of PaaS platform provided by the invention.

The embodiment of the present invention provides a kind of large data handling system of PaaS platform, comprising: PaaS platform layer, virtual distributed treatment cluster, cloud store and server.Specific as follows:

Described PaaS platform layer provides various serviced component, comprises large data processing service assembly, and provides the user interface (UserInterface, be called for short UI) of operation for user.Described PaaS platform adopts OSGi(OpenServiceGatewayInitiative) framework, the various services such as middleware services, data, services, monitor service, large data processing service insert PaaS platform with kit form, thus define pluggable, that dynamic changes behavior, stability and high efficiency system.Described large data processing service assembly provides the input of configuration parameter needed for generating virtual distributed treatment cluster, the representing of result for user; Virtual distributed treatment cluster management function is provided simultaneously, comprises the life cycle controlling cluster, the process monitoring cluster deal with data.

Described virtual distributed treatment cluster, for system provides the analysis data-handling capacity of core.The parameter configuration that described cluster is provided according to large data processing service assembly by PaaS platform, is generated by Intel Virtualization Technology.Described cluster obtains data to be analyzed from cloud stores, and carries out data processing and analysis, and by the user interface of the large data processing service assembly of PaaS platform, analysis result is presented to user according to the script that large data processing service assembly provides.Described cluster adopts Hadoop aggregated structure, achieves a distributed file system (HadoopDistributedFileSystem is called for short HDFS).HDFS has the feature of high fault tolerance, and design is used for being deployed on cheap hardware.And HDFS provides high transmission rates to visit the data of application program.By described Hadoop framework, utilize PaaS platform existing resource, provide the large data-handling capacity of a high reliability, high scalability, high efficiency, high fault tolerance.

Described cloud stores and server, can adopt the existing resource of PaaS platform to build, for whole system provides hardware resource basis.All disk units during described cloud stores derive from cheap PC equipment, are incorporated into the application server being supplied to front end in single shared storage pool, greatly improve disk utilization.Distributed storage improves file read-write efficiency; Cloud storage can realize Large Copacity by linear expansion, can export for unstructured data provides high I O(to input simultaneously) bandwidth.Storage backup strategy eliminates the Single Point of Faliure of disk, ensures high reliability, and conventional store has the advantage of low cost relatively.

The large data processing method of the PaaS platform that the embodiment of the present invention provides and system, have following beneficial effect:

(1), the present invention makes full use of existing storage and the computational resource of PaaS platform, improves PaaS platform resource utilization; User no longer needs again to buy new storage and server, can effectively reduce cost; Meanwhile, large data processing service with the mode of assembly integrated enter PaaS platform, can expand easily, Speeding up development efficiency.

(2), along with the development of PaaS platform, more and more, increasing application deployments is in PaaS platform, the mass data processing of PaaS platform is inevitable, and the present invention can solve the mass data processing problem in PaaS platform effectively, provides data-handling efficiency.

The above is the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications are also considered as protection scope of the present invention.

Claims

1. a large data processing method for PaaS platform, is characterized in that, comprising:

S5, data processed result is supplied to described user by PaaS platform server;

Wherein, described PaaS platform is configured on described PaaS platform server, and described PaaS platform is integrated with large data processing service assembly; Described large data processing service assembly is for performing the method flow in above-mentioned steps S1 ~ S5.

2. the large data processing method of PaaS platform as claimed in claim 1, it is characterized in that, described node is the virtual machine in distributed treatment cluster; Described node comprises Controlling vertex and computing node, and described Controlling vertex is used for management cluster and distribute data Processing tasks, and described computing node is used for analyzing and processing data.

3. the large data processing method of PaaS platform as claimed in claim 2, it is characterized in that, described step S2 specifically comprises:

S201, creates parameter according to described cluster, generates a virtual machine, and configure the running environment of described virtual machine by Intel Virtualization Technology;

S202, creates the quantity of the node in parameter, the virtual machine that copy step S201 generates according to described cluster, generate the virtual machine of requirement;

S203, arrange between virtual machine without cryptographic communication;

4. the large data processing method of PaaS platform as claimed in claim 3, it is characterized in that, described step S3 specifically comprises:

Whether the file format that PaaS platform server detects in described journal file memory address is journal file; If so, then data to be analyzed are imported from described journal file memory address, otherwise data source configuration failure to be analyzed.

5. the large data processing method of PaaS platform as claimed in claim 4, it is characterized in that, described step S4 specifically comprises:

S401, PaaS platform server will be used for analyzing the script transmission of data to the Controlling vertex in described distributed treatment cluster; The described script for analyzing data is MapReduce script, is used to indicate the method importing data to be analyzed and the method performing MapReduce operation;