CN103067501A

CN103067501A - Large data processing method of PaaS platform

Info

Publication number: CN103067501A
Application number: CN2012105816708A
Authority: CN
Inventors: 李进
Original assignee: GCI Science and Technology Co Ltd
Current assignee: GCI Science and Technology Co Ltd
Priority date: 2012-12-28
Filing date: 2012-12-28
Publication date: 2013-04-24
Anticipated expiration: 2032-12-28
Also published as: CN103067501B

Abstract

The invention discloses a large data processing method of a PaaS platform. The large data processing method of the PaaS platform comprises that a PaaS platform server receives a colony creating parameter which is inputted by a user. The PaaS platform server generates a distributed processing colony by virtualization technology according to the colony creating parameter. The PaaS platform server transmits script which is used for analyzing data to the distributed processing unit, and processes analyzed data by the distributed processing colony. The PaaS platform server provides data processing results for the user. The large data processing method of the PaaS platform can solve the problem of processing of mass data of the PaaS platform, and improves data processing efficiency.

Description

The large data processing method of PaaS platform

Technical field

The present invention relates to the cloud computing technology field, relate in particular to a kind of PaaS(Platform-as-a-Service, platform is namely served) the large data processing method of platform.

Background technology

The cloud computing development is just in full swing, and as the key areas of cloud computing industry, PaaS has become each large enterprises and contended following important camp.Because IaaS(Infrastructure as a Service, infrastructure is namely served) and SaaS(Software-as-a-service, software is namely served) realized commercialization, numerous application software have all realized standardization in the cloud environment, the user needs to take full advantage of the innovative solution that PaaS brings, and the service provider then needs this solution to embody the differential competition of oneself.PaaS can advance the development of SaaS as a kind of service mode, can improve available resource quantity on the Web platform.The PaaS solution provides convenience for the deployment of application program, has simplified the complexity of buying and managing the bottom software and hardware, has also reduced cost.

Development along with the PaaS platform, more and more, increasing application deployments are on the PaaS platform, because the automation progress that data produce, increasing application requirements preserves the data flow persistence of these quantity sustainable growths, and carry out follow-up query analysis and data mining, this management to the mass data of PaaS platform has proposed stern challenge, and the large data processing problem under the PaaS platform arises.

Summary of the invention

The embodiment of the invention proposes a kind of large data processing method of PaaS platform, can solve the processing problem of the mass data of PaaS platform, improves data-handling efficiency.

The embodiment of the invention provides a kind of large data processing method of PaaS platform, comprising:

The cluster that S1, PaaS Platform Server receive user's input creates parameter; Described cluster creates parameter and comprises the quantity of the node of distributed treatment cluster to be created, the memory size of node and the storage size of node;

S2, the PaaS Platform Server creates parameter according to described cluster, generates the distributed treatment cluster by Intel Virtualization Technology;

S3, the PaaS Platform Server disposes data source to be analyzed according to the journal file memory address of described user's input or the Apply Names of described user deployment;

S4, the PaaS Platform Server will be given described distributed treatment cluster for the script transmission of analyzing data, by described distributed treatment cluster data to be analyzed be processed;

S5, the PaaS Platform Server offers described user with data processed result.

Wherein, described node is the virtual machine in the distributed treatment cluster; Described node comprises control node and computing node, and described control node is used for management cluster and distribute data Processing tasks, and described computing node is used for analyzing and processing data.

The large data processing method of the PaaS platform that the embodiment of the invention provides is utilized PaaS platform existing resource, and the IaaS layer Intel Virtualization Technology that is passed through bottom by the PaaS platform generates each node in the distributed treatment cluster; Provide large data-handling capacity by the distributed treatment cluster that generates for the PaaS platform, can solve the processing problem of the mass data of PaaS platform, improve data-handling efficiency.

Description of drawings

Fig. 1 is the schematic flow sheet of an embodiment of the large data processing method of PaaS platform provided by the invention;

Fig. 2 is the structural representation of an embodiment of the large data handling system of PaaS platform provided by the invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.

Referring to Fig. 1, it is the schematic flow sheet of an embodiment of the large data processing method of PaaS platform provided by the invention.

The embodiment of the invention provides a kind of large data processing method of PaaS platform, comprises step S1 ~ S5, and is specific as follows:

The cluster that S1, PaaS Platform Server receive user's input creates parameter.

Described cluster creates parameter and comprises the quantity of the node of distributed treatment cluster to be created, the memory size of node and the storage size of node, and other parameters.

Described node is the virtual machine in the distributed treatment cluster; Described node comprises control node and computing node, and described control node is used for management cluster and distribute data Processing tasks, and described computing node is used for analyzing and processing data.

In addition, the PaaS Platform Server also creates parameter according to described cluster, and whether the detection system resource meets the demands.When meeting the demands, execution in step S2 creates the distributed treatment cluster.

S2, the PaaS Platform Server creates parameter according to described cluster, generates the distributed treatment cluster by Intel Virtualization Technology.

Step S2 specifically comprises step S201 ~ S204, and is as follows:

S201 creates parameter according to described cluster, generates a virtual machine by Intel Virtualization Technology, and disposes the running environment of described virtual machine.

For example, at the virtual machine that generates the softwares such as jdk, mysql, hadoop are installed, and are set.Required software can copy the soft file under large data processing serviced component.In one embodiment, virtual machine adopts Centos5.5 operating system, and the jdk version is 1.6.23, and the mysql version is that 5.5, hadoop version is 1.0.2.

S202 is according to the quantity of the node in the described cluster establishment parameter, the virtual machine that copy step S201 generates, the virtual machine of generation requirement.

S203, arrange between the virtual machine without cryptographic communication.

Step S203 specifically comprises: control every virtual machine activation key generator, generate separately PKI and private key.The PKI that more every virtual machine is generated copies on other virtual machines, realizes without cryptographic communication.

During implementation, can in ssh-keygen-t dsa program of every virtual machine operation, can generate separately PKI and private key.And the content of PKI file the inside copied in the authorized_key file of other virtual machines mutually, respectively log in once, generate the known_hosts file, realize without cryptographic communication.

S204 arranges control node and computing node in the distributed treatment cluster.

The virtual machine that the present embodiment acquiescence will generate for the first time is as the control node, and all the other virtual machines are as computing node.And, revise slaves, masters, mapred-site.xml, hdfs-site.xml, hadoop-env.sh, core-site.xml among the hadoop, configure distributed treatment cluster parameter.

S3, the PaaS Platform Server disposes data source to be analyzed according to the journal file memory address of described user's input or the Apply Names of described user deployment.

Step S3 specifically comprises:

The PaaS Platform Server receives user's input journal file memory address, perhaps obtains corresponding journal file memory address according to described user in the title of the application of PaaS platform deploy;

Whether the file format that the PaaS Platform Server detects in the described journal file memory address is journal file (judging namely whether journal file exists); If, then import data to be analyzed from described journal file memory address, otherwise data source configuration failure to be analyzed.

Journal file in the described journal file memory address is data source to be analyzed, is imported into to carry out the data processing in the distributed type assemblies in follow-up step S4.

S4, the PaaS Platform Server will be given described distributed treatment cluster for the script transmission of analyzing data, by described distributed treatment cluster data to be analyzed be processed.

Above-mentioned steps S4 specifically comprises:

S401, the PaaS Platform Server will be given for the script transmission of analyzing data the control node of described distributed treatment cluster; Described script for analyzing data is the MapReduce script, is used to indicate the method that imports data to be analyzed and the method for carrying out the MapReduce operation.

S402, described control node select in the described distributed treatment cluster idle computing node, by described computing node executing data Processing tasks concurrently, data to be analyzed are processed.

Control node in the distributed treatment cluster mainly exercises supervision to the execution of MapReduce operation in the cluster and manages, and computing node is responsible for the specific implementation of Map task and Reduce task in the MapReduce operation.When the distributed treatment cluster is submitted in the MapReduce operation, relevant input data will at first be divided into a plurality of segments, then control node and select idle computing node the data segment is carried out the Map task concurrently.Then these can again be divided into by the control node and be selected the Reduce task that idle computing node is carried out concurrently to them by the intermediate record that the Map task produces, thereby obtain the data acquisition system corresponding with each key assignments as operation result.Such process will be carried out repeatedly, until all Map task and Reduce tasks carrying is complete in the MapReduce operation.

During implementation, whether the PaaS Platform Server also detects the script that is used for the analysis data according to script type and meets the requirements.For example, require script to be necessary for the jar type.After meeting the requirements, execution in step S401 and S402.

S5, the PaaS Platform Server offers described user with data processed result.

The large data processing method of the PaaS platform that the embodiment of the invention provides can be utilized PaaS platform existing resource, and the IaaS layer Intel Virtualization Technology that is passed through bottom by the PaaS platform generates each node in the distributed treatment cluster; Provide large data-handling capacity by the distributed treatment cluster that generates for the PaaS platform, thereby solve the processing problem of the mass data of PaaS platform, improve data-handling efficiency.

In the middle of implementation, at PaaS Platform Server configuration PaaS platform, this PaaS platform is integrated, and large data are processed serviced component, process serviced component by described large data and carry out large flow chart of data processing among above-mentioned steps S1 ~ S5.

Referring to Fig. 2, the structural representation of an embodiment of the large data handling system of PaaS platform provided by the invention.

The embodiment of the invention provides a kind of large data handling system of PaaS platform, comprising: PaaS podium level, virtual distributed treatment cluster, cloud storage and server.Specific as follows:

Described PaaS podium level provides various serviced components, comprises large data processing serviced component, and the user interface (User Interface is called for short UI) of operation is provided for the user.Described PaaS platform adopts OSGi(Open Service Gateway Initiative) framework, middleware services, data, services, monitor service, large data are processed the various services such as service and are inserted the PaaS platform with kit form, thus formed can plug, the system of capable of dynamic change behavior, stability and high efficiency.Described large data are processed serviced component provides the required configuration parameter of generating virtual distributed treatment cluster for the user the representing of input, result; Virtual distributed treatment cluster management function is provided simultaneously, comprises the life cycle of controlling cluster, the process of monitoring cluster deal with data.

Described virtual distributed treatment cluster provides the analysis data-handling capacity of core for system.Described cluster is processed the parameter configuration that serviced component provides by the PaaS platform according to large data, generates by Intel Virtualization Technology.Described cluster obtains data to be analyzed from cloud stores, the script of processing serviced component and providing according to large data carries out data to be processed and analyze, and by the user interface that the large data of PaaS platform are processed serviced component analysis result is represented to the user.Described cluster adopts the Hadoop aggregated structure, has realized a distributed file system (Hadoop Distributed File System is called for short HDFS).HDFS has the characteristics of high fault tolerance, and design is used for being deployed on the cheap hardware.And HDFS provides high transmission rates to visit the data of application program.By described Hadoop framework, utilize PaaS platform existing resource, the large data-handling capacity of a high reliability, high scalability, high efficiency, high fault tolerance is provided.

Described cloud storage and server can adopt the existing resource of PaaS platform to make up, for whole system provides the hardware resource basis.All disk units in the described cloud storage derive from cheap PC equipment, are incorporated into the application server that offers front end in the single shared storage pool, have greatly improved disk utilization.Distributed storage has improved file read-write efficient; The cloud storage can realize large capacity by linear expansion, can provide high I O(input output for unstructured data simultaneously) bandwidth.The storage backup strategy is eliminated the Single Point of Faliure of disk, ensures high reliability, and conventional store has cheaply advantage relatively.

Large data processing method and the system of the PaaS platform that the embodiment of the invention provides have following beneficial effect:

(1), the present invention takes full advantage of existing storage and the computational resource of PaaS platform, improves PaaS platform resource service efficiency; The user no longer needs again to buy new storage and server, can effectively reduce cost; Simultaneously, large data are processed service and are advanced the PaaS platform so that the mode of assembly is integrated, can expand easily Speeding up development efficient.

(2), along with the development of PaaS platform, more and more, increasing application deployments are on the PaaS platform, the mass data processing of PaaS platform is inevitable, and the present invention can solve the mass data processing problem on the PaaS platform effectively, and data-handling efficiency is provided.

The above is preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also are considered as protection scope of the present invention.

Claims

1. the large data processing method of a PaaS platform is characterized in that, comprising:

S5, the PaaS Platform Server offers described user with data processed result.

2. the large data processing method of PaaS platform as claimed in claim 1 is characterized in that, described node is the virtual machine in the distributed treatment cluster; Described node comprises control node and computing node, and described control node is used for management cluster and distribute data Processing tasks, and described computing node is used for analyzing and processing data.

3. the large data processing method of PaaS platform as claimed in claim 2 is characterized in that, described step S2 specifically comprises:

S201 creates parameter according to described cluster, generates a virtual machine by Intel Virtualization Technology, and disposes the running environment of described virtual machine;

S202 is according to the quantity of the node in the described cluster establishment parameter, the virtual machine that copy step S201 generates, the virtual machine of generation requirement;

S203, arrange between the virtual machine without cryptographic communication;

4. the large data processing method of PaaS platform as claimed in claim 3 is characterized in that, described step S3 specifically comprises:

Whether the file format that the PaaS Platform Server detects in the described journal file memory address is journal file; If, then import data to be analyzed from described journal file memory address, otherwise data source configuration failure to be analyzed.

5. the large data processing method of PaaS platform as claimed in claim 4 is characterized in that, described step S4 specifically comprises:

S401, the PaaS Platform Server will be given for the script transmission of analyzing data the control node of described distributed treatment cluster; Described script for analyzing data is the MapReduce script, is used to indicate the method that imports data to be analyzed and the method for carrying out the MapReduce operation;