CN106657099B

CN106657099B - Spark data analysis service publishing system

Info

Publication number: CN106657099B
Application number: CN201611248761.4A
Authority: CN
Inventors: 王莹; 张立军; 孙丙聪
Original assignee: Beijing Tianyuan Innovation Technology Co ltd
Current assignee: Beijing Tianyuan Innovation Technology Co ltd
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2020-06-16
Anticipated expiration: 2036-12-29
Also published as: CN106657099A

Abstract

The invention provides a data analysis service distribution system, which comprises a Spark data analysis module, a service scheduling module and a service standard formulation module; the service standard formulating module is used for formulating a uniform service release standard; the service scheduling module is used for receiving a service request and sending the service request to an idle service; the Spark data analysis module is used for constructing a service container and analyzing and processing the service request according to the service release standard. By formulating a uniform service standard, a third-party client or a business system carries out big data analysis by calling data analysis service, so that the business system and the big data analysis can be effectively isolated, and the development cost of the business system is reduced; and a Spark distributed computing system is adopted in the service operation environment, so that the speed and the efficiency of data analysis are greatly improved.

Description

Spark data analysis service publishing system

Technical Field

The invention relates to the technical field of data analysis and mining, in particular to a Spark data analysis service release system.

Background

With the advent of the information age, the accumulation of data has grown geometrically. Various data analysis algorithms have emerged in order to mine valid information from existing massive data. In the actual operation process of data analysis, the most suitable algorithm cannot be determined immediately, and different calculation results need to be obtained by continuously trying different algorithms or algorithm combinations. And comparing different calculation results to obtain the optimal algorithm scheme and the optimal analysis result so as to obtain the most effective data feedback information.

Data analysts need to understand both the principles of algorithms and the specific code implementations of the algorithms. The requirement on technical personnel is high, and when different algorithms are combined to analyze data, the codes need to be continuously adjusted, so that the method is complex. The current internet has entered the information data era, and with the rapid growth of data, companies and scientific research institutions increasingly attach importance to mining effective information from existing data, and various data mining system architectures appear.

Data mining is rarely involved in traditional business systems, and traditional software companies need to spend a great deal of time and expense building an analytical mining platform in order to adapt to the development of large data.

Disclosure of Invention

The invention provides a data analysis service distribution system which overcomes the problems or at least partially solves the problems, unifies the service forms, reasonably utilizes cluster resources, and constructs cheap large data analysis services through Spark distributed architecture design.

According to one aspect of the invention, the system comprises a Spark data analysis module, a service scheduling module and a service standard formulation module; the service standard formulating module is used for formulating a uniform service release standard; the service scheduling module is used for receiving a service request and sending the service request to an idle service; the Spark data analysis module is used for constructing a service container and analyzing and processing the service request according to the service release standard.

Preferably, the user adopts a B/S framework to view service information through a browser, adjust the service state, and set the service execution form and the service scale.

Preferably, the service standard formulation module specifies a unified service standard for different algorithms, specifically including a service parameter, a service result combination mode, and a service invocation mode.

Preferably, the service scheduling module is further configured to make the data analysis function as an HTTP interface of an open API.

Preferably, the Spark data analysis module comprises a Spark data analysis unit and a distributed cluster;

the Spark data analysis unit is used for analyzing and calculating the distributed service request through a Spark distributed computing system;

the distributed cluster is used for providing a distributed computing running environment for the Spark data analysis unit.

Preferably, the distributed clusters include Spark clusters and Hadoop clusters.

Preferably, the Spark data analysis unit includes a service subunit and a process issuing subunit;

the business subunit is used for randomly combining and drawing an algorithm for realizing the service request into a flow chart according to the service release standard;

the flow issuing subunit is used for combining all the nodes of the flow chart to generate a task, making the task into a service and analyzing and processing the service request.

Preferably, the service scheduling module is configured to send the service request to an idle service according to a load balancing-random algorithm through cluster data provided by the distributed cluster.

Preferably, the service scheduling module communicates with the service through a socket, and the communication content includes service request data, service result data, service state data, and service calculation process data.

According to the data analysis service distribution system provided by the invention, by formulating a uniform service standard, a third-party client or a business system carries out big data analysis by calling a data analysis service, so that the business system and the big data analysis can be effectively isolated, and the development cost of the business system is reduced; and a Spark distributed computing system is adopted in the service operation environment, so that the speed and the efficiency of data analysis are greatly improved.

Drawings

Fig. 1 is a block diagram of a data analysis service distribution system according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Fig. 1 shows a data analysis service distribution system, which includes a Spark data analysis module, a service scheduling module, and a service standard formulation module; the service standard formulating module is used for formulating a uniform service release standard, specifically comprises a service production standard, a parameter transmission standard and a result return standard, and can ensure the uniformity of the service and facilitate the use of a user through the standard; the service scheduling module is used for receiving the service request, sending the service request to the idle service, allocating a data analysis task, balancing cluster resources, executing a task cycle, and starting and closing the service; the Spark data analysis module is used for constructing a service container and analyzing and processing the service request according to the service release standard. The service's operating environment employs a Spark distributed computing system. Spark distributed computing systems are one of the mainstream cloud computing frameworks. And a cloud computing mode is adopted, so that the speed and the efficiency of data analysis are greatly improved. The operating environment of the service adopts a Spark distributed computing system, so that different sequence combinations of algorithms can be realized to analyze and process data, and the analysis process is diversified.

In this embodiment, a user views service information, such as a service parameter, a service return value combination form, a service state, a flowchart, a service call log, and the like, through a browser by using a B/S framework; adjusting the service state and setting the service execution form, such as timing execution, periodic execution and the like; size of service, such as number of concurrencies, etc.

Preferably, the service standard formulation module assigns different algorithms to a uniform service standard, specifically comprising a service parameter, a service result combination mode and a service calling mode; by the standard, the uniformity of the service can be ensured, the use difficulty of the user can be reduced, the use of the user is facilitated, and the availability of the service and the reusability of the service system code are improved.

Preferably, the Spark data analysis unit further includes a service subunit and a process issuing subunit;

the business subunit is used for randomly combining and drawing an algorithm for realizing the service request into a flow chart according to the service standard; the flow chart comprises algorithm instance nodes and the relationship of the algorithm instance nodes, and the relationship of the algorithm instance nodes is determined through connecting lines among the algorithms.

The flow issuing subunit is used for combining all the nodes of the flow chart to generate a task and making the task into a service.

When a service request exists, the service scheduling module sends the service request to an idle service through cluster resource data provided by a distributed data set according to a load balancing-random algorithm; and the service scheduling module records the current state of each service and randomly calls the background idle service by adopting a random algorithm. Because each service is called roughly the same number of times as requests increase, probabilistically speaking, under the same execution environment.

The invention provides a Spark data analysis service release system, which increases the wide application of services and reduces the generation of errors and the complexity of service use by specifying a uniform service release standard, constructs a data analysis platform by a Spark data analysis architecture to realize analysis calculation and analysis processes, and greatly improves the speed and efficiency of data analysis by adopting a cloud calculation mode; the service system and the big data analysis are effectively isolated, the development cost of the service system is reduced, the data analysis function is made into an HTTP interface of an open API, and the third party can call the data conveniently.

Finally, the method of the present application is only a preferred embodiment and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data analysis service distribution system is characterized by comprising a Spark data analysis module, a service scheduling module and a service standard formulation module; the service standard formulating module is used for formulating a uniform service release standard; the service scheduling module is used for receiving a service request and sending the service request to an idle service; the Spark data analysis module is used for constructing a service container and analyzing and processing the service request according to the service release standard;

the system also comprises a B/S framework, wherein a user checks service information and adjusts the service state through a browser by adopting the B/S framework, and sets a service execution form and a service scale;

the Spark data analysis module comprises a Spark data analysis unit and a distributed cluster;

the distributed cluster is used for providing a distributed computing operation environment for the Spark data analysis unit;

the Spark data analysis unit also comprises a service subunit and a process issuing subunit;

2. The data analysis service distribution system of claim 1, wherein the service standard formulation module specifies a unified service standard for different algorithms, specifically comprising a service parameter, a service result combination mode, and a service invocation mode.

3. The data analytics service distribution system of claim 1, wherein the service scheduling module is further configured to make data analytics functions as an HTTP interface to an open API.

4. The data analysis service distribution system of claim 1, wherein the distributed clusters comprise Spark clusters and Hadoop clusters.

5. The data analysis service distribution system of claim 1, wherein the service scheduling module is configured to send the service request to the idle service according to a load balancing-random algorithm through the cluster profile data provided by the distributed cluster.

6. The data analysis service distribution system of claim 1, wherein the service scheduling module communicates with the service through a socket, and the communication content includes service request data, service result data, service status data, and service calculation process data.